Showing posts with label w3cexlog. Show all posts
Showing posts with label w3cexlog. Show all posts

Friday, March 23, 2012

Outer Join and WHERE clause strange behavior

I am writing a stored procedure to get aggregrate data out of a web site log
table by joining it with a view
DDL for the table:
CREATE TABLE w3cexlog (
pacifictime datetime,
[cs-method] varchar(20),
target varchar(8000),
query varchar(2000),
username varchar(100),
browser varchar(8000),
referrer varchar(2000),
status int,
substatus int,
win32status int,
[sc-bytes] bigint,
[cs-bytes] bigint,
[time-taken] bigint
)
The view is through ADSI, and pulls the Active Directory OU, samAccountName
(which corresponds to username), and full name for the user.
If the view where a base table, it's DDL would look like this
CREATE TABLE PortalUsers (
name nvarchar(512),
samAccountName nvarchar(512),
OU nvarchar(8000)
)
I want to find out how many hits where generated by users in each OU, along
with visitors from each OU, by joining PortalUsers to w3cexlog on
PortalUsers.samAccountName = w3cexlog.username. (The web site uses AD
authentication, so they must match).
Here is the query:
SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visito
rs
FROM portalusers p LEFT JOIN w3cexlog w on
p.samaccountname=w.username
GROUP BY p.ou
ORDER BY 2 DESC, 1
This query gives the expected results; OUs for which none of its users
produced a "hit" still show in the output, but with zero's in the "hits" and
"visitors" columns. However, if I add a WHERE clause to query a specific dat
e
range:
SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visito
rs
FROM portalusers p LEFT JOIN w3cexlog w on
p.samaccountname=w.username
WERE w.pacifictime BETWEEN '20060102' AND '20060112'
GROUP BY p.ou
ORDER BY 2 DESC, 1
The OUs that produces no hits are now gone from the result set! That WHERE
clause wouldn't have removed any rows for the OUs that didn't have hits; the
y
where never there to begin with; so why does the behavior of the outer join
change?
In order to get around this, I constructed the following query
SELECT t1.OU, t2.hits, t2.visitors FROM
(
SELECT DISTINCT OU FROM PortalUsers
) t1
LEFT JOIN
(
SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visito
rs
FROM portalusers p LEFT JOIN w3cexlog w on
p.samaccountname=w.username
WERE w.pacifictime BETWEEN '20060102' AND '20060112'
GROUP BY p.ou
) t2
ON t1.OU = t2.OU
ORDER BY 2 DESC, 1
But the performance of this is terrible."Mark Williams" <MarkWilliams@.discussions.microsoft.com> wrote in message
news:5002E6BF-4C4F-48B1-B333-D7B398C42430@.microsoft.com...
>I am writing a stored procedure to get aggregrate data out of a web site
>log
> table by joining it with a view
>
> SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS
> visitors
> FROM portalusers p LEFT JOIN w3cexlog w on
> p.samaccountname=w.username
> WERE w.pacifictime BETWEEN '20060102' AND '20060112'
> GROUP BY p.ou
> ORDER BY 2 DESC, 1
> The OUs that produces no hits are now gone from the result set! That WHERE
> clause wouldn't have removed any rows for the OUs that didn't have hits;
> they
> where never there to begin with; so why does the behavior of the outer
> join
> change?
>
Mark,
This is the way things work. An outer join may contain results
that fail the ON condition, but it will never contain results that fail
the WHERE condition. The WHERE condition is a real filter
than can be thought of as being applied to the result of the
outer join.
If you have result rows with 0 hits, all the result columns from
w3cexlog in those result rows contain NULL. NULL will never
be between '20060102' and '20060112', so you'll see no rows
that failed the ON condition.
Maybe what you want is just
... on p.samaccountname=w.username
AND w.pacifictime BETWEEN '20060102' AND '20060112'
...
Steve Kass
Drew University|||Another option is to leave the join unchanged and alter the where caluse lik
e
this:
...
WHERE (w.pacifictime BETWEEN '20060102' AND '20060112' or w.pacifictime is
null)
...
ML
http://milambda.blogspot.com/|||Thanks Steve and ML. I should have seen it, but I was a little tired! I had
the incorrect thinking that the WHERE clause was applied before the join,
instead of after.
"Steve Kass" wrote:

> "Mark Williams" <MarkWilliams@.discussions.microsoft.com> wrote in message
> news:5002E6BF-4C4F-48B1-B333-D7B398C42430@.microsoft.com...
> Mark,
> This is the way things work. An outer join may contain results
> that fail the ON condition, but it will never contain results that fail
> the WHERE condition. The WHERE condition is a real filter
> than can be thought of as being applied to the result of the
> outer join.
> If you have result rows with 0 hits, all the result columns from
> w3cexlog in those result rows contain NULL. NULL will never
> be between '20060102' and '20060112', so you'll see no rows
> that failed the ON condition.
> Maybe what you want is just
> ... on p.samaccountname=w.username
> AND w.pacifictime BETWEEN '20060102' AND '20060112'
> ...
> Steve Kass
> Drew University
>
>|||It will work, but I wouldn't suggest this as a good option, I'm afraid.
This is like adding "and I really meant that is was an outer join" to the
where clause.
Or did you see some advantage with putting it in the where clause, like some
problem that doing it this way might solve?
----
Louis Davidson - http://spaces.msn.com/members/drsql/
SQL Server MVP
"Arguments are to be avoided: they are always vulgar and often convincing."
(Oscar Wilde)
"ML" <ML@.discussions.microsoft.com> wrote in message
news:3F48CD19-D89E-4535-8FFD-0045DF823CA1@.microsoft.com...
> Another option is to leave the join unchanged and alter the where caluse
> like
> this:
> ...
> WHERE (w.pacifictime BETWEEN '20060102' AND '20060112' or w.pacifictime is
> null)
> ...
>
> ML
> --
> http://milambda.blogspot.com/|||Another option is to move the condition to the ON clause:
SELECT t1.OU, t2.hits, t2.visitors FROM
(
SELECT DISTINCT OU FROM PortalUsers
) t1
LEFT JOIN
(
SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS
visitors, pacifictime
FROM portalusers p LEFT JOIN w3cexlog w on
p.samaccountname=w.username
GROUP BY p.ou
) t2
ON (t1.OU = t2.OU )
AND t2.pacifictime BETWEEN '20060102' AND '20060112'
ORDER BY 2 DESC, 1
That might perform better|||On Sat, 14 Jan 2006 04:46:01 -0800, ML wrote:

>Another option is to leave the join unchanged and alter the where caluse li
ke
>this:
>...
>WHERE (w.pacifictime BETWEEN '20060102' AND '20060112' or w.pacifictime is
>null)
>...
Hi ML,
That would also include rows with w.pacifictime equal to NULL in the
base table (assuming the column is nullable); I don't think that this is
what Mark wants.
Hugo Kornelis, SQL Server MVP|||Since both methods yield same results the only argument would be performance
.
So, which one would perform better?
ML
http://milambda.blogspot.com/|||On Sun, 15 Jan 2006 14:09:01 -0800, ML wrote:

>Since both methods yield same results the only argument would be performanc
e.
>So, which one would perform better?
Hi ML,
Have you testeed them?
(I'll gladly admit I didn't - but due to the OR in your method, I'd be
surprised if it turns out to be the faster version).
Of course, they don't yield the same results if the column is nullible.
Hugo Kornelis, SQL Server MVP|||Beyond what was discussed about how if the pacifictime column was nullable
(you would want to check for the join columns being null) the argument is
style and symantics.
I would argue that it is very important to write readable queries that can
be understood by the next reader AND that make sense. Performance is
important, and if the less clear version peformed measurably better (if
anything it would likely be no better than equal, but sometimes there can be
reasons to do things that don't seem kosher for performance sake, but this
is unlikely to be one of those cases.
----
Louis Davidson - http://spaces.msn.com/members/drsql/
SQL Server MVP
"Arguments are to be avoided: they are always vulgar and often convincing."
(Oscar Wilde)
"ML" <ML@.discussions.microsoft.com> wrote in message
news:61118E3B-FAED-4124-86F5-26219564531A@.microsoft.com...
> Since both methods yield same results the only argument would be
> performance.
> So, which one would perform better?
>
> ML
> --
> http://milambda.blogspot.com/