Av rating:

Total votes: 24
Total comments: 23
SQL Server Grouping Workbench
26 April 2007
/*This Workbench is about using the GROUP BY clause in a SQL Statement. It is
designed so it can be pasted straight into SSMS or the Query Analyser, though
to run the examples, you'll need the data file too. Just click the "CODE DOWNLOAD"
link at the top of this article. Also included in the download bundle is the
original SQL file, if you need it.
-- Contents
-----------
Introduction
Removing Duplicates
Simple Reporting
Grouping and Pivot Tables
Enumerating the Group members
Ranging
Grouping as a utility
Questions
-- Introduction
---------------
Like all these workbench articles, the whole point is to try different things
out, improve on the code, and learn in a practical way. If you're like us, and
find you can take in information only after real practice doing things for real
then this article is for you.
If you think that it is too elementary, move to the end of the article and test
out your knowledge by answering all the questions (Yes, the editor is giving
away prizes for the best answers he receives. Insist on the glorious USB Red-Gate
pendant).
Aggregations in SQL are very simple to do, but can cause a lot of confusion
due to the word 'Group'. The result you get from a grouping does not consist
of the individual records but a collection or 'aggregation' of records.
A pack of playing cards is a group of 52 playing cards.
If you were to group a pack of cards by colour, you'd have two 'rows' of 26
cards; if you grouped by suit, there would be four 'rows' with 13 in each. You
could group them by their rank/court attribute, or in other ways. The point is
that the result is always an aggregate, even if there is only one member in the
collection.
Each row in the result of a GROUP BY represents a collection, or grouping.
You can get the values for the number of records in each collection, or grouping,
or the sum, average, and so on of numeric values, or the max or min of strings,
but to ask for individual values makes no sense if you haven't grouped by
that value.
Sometimes, you will use an aggregate function in a SELECT Statement without using
a GROUP BY clause. There is an implicit grouping of the entire result into one
row, which is why you can only use aggregate functions in such a case, and will
get one row returned.
GROUP BY statements are the bedrock of reporting in SQL.
The built-in aggregate functions are...
common ones
sum([all|distinct] expr.) Total of the (distinct) values in the expression
avg([all|distinct] expr.) Average of the (distinct) values in the expression
count([all|distinct] expr.) Number of (distinct) non-null values in the expression
count(*) Number of selected rows
max(expr.) Highest value in the expression
min(expr.) Lowest value in the expression
Statistical functions
stdev(expr.) The standard deviation of the values in the expression
Stdevp(expr.) Population standard deviation of the expression values
Var(expr.) The Variance of the expression values
Varp(expr.) The population Variance of the expression values
Checksum functions
Binary_checksum(expr.) Returns the binary checksum value
Checksum(expr.) Returns the checksum value
Checksum(*) checksum is over all the columns of the table
Checksum_agg
([all|distinct] expr.) Returns the checksum of the NON-NULL values in a group.
(You can use sum and avg with numeric columns only--int, smallint, TinyIntBigint
decimal, numeric, float, SmallMoney and money. You cannot use min and max with bit
datatypes. You cannot use aggregate functions other than count(*) with
text and image datatypes.)
We'll start with some simple manoevres and then finish off with a few wheelies and
handbrake-turns.
As sample data, we'll take a list of all Oil-producing countries in the world
and their oil-production
Create a new database or use your 'practice' development Database and then create
this */
CREATE TABLE [dbo].[OilProducers](
[[OilProducer_ID] [int] IDENTITY(1,1) NOT NULL,
[country] [varchar](30) NOT NULL,
[BarrelsPerDay] [int] NOT NULL,
[Continent] [varchar](80) NOT NULL,
[Region] [varchar](80) NOT NULL
) ON [PRIMARY]
GO
-- And we'll get our data in. Download the data file, and load the data as
follows:
-- substitute the full path to your file for the name 'OilProducers.txt'
BULK INSERT dbo.OilProducers
FROM ' OilProducers.txt'
/* or
Declare @Command varchar(8000)
select @Command='bcp '
+db_name()
+'.dbo.OilProducers in OilProducers.txt -c -S '+@@Servername+' -E -T'
execute xp_cmdshell @command--assumes Windows Security
*/
--Removing Duplicates
---------------------
-- Before we get stuck into doing reporting, I'll illustrate how simple it is
-- to remove duplicates from tables using grouping
-- Of course, if you are going to remove duplicates, you will really want to
-- move them to another table for forensic analysis to find out how it happened
-- and whether they are valid entries. Here is our Duplicates table
CREATE TABLE [dbo].[Duplicates](
[OilProducer_ID] [int] IDENTITY(1,1) NOT NULL,
[Duplicate_ID] [int],
[country] [varchar](30) NOT NULL,
[BarrelsPerDay ] [int] NOT NULL,
[Continent] [varchar](80) NOT NULL,
[Region] [varchar](80) NOT NULL
) ON [PRIMARY]
--let's maliciously create some duplicates!
INSERT INTO OilProducers(Country,BarrelsPerDay,continent, region)
SELECT Country,BarrelsPerDay,continent, region
FROM OilProducers WHERE country LIKE 'A%'
INSERT INTO OilProducers(Country,BarrelsPerDay ,continent, region)
SELECT Country,BarrelsPerDay,continent, region
FROM OilProducers WHERE country BETWEEN 'A' AND 'E'
--you can then easily see these duplicates by
SELECT 'First_ID'=MAX(OilProducer_ID), 'No.InTable'=COUNT(*)
FROM OilProducers
GROUP BY country
HAVING COUNT(*)>1--notice that we use a HAVING clause. This is because
--we are selecting only those rows that contain more than one row in the
--original table (we could have checked only part of the original table
--by using the WHERE clause. A WHERE clause cannot contain an aggregate
-- function of course, but the HAVING clause can.
-- So we move them to another table to check them to make sure. We wrap
-- this up in a transaction just in case anything goes wrong, as we don't
-- want to lose a record that might turn out to be legit.
--start of routine
-------------------------------------------------------------
DECLARE @found INT
DECLARE @Errorhappened INT
DECLARE @OtherErrorhappened INT
BEGIN TRANSACTION
WHILE COALESCE(@found,1)>0
AND COALESCE(@ErrorHappened,0) =0
AND COALESCE(@OtherErrorHappened,0)=0
BEGIN
INSERT INTO Duplicates (Duplicate_ID,country,BarrelsPerDay,continent,region)
SELECT OilProducer_ID,country,BarrelsPerDay,continent,region
FROM OilProducers WHERE oilProducer_ID IN (
SELECT MAX(OilProducer_ID)
FROM OilProducers
GROUP BY country
HAVING COUNT(*)>1)
SELECT @Errorhappened=@@Error, @found=@@Rowcount
DELETE FROM OilProducers WHERE oilProducer_ID IN (
SELECT MAX(OilProducer_ID)
FROM OilProducers
GROUP BY country
HAVING COUNT( *)>1)--the HAVING clause has to be used as
--we have to select rows after the
--aggregation
SELECT @OtherErrorHappened=@@Error
END
IF @errorHappened<>0 or @OtherErrorHappened <>0 ROLLBACK TRANSACTION
ELSE COMMIT TRANSACTION
--end of routine
---------------------------------------------------------------
--Now look in the duplicate table
SELECT * FROM Duplicates
--and check the main table for duplicates with the SQL I've already given
--Simple Reporting
------------------
-- what about getting a summary of oil production per continent?
SELECT [Barrels per day ]=SUM(BarrelsPerDay), continent
FROM OilProducers
GROUP BY continent --each row should be a continent
ORDER BY SUM(BarrelsPerDay) DESC --in descending order
-- we are grouping by continent and get back seven records, each of which
-- represents a collection of oil producers in the same continent
-- ..or by region ordered by production?
SELECT [production]=SUM(BarrelsPerDay),
[average]=AVG(BarrelsPerDay),
Region,
[ Producers in region]=COUNT(*)
FROM OilProducers
GROUP BY region --each row is a collection of all the records for a region
ORDER BY SUM(BarrelsPerDay) DESC --order by production
--Grand totals are easy
SELECT
[continent]=COALESCE(continent,'Total (world production)'),
[Barrels per day]=SUM(BarrelsPerDay)
FROM OilProducers
GROUP BY continent WITH rollup--each row should be a continent
ORDER BY grouping(continent) ASC,SUM(BarrelsPerDay) DESC
-- we used the grouping() function to order the total at the end
-- we used WITH ROLLUP to compute the total. It can be very useful
-- for running sums and running averages.
-- ..but someone is bound to ask for just the top five
SELECT TOP 5 [production]=SUM(BarrelsPerDay),
[average]=AVG(BarrelsPerDay), Region, [Producers in region]=COUNT(*)
FROM OilProducers
GROUP BY region --each row is a collection of all the records for a region
ORDER BY SUM(BarrelsPerDay) DESC --order by production
-- ..or maybe the bottom five!
SELECT * FROM
(SELECT TOP 5 [production]=SUM(BarrelsPerDay), [average]=AVG(BarrelsPerDay),
Region, [Producers in region]=COUNT(*)
FROM OilProducers
GROUP BY region--each row is a collection of all the records for a region
ORDER BY SUM(BarrelsPerDay) ASC)f --order by production
ORDER BY production DESC
-- or, more tiresomely, for the top five, the others as an 'Others' row
-- AND the sum total!
SELECT
[region]=CASE WHEN g.n=1 THEN t ELSE Region END,
[production]=CASE WHEN t='other'
THEN production -
(SELECT SUM(PRODUCTION)
FROM
(SELECT TOP 5 [production]=SUM(BarrelsPerDay)
FROM OilProducers
GROUP BY region
ORDER BY SUM(BarrelsPerDay) DESC)F)
ELSE production END
FROM (SELECT [n]=0,[t]='',[TheOrder]=1 --trick to duplicate the null row the
UNION ALL --'to get 'others' row
SELECT 1,'Other',2
UNION ALL SELECT 1,'Total',3)g
INNER JOIN
(SELECT TOP 6 [production]=SUM(BarrelsPerDay),--the basic data
[n]=grouping(region),
Region
FROM OilProducers
GROUP BY region WITH rollup--
ORDER BY SUM(BarrelsPerDay) DESC
)f
ON g.n=f.n
ORDER BY g.TheOrder ASC, production DESC
--and then there will be the guy who wants a list of just those
--regions who produce more than the average
SELECT Region, [production]=SUM(BarrelsPerDay), [Producers in region]=COUNT(*)
FROM OilProducers
GROUP BY region-- each row is a collection of all the records for a region
HAVING SUM(BarrelsPerDay)>
(SELECT AVG(average) FROM
(SELECT [average]=AVG(BarrelsPerDay) FROM OilProducers GROUP BY region)f)
ORDER BY SUM(BarrelsPerDay) DESC --order by production
--or possibly the five countries closest to the average
SELECT TOP 5 country, BarrelsPerDay, theAverage
FROM OilProducers
CROSS JOIN
(SELECT [theAverage]=AVG(BarrelsPerDay) FROM OilProducers)f
ORDER BY ABS(barrelsPerDay-theaverage) ASC
--or conceivably, region closest to the average
SELECT TOP 5 * FROM
(SELECT region, [sumBPD]=SUM(BarrelsPerDay)--, theAverage
FROM OilProducers GROUP BY region
)h
CROSS JOIN
(SELECT [theAverage]=AVG(production) FROM
(SELECT [production]=SUM(BarrelsPerDay)
FROM OilProducers GROUP BY region
)g
)f
ORDER BY ABS(sumBPD-theaverage) ASC
-- Grouping and Pivot Tables
----------------------------
-- So let's get a bit harder and do a pivot table relating regions and
-- continents with column and row totals.
SELECT [region/continent ]=CONVERT(CHAR(25),COALESCE(region,'Total')),
'Asia'=SUM(CASE WHEN continent = 'Asia' THEN BarrelsPerDay ELSE 0 END),
'North A.'=SUM(CASE WHEN continent ='North America'
THEN BarrelsPerDay ELSE 0 END) ,
'Latin A.'=SUM(CASE WHEN continent ='Latin America and the Caribbean'
THEN BarrelsPerDay ELSE 0 END),
'Africa'=SUM(CASE WHEN continent ='Africa' THEN BarrelsPerDay ELSE 0 END),
'Russia'=SUM(CASE WHEN continent ='Russia' THEN BarrelsPerDay ELSE 0 END),
'Europe'=SUM(CASE WHEN continent ='Europe' THEN BarrelsPerDay ELSE 0 END),
'Oceania'=SUM(CASE WHEN continent ='Oceania' THEN BarrelsPerDay ELSE 0 END),
'sum'=SUM(BarrelsPerDay)
FROM OilProducers
GROUP BY region
WITH rollup
ORDER BY grouping(region),SUM(BarrelsPerDay) DESC
--the 'order by grouping' trick ensures that the total comes in the right place
--on the last row!
--Enumerating the Group members
-------------------------------
--This is all OK but a lot of people want, and expect, to have a list of all
--the constituents of their grouping in the result. They don't like DBAs
--laughing and saying it isn't possible. There are now CLR Aggregation routines
--that do it, but here is a method that works on SQL 2000. Not a cursor in
--sight! (it works a lot better in 2005 with VaRCHAR(MAX))
DECLARE @list VARCHAR(8000)
SELECT @List=COALESCE(@list+',','') +'|'+region +'|'+ country+'|'+region +'|'
FROM OilProducers ORDER BY region
SELECT [region]=CONVERT(CHAR(26),region),
[Oil-producing countries]=LEFT(members,CHARINDEX(',|',members+',|')-1) FROM
(SELECT region, 'members'=
REPLACE(
SUBSTRING(list,CHARINDEX('|'+regions.region+'|',list),8000),
'|'+region+'|',
'')
FROM (SELECT 'list'=@List)f
CROSS JOIN (
SELECT region FROM oilproducers GROUP BY region
)regions
)g
/*
Results (Truncated to the right)
region Oil-producing countries
-------------------------- ----------------------------------------------------...
Australia and New Zealand Australia,New Zealand
Caribbean Virgin Islands,Cuba,Trinidad and Tobago,Aruba,Barbad...
Central America Belize,Nicaragua,Guatemala,Mexico
Eastern Africa Zambia,Madagascar
Eastern Asia Korea, North,Mongolia,Taiwan,Korea, South,China,Japan
Eastern Europe Czech Republic,Poland,Belarus,Ukraine,Hungary,Romani...
Melanesia Papua New Guinea
Middle Africa Congo,Chad,Equatorial Guinea,Gabon,Angola,Cameroon
North America Canada, United States
Northern Africa Algeria,Egypt,Libya,Tunisia,Sudan,Morocco
Northern Europe Finland,Sweden,Lithuania,Estonia,United Kingdom,Norw...
Russia Russia
South America Ecuador,Argentina,Brazil,Colombia,Venezuela,Peru,Sur...
South-central Asia Kyrgyzstan,Bangladesh,Pakistan,Tajikistan,Turkmenist...
South-eastern Asia Malaysia,Indonesia,Brunei,Vietnam,Thailand,Singapore...
Southern Africa South Africa
Southern Europe Italy,Croatia,Spain,Greece,Albania,Slovenia
Western Africa Sierra Leone,Ghana,Nigeria,Mauritania,Cote d'Ivoire
Western Asia Turkey,Bahrain,Yemen,Syria,United Arab Emirates,Saud...
Western Europe Switzerland,Belgium,Germany,Netherlands,France,Austr...
*/
-- now this technique could get quite interesting if combined with 'ranging' where
-- you can impose categories onto the data of your choice (Date ranges are very
-- common)
-- Ranging
----------
--By ranging, I mean imposinbg aribrary value ranges, and grouping by them
--you can, of course, use a helper table to do this much more elegantly
-- Ranging can be by column
SELECT
[range]='No. of Countries',
[< 10,000 bpd]=SUM(CASE WHEN Barrelsperday<10000 THEN 1 ELSE 0 END),
[10,000 to 100,000]=SUM(CASE WHEN Barrelsperday BETWEEN 10000 AND 100000
THEN 1 ELSE 0
END),
[100,000 to 1m bpd]=SUM(CASE WHEN Barrelsperday BETWEEN 100001 AND 1000000
THEN 1 ELSE 0
END),
[1m to 10m bpd]=SUM(CASE WHEN Barrelsperday BETWEEN 1000001 AND 10000000
THEN 1 ELSE 0
END),
[more than 10m bpd]=SUM(CASE WHEN Barrelsperday > 10000000 THEN 1 ELSE 0 END)
FROM OilProducers
--or by row...
SELECT [category]=CASE WHEN Barrelsperday<10000 THEN '1/ > 10,000 bpd'
WHEN Barrelsperday BETWEEN 10000 AND 100000 THEN '2/ 10,000 to 100,000 '
WHEN Barrelsperday BETWEEN 100001 AND 1000000 THEN '3/ 100,000 to 1m bpd'
WHEN Barrelsperday BETWEEN 1000001 AND 10000000 THEN '4/ 1m to 10m bpd'
WHEN Barrelsperday > 10000000 THEN '5/ more than 10m bpd' END,
[total]=SUM(BarrelsperDay),
[No. of Countries]=COUNT(*)
FROM OilProducers
GROUP BY CASE WHEN Barrelsperday<10000 THEN '1/ > 10,000 bpd'
WHEN Barrelsperday BETWEEN 10000 AND 100000 THEN '2/ 10,000 to 100,000'
WHEN Barrelsperday BETWEEN 100001 AND 1000000 THEN '3/ 100,000 to 1m bpd'
WHEN Barrelsperday BETWEEN 1000001 AND 10000000 THEN '4/ 1m to 10m bpd'
WHEN Barrelsperday > 10000000 THEN '5/ more than 10m bpd' END
--The code above is much more efficient than it looks, but why not make a
--User-defined function to do it?
CREATE FUNCTION OilOutputCategory
(
@OilOutput INT
)
RETURNS VARCHAR(30)
AS
BEGIN
RETURN CASE WHEN @OilOutput<10000 THEN '1/ > 10,000 bpd'
WHEN @OilOutput BETWEEN 10000 AND 100000 THEN '2/ 10,000 to 100,000'
WHEN @OilOutput BETWEEN 100001 AND 1000000 THEN '3/ 100,000 to 1m bpd'
WHEN @OilOutput BETWEEN 1000001 AND 10000000 THEN '4/ 1m to 10m bpd'
WHEN @OilOutput > 10000000 THEN '5/ more than 10m bpd' END
END
GO
--so the query becomes...
SELECT [category]=dbo.OilOutputCategory(BarrelsPerDay),
[total]=SUM(BarrelsperDay),
[producers]=COUNT(*)
FROM OilProducers
GROUP BY dbo.OilOutputCategory(BarrelsPerDay)
--so let's combine ranging and enumeration!
--We can list all the countries that fall in each range category....
DECLARE @list VARCHAR( 8000)
SELECT @List=COALESCE(@list+',','')+catfudge
FROM (--the concatenation of the string didn't work until we did this!
SELECT TOP 1000 [catfudge]=
('|'+dbo.OilOutputCategory(BarrelsPerDay) +'|'+ country+'|'
+dbo.OilOutputCategory(BarrelsPerDay) +'|'),
[cat]=dbo.OilOutputCategory(BarrelsPerDay)
FROM OilProducers ORDER BY cat)f
SELECT [category]=CONVERT(CHAR(26),category),
[Oil-producing countries]=LEFT(members,CHARINDEX(',|',members+',|')-1) FROM
(SELECT category, 'members'=
REPLACE(
SUBSTRING(list,CHARINDEX('|'+ categories.category+'|',list),8000),
'|'+categories.category+'|',
'')
FROM (SELECT 'list'=@List)f
CROSS JOIN (
SELECT [category]=dbo.OilOutputCategory(BarrelsPerDay) FROM oilproducers
GROUP BY dbo.OilOutputCategory(BarrelsPerDay)
)categories
)g
/*
Results (Truncated to the right)
category Oil-producing countries
-------------------------- -------------------------------------------------...
1/ > 10,000 bpd Singapore,Burma,Finland,Taiwan,Ghana,Korea, South,...
2/ 10,000 to 100,000 Tunisia,Netherlands,Ukraine,Mauritania,France,Cuba...
3/ 100,000 to 1m bpd Syria,Sudan,Oman,Qatar,Romania,Italy,Malaysia,Chad...
4/ 1m to 10m bpd Venezuela,United Arab Emirates,United Kingdom,Indo...
5/ more than 10m bpd Saudi Arabia
*/
-- Grouping as a utility
------------------------
-- One can use grouping in a variety of ways that have nothing to do with
-- reporting . Here is an example of using GROUP BY to help produce a table by
-- splitting a delimited list. It needs the Number Helper Table (You'll have
-- to refer to the 'Helper Table Workbench to find out about that).
CREATE FUNCTION [dbo].[uftSplitStringGroup]
(
@DelimitedList VARCHAR (8000),
@Delimiter VARCHAR(20)=','
)
RETURNS
@Results TABLE
(
SeqNo INT IDENTITY(1, 1),
Item VARCHAR(8000)
)
AS
BEGIN
DECLARE @String VARCHAR(8000)
DECLARE @LenDelimiter INT
SELECT @String=@Delimiter+@DelimitedList+@Delimiter,
@LenDelimiter=LEN( @Delimiter)
INSERT INTO @results (item)
SELECT SUBSTRING(@string,
s1.number+@LenDelimiter,
MIN(s2.number)-s1.number-@LenDelimiter)
FROM (SELECT [string]= @String )f
CROSS JOIN numbers s1
INNER JOIN numbers s2 ON s1.number< s2.number
WHERE s1.number<=LEN (@string)
AND s2.number<=LEN (@string)
AND SUBSTRING(@string,s1.number,@LenDelimiter)=@Delimiter
AND SUBSTRING(@string,s2.number,@LenDelimiter)=@Delimiter
GROUP BY s1.number
RETURN
END
/*
select * from uftSplitStringGroup('first,second,third',default)
select * from uftSplitStringGroup('<>this is something<>something else<>','<>')
select * from uftSplitStringGroup(
'Turkey,Bahrain,Yemen,Syria,United Arab Emirates,Saudi Arabia,Qatar,Kuwait',',')
Questions
------------
Send your answers to editor@simple-talk.com. The first three correct entries
will receive a much-coveted Simple-Talk goodie bag (polo shirt, USB key, bottle
opener, beermat, pen).
1/ How would you get the count of the number of rows in a table with NULLS in
a particular column, using GROUP BY, but without using a COALESCE or CASE
statement
2/ What is the GROUPING() function useful for?
3/ Can a WHERE clause contain an aggregate function?
4/ When would you need to use a HAVING clause?
5/ What does the ROLLUP do? How would you use it?
6/ Can you use UDFs (user-defined scalar-valued functions) in GROUP BY clauses? */
See also other Workbenches at Simple-Talk
Robyn Page's SQL Server DATE/TIME Workbench
Robyn Page
Date calculation and formatting in SQL Server can be surprisingly tricky. Robyn Page's "hands-on" workbench will lead you through the minefield.
Robyn Page's SQL Server String Manipulation Workbench
Robyn Page
String searching and manipulation in SQL Server can be error-prone and tedious...unless you're armed with the techniques described in Robyn's string manipulation workbench...
SQL Server Error Handling Workbench
Grant Fritchey
Grant Fritchey steps into the workbench arena, with an example-fuelled examination of catching and gracefully handling errors in SQL 2000 and 2005, including worked examples of the new TRY..CATCH capabilities.
Robyn Page's SQL Server Cursor Workbench
Robyn Page
The topic of cursors is the ultimate "hot potato" in the world of SQL Server. Everyone has a view on when they should and mainly should not be used. By example and testing Robyn Page proves that, when handled with care, cursors are not necessarily a "bad thing".
Robyn Page's SQL Server Data Validation Workbench
Robyn Page
Robyn Page provides essential techniques for ensuring the validity of the data being entered into your SQL Server tables.
Robyn Page's Excel Workbench
Robyn Page and Phil Factor
The need to produce Excel reports from SQL Server is very common. Here, Robyn Page and Phil Factor present practical techniques for creating and manipulating Excel spreadsheets from SQL Server, using linked servers and T-SQL. The pièce de résistance is a stored procedure that uses OLE Automation...
Robyn Page's SQL Server Security Workbench
Robyn Page and Phil Factor
Robyn Page and Phil Factor present practical T-SQL techniques for controlling access to sensitive information within the database, and preventing malicious SQL injection attacks.
SQL Server Helper Table Workbench
Robyn Page and Phil Factor
Cursors and iterations are both renowned for slowing down Transact SQL code, but sometimes seem unavoidable. In this workbench, Robyn Page and Phil Factor demonstrate some set-based techniques for string manipulation and time interval-based reporting, which use helper tables instead.
This article has been viewed 59433 times.