Phil Factor's Phrenetic Phoughts

Simple-Talk columnist
The wilder shores of Transact SQL    Phil on Twitter   Phil on SQL Server Central"

The ballad of the tuple relation

Published Thursday, May 10, 2007 4:13 PM

I was peacefully pottering away at my workstation, contentedly spawning a daughter thread or something similar, when Robyn popped in. She'd been wrestling with the Grouping Workbench and trying to come up with a copper-bottomed explanation for what the GROUP BY statement does. I'd been trying to help out in my usual arrogant paternalistic way. This time she innocently asked me to interpret the following passage, which is an explanation of Grouping in O'Reilly's Relational Database Dictionary. ('helping to ensure the success of your database projects' gushes the back cover)

Grouping 

Let r be a relation and let the heading of r be partitioned into subsets {X} and {Y}. Let the attributes of {Y} be Yl, Y2, ..., Yn; also, let {X} not contain any attribute called YR. Then the grouping r GROUP ({Y} AS YR) is another relation s. The heading of s consists of {X} extended with an attribute YR of type RELATION {Y}. The body of s is defined as follows: first, let z be the result of r WRAP ({Y} AS YT). Second, for each distinct X value x in z, (a) let yr be the relation whose tuples are all and only those YT values from tuples in z in which the X value is x; (b) let t be a tuple with X value x and YR value yr (and no other attributes); then, and only then, t is a tuple of s. Note: Given a relation r and some grouping of r, there's always an inverse ungrouping that will yield r again; however, the converse is not necessarily true.
Example: The following expression denotes a grouping of the relation that's the current value of relvar SP:

SP GROUP ( { P#, QTY } AS PQ_REL )

That grouping is a relation spq of type RELATION {S# S#, PQ_REL RELATION {P# P#, QTY QTY}}. Relation spq contains one tuple for each distinct S# value currently appearing in. relvar SP, and no other tuples.

You might be thinking 'Phew, hot stuff', but I couldn't make much sense of it. In fact, were I ever to be abducted by extraterrestrial aliens, I will expect them to address me in similar unfathomable clicks and whirrs before conducting their foul experiments upon my person.

It was whilst I was fruitlessly re-reading the definition for the fifth time trying to gather some meaning from it, I began to wonder whether, in fact, it was a poem. I give the first two resonant stanzas

Let r be a relation and I will tell you why

let the heading of r be partitioned into subsets {X} and {Y}.

Let the attributes of {Y} be Yl, Y2, ..., Yn; also,

let {X} not contain any attribute called YR though.

The heading, now,  of s consists of {X} extended by

an attribute YR of type RELATION {Y}.

 

 The body of s is defined as follows: See!
first, let z be the result of r WRAP ({Y} AS YT).
Second, for each distinct X value x in z,
(a) let yr be the relation whose tuples be
 all and only those YT values, as anyone expects
from tuples in z in which the X value is x;

...and so on. Quite the sort of output one might expect of a Post-modernist Poet.

But then, after having extracted as much amusement out of the paragraphs as I could, I felt a certain twinge of sadness. Set theory is so simple that it can be taught to Primary School children. SQL is merely a formalised dialect of English that describes action. I firmly believe that a relational database can be made so simple that any educated person can use it. Codd formalised the theory for the automation of existing processes that existed long before the first computers. It is the language of information retrieval.

 

Robyn is a firm believer in the idea that there is nothing in Relational Database Theory or practice that cannot be explained to an interested child. If so, then I shall have to watch out for the next generation of Database consultants.

 

I have  bullied the editor of Simple-Talk to offer a Simple-Talk Goodie-bag, and a signed printed copy of my book, to the best interpretation of the passage I’ve quoted that is comprehensible to him. If anyone can turn the entire passage into a Bob-Dylan protest song, then a Phil Factor Excellence Award will be made.

 

 

 

 

 

 

Comments

 

Robyn Page said:

Hang on Phil,
Here are another two definitions from the same book

Tuple
A Tuple Value. Every subset of a tuple (i.e. every subtuple of the tuple in question) is itself a tuple.

SubTuple
A subset of a tuple, hence a tuple.

Everyone clear on that then?

...and who'se this 'Bob Dylan' then?



May 11, 2007 11:11 AM
 

Adam Machanic said:

That's relational closure...

May 11, 2007 11:13 AM
 

Scott Frigard said:

Ahhh, I see that C.J. Date is the author. Maybe we should ask Fabian Pascal for clarification.
May 11, 2007 12:25 PM
 

Adam Machanic said:

No need to clarify, as these guys seem to often follow the central rule of "researchers" everywhere: make everything sound as 'academic' as possible so everyone thinks you're real smart.

Date's more recent book, "Database in Depth", is actually a really good read--not like this at all.  I highly recommend it.

In the meantime, can someone tell me what a "relation spq" is?
May 11, 2007 4:34 PM
 

Brad24 said:

I wonder if this is 'the piece of Codd which passeth all understanding'?
May 12, 2007 5:29 AM
 

Adam Machanic said:

Check out this one...

http://punitive-surgery.lcs.mit.edu/scicache/467/scimakelatex.67204.CJ+Date.html

... and when you're done reading, check out this link:

http://pdos.csail.mit.edu/scigen/

:)
May 13, 2007 12:57 AM
 

Adam Machanic said:

Arrgh, the top link expired!  Oh well, check out the bottom one anyway!

May 13, 2007 9:41 AM
 

Graeme Williams said:

There are two differences between Date's definition and standard SQL.  First, he specifies the ungrouped columns in the "GROUP" expression, not the columns to be grouped that you would list in a standard GROUP BY clause.  Second, Date allows columns to be table-valued; the result of his GROUP expression has one column which is table-valued.  In standard SQL, all the rows for a given value of the GROUP BY columns are collapsed into a single row of the result using aggregrate functions.  Date keeps each set of rows in a table value.

Also, Date uses other operators he has defined, like WRAP, which might make the definition simpler to write but makes it harder to read.  Below, I've applied WRAP rather than using it explicitly.

What I've done below is to give a definition which constructs the same result as Date, using something more like standard SQL.  You could, I guess, translate directly from one to the other.

Anyway, here goes ...

You can group a table by any subset of its columns.  Consider a table T with columns C(1) to C(N).  Then

   T GROUP (C(k+1) ... C(N) AS GC)

is a table with (k+1) columns whose first k columns are:

    SELECT DISTINCT C(1) ... C(k) FROM T

with an additional table-valued column GC.  If in a given row C(1) = V(1) and so on, the value of GC is given by:

   SELECT C(k+1) ... C(N) FROM T WHERE C(1) = V(1) AND ... AND C(k) = V(k).

Note:  GC represents the rows of T before you apply any aggregrate function (like SUM or COUNT).  In standard SQL, you would say:

  SELECT C(1), ... C(k), ... FROM T GROUP BY C(1), ... C(k)

where the second ellipsis represents aggregate functions applied to some or all of the columns C(k+1) ... C(N).
May 15, 2007 7:17 AM
 

kaburu said:

Dunno about Bob Zimmerman aka Dylan, but Shakespeare might be helpful on the subject of tuples:
 
"... a black ram is tupping your white ewe..." (Othello).

Though where that leaves subtuples, I really don't know. If a subtuple is itself a tuple then it must.... no, don't go there.... better yet just pack up and go for a tipple.
May 15, 2007 11:11 PM
 

Timothy Walters said:

That's the most complex explanation of GROUP BY that I have ever seen.

Here's how I explain it to people:

GROUP BY allows you to specify the unique fields that combine to make a "key". All data that matches this "key" will be included in any aggregate functions in the query.

Simplest Example:
SELECT State, SUM( TotalSales )
FROM SalesStaff
GROUP BY State

This will allow you to see the sum of [TotalSales] for each unique [State] entry in the table.

Average Example:
SELECT State, Division, SUM( TotalSales ), MAX( TotalSales )
FROM SalesStaff
GROUP BY State, Division

This will allow you to see [TotalSales] and the highests [TotalSales] value for any each combination of State & Division. It won't list a State & Division combo if there's no records that have that exact combo.
May 16, 2007 12:54 AM
 

kgbj5596 said:

I might be showing my ignorance but can anyone explain the point of have a group by 'clause'.  You are forced to copy and paste  a subset of from your select clause into your group by clause.  From then on have to manually keep the two in synch.  Why can't the compiler just assume that you want to group by non-aggregation clauses seeing as you don't seem to have any other choices anyway - ie your group by clause is always the non-aggregation lines of your select clause...
May 16, 2007 4:31 AM
 

Mark Tremel said:

I think that kgbj5596 hit the nail on the head.  A "group by" is the aggregation of the fields you select but do not perform summary calculations upon.
May 16, 2007 7:10 AM
 

GSquared said:

On the original question, I would need some terms defined to make sense of it.  Hopefully, these are either standard terms or are defined elsewhere in the book in question.  First, what is a "heading" in reference to a relation?  How does it differ from the "body" of a relation?  Once I have that, I could go on from there to decode the paragraph.  (Maybe I'm ignorant on this one, but I've never heard/used those terms when describing sets or relations.)

On the subject of "why have a group by clause at all", it's probably just because an implicit "group by" hasn't been requested often enough from the SQL Server developers at MS.  Would probably be easy enough to take the code that catches that error and make it add a group function to the optimizer/compiler itself instead of popping up an error.  Of course, it could be done at the IDE level of the database developer tool of choice, but it would be better to make it native.  So, let's all ask for it.
May 16, 2007 8:51 AM
 

Robyn Page said:

The Relational Database Dictionary (C.J. Date Aug 2006, O'Reilly) has the answers yet again to the questions of GSquared!


body

A set of tuples that are all of the same type; more specifically, the set of tuples appearing in a given relation, or in a given rel-var at a given time. Every subset of a body is itself a body.

Examples: The set of tuples appearing in relvar S at any given time; any subset of that set.


heading

A set of attributes, in which (by definition) each attribute is of the form <A,T>, where A is an attribute name and T is the type name for attribute A. Within any given heading, (a) distinct attributes are allowed to have the same type name but not the same attribute name; (b) the number of attributes is the degree (of the heading in question). Every subset of a heading is itself a heading. Note: Given that it's common to refer informally to an attribute by its attribute name alone, it's also common to regard a given heading informally as a set of attribute names alone.

Examples: The heading of relvar S is

{ <S#,S#>, <SNAME,MAME>, STATUS, INTEGER), <CITY,CHAR> }

The following (corresponding to a certain projection of relvar S) is also a heading:

{ <CITY,CHAR>,  <SNAME,NAME> }

These two headings might be represented less formally as:

{ S#, SMAME, STATUS, CITY }

{ CITY,  SNAME }
May 16, 2007 10:20 AM
 

The SQL Server Thought Police said:

I'm puzzling over the sentiments that SQL Server do not provide and 'implicit' GROUP BY clause. Surely if you do something like...
       select count(*) from MyTable
aren't you implying an aggregation of every single row? If you did any more than this, how would the query engine know what type of aggregation you wanted on a particular column when there are so many to choose from. Is it MAX, MIN AVG etc. If you mistakenly requested a column should that be treated as adding that column into the GROUP BY clause. What would happen if you'd actually done a Typo?
May 16, 2007 10:36 AM
 

d1abl0 said:

This is a very simple answer to "why do students fail SQL 101 in school?"
May 16, 2007 10:53 AM
 

Matt said:

I'm going for the Phil Factor Excellence Award by turning that incomprehensible mess of gobbledegook into another mess of incomprehensible gobbledegook (aka: A Dylan-ish protest song).

*twang* (guitar)
/mumble/ /mumble/ yeah
*twang*
Grooooup /mumble/ /mumble/ /mumble/ truck /mumble/ moon /mumble/ oh yeeeeah
*twang*
/mumble/ /mumble/ /groan/ select (harmonica solo) /mumble/ Groooooup
*twang*
Yeeeeah....

[repeat as necessary]

Now.... How do I claim my prize? (hey - it makes about as much sense as what Phile provided! :)
May 16, 2007 9:07 PM
 

The SQL Server Thought Police said:

Come gather 'round DBAs, perhaps you should
now admit that poor CJ Day's misunderstood
And accept it that soon You'll be reading his tome.
If your time to you Is worth savin'
Then you better start swimmin'
Or you'll sink like a stone
For the times they are a-changin'.

The writer and critic Phil Factor ignore.
And keep your eyes wide The chance won't come no more
And don't speak too soonFor the wheel's still in spin
And there's no tellin' who That it's namin'.
For the loser now
Will be later to win
For the times they are a-changin'.
May 17, 2007 2:54 AM
 

Ian said:

Interesting, I had to reread it a couple of times, and I think I was actually helped by my ignorance in understanding it. Sounds daft, perhaps, but I teach a bit of Java now and again to Uni students, and my preconceptions interfere with what they are being asked to achieve. My SQL is fine, but I'm no DBA

It didn't strike me as useful though. Group By means when Jon has a bag of fluffy animals I can ask him useful questions like: How many animals do you have? How many mammals do you have? Why did you knock over a pet store? How much food should I buy based on what each eats? Will you please let those animals out of that bag? How many kilos of oats will I need for that pony! What's the average weight of the voles and how many are below it by more than 10 percent?
Before group by I had one and one and one and one voles. Now I have four. They ate three and two and two and one grams of food, now I know I need to get eight grams of food to feed them all.

I had to head into "Having" country there, but I reckon any schoolchild can follow that.

May 17, 2007 9:12 AM
 

Rel_vs_SQL said:

As Graeme pointed out above, the definition given by Date describes a relational grouping operator, not the SQL "GROUP BY" operator. I think this is where much of your confusion stems from.
May 17, 2007 12:46 PM
 

Phil Factor said:

Re: Rel_vs_SQL

The Previous entry says:

GROUP        see grouping
May 17, 2007 1:28 PM
 

Greg Gaughan said:

First of all, this grouping is not the same as SQL's GROUP BY. SQL's GROUP BY specifies all the columns for which a distinct row is required in the result set. Any any other columns that are selected must be aggregated, e.g. via MAX.

Take the SP table:

SELECT * FROM SP
|SNO  |PNO   |QTY  
===================
|S1   |P1    |  300
|S1   |P2    |  200
|S1   |P3    |  400
|S1   |P4    |  200
|S1   |P5    |  100
|S1   |P6    |  100
|S2   |P1    |  300
|S2   |P2    |  400
|S3   |P2    |  200
|S4   |P2    |  200
|S4   |P4    |  300
|S4   |P5    |  400

SQL's GROUP BY on the SNO column would give one row per distinct SNO:

SELECT SNO, MAX(QTY) FROM SP GROUP BY SNO
|SNO  |2    
============
|S1   |  400
|S2   |  400
|S3   |  200
|S4   |  400


The relational GROUP, described very concisely and precisely by Date, adds an attribute to the resulting relation containing an embedded relation comprising the specified columns. The other columns (the ones not specified) indicate that a distinct row for them is required in the result relation.

Take the SP relation:

SP
+----+----+-----+
| S# | P# | QTY |
+====+====+-----+
| S1 | P1 | 300 |
| S1 | P2 | 200 |
| S1 | P3 | 400 |
| S1 | P4 | 200 |
| S1 | P5 | 100 |
| S1 | P6 | 100 |
| S2 | P1 | 300 |
| S2 | P2 | 400 |
| S3 | P2 | 200 |
| S4 | P2 | 200 |
| S4 | P4 | 300 |
| S4 | P5 | 400 |
+----+----+-----+

The result for Date's example should make its behaviour clear (in the syntax of Dee, http://www.quicksort.co.uk):

GROUP(SP, ['P#', 'QTY'], 'PQ_REL')
+----+--------------+
| S# | PQ_REL       |
+====+==============+
| S1 | +----+-----+ |
|    | | P# | QTY | |
|    | +====+=====+ |
|    | | P1 | 300 | |
|    | | P2 | 200 | |
|    | | P3 | 400 | |
|    | | P4 | 200 | |
|    | | P5 | 100 | |
|    | | P6 | 100 | |
|    | +----+-----+ |
| S2 | +----+-----+ |
|    | | P# | QTY | |
|    | +====+=====+ |
|    | | P1 | 300 | |
|    | | P2 | 400 | |
|    | +----+-----+ |
| S3 | +----+-----+ |
|    | | P# | QTY | |
|    | +====+=====+ |
|    | | P2 | 200 | |
|    | +----+-----+ |
| S4 | +----+-----+ |
|    | | P# | QTY | |
|    | +====+=====+ |
|    | | P2 | 200 | |
|    | | P4 | 300 | |
|    | | P5 | 400 | |
|    | +----+-----+ |
+----+--------------+

This is the relational grouping that Date is talking about. It is more flexible than the quirky SQL GROUP BY (and there's no need for the crazy HAVING clause etc.)

We can 'extend' the grouped result to add an aggregated value from the embedded relation and then finally remove the embedded relation to simulate the SQL behaviour:

GROUP(SP, ['P#', 'QTY'], 'PQ_REL').extend('G', lambda t:{'G':MAX(t.PQ_REL, lambda u:u.QTY)}).remove(['PQ_REL'])
+----+-----+
| S# | G   |
+====+=====+
| S1 | 400 |
| S2 | 400 |
| S3 | 200 |
| S4 | 400 |
+----+-----+

Also, GROUP results can be passed to other relational operators and so we can 'ungroup' them to get back to the original relation:

UNGROUP(GROUP(SP, ['P#', 'QTY'], 'PQ_REL'), 'PQ_REL')
+----+----+-----+
| S# | P# | QTY |
+====+====+-----+
| S1 | P1 | 300 |
| S1 | P2 | 200 |
| S1 | P3 | 400 |
| S1 | P4 | 200 |
| S1 | P5 | 100 |
| S1 | P6 | 100 |
| S2 | P1 | 300 |
| S2 | P2 | 400 |
| S3 | P2 | 200 |
| S4 | P2 | 200 |
| S4 | P4 | 300 |
| S4 | P5 | 400 |
+----+----+-----+
May 17, 2007 5:29 PM
 

Rel_vs_SQL said:

Phil Factor Said:

> Re: Rel_vs_SQL

> The Previous entry says:

> GROUP        see grouping

My comment stands. Date's book is on Relational, not SQL. Different beasts. See Greg's excellent post if you still find yourself challenged by this.
May 17, 2007 6:12 PM
 

Phil Factor said:

I am in awe of Greg's wonderful explanation. It is brilliant. Perhaps my confusion is caused by the blurb on the back cover which says 'Whether you are using Oracle. DB2  SQL Server, MySQL, or PostgreSQL, the Relational Database Dictionary will prevent confusion about the precise meaning of Database-related terms.'
May 18, 2007 2:22 AM
 

thisisfutile said:

LMAO...Matt
May 30, 2007 11:37 AM
You need to sign in to comment on this blog

















<May 2007>
SuMoTuWeThFrSa
293012345
6789101112
13141516171819
20212223242526
272829303112
3456789
A SysAdmin's Guide to Change Management
 In the first in a series of monthly articles, ‘Confessions of a Sys Admin’, Matt describes the issues... Read more...

Exchange: Recovery Storage Groups
 It can happen at any time: You get a request, as Admin, from your company, to provide the contents of... Read more...

Build Your Own Virtualized Test Lab
 Desmon explains the fundamentals of building a test lab for Windows servers and Enterprise applications... Read more...

Rendering Hierarchical Data with the Treeview
 It sometimes happens that Web Server controls that visualize data don't quite fit with the way that... Read more...

SQL Server 2008: Performance Data Collector
 With Performance Data Collector in SQL Server 2008, you can now store performance data from a number of... Read more...