MySQL - Common Queries
MySQL - Common Queries
MySQL - Common Queries
Page 1 of 92
TreeView
Aggregates
Basic aggregation Aggregate across columns Aggregates across multiple joins Aggregates excluding leaders Aggregates of specified size All X for which all Y are Z Avoiding repeat aggregation Cascading aggregates Cross-aggregates Group data by datetime periods League table Monthly expenses Nested aggregation Pairwise matchmaking Show only one child row per parent row Skip repeating values Track state changes Values linked with all values of another column Within-group aggregates Within-group aggregates with a wrinkle Within-group quotas (Top N per group) Aggregates and Statistics Aggregates from bands of values Average the top 50% values per group Correlation Count unique values of one column Median Mode Rank order
Emulate Row_Number() Next row Order by leading digits Order by month name Order by numerics then alphas Suppress repeating ordering values Pagination Pagination
Relational division
All possible recipes with given ingredients All X for which all Y are Z (relational division) Who makes all the parts for a given assembly?
Sequences
Find adjacent unbooked theatre seats Find blocks of unused numbers Find missing numbers in a sequence Find previous and next values in a sequence Find row with next value of specified column Find sequence starts and ends Find specific sequences Gaps in a time series Make values of a column sequential Track stepwise project completion Winning Streaks
Frequencies
Display column values which occur N times Display every Nth row
Data comparison
Backslashes in data Compare data in two tables Show rows where column value changed Using bit operators
Spherical geometry
Great circle distance
Database metadata
Add auto-incrementing primary key to a table Auto-increment: reset next value Change or drop a foreign key Compare structures of two tables Compare two databases Database size Find child tables Find parent tables Find primary key of a table Find the size of all databases on the server List databases, tables, columns List differences between two databases List users of a database Rename Database Replace text in all string columns of a database Show Show Create Trigger Show Table Status equivalent from information_schema Show Tables
JOIN
Approximate joins Cascading JOINs Classroom scheduling Data-driven joins Full Outer Join Intersection and difference Many-to-many joins What else did buyers of X buy? Join or subquery? Parents without children Parties who have contracts with one another The unbearable slowness of IN() The [Not] Exists query pattern What exams did a student not register for?
Stored procedures
A cursor if necessary, but not necessarily a cursor Emulate sp_exec Variable-length argument for query IN() clause
Strings
Count delimited substrings Count substrings Levenshtein distance Proper case Retrieve octets from IP addresses Return digits or alphas from a string Strip HTML tags
NULLs
List NULLs at end of query output Parents with and without children
Ordering resultsets
More MySQL tips and snippets here
Basic aggregation
This is the simplest grouping query pattern: for column foo, display the smallest, largest, sum, average or some other statistic of column bar values:
SELECT foo, MIN(bar) AS bar FROM tbl GROUP BY foo
Return the highest bar value for each foo, ordering top to bottom by that value:
SELECT foo, MAX(bar) AS Count FROM tbl GROUP BY foo ORDER BY Count DESC;
Ditto for AVG(), COUNT() etc. The pattern is easily extended for multiple grouping column expressions. MySQL introduced the SQL extension GROUP_CONCAT(), which makes short work of listing items in groups. For example, given a table of suppliers and the parts they make ...
CREATE TABLE supparts(supID char(2),partID char(2)); INSERT INTO supparts VALUES ('s1','p1'),('s1','p2'),('s1','p3'),('s1','p4'),('s1','p5'),('s1','p6'), ('s2','p1'),('s2','p2'),('s3','p2'),('s4','p2'),('s4','p4'),('s4','p5');
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 2 of 92
GROUP BY partID; +--------+-------------+ | partID | Suppliers | +--------+-------------+ | p1 | s1,s2 | | p2 | s1,s2,s3,s4 | | p3 | s1 | | p4 | s1,s4 | | p5 | s1,s4 | | p6 | s1 | +--------+-------------+
If columns other than the GROUP BY column must be retrieved, and if the grouping expression does not have a strictly 1:1 relationship with those columns, then to avoid returning arbitrary values for those non-grouping columns, you must put the GROUP BY query in a subquery and join that result to the other columns, for example:
SELECT s.partID, s, thiscol, s.thatcol, anothercol, x.Suppliers FROM supparts s JOIN ( SELECT partID,GROUP_CONCAT(supID ORDER BY supID) AS Suppliers FROM supparts GROUP BY partID ) x USING(partID)
Last updated 25 Nov 2010
Feedback
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 3 of 92
packageItemTaxItemID INT, packageItemTaxName CHAR(5), packageItemTaxAmount DECIMAL(10,2) ); INSERT INTO packageItemTax VALUES (1,1,'GST',7.00), (2,1,'HST',10.00);
With three child tables, the sums are tripled. Why? The query aggregates across each join. How then to get the correct results? One way is with correlated subqueries:
SELECT i.packageItemID AS Item, SUM(i.packageItemPrice) AS Price, c.Credit, t.Tax FROM packageItem i JOIN ( SELECT packageCreditItemID, SUM(packageCreditAmount) AS Credit FROM packageCredit GROUP BY packageCreditItemID ) c ON i.packageItemID = c.packageCreditItemID JOIN ( SELECT packageItemTaxItemID, SUM(t.packageItemTaxAmount) AS Tax FROM packageItemTax t GROUP BY packageItemTaxItemID ) t ON i.packageItemID = t.packageItemTaxItemID GROUP BY packageItemID; +------+--------+--------+-------+ | Item | Price | Credit | Tax | +------+--------+--------+-------+ | 1 | 100.00 | 117.00 | 17.00 | +------+--------+--------+-------+
Moving the subquery logic to the JOIN level may speed up performance considerably:
SELECT i.packageItemID AS Item, SUM(i.packageItemPrice) AS Price, c.Credit, t.Tax FROM packageItem i JOIN ( SELECT packageCreditItemID, SUM(packageCreditAmount) AS Credit FROM packageCredit GROUP BY packageCreditItemID ) c ON i.packageItemID = c.packageCreditItemID JOIN ( SELECT packageItemTaxItemID, SUM(t.packageItemTaxAmount) AS Tax FROM packageItemTax t GROUP BY packageItemTaxItemID ) t ON i.packageItemID = t.packageItemTaxItemID GROUP BY packageItemID;
If subqueries are unavailable or too slow, replace them with temp tables.
Last updated 16 Feb 2011
Feedback
and you wish to list ranks by group omitting the leading rank in each group. The simplest query for group leaders is ...
SELECT grp, MIN(rank) as top FROM ranks r2
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 4 of 92
The simplest way to get a result that omits these is an exclusion join from the ranks table to the above result:
SELECT r1.grp, r1.rank FROM ranks r1 LEFT JOIN ( SELECT grp, MIN(rank) as top FROM ranks r2 GROUP BY grp ) AS r2 ON r1.grp=r2.grp AND r1.rank = r2.top WHERE r2.grp IS NULL ORDER BY grp, rank; +------+------+ | grp | rank | +------+------+ | 1 | 7 | | 1 | 9 | | 2 | 3 | | 2 | 5 | | 2 | 6 | | 2 | 8 | | 4 | 12 | | 4 | 13 | +------+------+
Feedback
To retrieve a list of c1 values for which there exist specific values in another column c2, you need an IN clause specifying the c2 values and a HAVING clause specifying the required number of different items in the list ...
SELECT c1 FROM table WHERE c2 IN (1,2,3,4) GROUP BY c1 HAVING COUNT(DISTINCT c2)=4;
This is easy to generalise to multiple column expressions, and a HAVING clause specifying any number of items from the IN list.
Back to the top Browse the book Buy the book Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 5 of 92
What query shows which parties have candidates in all districts? The simplest solution is to aggregate on party from a join of candidates to districts, and condition the result on each party having a rowcount at least equal to the district count:
SELECT party FROM candidates INNER JOIN districts USING (district) GROUP BY party HAVING COUNT(party) >= (SELECT COUNT(*) FROM districts); +---------+ | party | +---------+ | Liberal | +---------+
The query pattern is: given a table of x values in table X, a table of y values in table Y, and a table of x-y pairs in table Z, use the following query to find all X.x which are paired with every Y.y in Z:
SELECT x FROM Y JOIN Z USING (y) GROUP BY x HAVING COUNT(x) >= (SELECT COUNT(*) FROM y);
The query needs to aggregate language counts by country twice: once for all language counts by country, and once again to identify countries with the highest number of languages:
SELECT name, population, headofstate, top.num FROM Country JOIN ( SELECT countrycode, COUNT(*) AS num FROM CountryLanguage WHERE isofficial='T' GROUP BY countrycode HAVING num = ( SELECT MAX(summary.nr_official_languages) FROM ( SELECT countrycode, COUNT(*) AS nr_official_languages FROM CountryLanguage WHERE isofficial='T' GROUP BY countrycode ) AS summary ) ) as top ON Country.code=top.countrycode; +--------------+------------+-------------+-----+ | name | population | headofstate | num | +--------------+------------+-------------+-----+ | Switzerland | 7160400 | Adolf Ogi | 4 | | South Africa | 40377000 | Thabo Mbeki | 4 | +--------------+------------+-------------+-----+
In addition, one of the nested subqueries is buried in a HAVING clause. This is fine with small tables, but if the table being aggregated is very large and the aggregation is complex, performance may be unsatisfactory. Substituting a temporary table for the double nesting can improve performance in two ways: the aggregation needs to be done just once we can use an exclusion join, which is usually faster than a HAVING clause, to find countries with the maximum counts:
DROP TABLE IF EXISTS top; CREATE TABLE top ENGINE=MEMORY SELECT countrycode, COUNT(*) AS num FROM CountryLanguage l1 WHERE isofficial='T' GROUP BY countrycode; SELECT name,population,headofstate,t3.num FROM country c JOIN (
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 6 of 92
SELECT t1.countrycode, t1.num FROM top t1 LEFT JOIN top t2 ON t1.num < t2.num WHERE t2.countrycode IS NULL ) AS t3 ON c.code=t3.countrycode; +--------------+------------+-------------+-----+ | name | population | headofstate | num | +--------------+------------+-------------+-----+ | Switzerland | 7160400 | Adolf Ogi | 4 | | South Africa | 40377000 | Thabo Mbeki | 4 | +--------------+------------+-------------+-----+ DROP TABLE top;
You notice that we haven't actually used a TEMPORARY table? Indeed we haven't, because of the MySQL limitation that temporary tables cannot be referenced multiple times in a query. Until that's lifted, we get almost as much speed improvement from using a MEMORY table as a temporary table.
Back to the top Browse the book Buy the book Feedback
Cascading aggregates
When you have parent-child-grandchild tables, eg companies, users, actions, and your query requirement is for per-parent aggregates from the child table and per-child aggregates from the grandchild table, then cascading joins yield spuriously multiplied counts, and correlated subqueries fail because the second correlated subquery cannot find a visible joining column. One solution is to use derived tables. Assuming ...
DROP TABLE IF EXISTS companies,users,actions; CREATE TABLE companies (id int, name char(10)); INSERT INTO COMPANIES VALUES(1,'abc ltd'),(2,'xyz inc'); CREATE TABLE users (id INT,companyid INT); INSERT INTO users VALUES(1,1),(2,1),(3,1),(4,2),(5,2); CREATE TABLE actions (id INT, userid INT, date DATE); INSERT INTO actions VALUES ( 1, 1, '2009-1-2'),( 2, 1, '2009-1-3'),( 3, 2, '2009-1-4'),( 4, 2, '2009-1-5'),( 5, 3, '2009-1-6'), ( 6, 3, '2009-1-7'),( 7, 4, '2009-1-8'),( 8, 5, '2009-1-9'),( 9, 5, '2009-1-9'),(10, 5, '2009-1-9');
then... Join companies and users once to establish a derived company-user table. Join them a second time, this time aggregating on users.id to retrieve user counts per company. Join the first derived table to the actions table, aggregating on actions.id to retrieve actions per user per company: Here is the SQL:
SELECT cu1.cname as CpyName, cu2.uCnt as Users, ua.uid as UserNo, ua.aCnt as Actions, ua.Latest FROM ( SELECT c.id AS cid, c.name AS cname, u1.id AS uid FROM companies c INNER JOIN users u1 ON u1.companyid=c.id ) AS cu1 INNER JOIN ( SELECT c.id AS cid, COUNT(u2.id) AS uCnt FROM companies c INNER JOIN users u2 ON u2.companyid=c.id GROUP BY c.id ) AS cu2 ON cu1.cid=cu2.cid INNER JOIN ( SELECT u3.id AS uid, COUNT(a.id) AS aCnt, MAX(a.date) AS latest FROM users u3 INNER JOIN actions a ON u3.id=a.userid GROUP BY u3.id ) AS ua ON ua.uid=cu1.uid; +---------+-------+--------+---------+------------+ | CpyName | Users | UserNo | Actions | Latest | +---------+-------+--------+---------+------------+ | abc ltd | 3 | 1 | 2 | 2009-01-03 | | abc ltd | 3 | 2 | 2 | 2009-01-05 | | abc ltd | 3 | 3 | 2 | 2009-01-07 | | xyz inc | 2 | 4 | 1 | 2009-01-08 | | xyz inc | 2 | 5 | 3 | 2009-01-09 | +---------+-------+--------+---------+------------+
Feedback
Cross-aggregates
Given the table authorbook(authid INT, bookid INT), what query finds the books who have authors with more than one book in the table? Even one level of recursion can induce a mild trance. Escape the trance by taking the problem one step at a time. First write the query that finds the authors with multiple books. Then join an outer query to that on authorid, and have the outer query select bookid:
SELECT a1.bookid FROM authorbook a1 INNER JOIN (
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 7 of 92
Feedback
where thistime is the TIME column and periodMinutes is the period length in minutes. So to group by 15-min periods, write ...
SELECT ... GROUP BY ( 4 * HOUR( thistime ) + FLOOR( MINUTE( thistime ) / 15 )) ...
A simpler application of the same logic works for hours. For example, list data by three-hour periods:
DROP TABLE IF EXISTS t; CREATE TABLE t(t time,i int); INSERT INTO t VALUES('01:01:01',1),('02:02:02',2),('05:05:05',5); SELECT FLOOR(HOUR(t)/3) AS period, GROUP_CONCAT(i) AS i FROM t GROUP BY period; +--------+------+ | period | i | +--------+------+ | 0 | 1,2 | | 1 | 5 | +--------+------+
It could be made to work for weeks with a function that maps the results of WEEK() to the range 1...52. When the desired grouping period is a value returned by a MySQL date-time function, matters become simpler: just group by the desired value. Thus to group by weeks, write ..
SELECT ... GROUP BY WEEK( datecol) ...
modifying the denominator to suit. If there is no expression invoking a MySQL date-time function that returns the desired grouping period, you will need to write your own stored function.
Back to the top Browse the book Buy the book Feedback
League table
Here is a simple soccer league table setup that was developed in the MySQL Forum by J Williams and a contributor named "Laptop Alias". The teams table tracks team ID and name, the games table tracks home and away team IDs and goal totals for each game. The query for standings is built by aggregating a UNION of home team and away team game results:
DROP TABLE IF EXISTS teams, games; CREATE TABLE teams(id int primary key auto_increment,tname char(32)); CREATE TABLE games(id int primary key auto_increment, date datetime, hteam int, ateam int, hscore tinyint,ascore tinyint); INSERT INTO teams VALUES(1,'Wanderers'),(2,'Spurs'),(3,'Celtics'),(4,'Saxons'); INSERT INTO games VALUES (1,'2008-1-1 20:00:00',1,2,1,0),(2,'2008-1-1 20:00:00',3,4,0,2), (3,'2008-1-8 20:00:00',1,3,1,1),(4,'2008-1-8 20:00:00',2,4,2,1); SELECT * FROM teams; +----+-----------+ | id | tname | +----+-----------+ | 1 | Wanderers | | 2 | Spurs | | 3 | Celtics | | 4 | Saxons | +----+-----------+ SELECT * FROM games; +----+---------------------+-------+-------+--------+--------+ | id | date | hteam | ateam | hscore | ascore | +----+---------------------+-------+-------+--------+--------+ | 1 | 2008-01-01 20:00:00 | 1 | 2 | 1 | 0 | | 2 | 2008-01-01 20:00:00 | 3 | 4 | 0 | 2 |
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 8 of 92
| 3 | 2008-01-08 20:00:00 | 1 | 3 | 1 | 1 | | 4 | 2008-01-08 20:00:00 | 2 | 4 | 2 | 1 | +----+---------------------+-------+-------+--------+--------+ -- Standings query: SELECT tname AS Team, Sum(P) AS P,Sum(W) AS W,Sum(D) AS D,Sum(L) AS L, SUM(F) as F,SUM(A) AS A,SUM(GD) AS GD,SUM(Pts) AS Pts FROM( SELECT hteam Team, 1 P, IF(hscore > ascore,1,0) W, IF(hscore = ascore,1,0) D, IF(hscore < ascore,1,0) L, hscore F, ascore A, hscore-ascore GD, CASE WHEN hscore > ascore THEN 3 WHEN hscore = ascore THEN 1 ELSE 0 END PTS FROM games UNION ALL SELECT ateam, 1, IF(hscore < ascore,1,0), IF(hscore = ascore,1,0), IF(hscore > ascore,1,0), ascore, hscore, ascore-hscore GD, CASE WHEN hscore < ascore THEN 3 WHEN hscore = ascore THEN 1 ELSE 0 END FROM games ) as tot JOIN teams t ON tot.Team=t.id GROUP BY Team ORDER BY SUM(Pts) DESC ; +-----------+------+------+------+------+------+------+------+------+ | Team | P | W | D | L | F | A | GD | Pts | +-----------+------+------+------+------+------+------+------+------+ | Wanderers | 2 | 1 | 1 | 0 | 2 | 1 | 1 | 4 | | Spurs | 2 | 1 | 0 | 1 | 2 | 2 | 0 | 3 | | Saxons | 2 | 1 | 0 | 1 | 3 | 2 | 1 | 3 | | Celtics | 2 | 0 | 1 | 1 | 1 | 3 | -2 | 1 | +-----------+------+------+------+------+------+------+------+------+
Feedback
Monthly expenses
You have four tables to track revenue and expensesbankaccount, cash, accountitem, accountcategory:
drop table if exists accountitem,accountcategory,bankaccount,cash; create table accountitem( itemid int primary key auto_increment,itemname char(32),itemcatid int ); create table accountcategory( categoryid int primary key auto_increment,categoryname char(32),isexpense bool ); create table bankaccount( id int auto_increment primary key,amount decimal(12,2),itemid int,entrydate date ); create table cash( id int auto_increment primary key,amount decimal(12,2),itemid int,date date );
You need monthly expense totals. The solution uses two query patterns sum across a join of cash and bank tables for totals pivot on month to break out monthly totals
select if(month=1, m.Amt+n.Amt,0 ) As `Jan`, -- pivot by month if(month=2, m.Amt+n.Amt,0 ) As `Feb`, if(month=3, m.Amt+n.Amt,0 ) As `Mar`, if(month=4, m.Amt+n.Amt,0 ) As `Apr`, if(month=5, m.Amt+n.Amt,0 ) As `May`, if(month=6, m.Amt+n.Amt,0 ) As `Jun`, if(month=7, m.Amt+n.Amt,0 ) As `Jul`, if(month=8, m.Amt+n.Amt,0 ) As `Aug`, if(month=9, m.Amt+n.Amt,0 ) As `Sep`, if(month=10,m.Amt+n.Amt,0 ) As `Oct`, if(month=11,m.Amt+n.Amt,0 ) As `Nov`, if(month=12,m.Amt+n.Amt,0 ) As `Dec` from ( select Month(a.Date) As Month, sum(a.Amount) as Amt from cash a join accountitem b on a.itemid=b.itemid join accountcategory c on b.ItemCatID=c.CategoryID where c.IsExpense=1 Group By Month ) As m join ( -- join cash & bank data select Month(a.EntryDate) As Month, sum(a.Amount ) AS Amt from BankAccount a join accountitem b on a.itemid=b.itemid join accountcategory c on b.ItemCatID=c.CategoryID
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 9 of 92
Feedback
Nested aggregation
Employee sales commission rates increase with sales totals according to specified bands of sales total amounts like a graduated income tax in reverse. To compute total commissions due each employee, we need to aggregate twice: first to get sales per employee, then to get commissions per employee:
DROP TABLE IF EXISTS sales, commissions; CREATE TABLE sales(employeeID int,sales int); INSERT INTO sales VALUES(1,2),(1,5),(1,7),(2,9),(2,15),(2,12); SELECT * FROM sales; +------------+-------+ | employeeID | sales | +------------+-------+ | 1 | 2 | | 1 | 5 | | 1 | 7 | | 2 | 9 | | 2 | 15 | | 2 | 12 | +------------+-------+ CREATE TABLE commissions( comstart DECIMAL(6,2), commend DECIMAL(6,2), comfactor DECIMAL(6,2), pct INT ); INSERT INTO commissions VALUES (1.00,10.00,0.10,10),(11.00,20.00,0.20,20),(21.00,30.00,0.30,30),(31.00,40.00,0.40,40); SELECT * FROM commissions; +----------+---------+-----------+------+ | comstart | commend | comfactor | pct | +----------+---------+-----------+------+ | 1.00 | 10.00 | 0.10 | 10 | | 11.00 | 20.00 | 0.20 | 20 | | 21.00 | 30.00 | 0.30 | 30 | | 31.00 | 40.00 | 0.40 | 40 | +----------+---------+-----------+------+
The first problem is to work out how commission ranges map to sales totals to determine base amounts for calculation of each part-commission. We assume the ranges are inclusive, ie a range that starts at 1 euro is meant to include that first euro: if amt < comstart, base amount = 0 if amt <= commend, base amount = amt-comstart+1 if amt > commend, base amount = commend - comstart+1 This is a nested IF():
IF(s.amt<c.comstart,0,IF(s.amt<=c.commend,s.amt-c.comstart,c.commend-c.comstart))
JOIN between
The second problem is how to apply every commission range row to every employee sales sum. That's a CROSS aggregated sales and commissions:
SELECT * FROM ( SELECT employeeID,SUM(sales) AS amt FROM sales GROUP BY employeeID ) AS s JOIN commissions ORDER BY s.employeeID; +------------+------+----------+---------+-----------+------+ | employeeID | amt | comstart | commend | comfactor | pct | +------------+------+----------+---------+-----------+------+ | 1 | 14 | 1.00 | 10.00 | 0.10 | 10 | | 1 | 14 | 11.00 | 20.00 | 0.20 | 20 | | 1 | 14 | 21.00 | 30.00 | 0.30 | 30 | | 1 | 14 | 31.00 | 40.00 | 0.40 | 40 | | 2 | 36 | 31.00 | 40.00 | 0.40 | 40 | | 2 | 36 | 1.00 | 10.00 | 0.10 | 10 | | 2 | 36 | 11.00 | 20.00 | 0.20 | 20 | | 2 | 36 | 21.00 | 30.00 | 0.30 | 30 | +------------+------+----------+---------+-----------+------+
Now check how the formula applies on every commission band for every sales total:
SELECT s.employeeID,s.amt,c.comstart,c.commend, IF(s.amt<=c.comstart,0, IF( s.amt < c.commend, s.amt-c.comstart+1, c.commend-c.comstart+1 ) ) AS base, c.comFactor AS ComPct, IF(s.amt<=c.comstart,0, IF(s.amt<c.commend,s.amt-c.comstart+1,c.commend-c.comstart+1) ) * comFactor AS Comm FROM ( SELECT employeeID,SUM(sales) AS amt
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 10 of 92
FROM sales GROUP BY employeeID ) AS s JOIN commissions c ORDER BY s.employeeID,comstart; +------------+------+----------+---------+-------+--------+--------+ | employeeID | amt | comstart | commend | base | ComPct | Comm | +------------+------+----------+---------+-------+--------+--------+ | 1 | 14 | 1.00 | 10.00 | 10.00 | 0.10 | 1.0000 | | 1 | 14 | 11.00 | 20.00 | 4.00 | 0.20 | 0.8000 | | 1 | 14 | 21.00 | 30.00 | 0.00 | 0.30 | 0.0000 | | 1 | 14 | 31.00 | 40.00 | 0.00 | 0.40 | 0.0000 | | 2 | 36 | 1.00 | 10.00 | 10.00 | 0.10 | 1.0000 | | 2 | 36 | 11.00 | 20.00 | 10.00 | 0.20 | 2.0000 | | 2 | 36 | 21.00 | 30.00 | 10.00 | 0.30 | 3.0000 | | 2 | 36 | 31.00 | 40.00 | 6.00 | 0.40 | 2.4000 | +------------+------+----------+---------+-------+--------+--------+
Here is another example. We track passenger flight bookings in three tables: flight, booking, passenger. To report all destinations per passenger, retrieve DISTINCT passenger-destination combos, then count them:
DROP TABLES IF EXISTS flight, booking, passenger; CREATE TABLE flight(flight CHAR(12),source CHAR(12),destination CHAR(12)); INSERT INTO flight VALUES ('ab123','dublin','london'),('bc123','prague','london'),('cd123','stuttgart','paris'), ('de123','paris','madrid'); CREATE TABLE booking(flight CHAR(5),seat CHAR(2),passenger_id INT NOT NULL); INSERT INTO booking VALUES ('ab123','a2',1),('bc123','a1',2),('bc123','a2',1),('cd123','a3',1); CREATE TABLE passenger(passenger_id INT, name VARCHAR(12)); INSERT INTO passenger VALUES (1,'john'),(2,'bill'),(3,'david'); SELECT x.*,COUNT(b.passenger_id) bookings FROM ( SELECT DISTINCT p.passenger_id, p.name, d.destination FROM passenger p CROSS JOIN flight d ) x LEFT JOIN flight d ON d.destination = x.destination LEFT JOIN booking b ON b.passenger_id = x.passenger_id AND b.flight = d.flight GROUP BY passenger_id, destination; +--------------+-------+-------------+----------+ | passenger_id | name | destination | bookings | +--------------+-------+-------------+----------+ | 1 | john | london | 2 | | 1 | john | madrid | 0 | | 1 | john | paris | 1 | | 2 | bill | london | 1 | | 2 | bill | madrid | 0 | | 2 | bill | paris | 0 | | 3 | david | london | 0 | | 3 | david | madrid | 0 | | 3 | david | paris | 0 | +--------------+-------+-------------+----------+
Last updated 15 Nov 2009
Feedback
Pairwise matchmaking
Given tables tracking users and their hobbies, how do we write a query that ranks pairs of users on hobby similarity?
DROP TABLE IF EXISTS users,hobbies,users_hobbies; CREATE TABLE users( id int, name char(16) ) ; INSERT INTO users VALUES (1,'John'),(2,'Lewis'),(3,'Muhammad'); CREATE TABLE hobbies( id int, title char(16) ) ; INSERT INTO hobbies VALUES (1,'Sports'),(2,'Computing'),(3,'Drinking'),(4,'Racing'),(5,'Swimming'),(6,'Photography'); CREATE TABLE users_hobbies( user_id int, hobby_id int ) ; INSERT INTO users_hobbies VALUES (1,2),(1,3),(1,6),(2,1),(2,5),(2,6),(3,2),(3,5),(3,6),(1,2),(1,3),(1,6),(2,1), (2,5),(2,6),(3,2),(3,5),(3,6),(1,2),(1,3),(1,6),(2,1),(2,5),(2,6),(3,2),(3,5),(3,6);
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 11 of 92
It's a SQL version of a famous computing problem known as nearest neighbour search or similarity search: given a metric space of K vectors, return the N most similar. Start by defining a similarity measure for one pair of users: if user A has x hobbies and user B has y hobbies, one measure of their similarity is the number of hobbies they share, divided by the number of hobbies that either has. Is that plausible? If A and B both have hobbies 1,14,27, they are 100*3/3=100% similar. If A has 9,13 while B has 6,9,15, together they have 4 hobbies, one of which they share, so their similarity is 25%. That's reasonable, but incomplete. If the comparison space has 100 hobbies, and one pair shares 1 of 4 while another pair shares 4 of 16, are those two pairs equally similar? Arguably not, since the second pair shares a greater proportion of the total possible. Then the similarity measure should take this into account, so if N=the total number of hobbies, S=the number of hobbies shared by a pair, and T=the total number of distinct hobbies they have together, their similarity is (S/T) * (T/N), or simply S/N. So from first logical principles, the solution is a three-step: 1. Write the above logic in SQL 2. Write a query that applies that SQL to all pairs 3. Rank the scores that result Fortunately, aggregation collapses these three logical steps to 2: Step 1: count hobbies, then collect hobbies pairwise by user:
SET @N = (Select Count(DISTINCT id) FROM hobbies); SELECT @N; +------+ | @N | +------+ | 6 | +------+ SELECT a.user_id, b.user_id, Group_Concat(DISTINCT a.hobby_id) AS 'Pairwise shared hobbies' FROM users_hobbies a JOIN users_hobbies b ON a.user_id<b.user_id AND a.hobby_id=b.hobby_id GROUP BY a.user_id,b.user_id; +---------+---------+-------------------------+ | user_id | user_id | Pairwise shared hobbies | +---------+---------+-------------------------+ | 1 | 2 | 6 | | 1 | 3 | 2,6 | | 2 | 3 | 5,6 | +---------+---------+-------------------------+
But there is no hard and fast rule that determines a uniquely correct measure of pairwise similarity. Some circumstances may require the above formula. Others may require simple or complicated weights computed from pair and population sizes. Implement them by applying a corrective calculation to the denominator @N in Count ( DISTINCT a.hobby_id ) / @N, 2 ).
Last updated 17 Sep 2010
Feedback
But is it accurate? No, because it displays only the first c.pid value it happens to find. For further discussion see Within-group aggregates.
Back to the top Browse the book Buy the book Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 12 of 92
Feedback
To test query efficiency, we need more than six rows. To generate sequential test data, a utility table of ints from 0 through 9 is helpful. We keep general-purpose database objects in a sys database available to all developers on the server:
create database if not exists sys; use sys; drop table if exists ints; create table ints(i int); insert into ints values(0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
Quassnoi's test data was 10,000 rows tracking ID, state, date and value, with state oscillating between 1 and 2 every one thousand rows. The primary key is ID. There is a covering (date, state) index. Dates are in descending order. To simplify comparative testing, we parameterise the state-change interval:
set @state_interval=1000; drop table if exists history; create table history ( id int primary key, state int not null, date datetime not null, value varchar(16) not null, key(date, state) ) engine=innodb default charset=utf8; insert into history select id, 1 + floor(((id-1)/@state_interval) mod 2), '2009-07-24' - interval id minute, concat('value ', id) from ( select 1 + ( t.i*1000 + u.i*100 + v.i*10 + w.i ) as id from sys.ints t join sys.ints u join sys.ints v join sys.ints w ) tmp;
To the original requirement we add reporting the value associated with each state change. Quassnoi found a setbased solution, but it is painfully slow:
SELECT MIN(id) AS id, MIN(date) AS date, MIN(state) AS state, value, COUNT(*) cnt FROM ( SELECT id, date, state,value, ( SELECT id FROM history a WHERE (a.date, a.state, a.id) < (b.date, b.state, b.id) AND a.state <> b.state ORDER BY date DESC, state DESC LIMIT 1 ) AS prev FROM history b ) q GROUP BY prev ORDER BY date; +------+---------------------+-------+-------------+------+ | id | date | state | value | cnt | +------+---------------------+-------+-------------+------+ | 9001 | 2009-07-17 01:20:00 | 2 | value 10000 | 1000 | | 8001 | 2009-07-17 18:00:00 | 1 | value 9000 | 1000 | | 7001 | 2009-07-18 10:40:00 | 2 | value 8000 | 1000 |
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 13 of 92
| 6001 | 2009-07-19 03:20:00 | 1 | value 7000 | 1000 | | 5001 | 2009-07-19 20:00:00 | 2 | value 6000 | 1000 | | 4001 | 2009-07-20 12:40:00 | 1 | value 5000 | 1000 | | 3001 | 2009-07-21 05:20:00 | 2 | value 4000 | 1000 | | 2001 | 2009-07-21 22:00:00 | 1 | value 3000 | 1000 | | 1001 | 2009-07-22 14:40:00 | 2 | value 2000 | 1000 | | 1 | 2009-07-23 07:20:00 | 1 | value 1000 | 1000 | +------+---------------------+-------+-------------+------+
The correlated SELECT ... ORDER BY ... LIMIT 1 subquery may look dodgy, but a conventional call to MIN() is even slower:
SELECT MIN(id) AS id, MIN(date) AS date, MIN(state) AS state, COUNT(*) cnt FROM ( SELECT id, date, state, ( SELECT min(id) FROM history a WHERE (a.date, a.state, a.id) < (b.date, b.state, b.id) AND a.state <> b.state ) AS prev FROM history b ) tmp GROUP BY prev ORDER BY date;
The solution is to build a derived table ordered by date and state with user variables to track state changes. The derived column state_change starts at 0 and increments whenever state changes. Then selecting minimum date and state values from the derived table, and grouping them by state_change, picks out the rows where state changes. This is hundreds of times faster than either of the set-based solutions:
SELECT MIN(id) AS id, MIN(date) AS date, MIN(state) AS state, value, COUNT(*) cnt FROM ( SELECT @r := @r + (@state != state OR @state IS NULL) AS state_change, @state := state AS current, h.id, h.date, h.state, h.value FROM ( SELECT @r := 0, @state := NULL ) as vars -- one-row virtual table CROSS JOIN history as h -- cross-joined with history ORDER BY date, state -- ordered to track state changes ) q GROUP BY state_change ORDER BY date;
If how this works puzzles you, run the query with state_change added to the query's SELECT list. In other SQL dialects, in fact, you would have to include the state_change column, since standard SQL requires that GROUP BY columns be SELECTed.
Last updated 12 Feb 2011
Feedback
... and you want a list of texts which include every keyword. You might think you have to join and match. You don't. All you need to do is count the distinct keywords which occur for each text, then for each text compare that number with the entire list of distinct keywords:
SELECT txtID, COUNT(DISTINCT keyword) AS N FROM keywords GROUP BY txtID HAVING N = (SELECT COUNT(DISTINCT keyword) FROM keywords); +-------+---+ | txtID | N | +-------+---+ | 2 | 2 | +-------+---+
Feedback
Within-group aggregates
You have a products table with columns item, supplier, price. Multiple suppliers offer various prices for the same item. You need to find the supplier with the lowest price for each item.
DROP TABLE IF EXISTS products; CREATE TABLE products(item int,supplier int,price decimal(6,2)); INSERT INTO products VALUES(1,1,10),(1,2,15),(2,2,20),(2,1,21),(2,2,18); SELECT * FROM products; +------+----------+-------+ | item | supplier | price | +------+----------+-------+ | 1 | 1 | 10.00 | | 1 | 2 | 15.00 | | 2 | 2 | 20.00 | | 2 | 1 | 21.00 |
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 14 of 92
| 2 | 2 | 18.00 | +------+----------+-------+
Your first thought may be to GROUP BY item, but that is not guaranteed to return the correct supplier value for each minimum item price. Grouping by both item and supplier will return more information than you want. Nor can you write WHERE price=MIN(...) because the query engine will evaluate the WHERE clause before it knows the MIN value. This is the problem of aggregating within aggregates. It is sometimes called the 'groupwise aggregates' problem, but the term 'groupwise' is ambiguous. We think better names for it are subaggregates, inner aggregates, or within-group aggregates. It's easy to show that the within-group aggregates problem is a form of the problem of returning values from nongrouping columns in an aggregate query. Suppose you write ...
SELECT item,supplier,MIN(price) FROM products GROUP BY item;
Will this reliably return the correct supplier per item? No. Unless there is exactly one supplier per item, the supplier value returned will be arbitrary. To retrieve the correct supplier for each item, you need more logic. One way to model the within-aggregates problem is via an left self exclusion join...
SELECT p1.item,p1.supplier,p1.price FROM products AS p1 LEFT JOIN products AS p2 ON p1.item WHERE p2.item IS NULL;
...because in the resultset built by joining on left item = right item and left price > right price, the left-sided rows for which there is no greater right-sided price are precisely the per-item rows with the smallest prices. Another solution is to derive an intermediate table of aggregated minimum prices. Before MySQL 4.1, it has to be a temporary table:
CREATE TEMPORARY TABLE tmp ( item INT, minprice DECIMAL DEFAULT 0.0 ); LOCK TABLES products READ; INSERT INTO tmp SELECT item, MIN(price) FROM products GROUP BY item;
to which you then join the products table to find the matching suppliers:
SELECT products.item, supplier, products.price FROM products JOIN tmp ON products.item = tmp.item WHERE products.price=tmp.minprice; UNLOCK TABLES; DROP TABLE tmp;
From MySQL 4.1 on, the temporary table can be a correlated subquery. This is the most intuitively obvious syntax for the problem. Often it's also the slowest solution:
SELECT item, supplier, price FROM products AS p1 WHERE price = ( SELECT MIN(p2.price) FROM products AS p2 WHERE p1.item = p2.item );
It is usually faster to move the aggregating subquery from the WHERE clause to the FROM clause:
SELECT p.item, p.supplier, p.price FROM products AS p JOIN ( SELECT item, MIN(price) AS minprice FROM products GROUP BY item ) AS pm ON p.item = pm.item AND p.price = pm.minprice;
Some users have trouble mapping elements of this model to their instance of the problem. There are five elements (or sets of them): (i) a table, which might be a view, a single physical table, or a table derived from joins (ii) one or more grouping columns, (iii) one or more columns to aggregate, (iv) one or more columns not mentioned in the GROUP BY clause, (v) an aggregating job to do, typically MIN() or MAX(). In the product/minimum price solution above: (i) table tbl = product (ii) grouping column grouping_col = item (iii) column to aggregate = col_to_aggregate = price (iv) non-aggregated columns other_detail, ...etc... = supplier (v) aggregating function = MIN().
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 15 of 92
A "tricky" solution is to use ORDER BY in a subquery to find lowest prices, and write an outer GROUP BY query to pick each item's lowest price from the subquery:
SELECT * FROM ( SELECT * FROM products ORDER BY price ASC ) AS s GROUP BY item;
Other methods we've cited for this problem scale poorly because they're O(N2). If ordering and grouping columns are indexed, this solution is O(N * log N), so it may be your best bet for speed. It's an instance of a useful rule of thumb: simplest is often fastest. Finally, here is a radically different model of the problem. It can find both within-group minima and withingroup maxima in a single query. This model aggregates the concatenated within-group grouped column value and the within-group grouping column name in a single string, then uses SUBSTR() to break them apart in the result:
SELECT item, SUBSTR( MIN( CONCAT( LPAD(price,6,0),supplier) ), 7) AS LEFT( MIN( CONCAT( LPAD(price,6,0),supplier) ), 6)+0 AS SUBSTR( MAX( CONCAT( LPAD(price,6,0),supplier) ), 7) AS LEFT( MAX( CONCAT( LPAD(price,6,0),supplier) ), 6)+0 AS FROM products GROUP BY item; +------+-------------+----------+-------------+----------+ | item | MinSupplier | MinPrice | MaxSupplier | MaxPrice | +------+-------------+----------+-------------+----------+ | 1 | 1 | 10 | 2 | 15 | | 2 | 2 | 18 | 1 | 21 | +------+-------------+----------+-------------+----------+
Try all solutions to find which is fastest for your version of the problem. To find the top or bottom N per group, you might think the LIMIT clause would work, but LIMIT is limited in subqueries. See Within-group quotas.
Last updated 07 Aug 2010
Feedback
For the above dataset, the result which correctly matches wages and tips would be:
+------+------------+------+------+-------+------+--------+------+------------+ | tid | Date | Hrs | tabs | tips | wid | waiter | rate | start | +------+------------+------+------+-------+------+--------+------+------------+
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 16 of 92
| 1 | 2008-02-29 | 7 | 21 | 65.25 | 2 | 4 | 6.00 | 2005-03-01 | | 2 | 2008-03-06 | 6 | 15 | 51.75 | 3 | 5 | 7.00 | 2007-01-05 | | 3 | 2008-03-21 | 3 | 5 | 17.85 | 2 | 4 | 6.00 | 2005-03-01 | | 4 | 2008-03-22 | 5 | 10 | 39.00 | 4 | 5 | 8.00 | 2008-03-20 | +------+------------+------+------+-------+------+--------+------+------------+
Why is this different from an ordinary within-groups aggregate? The correct wage rate for a tips row is not the wages row for that waiter with the latest date; it is the wages row having the latest date before the date in the given tips row.
tips.date, then exclusion-join that result
One way to proceed is to build a temporary table from a join of wages to tips on waiter and wages.start < to itself to remove all rows except those with the latest wage rate per tips row. A two-step:
-- wages-tips join removing later wage changes: DROP TABLE IF EXISTS tmp; CREATE TABLE tmp SELECT t.id AS tid, t.date AS Date, t.hours_worked AS Hrs,t.tabs,t.tips, w.id AS wid, w.waiter, w.rate, w.start FROM tips t JOIN wages w ON w.waiter=t.waiter AND w.start<=t.date; -- self-exclusion join to remove obsolete wage rows: SELECT t1.* FROM tmp t1 LEFT JOIN tmp t2 ON t1.tid=t2.tid and t1.start<t2.start WHERE t2.waiter is null ORDER BY t1.Date; +------+------------+------+------+-------+------+--------+------+------------+ | tid | Date | Hrs | tabs | tips | wid | waiter | rate | start | +------+------------+------+------+-------+------+--------+------+------------+ | 1 | 2008-02-29 | 7 | 21 | 65.25 | 2 | 4 | 6.00 | 2005-03-01 | | 2 | 2008-03-06 | 6 | 15 | 51.75 | 3 | 5 | 7.00 | 2007-01-05 | | 3 | 2008-03-21 | 3 | 5 | 17.85 | 2 | 4 | 6.00 | 2005-03-01 | | 4 | 2008-03-22 | 5 | 10 | 39.00 | 4 | 5 | 8.00 | 2008-03-20 | +------+------------+------+------+-------+------+--------+------+------------+ DROP TABLE tmp;
That's fine, but can we skip the temp table? Yesby adding the condition wages.start <= tips.date to each side of the exclusion join:
SELECT t.id AS tid, t.date, t.hours_worked AS Hrs,t.tabs,t.tips, w.id AS wid, w.waiter, w.rate, w.start FROM tips t JOIN wages w ON w.waiter=t.waiter AND w.start <= t.date LEFT JOIN wages w2 ON w.waiter=w2.waiter AND w2.start<=t.date AND w.start<w2.start WHERE w2.id IS NULL ORDER BY t.date;
One approach is to rank rows with user variables and pick off the top two for each key in the WHERE clause:
SELECT tmp.ID, tmp.entrydate FROM ( SELECT ID, entrydate, IF( @prev <> ID, @rownum := 1, @rownum := @rownum+1 ) AS rank, @prev := ID FROM test t JOIN (SELECT @rownum := NULL, @prev := 0) AS r ORDER BY t.ID ) AS tmp WHERE tmp.rank <= 2 ORDER BY ID, entrydate; +------+------------+ | ID | entrydate | +------+------------+ | 1 | 2007-05-01 | | 1 | 2007-05-02 | | 2 | 2007-06-03 | | 2 | 2007-06-04 | | 3 | 2007-07-01 | | 3 | 2007-07-02 | +------+------------+
This is pretty much the same query pattern as the user variable method of emulating Row_Number(). The join in the subquery is just a device for resetting the variables after reading a row.
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 17 of 92
If the groups are fairly small, another feasible approach is to self-join and count. With appropriate ordering, the first two rows per ID are the rows which, for a given ID, have two or fewer rows with earlier dates. If we use an inequality join with the COUNT(*) function to find the earlier rows per ID ...
SELECT t1.id, t1.entrydate, COUNT(*) AS earlier FROM test AS t1 JOIN test AS t2 ON t1.id=t2.id AND t1.entrydate >= t2.entrydate GROUP BY t1.id, t1.entrydate +------+------------+---------+ | id | entrydate | earlier | +------+------------+---------+ | 1 | 2007-05-01 | 1 | | 1 | 2007-05-02 | 2 | | 1 | 2007-05-03 | 3 | | 1 | 2007-05-04 | 4 | | 1 | 2007-05-05 | 5 | | 1 | 2007-05-06 | 6 | | 2 | 2007-06-01 | 1 | | 2 | 2007-06-02 | 2 | | 2 | 2007-06-03 | 3 | | 2 | 2007-06-04 | 4 | | 3 | 2007-07-01 | 1 | | 3 | 2007-07-02 | 2 | | 3 | 2007-07-03 | 3 | +------+------------+---------+
... then we get our result by removing rows where the 'earlier' count exceeds 2:
SELECT t1.id, t1.entrydate, count(*) AS earlier FROM test AS t1 JOIN test AS t2 ON t1.id=t2.id AND t1.entrydate >= t2.entrydate GROUP BY t1.id, t1.entrydate HAVING earlier <= 2; +------+------------+---------+ | id | entrydate | earlier | +------+------------+---------+ | 1 | 2007-05-01 | 1 | | 1 | 2007-05-02 | 2 | | 2 | 2007-06-01 | 1 | | 2 | 2007-06-02 | 2 | | 3 | 2007-07-01 | 1 | | 3 | 2007-07-02 | 2 | +------+------------+---------+
This is about as efficient as the first method with a small table, but it compares every within-group row to every other within-group row. As the size N of a group increases, execution time increases by N*N. If the query takes one minute for groups of 1,000, it will take 16 minutes for groups of 4,000, and more than four hours for groups for 16,000. The solution does not scale. What to do? Forget GROUP BY! Manually assemble the desired query results in a temporary table from simple indexed queries, in this case, two rows per ID:
DROP TEMPORARY TABLE IF EXISTS earliers; CREATE TEMPORARY TABLE earliers( id INT, entrydate DATE); INSERT INTO earliers SELECT id,entrydate FROM test WHERE id=1 ORDER BY entrydate LIMIT 2; INSERT INTO earliers SELECT id,entrydate FROM test WHERE id=2 ORDER BY entrydate LIMIT 2; INSERT INTO earliers SELECT id,entrydate FROM test WHERE id=3 ORDER BY entrydate LIMIT 2;
You need one INSERT statement per grouping value. To print the result, just query the earliers table:
SELECT * FROM earliers ORDER BY id, entrydate; +------+------------+ | id | entrydate | +------+------------+ | 1 | 2007-05-01 | | 1 | 2007-05-02 | | 2 | 2007-06-01 | | 2 | 2007-06-02 | | 3 | 2007-07-01 | | 3 | 2007-07-02 | +------+------------+ DROP TEMPORARY TABLE earliers;
Most useful reports run again and again. If that's the case for yours, automate it in a stored procedure: using a cursor and a prepared statement, auto-generate an INSERT statement for every grouping value, and return the result:
DROP PROCEDURE IF EXISTS listearliers; DELIMITER | CREATE PROCEDURE listearliers() BEGIN DECLARE curdone, vid INT DEFAULT 0; DECLARE idcur CURSOR FOR SELECT DISTINCT id FROM test; DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET curdone = 1; DROP TEMPORARY TABLE IF EXISTS earliers; CREATE TEMPORARY TABLE earliers( id INT, entrydate DATE); SET @sql = 'INSERT INTO earliers SELECT id,entrydate FROM test WHERE id=? ORDER BY OPEN idcur; REPEAT FETCH idcur INTO vid; IF NOT curdone THEN BEGIN
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 18 of 92
SET @vid = vid; PREPARE stmt FROM @sql; EXECUTE stmt USING @vid; DROP PREPARE stmt; END; END IF; UNTIL curdone END REPEAT; CLOSE idcur; SELECT * FROM earliers ORDER BY id,entrydate; DROP TEMPORARY TABLE earliers; END; | DELIMITER ; CALL listearliers();
Last updated 04 May 2010
Feedback
Here is an example from a MySQL forum. You have a table of IDs and point scores, and you wish to show the distribution of counts across ranges. A traditional way of doing this is to build a table of ranges, then join from ranges to scores:
drop table if exists points,ranges; create table points ( id int not null auto_increment primary key, points int not null ); create table ranges ( low int not null, high int not null ); insert into points (points) values (456),(401),(543),(234),(303),(521),(478),(643),(575),(456),(432), (312),(564),(423),(411),(395),(543); insert into ranges (low, high) values (0,50),(51,100),(101,150),(151,200),(201,250),(251,300),(301,350), (351,400),(401,450),(451,500),(501,550),(551,600),(601,650),(651,700); select r.low, r.high, count(p.id) from ranges r left join points p on p.points between r.low and r.high group by r.low, r.high;
That's fine, but the ranges should be consistent, eg 0-49,50-99,100-149..., and if you keep a simple utility table of ints 0..9, you never need to build another range table:
drop table if exists ints,ranges; create table ints(i tinyint); insert into ints values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9); select ranges.low, ranges.high, count(p.id) from ( select ( t.i*100 + u.i*10 + v.i ) as low,( t.i*100 + u.i*10 + v.i ) + 49 as high from ints t join ints u join ints v where (t.i*100 + u.i*10 + v.i) mod 50 = 0 and (t.i*100 + u.i*10 + v.i) < 701 ) ranges left join points p on p.points between ranges.low and ranges.high group by ranges.low; +-----+------+-------------+ | low | high | count(p.id) | +-----+------+-------------+ | 0 | 50 | 0 | | 51 | 100 | 0 | | 101 | 150 | 0 | | 151 | 200 | 0 | | 201 | 250 | 1 | | 251 | 300 | 0 | | 301 | 350 | 2 | | 351 | 400 | 1 | | 401 | 450 | 4 | | 451 | 500 | 3 | | 501 | 550 | 3 | | 551 | 600 | 2 |
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 19 of 92
Feedback
How would we write a query that returns the average of the top 50% of scores per team? The per-team median value is its middle value--lower than the highest 50% and higher than the lowest 50% of values for that team--so a shortcut is to query the team medians, then aggregate on a join that selects per-team scores above the medians. How to find per-team medians? If a resultset has an odd number of rows, at least one row has the true median score. If it has an even number of rows, the median score is an average of two central values. The following query adapts Joe Celko's formula in "SQL for Smarties" averaging "low" and "high" medians:
DROP TABLE IF EXISTS medians; CREATE TABLE medians SELECT p1.teamid, AVG(P1.score) AS median FROM games AS P1, games AS P2 WHERE p1.teamid=p2.teamid GROUP BY p1.teamid HAVING ( SUM(CASE WHEN P2.score <= P1.score THEN AND SUM(CASE WHEN P2.score >= P1.score THEN ) OR ( SUM(CASE WHEN P2.score >= P1.score THEN AND SUM(CASE WHEN P2.score <= P1.score THEN ); +--------+--------+ | teamid | median | +--------+--------+ | 1 | 4.5000 | | 2 | 6.8333 | +--------+--------+
Feedback
Correlation
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 20 of 92
Probably more bunk has been written about correlation than about any other statistic. We'll keep this short and straight. At its simplest, correlation is a statistical measure of non-random, linear association between pairs of values in a dataset. It's denoted by r, and varies from -1 through +1, where -1 indicates perfect inverse correlation (the regression line goes down left to right), 0 indicates no correlation (there is no regression line; it's just a scatterplot), and +1 indicates perfect direct correlation (the regression line goes up left to right). For an example we'll use a bit of imaginary data:
drop table if exists t; create table t (id int, x int, y insert into t values (1 , 68, 4.1),(2 , 71, 4.6),(3 , (6 , 60, 3.1),(7 , 67, 3.8),(8 , (11, 68, 3.5),(12, 67, 3.2),(13, (16, 63, 4.0),(17, 65, 4.1),(18, float); 62, 68, 63, 67, 3.8),(4 , 4.1),(9 , 3.7),(14, 3.8),(19, 75, 71, 62, 63, 4.4),(5 , 4.3),(10, 3.3),(15, 3.4),(20, 58, 69, 60, 61, 3.2), 3.7), 3.4), 3.6);
If you like to think about such problems concretely, you can think of id as a subject's id, x as a subject's height, and y as a subject's score on a self-confidence questionnaire, so we would be computing a correlation between height and self-confidence. There are many correlation formulas. Most commonly used is the Pearson product-moment correlation coefficient, which is valid only for normally distributed data (data which roughly fits a bell curve). A good Wikipedia page offers several formulas for the Pearson coefficient. Note that when the data skews significantly from a normal distribution, you very likely need a different formula. That's one problem with correlation: it isn't always a valid number to calculate. Another problem is that computing just the correlation coefficient is never enough. You also need to find the probability that the coefficient you calculated is significantly different from 0; that is something you look up in a statistical table, and it's outside our scope here. You will probably also want the slope of the relationship or regression line, its intercept with the Y axis, and the coefficient of regression, which is the proportion of observed variation due to correlation. For this example we assume a normal distribution and no missing values. Calculation of the correlation coefficient needs two passes: a first to calculate basic statistical quantities, then a second to calculate the slope, intercept and correlation coefficient from those basic quantities. Calculating the coefficient of regression requires a third pass. If we're computing correlation in a database like MySQL, then, you'd expect Views will be useful. Unfortunately we're immediately bitten by two limitations in the MySQL implementation of Views: subqueries in the FROM clause of a View are not supported, and neither are parameters. Then do we need to encapsulate correlation in a stored procedure? We could, but we needn't because of how r calculates. We like a more direct solution. The algorithm is: 1. Calculate the required basic statistics. 2. Use the results of #1 to calculate slope, intercept and r. 3. Use the slope and intercept from #2 to calculate the coefficient of regression. 4. Collect and display the results. In this solution we do it all in one query ... 1. The innermost query does step 1, 2. The query that references the inner query does step 2, 3. The next outer query calculates the regression coefficient by cross-joining the step 2 query result, which is one logical row, with the original table. 4. The outermost query displays the result: Read the query inside out:
SELECT -- Step 4 N, Slope, avgY - slope*avgX AS Intercept, Correlation, CoeffOfReg FROM ( SELECT -- Step 3 N, avgX, avgY, slope, intercept, Correlation, FORMAT( 1 - SUM((y - intercept - slope*x)*(y - intercept - slope*x))/ ((N-1)*varY), 5 ) AS CoeffOfReg FROM t AS t2 JOIN ( SELECT -- Step 2 N, avgX, avgY, varY, slope, Correlation, avgY - slope*avgX AS intercept FROM ( SELECT N, avgX, avgY, varY, FORMAT(( N*sumXY - sumX*sumY ) / ( N*sumsqX - sumX*sumX ), 5 ) AS slope, FORMAT(( sumXY - n*avgX*avgY ) / ( (N-1) * SQRT(varX) * SQRT(varY)), 5 ) AS Correlation FROM ( SELECT -- Step 1 COUNT(x) AS N, AVG(x) AS avgX, SUM(x) AS sumX, SUM(x*x) AS sumsqX, VAR_SAMP(x) AS varX, AVG(y) AS avgY, SUM(y) AS sumY, SUM(y*y) AS sumsqY,
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 21 of 92
VAR_SAMP(y) AS varY, SUM(x*y) AS sumXY FROM t ) AS sums ) AS calc ) stats ) combined; +----+---------+---------------------+-------------+------------+ | N | Slope | Intercept | Correlation | CoeffOfReg | +----+---------+---------------------+-------------+------------+ | 20 | 0.07066 | -0.8661640047683719 | 0.73064 | 0.53383 | +----+---------+---------------------+-------------+------------+
Last updated 22 May 2009
Feedback
Feedback
Median
Statistically, the median is the middle value--the value that is smaller than that found in half of all remaining rows, and larger than that found in the other half:
SELECT l1.hours As Median FROM BulbLife As l1, bulbLife AS l2 GROUP BY l1.Hours HAVING SUM(CASE WHEN l2.hours <= l1.hours THEN 1 ELSE 0 END) >= (COUNT(*)+1) / 2 AND SUM(CASE WHEN l2.hours >= l1.hours THEN 1 ELSE 0 END) >= (COUNT(*)/2) + 1;
An anonymous reader pointed out that this will cost O(N*N), ie it does not scale, so we posted a MySQL implementation of Torben Mogenson's algorithm for calculating the median (http://ndevilla.free.fr/median/median/node20.html), which is said to be amongst the fastest. It also proved too slow. Now Joe Wynne has offered an algorithm which appears to be correct, and which does scale. Here it is as a MySQL stored procedure:
DROP PROCEDURE IF EXISTS Median; DELIMITER | CREATE PROCEDURE Median( tbl CHAR(64), col CHAR(64), OUT res DOUBLE ) BEGIN DECLARE arg CHAR(64); SET @sql = CONCAT( 'SELECT ((COUNT(*))/2) INTO @c FROM ', tbl ); PREPARE stmt FROM @sql; EXECUTE stmt; DROP PREPARE stmt; SET @a = CONVERT(FLOOR(@c), SIGNED); IF @a = @c THEN BEGIN SET @a = @a-1; SET @b = 2; SET arg = CONCAT( 'AVG(', col, ')' ); END; ELSE BEGIN SET @b = 1; SET arg = col; END; END IF; SET @sql = CONCAT('SELECT ', arg, ' INTO @res FROM (SELECT ', col, ' FROM ', tbl, ' ORDER BY ', col, ' LIMIT ?,?) as tmp'); PREPARE stmt FROM @sql; EXECUTE stmt USING @a, @b; DROP PREPARE stmt; SET res=@res; END; | DELIMITER ;
Why don't we make it a function? Because MySQL functions do not (yet?) allow dynamic SQL.
Back to the top Browse the book Buy the book Feedback
Mode
Statistically, the mode is the most frequently occurring value. Given tables parent(id int) and child(pid int, cid int), where child.pid references parent.id as a foreign key, what query finds the parent.id most often represented in the child id, that is, the modal count of child.pid?
SELECT pid, COUNT(*) AS frequency FROM child GROUP BY pid ORDER BY frequency DESC LIMIT 1;
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 22 of 92
Feedback
Rank order
Without MSSQL's RANK() aggregate function, how do we display rank order in a MySQL query, for example from a table like this?
CREATE TABLE votes( name CHAR(10), votes INT ); INSERT INTO votes VALUES ('Smith',10),('Jones',15),('White',20),('Black',40),('Green',50),('Brown',20);
The query is a two-step: 1. Join the table to itself on the value to be ranked, handling ties 2. Group and order the result of the self-join on rank:
SELECT v1.name, v1.votes, COUNT(v2.votes) AS Rank FROM votes v1 JOIN votes v2 ON v1.votes < v2.votes OR (v1.votes=v2.votes and v1.name = v2.name) GROUP BY v1.name, v1.votes ORDER BY v1.votes DESC, v1.name DESC; +-------+-------+------+ | name | votes | Rank | +-------+-------+------+ | Green | 50 | 1 | | Black | 40 | 2 | | White | 20 | 3 | | Brown | 20 | 3 | | Jones | 15 | 5 | | Smith | 10 | 6 | +-------+-------+------+
Suppose you want to know the vote count for White and the names and tallies for the next highest and next lowest counts:
SELECT s1.name, s1.votes, COUNT(s2.name) rank, IF(s1.name = 'White','<-',' ') AS 'Near Ranks' FROM votes s1 JOIN votes s2 ON s1.votes < s2.votes or (s1.votes=s2.votes and s1.name = s2.name) GROUP BY s1.name, s1.votes ORDER BY rank ) a LEFT JOIN ( SELECT s1.name, s1.votes, COUNT(s2.name) rank, IF(s1.name = 'White', '<-',' ') AS 'Near Ranks' FROM votes s1 JOIN votes s2 ON s1.votes < s2.votes or (s1.votes=s2.votes and s1.name = s2.name) GROUP BY s1.name, s1.votes ORDER BY rank ) b ON a.rank BETWEEN b.rank-1 AND b.rank+1 WHERE a.name = 'White'; +-------+-------+------+------------+ | name | votes | rank | Near Ranks | +-------+-------+------+------------+ | Black | 40 | 2 | | | White | 20 | 3 | <| | Brown | 20 | 3 | | +-------+-------+------+------------+
But these methods don't scale; they're O(N2) or worse, especially if non-trivial calculation is needed to compute rank. If you do not need to identify ties in the ranking column, writing a temporary ordering table and then querying it is often much faster:
DROP TEMPORARY TABLE IF EXISTS tmp; SET @i=0; CREATE TEMPORARY TABLE tmp SELECT idcol,valuecol,@i:=@i+1 AS rank ORDER BY valuecol DESC;
Baron Schwartz has a useful page on optimising rankings for leaderboards. It requires maintenance of a ranking computation table, as described on his page under "How to maintain the rank column". That table has this structure:
CREATE TABLE score_ranked ( gamer INT NOT NULL, game INT NOT NULL, score INT) NOT NULL, rank_in_game INT NOT NULL DEFAULT 0, PRIMARY KEY (gamer,game), KEY (game,score), KEY (score), KEY (game,rank_in_game) ) ENGINE=InnoDB;
If you use Baron's schema, you will find the following query, developed on a MySQL forum by Rick James, reasonably efficient for retrieving top score sums for all games:
SELECT @seq := 0, -- raw 1,2,3,... @rank := 0, -- allow for ties @prev := 999999999; -- catch ties SELECT Rank, gamer, point
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 23 of 92
FROM ( SELECT @seq := IF(@prev = point, @seq, @seq + 1) AS seq, @rank := IF(@prev = point, @rank, @seq) AS Rank, @prev := point as prev, # I forgot this gamer, point FROM ( SELECT gamer, SUM(score) as point FROM score_ranked s GROUP BY gamer ORDER BY point DESC ) x ) y WHERE Rank <= 10 ORDER BY Rank, gamer;
Last updated 19 Nov 2009
Feedback
Backslashes in data
Backslashes multiply weirdly:
SELECT 'a\b' RLIKE 'a\b';
returns 1, as does...
SELECT 'a\\b' RLIKE 'a\\\\b';
because in a pair of backslashes, the second is not escaped by the first, so to compare two literals you double each backslash in the RLIKE argument. But if you are querying a table for such a string from the MySQL client, this doubling happens twice--once in the client, and once in the database--so to find a column value matching 'a\\b', you need to write...
SELECT desc FROM xxx WHERE desc RLIKE 'aa\\\\\\\\bb';
For convenience, encapsulate the query in a stored procedure with params for the two table names and for a comma-separated list of column names starting with the primary key:
drop procedure if exists comparetables; delimiter go create procedure comparetables( tbla char(64), tblb char(64), columnlist char(255) ) begin set @sql = concat( " SELECT MIN(TableName) as TableName, ", columnlist, " FROM ( ", " SELECT '", tbla, "' as TableName, ", columnlist, " FROM ", tbla, " UNION ALL ", " SELECT '", tblb, "' as TableName, ", columnlist, " FROM ", tblb, ") AS tmp ", " GROUP BY ", columnlist, " HAVING COUNT(*) = 1 ", " ORDER BY 1" ); prepare stmt from @sql; execute stmt; drop prepare stmt; end; go delimiter ; drop table if exists a,b; create table a(i int primary key,j int); insert into a values(1,1),(2,2),(3,3); create table b select * from a; update b set j=4 where i=3; call comparetables('a','b','i,j'); +-----------+------+------+ | TableName | i | j |
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 24 of 92
+-----------+------+------+ | a | 3 | 3 | | b | 3 | 4 | +-----------+------+------+
Last updated 26 Nov 2010
Feedback
We want to retrieve only rows whose `p` values differ from immediately previous values (marked by * above). As with running sums. we get the desired listing by tracking row-to-row value changes with user variables:
set @p=''; set @d=''; select p 'Pct Changed',d Date from ( select p, if( p<>@p, d, @d ) as d, @p:=p, @d:=d from t order by d ) as t group by d; +-------------+------------+ | Pct Changed | Date | +-------------+------------+ | 50% | 2008-05-01 | | 30% | 2008-05-02 | | 50% | 2008-05-04 | | 20% | 2008-05-06 | | 50% | 2008-05-08 | | 70% | 2008-05-09 | +-------------+------------+
Last updated 11 Aug 2010
Feedback
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 25 of 92
the table to add auto_increment and primary key properties to the new column. So given table t with columns named `dt` and `observed`...
DROP TABLE IF EXISTS t2; SET @id=0; CREATE TABLE t2 SELECT @id:=@id+1 AS id, dt, observed FROM t ORDER BY dt; ALTER TABLE t2 MODIFY id INT AUTO_INCREMENT PRIMARY KEY; DROP TABLE t; RENAME TABLE t2 TO t;
Feedback
Notice that you can omit the CONSTRAINT when you declare a foreign key, but the only way to DROP a foreign key is to reference it by the constraint_name which you probably never specified! There should be a circle of hell reserved for designers who build inconsistencies like this into their tools. The only way round this one is to run SHOW CREATE TABLE to find out what the foreign key's constraint_name is, so you can write the DROP statement. Here is a wee test case:
drop table if exists a,b; create table a(i int primary key)engine=innodb; create table b(i int,foreign key(i) references a(i)) engine=innodb; show create table\G CREATE TABLE `b` ( `i` int(11) DEFAULT NULL, KEY `i` (`i`), CONSTRAINT `b_ibfk_1` FOREIGN KEY (`i`) REFERENCES `a` (`i`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 -- drop and alter table alter table show create recreate the FK: b drop foreign key b_ibfk_1; b add foreign key(i) references a(i) on update cascade; table b\G
Create Table: CREATE TABLE `b` ( `i` int(11) DEFAULT NULL, KEY `i` (`i`), CONSTRAINT `b_ibfk_1` FOREIGN KEY (`i`) REFERENCES `a` (`i`) ON UPDATE CASCADE ) ENGINE=InnoDB DEFAULT CHARSET=latin1 drop table a,b; Back to the top Browse the book Buy the book Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 26 of 92
ORDER BY ordinal_position;
For MySQL 5.0.2 or later here is a query that lists all table structure differences between any two tables. It selects all information_schema.columns rows for one table, does the same for the second table, UNIONs these two queries, then uses HAVING to pick only those rows where the COUNT(*) in the union is 1that is, where any column of one table differs from its mate. To avoid having to cut and paste database and table names, save it as a stored procedure in any database (other than information_schema):
DROP PROCEDURE IF EXISTS CompareTableStructs; -- Uncomment if MySQL version is 5.0.6-5.0.15: -- SET GLOBAL log_bin_trust_routine_creators=TRUE; DELIMITER | CREATE PROCEDURE CompareTableStructs ( IN db1 CHAR(64), IN tbl1 CHAR(64), IN db2 CHAR(64), IN tbl2 CHAR(64) ) SELECT MIN(TableName) AS TableName, column_name, ordinal_position, column_default, is_nullable, data_type, character_maximum_length, numeric_precision, numeric_scale, character_set_name, collation_name, column_type, column_key, extra, privileges, column_comment FROM ( SELECT tbl1 as TableName, column_name, ordinal_position, column_default, is_nullable, data_type, character_maximum_length, numeric_precision, numeric_scale, character_set_name, collation_name, column_type, column_key, extra, privileges, column_comment FROM information_schema.columns AS i1 WHERE table_schema=db1 AND table_name=tbl1 UNION ALL SELECT tbl2 as TableName, column_name, ordinal_position, column_default, is_nullable, data_type, character_maximum_length, numeric_precision, numeric_scale, character_set_name, collation_name, column_type, column_key, extra, privileges, column_comment FROM information_schema.columns AS i2 WHERE table_schema=db2 AND table_name=tbl2 ) AS tmp GROUP BY column_name HAVING COUNT(*) = 1 ORDER BY column_name ; | DELIMITER ;
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 27 of 92
FROM a UNION ALL SELECT 'Table b' as TableName, b.id, b.col1, b.col2, b.col3, ... FROM b ) AS tmp GROUP BY id, col1, col2, col3, ... HAVING COUNT(*) = 1 ORDER BY ID;
To apply this logic to the comparison of two database structures: write temp tables collecting desired I_S metadata on each database map the compare-data query template to those two metadata tables This logic is easiest to re-use when it is parameterised in a stored procedure, in a system database:
USE sys; DROP PROCEDURE IF EXISTS CompareDBs; DELIMITER | CREATE PROCEDURE CompareDBs( vdb1 VARCHAR(64), vdb2 VARCHAR(64) ) BEGIN DROP TEMPORARY TABLE IF EXISTS desc1,desc2; CREATE TEMPORARY TABLE desc1 SELECT t1.table_schema, t1.table_name, t1.table_type, t1.engine, c1.column_name, c1.ordinal_position, c1.column_type, c1.column_default, c1.is_nullable, c1.column_key FROM information_schema.tables t1 JOIN information_schema.columns c1 USING (table_schema,table_name) WHERE t1.table_schema=vdb1 ORDER BY t1.table_name,c1.column_name; CREATE TEMPORARY TABLE desc2 SELECT t1.table_schema, t1.table_name, t1.table_type, t1.engine, c1.column_name, c1.ordinal_position, c1.column_type, c1.column_default, c1.is_nullable, c1.column_key FROM information_schema.tables t1 JOIN information_schema.columns c1 USING (table_schema,table_name) WHERE t1.table_schema=vdb2 ORDER BY t1.table_name,c1.column_name; SELECT TableName,column_name,MIN(SchemaName),table_type,engine, ordinal_position,column_type,column_default,is_nullable,column_key FROM ( SELECT a.table_schema AS SchemaName,a.table_name AS TableName,a.table_type,a.engine, a.column_name,a.ordinal_position,a.column_type,a.column_default,a.is_nullable,a.column_key FROM desc1 a UNION ALL SELECT b.table_schema AS SchemaName,b.table_name AS TableName,b.table_type,b.engine, b.column_name,b.ordinal_position,b.column_type,b.column_default,b.is_nullable,b.column_key FROM desc2 b ) AS tmp GROUP BY TableName,table_type,engine,column_name,ordinal_position,column_type,column_default,is_nullable,column_key HAVING COUNT(*) = 1 ORDER BY TableName,column_name,SchemaName; DROP TEMPORARY TABLE desc1, desc2; END | DELIMITER ;
Call it as follows:
CALL compareDBs('db1','db2'); MEMORY tables would it speed it Back to the top
up, but unfortunately MySQL MEMORY tables do not support BLOB/TEXT columns.
Browse the book Buy the book Feedback
Database size
SELECT table_schema AS 'Db Name', Round( Sum( data_length + index_length ) / 1024 / 1024, 3 ) AS 'Db Size (MB)', Round( Sum( data_free ) / 1024 / 1024, 3 ) AS 'Free Space (MB)' FROM information_schema.tables
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 28 of 92
GROUP BY table_schema ;
Last updated 18 Jun 2010
Feedback
Unfortunately it performs slowly, not because of how it is written, but because of how MySQL has implemented information_schema. Indeed a bug report page is devoted to the issue. It says MySQL AB will not be speeding up information_schema query performance any time soon. In theUsual we worked around this problem with a PHP function that queries information_schema if that is required, but by default parses the results of iterative SHOW TABLES commands. The SHOW TABLES method logic is simple, and will port readily to another application language. It runs 10-50 times faster than the equivalent information_schema query. The following version looks for child tables in one database; it slows down a bit when modified to search all server DBs, but even then it is much faster than its information_schema equivalent. It assumes an available connection object $conn:
function childtables( $db, $table, $via_infoschema=FALSE ) { GLOBAL $conn; $ret = array(); if( $via_infoschema ) { $res = mysql_query( childtablesqry( $db, $table )) || die( mysql_error() ); if( !is_bool( $res )) while( $row = mysql_fetch_row( $res )) $ret[] = $row; } else { $tables = array(); $res = mysql_query( "SHOW TABLES" ); while( $row = mysql_fetch_row( $res )) $tables[] = $row[0]; $res = mysql_query( "SELECT LOCATE('ANSI_QUOTES', @@sql_mode)" ); $ansi_quotes = $res ? mysql_result( $res, 0 ) : 0; $q = $ansi_quotes ? '"' : "`"; $sref = ' REFERENCES ' . $q . $table . $q . ' (' . $q; foreach( $tables as $referringtbl ) { $res = mysql_query( "SHOW CREATE TABLE $referringtbl" ); $row = mysql_fetch_row( $res ); if(( $startref = stripos( $row[1], $sref )) > 0 ) { $endref = strpos( $row[1], $q, $startref + strlen( $sref )); $referencedcol = substr( $row[1], $startref+strlen($sref), $endref-$startref-strlen($sref) ); $endkey = $startref; while( substr( $row[1], $endkey, 1 ) <> $q ) $endkey--; $startkey = --$endkey; while( substr( $row[1], $startkey, 1 ) <> $q ) $startkey--; $referencingcol = substr( $row[1], $startkey+1, $endkey - $startkey );
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 29 of 92
$ret[] = array( $db, $referringtbl, $referencingcol, $referencedcol ); } } } return $ret; } function childtablesqry( $db, $table ) { return "SELECT c.table_schema,u.table_name,u.column_name,u.referenced_column_name " . "FROM information_schema.table_constraints AS c " . "INNER JOIN information_schema.key_column_usage AS u " . "USING( constraint_schema, constraint_name ) " . "WHERE c.constraint_type = 'FOREIGN KEY' " . "AND u.referenced_table_schema='$db' " . "AND u.referenced_table_name = '$table' " . "ORDER BY c.table_schema,u.table_name"; }
Last updated 28 Jan 2011
Feedback
Feedback
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 30 of 92
s.schema_name AS 'Schema', SUM(t.data_length) AS Data, SUM( t.index_length ) AS Indexes, SUM(t.data_length) + SUM(t.index_length) AS 'Mb Used', IF(SUM(t.data_free)=0,'',SUM(t.data_free)) As 'Mb Free', IF( SUM(t.data_free)=0, '', 100 * (SUM(t.data_length) + SUM(t.index_length)) / ((SUM(t.data_length)+SUM (t.index_length) + SUM(IFNULL(t.data_free,0))) ) ) AS 'Pct Used', COUNT(table_name) AS Tables FROM information_schema.schemata s LEFT JOIN information_schema.tables t ON s.schema_name = t.table_schema GROUP BY s.schema_name WITH ROLLUP
Feedback
Feedback
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 31 of 92
OR Create_priv = 'Y' OR Drop_priv = 'Y' OR Reload_priv = 'Y' OR Shutdown_priv = 'Y' OR Process_priv = 'Y' OR File_priv = 'Y' OR Grant_priv = 'Y' OR References_priv = 'Y' OR Index_priv = 'Y' OR Alter_priv = 'Y' OR Show_db_priv = 'Y' OR Super_priv = 'Y' OR Create_tmp_table_priv = 'Y' OR Lock_tables_priv = 'Y' OR Execute_priv = 'Y' OR Repl_slave_priv = 'Y' OR Repl_client_priv = 'Y' OR Create_view_priv = 'Y' OR Show_view_priv = 'Y' OR Create_routine_priv = 'Y' OR Alter_routine_priv = 'Y' OR Create_user_priv = 'Y' OR Event_priv = 'Y' OR Trigger_priv = 'Y' UNION SELECT host,user FROM mysql.db WHERE db=dbname AND ( Select_priv = 'Y' OR Insert_priv = 'Y' OR Update_priv = 'Y' OR Delete_priv = 'Y' OR Create_priv = 'Y' OR Drop_priv = 'Y' OR Grant_priv = 'Y' OR References_priv = 'Y' OR Index_priv = 'Y' OR Alter_priv = 'Y' OR Create_tmp_table_priv = 'Y' OR Lock_tables_priv = 'Y' OR Create_view_priv = 'Y' OR Show_view_priv = 'Y' OR Create_routine_priv = 'Y' OR Alter_routine_priv = 'Y' OR Execute_priv = 'Y' OR Event_priv = 'Y' OR Trigger_priv = 'Y' ) UNION SELECT host,user FROM mysql.tables_priv WHERE db=dbname UNION SELECT host,user FROM mysql.columns_priv WHERE db=dbname; | DELIMITER ; CALL ListDbUsers( 'test' );
Feedback
Rename Database
It's sometimes necessary to rename a database. MySQL 5.0 has no command for it. Simply bringing down the server to rename a database directory is not safe. MySQL 5.1.7 introduced a RENAME DATABASE command, but the command left several unchanged database objects behind, and was found to lose data, so it was dropped in 5.1.23. It seems a natural for a stored procedure using dynamic (prepared) statements. PREPARE supports CREATE | RENAME TABLE. As precautions: Before calling the sproc, the new database must have been created. The procedure refuses to rename the mysql database. The old database is left behind, minus what was moved.
DROP PROCEDURE IF EXISTS RenameDatabase; DELIMITER go CREATE PROCEDURE RenameDatabase( oldname CHAR (64), newname CHAR(64) ) BEGIN DECLARE version CHAR(32); DECLARE sname CHAR(64) DEFAULT NULL; DECLARE rows INT DEFAULT 1; DECLARE changed INT DEFAULT 0; IF STRCMP( oldname, 'mysql' ) <> 0 THEN REPEAT SELECT table_name INTO sname FROM information_schema.tables AS t WHERE t.table_type='BASE TABLE' AND t.table_schema = oldname LIMIT 1; SET rows = FOUND_ROWS(); IF rows = 1 THEN SET @scmd = CONCAT( 'RENAME TABLE `', oldname, '`.`', sname,
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 32 of 92
'` TO `', newname, '`.`', sname, '`' ); PREPARE cmd FROM @scmd; EXECUTE cmd; DEALLOCATE PREPARE cmd; SET changed = 1; END IF; UNTIL rows = 0 END REPEAT; IF changed > 0 THEN SET @scmd = CONCAT( "UPDATE mysql.db SET Db = '", newname, "' WHERE Db = '", oldname, "'" ); PREPARE cmd FROM @scmd; EXECUTE cmd; DROP PREPARE cmd; SET @scmd = CONCAT( "UPDATE mysql.proc SET Db = '", newname, "' WHERE Db = '", oldname, "'" ); PREPARE cmd FROM @scmd; EXECUTE cmd; DROP PREPARE cmd; SELECT version() INTO version; IF version >= '5.1.7' THEN SET @scmd = CONCAT( "UPDATE mysql.event SET db = '", newname, "' WHERE db = '", oldname, "'" ); PREPARE cmd FROM @scmd; EXECUTE cmd; DROP PREPARE cmd; END IF; SET @scmd = CONCAT( "UPDATE mysql.columns_priv SET Db = '", newname, "' WHERE Db = '", oldname, "'" ); PREPARE cmd FROM @scmd; EXECUTE cmd; DROP PREPARE cmd; FLUSH PRIVILEGES; END IF; END IF; END; go DELIMITER ;
Last updated 16 Jun 2010
Feedback
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 33 of 92
END; go DELIMITER ;
If you have a collection of generic stored procs like these, it's most convenient to keep them in one place for easy accessibility. We keep ours in a sys database.
Last updated 09 Oct 2009
Feedback
Feedback
Show Tables
The MySQL SHOW TABLES command is fine, but sometimes we want a little more information. This simple stored procedure lists the table name, engine type, version, collation and rowcount for every table in a database. (Individual databases come and go, so we keep all such database-wide stored routines in a system database.)
DROP PROCEDURE IF EXISTS showtables; CREATE PROCEDURE showtables() SELECT table_name AS 'Table', IFNULL(engine, 'VIEW') AS Engine, version AS Version, table_collation AS Collation, table_rows AS Rows FROM information_schema.tables WHERE table_schema=DATABASE();
If you often want to list tables in a DB not currently USEd, here is a parameter-driven version:
DROP PROCEDURE IF EXISTS showtablesDB; CREATE PROCEDURE showtablesDB( IN dbname CHAR(64) ) SELECT table_name AS 'Table', IFNULL(engine, 'VIEW') AS Engine, version AS Version, table_collation AS Collation, table_rows AS Rows FROM information_schema.tables WHERE table_schema=dbname;
Feedback
Age in years
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 34 of 92
Given a birthdate in @dob, here are two simple formulae for age in years:
Date_format( From_Days( To_Days(Curdate()) - To_Days(@dob) ), '%Y' ) + 0 Year(Curdate()) - Year(@dob) - ( Right(Curdate(),5) < Right(@dob,5) )
and here is one for age in years to two decimal places, ignoring day of month:
Round((((Year(now()) - Year(@dob)))*12 + (((Month(now()) - Month(@dob)))))/12, 2)
Last updated 18 Jul 2011
Feedback
Appointments available
Given a clinic of physicians, patients and appointments, how to find an available appointment time for a given physician? This is a variant of the [Not] Exists query pattern. Though we can write it with subqueries, performance will be crisper with a join. But finding data that is not there requires a join to data which is there. So in addition to tables for appointments, doctors and patients, we need a table of all possible appointment datetimes. Here's a schema illustrating the idea ...
CREATE TABLE d DATE, t TIME ); CREATE TABLE did INT ); CREATE TABLE pid INT ); CREATE TABLE aid INT, did INT, pid INT, d DATE, t TIME ); a_dt ( -- POSSIBLE APPOINTMENT DATES AND TIMES
a_drs (
a_pts (
a_appts (
-----
Now we can apply the [Not] Exists query pattern. To find free appointment datetimes for a given doctor in a given datetime range, we left join possible appointments to existing appointments on date and time and doctor, add Where conditions for desired appointment datetimes, and finally add a Where condition that the appointment slot be null, i.e. free...
SELECT d.did, a.d, a.t FROM a_dt AS a LEFT JOIN a_appts AS ap USING (d,t) LEFT JOIN a_drs AS d ON a.d = ap.d AND a.t = ap.t AND ap.did = d.did AND ap.did = 1 WHERE a.d BETWEEN desired_start_date AND desired_end_date AND a.t BETWEEN desired_start_time AND desired_end_time AND ap.aid IS NULL;
Feedback
If that solution is not available, you have to do with a weekday count, which this function (corrected 6 Jul 2009) computes:
SET GLOBAL log_bin_trust_function_creators=1; DROP FUNCTION IF EXISTS BizDaysInclusive; DELIMITER | CREATE FUNCTION BizDaysInclusive( d1 DATE, d2 DATE ) RETURNS INT DETERMINISTIC BEGIN DECLARE dow1, dow2, days INT; SET dow1 = DAYOFWEEK(d1); SET dow2 = DAYOFWEEK(d2); SET days = FLOOR( DATEDIFF(d2,d1)/7 ) * 5 + CASE WHEN dow1=1 AND dow2=7 THEN 5 WHEN dow1 IN(7,1) AND dow2 IN (7,1) THEN 0 WHEN dow1=dow2 THEN 1 WHEN dow1 IN(7,1) AND dow2 NOT IN (7,1) THEN dow2-1 WHEN dow1 NOT IN(7,1) AND dow2 IN(7,1) THEN 7-dow1
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 35 of 92
WHEN dow1<=dow2 THEN dow2-dow1+1 WHEN dow1>dow2 THEN 5-(dow1-dow2-1) ELSE 0 END; RETURN days; END; | DELIMITER ; SELECT BizDaysInclusive('2009-8-1','2009-9-15'); +------------------------------------------+ | BizDaysInclusive('2009-8-1','2009-9-15') | +------------------------------------------+ | 32 | +------------------------------------------+
So does the following algorithm, posted by Elowie Cruz on a MySQL forum; for ease of use we encapsulate it in a stored procedure:
DROP PROCEDURE IF EXISTS daycount; CREATE PROCEDURE DayCount( d1 DATE, d2 DATE ) SELECT dd.iDiff, dd.iDiff - dd.iWeekEndDays AS iWorkDays, dd.iWeekEndDays FROM ( SELECT dd.iDiff, ((dd.iWeeks * 2) + IF(dd.iSatDiff >= 0 AND dd.iSatDiff < dd.iDays, 1, 0) + IF (dd.iSunDiff >= 0 AND dd.iSunDiff < dd.iDays, 1, 0)) AS iWeekEndDays FROM ( SELECT dd.iDiff, FLOOR(dd.iDiff / 7) AS iWeeks, dd.iDiff % 7 iDays, 5 - dd.iStartDay AS iSatDiff, 6 - dd.iStartDay AS iSunDiff FROM ( SELECT 1 + DATEDIFF(d2, d1) AS iDiff, WEEKDAY(d1) AS iStartDay ) AS dd ) AS dd ) AS dd ; CALL DayCount( '2009-8-1','2009-9-15'); +-------+-----------+--------------+ | iDiff | iWorkDays | iWeekEndDays | +-------+-----------+--------------+ | 46 | 32 | 14 | +-------+-----------+--------------+
Last updated 10 Jul 2009
Feedback
You would not want to use that function on long date spans in a big table, but it will do for testing. Now, how to count the number of Tuesdays, say, between two dates? The basic logic is: 1. Count weeks between the two dates. 2. If beginning and ending weekdays are the same, then if they're Tuesday, the answer is weeks+1, otherwise it's just weeks.
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 36 of 92
3. Otherwise, if the beginning weekday <= the ending weekday, then if Tuesday is between them, the answer is weeks+1, otherwise it's just weeks. 4. Otherwise the ending weekday is less than the starting weekday; if Tuesday >= the starting weekday or <= the ending weekdayy, the answer is weeks+1, otherwise it's just weeks. For a convenient datasource, we'll use the two date columns orderdate and shippeddate in the orders table of the NorthWind database, and we'll use our brute force function DayCount() to check results:
SET @day = 3; SELECT DATE_FORMAT(orderdate,'%y%m%d') AS OrdDt, DATE_FORMAT(shippeddate,'%y%m%d') AS ShipDt, LEFT(DAYNAME(orderdate),3) AS D1, LEFT(DAYNAME(shippeddate),3) AS D2, @dow1 := DAYOFWEEK(orderdate) AS 'dw1', @dow2 := DAYOFWEEK(shippeddate) AS 'dw2', @days := DATEDIFF(shippeddate,orderdate) AS Days, @wks := FLOOR( @days / 7 ) AS Wks, FLOOR( IF( @dow1 = @dow2, IF( @day = @dow1, @wks+1, @wks), IF( @dow1 < @dow2, IF( @day BETWEEN @dow1 AND @dow2, @wks+1, @wks), IF( @day >= @dow1 OR @day <= @dow2, @wks+1, @wks ) ) ) ) AS Res, DayCount(DATE(orderdate),DATE(shippeddate),@day) AS Chk FROM orders HAVING !ISNULL(res-chk) AND res-chk <> 0; Empty set (0.00 sec)
No errors. We get the same result for @day = 1, 2, 4, 5, 6 and 7. But the formula is buried in the specifics of one table, so abstract it to a reusable function:
DROP FUNCTION IF EXISTS NamedDaysBetween; DELIMITER | CREATE FUNCTION NamedDaysBetween( d1 DATE, d2 DATE, daynum SMALLINT ) RETURNS INT BEGIN DECLARE dow1, dow2, wks, days INT; IF !ISNULL(d1) AND !ISNULL(d2) THEN SET dow1 = DAYOFWEEK( d1 ); SET dow2 = DAYOFWEEK( d2 ); SET days = DATEDIFF( d2, d1 ); SET wks = FLOOR( days / 7 ); SET days = IF( dow1 = dow2, IF( daynum = dow1, wks+1, wks), IF( dow1 < dow2, IF( daynum BETWEEN dow1 AND dow2, wks+1, wks), IF( daynum >= dow1 OR daynum <= dow2, wks+1, wks ) ) ); END IF; RETURN days; END; | DELIMITER ;
Feedback
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 37 of 92
| 2008-01-28 | +--------------+ set @d='2008-7-15'; select makedate( left(yearweek(@d),4),week( @d, 2 ) * 7 ) as 1stdayOfWeek; +--------------+ | 1stdayOfWeek | +--------------+ | 2008-07-14 | +--------------+
Feedback
Datetime difference
Find the difference between two datetime values in seconds, minutes, hours or days. If dt1 and dt2 are datetime values of the form 'yyyy-mm-dd hh:mm:ss', the number of seconds between dt1 and dt2 is
UNIX_TIMESTAMP( dt2 ) - UNIX_TIMESTAMP( dt1 )
To get the number of minutes divide by 60, for the number of hours divide by 3600, and for the number of days, divide by 3600 * 24.
Back to the top Browse the book Buy the book Feedback
Feedback
Reservation systems usually adopt the closed-open convention of representing when reservations begin and end. For example, if you book a hotel room for 22 May through 24 May, the hotel expects you to stay overnight on 22 May and 23 May, but not on 24 May. Apart from that difference, this is the same pattern as Finding missing numbers in a sequence.
SELECT a.enddate AS 'Available From', Min(b.startdate) AS 'To' FROM bookings AS a JOIN bookings AS b ON a.propertyID=b.propertyID AND a.enddate < b.startdate WHERE a.propertyID=1 GROUP BY a.enddate HAVING a.enddate < MIN(b.startdate); +----------------+------------+ | Available From | To | +----------------+------------+
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 38 of 92
This query cannot see reservation dates earlier than the first existing reservation date, or later than the last. Usually, you would want a calendar table to provide those limits, but you can fake them with a union. If the allowable reservation period is 1 Dec 2006 through 1 Jul 2007, union the left side of the join with a made-up row for 1 Dec 2006, and union the right side of the join with a made-up row for 1 Jul 2007:
SELECT a.enddate AS 'Available From', Min(b.startdate) AS 'To' FROM ( SELECT 0,1 as propertyID,'2006-12-01' as startdate,'2006-12-01' as enddate UNION SELECT * FROM bookings ) AS a JOIN ( SELECT * FROM bookings UNION SELECT 0,1,'2007-07-01' as startdate,'2007-07-02' as enddate ) AS b ON a.propertyID=b.propertyID AND a.enddate < b.startdate WHERE a.propertyID=1 GROUP BY a.enddate HAVING a.enddate < MIN(b.startdate); +----------------+------------+ | Available From | To | +----------------+------------+ | 2006-12-01 | 2007-01-01 | | 2007-01-15 | 2007-01-20 | | 2007-01-31 | 2007-02-10 | | 2007-02-17 | 2007-07-01 | +----------------+------------+
Feedback
One solution is to use a View to identify starting and stopping events, then define an Overlaps View:
CREATE OR REPLACE VIEW events AS SELECT start AS time, 1 AS value, id FROM visits UNION SELECT end AS time, -1 AS value, id FROM visits; CREATE OR REPLACE VIEW overlaps AS SELECT time t, (SELECT SUM(value) FROM events WHERE time <=t ) as visitcount FROM events; SELECT t, visitcount FROM overlaps WHERE visitcount>1 ORDER BY t; +---------------------+------------+ | t | visitcount | +---------------------+------------+ | 2008-09-01 15:02:00 | 2 | | 2008-09-01 16:19:00 | 2 | | 2008-09-01 16:20:00 | 3 | | 2008-09-01 16:22:00 | 2 | | 2008-09-01 18:18:00 | 2 | +---------------------+------------+
But that doesn't show us when overlap periods end. There is a fuller and more straightforward solution: join visits to itself on the criteria that (i) the first of each joined pair of visits started no earlier than the second,
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 39 of 92
(ii) the first visit started before the second ended, and (iii) the second visit started before the first ended: Thanks to Brian Moretta for pointing out the need for LEAST() to catch all endpoints:
SELECT v1.id, v1.start, v2.id, LEAST(v1.end,v2.end) AS end FROM visits v1 JOIN visits v2 ON v1.id <> v2.id and v1.start >= v2.start and v1.start < v2.end ORDER BY v1.start; +----+---------------------+----+---------------------+ | id | start | id | end | +----+---------------------+----+---------------------+ | 2 | 2008-09-01 15:02:00 | 1 | 2008-09-01 15:04:00 | | 5 | 2008-09-01 16:19:00 | 4 | 2008-09-01 16:23:00 | | 8 | 2008-09-01 16:20:00 | 4 | 2008-09-01 16:22:00 | | 8 | 2008-09-01 16:20:00 | 5 | 2008-09-01 16:22:00 | | 7 | 2008-09-01 18:18:00 | 9 | 2008-09-01 18:22:00 | +----+---------------------+----+---------------------+
Last updated 05 Sep 2009
Feedback
Feedback
You need a query indicating whether a given property is available for a given period of time. Hotels & property renters usually adopt what is called the 'closed-open' convention for bookings, eg a booking from 22 May through 24 May means you sleep there the nights of 22 and 23 May. To show that property P is available for the desired closed-open period dStart to dEnd, you need to prove there is no booked period for P that overlaps dStart through dEnd. Until you're used to thinking about periods, it's easier to analyse graphically. There are four ways a booked reservation can overlap the desired date range ...
dStart dEnd |----------------| startDate endDate |--------------------------| |------| |----| |------|
but there are just two ways a booked reservation can not overlap:
dStart dEnd |----------------| |-----| |-----| |-----| |-----|
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 40 of 92
So the period dStart through dEnd is available if there is no row where ...
!(endDate <= dStart OR startDate >= dEnd)
or equivalently ...
endDate > dStart AND startDate < dEnd
Feedback
Julian date
Unix_Timestamp( datetimevalue ) / (60*60*24) ) + 2440587.5
Back to the top Browse the book Buy the book Feedback
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 41 of 92
An elegant method of generating any desired number of sequential values, posted by Giuseppe Maxia on his blog, is ... Create three dummy rows in a View. Cross join them to make 10 dummy rows. Cross join those to make 100, 1,000 or however many you need. So to give the calendar table a million rows at one-hour intervals starting on 1 Jan 1970:
create or replace view v3 as select 1 n union all select 1 union all select 1; create or replace view v as select 1 n from v3 a, v3 b union all select 1; set @n = 0; drop table if exists calendar; create table calendar(dt datetime primary key); insert into calendar select cast('1970-1-1 00:00:00' + interval @n:=@n+1 hour as datetime) as dt from v a, v b, v c, v d, v e, v;
If you prefer to do it one query without user variables, or if your MySQL version does not support Views, make a general-purpose utility table of integers 0 through 9. We keep such utility objects in a sys database and make it available to all users:
create table sys.ints(i tinyint); insert into sys.ints values(0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
From sys.ints you can generate any desired sequence of dates or datetimes. For example to report monthly sales totals for 2009 from the table sales(date d, amount decimal(10,2)), write:
select a.yearmonth, sum(ifnull(b.amount,0)) as total from ( select concat( '2009-', lpad((u.i + t.i * 10), 2, 0)) as yearmonth from sys.ints u, sys.ints t where (u.i + t.i * 10) between 1 and 12 ) a left join sales b on a.yearmonth=date_format(b.d,'%Y-%m') group by a.yearmonth;
You can use such a query as a View, as an inline derived table, or as input to a CREATE TABLE statement. A slightly more elaborate method, giving the calendar table an auto-increment key that can also be used as a surrogate for datetime interval calculations:
CREATE TABLE calendar ( id INT AUTO_INCREMENT PRIMARY KEY, date DATE, UNIQUE days (date) );
Find a table with that many rows, 7670 in this case. Add a row to the calendar table for every day in the range:
INSERT INTO calendar (id) SELECT NULL FROM [name of table with 7670 rows] LIMIT 4018;
The calendar table now has one row for each day from 1990-01-01 through 2010-12-31. Keep the auto_increment ID column for quick day counts in the range, or drop the column if you don't need that.
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 42 of 92
To make the calendar table a diary, make the period one leap year, add month, day and text columns, update month and day values with MONTH(date) and DAYOFMONTH(date) respectively, and if the diary is to be used from year to year, drop the date field. [Based on a builder.com SQL Tip by Arthur Fuller and a MySQL list tip by Michael Stassen] To automate all this, write a stored procedure, for example:
CREATE TABLE times ( date_hour DATETIME, KEY ( date_hour ) ); DROP PROCEDURE IF EXISTS timespopulate; DELIMITER | CREATE PROCEDURE timespopulate( startdate DATETIME, num INT ) BEGIN DECLARE ctr INT DEFAULT 0; WHILE ctr < num DO BEGIN INSERT INTO times VALUES ( DATE_ADD( startdate, INTERVAL ctr HOUR) ); SET ctr = ctr + 1; END; END WHILE; END; | DELIMITER ; CALL timespopulate( '2007-1-1, 31*24 );
Feedback
Feedback
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 43 of 92
To help keep the query as uncluttered as possible, encapsulate the above calculation in a stored function...
SET GLOBAL log_bin_trust_function_creators=1; DROP FUNCTION IF EXISTS DaySeconds; CREATE FUNCTION DaySeconds( pStart datetime, pEnd datetime, pDate date ) RETURNS INT RETURN UNIX_TIMESTAMP( IF( DATE(pEnd) > pDate, CAST(ADDDATE(pDate, 1) AS DATETIME), pEnd )) UNIX_TIMESTAMP( IF( pStart < pDate, CAST(pDate AS DATETIME ), pStart )); SELECT CEIL(DaySeconds('2008-1-1 10:05:00','2008-1-1 10:59:30','2008-1-1')/60) AS Mins; +------+ | Mins | +------+ | 55 | +------+
Now for the query. To report usage per day over a period, we need a calendar table. For a test, make one for the first 100 days of the year 2008:
DROP TABLE IF EXISTS ints,calendar; CREATE TABLE INTS(i int); INSERT INTO ints VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9); CREATE TABLE calendar(date date); SET @n=0; INSERT INTO calendar SELECT ADDDATE('2008-1-1',@n:=@n+1) FROM sys.ints a JOIN sys.ints b;
Obtain usage per day for resourceID=1 by running DaySeconds() against a JOIN of calendar and bookings:
SELECT c.date AS date, SUM( CEIL(( DaySeconds( b.startdate, b.enddate, c.date ) / 60 ))) AS 'Mins Used' FROM calendar c JOIN bookings b ON c.date BETWEEN DATE(b.startdate) AND DATE(b.enddate) WHERE b.resourceID = 1 GROUP BY c.date; +------------+-----------+ | date | Mins Used | +------------+-----------+ | 2008-02-03 | 115 | | 2008-02-04 | 85 | | 2008-02-05 | 55 | | 2008-02-06 | 60 | +------------+-----------+
To report over a date range, join the above to the calendar table on a date range:
SELECT c.date, IFNULL( sums.N, 0 ) AS 'Mins Used' FROM calendar AS c LEFT JOIN ( SELECT c.date AS date, SUM( CEIL(( DaySeconds( b.startdate, b.enddate, c.date ) / 60 ))) AS N FROM calendar c JOIN bookings b ON c.date BETWEEN DATE(b.startdate) AND DATE(b.enddate) WHERE b.resourceID = 1 GROUP BY c.date ) AS sums ON sums.date = c.date WHERE c.date BETWEEN '2008-02-01' AND '2008-02-10'; +------------+-----------+ | date | Mins Used | +------------+-----------+ | 2008-02-01 | 0 | | 2008-02-02 | 0 | | 2008-02-03 | 115 | | 2008-02-04 | 85 | | 2008-02-05 | 55 | | 2008-02-06 | 60 | | 2008-02-07 | 0 | | 2008-02-08 | 0 | | 2008-02-09 | 0 | | 2008-02-10 | 0 | +------------+-----------+
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 44 of 92
drop table if exists t; create table t(interval_id int,start datetime,end datetime, att int); insert into t values (1,'2007-01-01 08:00:00','2007-01-01 12:00:00',5 ), (2,'2007-01-01 13:00:00','2007-01-01 17:00:00',10), (3,'2007-01-01 10:00:00','2007-01-01 15:00:00',15), (4,'2007-01-01 14:00:00','2007-03-07 19:00:00',20); select * from t; +-------------+---------------------+---------------------+------+ | interval_id | start | end | att | +-------------+---------------------+---------------------+------+ | 1 | 2007-01-01 08:00:00 | 2007-01-01 12:00:00 | 5 | | 2 | 2007-01-01 13:00:00 | 2007-01-01 17:00:00 | 10 | | 3 | 2007-01-01 10:00:00 | 2007-01-01 15:00:00 | 15 | | 4 | 2007-01-01 14:00:00 | 2007-03-07 19:00:00 | 20 | +-------------+---------------------+---------------------+------+
In this table, att is a delta: it tracks entrances. Actual attendance is SUM(att) at any given moment. For example, if the attendance sum is x at a given moment, then after we add a row with att=y and start/end datetimes embracing that moment, attendance will be x+y. So to retrieve total attendance at 01330h on 1 Jan 2007, we write:
SELECT SUM(att) FROM t WHERE t.start <= '2007-01-01 13:30:00' AND t.end >= '2007-01-01 13:30:00'; +----------+ | SUM(att) | +----------+ | 25 | +----------+
Then how would we extract maximum attendance during a given period, for example, maximum attendance between 1300h and 1700h? SQL does not deal efficiently with time. Some SQL dialects offer time series enhancements to the language; MySQL does not. And, querying time series data for aggregate statistics gets complicated very quickly. It gets a bit simpler with a calendar table that has a row for every possible datetime value. For our example, assume a granularity of one hour and a query period of one day. Naturally a real system would require a range of dates and perhaps a finer time granularity:
create insert insert insert insert insert insert insert insert insert insert insert insert insert insert insert insert insert insert insert insert insert insert insert insert table cal(id int,dt datetime); into cal values(1,'2007-1-1 01:00:00'); into cal values(2,'2007-1-1 02:00:00'); into cal values(3,'2007-1-1 03:00:00'); into cal values(4,'2007-1-1 04:00:00'); into cal values(5,'2007-1-1 05:00:00'); into cal values(6,'2007-1-1 06:00:00'); into cal values(7,'2007-1-1 07:00:00'); into cal values(8,'2007-1-1 08:00:00'); into cal values(9,'2007-1-1 09:00:00'); into cal values(10,'2007-1-1 10:00:00'); into cal values(11,'2007-1-1 11:00:00'); into cal values(12,'2007-1-1 12:00:00'); into cal values(13,'2007-1-1 13:00:00'); into cal values(14,'2007-1-1 14:00:00'); into cal values(15,'2007-1-1 15:00:00'); into cal values(16,'2007-1-1 16:00:00'); into cal values(17,'2007-1-1 17:00:00'); into cal values(18,'2007-1-1 18:00:00'); into cal values(19,'2007-1-1 19:00:00'); into cal values(20,'2007-1-1 20:00:00'); into cal values(21,'2007-1-1 21:00:00'); into cal values(22,'2007-1-1 22:00:00'); into cal values(23,'2007-1-1 23:00:00'); into cal values(24,'2007-1-1 24:00:00');
To accumulate the maximum attendance sum, collect target values for defined periods in an inner query, and sum them from the outer query:
SELECT SUM( att ) FROM ( SELECT t.start AS PeriodStart, t.end AS PeriodEnd, MIN(cal.dt) + INTERVAL 1 HOUR AS CountBegin, MAX(cal.dt) AS CountEnd, t.att FROM t JOIN cal ON cal.dt >= t.start AND cal.dt < t.end GROUP BY PeriodStart, PeriodEnd HAVING CountBegin < '2007-01-01 17:00:00' AND CountEnd > '2007-01-01 11:00:00' ) AS periods; +------------+ | SUM( att ) | +------------+ | 45 | +------------+
If the data is more complicated, eg if we also need to track exits, the period logic needs refinement but the principle remains the same.
Back to the top Browse the book Buy the book Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 45 of 92
While we're at it, we might as well have a function to return the most recent given weekday for a given date ...
delimiter go create function dateoflastweekday( d date, which tinyint ) returns date begin declare today tinyint; set today = dayofweek(d); return adddate( d, if(today=which,-7,if(today>which,which-today,which-today-7) )); end; go delimiter ;
Last updated 02 Dec 2009
Feedback
and you need a query to show the rows where the value changes. For how to retrieve a rank ordering we can subsequently use, see Rank order. Here we need rank ordering on time. We get it by joining each row R to rows with equal or earlier times and counting those rows:
DROP TABLE IF EXISTS tmp; CREATE TABLE tmp SELECT a.*, COUNT(a.value) AS rank FROM changes a JOIN changes b ON a.time >= b.time GROUP BY a.time; SELECT * FROM tmp; +----------+-------+------+ | time | value | rank | +----------+-------+------+ | 00:00:00 | 0 | 1 | | 01:05:00 | 1 | 2 |
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 46 of 92
| 01:09:00 | 1 | 3 | | 01:45:00 | 1 | 4 | | 02:24:00 | 0 | 5 | | 12:20:00 | 1 | 6 | | 12:40:00 | 0 | 7 | | 14:32:00 | 0 | 8 | | 18:20:00 | 1 | 9 | | 18:21:00 | 1 | 10 | | 20:40:00 | 0 | 11 | +----------+-------+------+
Now if we left join this result to itself on matching values and the left rank >= the right rank, and if we exclude rows with NULLs on the right side, we are left with just the rows for which the next value is different:
SELECT a.time, a.value FROM tmp a LEFT JOIN tmp b ON a.value=b.value AND a.rank=b.rank+1 WHERE b.rank IS NULL; +----------+-------+ | time | value | +----------+-------+ | 00:00:00 | 0 | | 01:05:00 | 1 | | 02:24:00 | 0 | | 12:20:00 | 1 | | 12:40:00 | 0 | | 18:20:00 | 1 | | 20:40:00 | 0 | +----------+-------+
(Every week or so, someone on a MySQL forum poses this question, and a forum regular who goes by the name 'laptop alias' posts a solution very like the above.)
Back to the top Browse the book Buy the book Feedback
It's easy to encapsulate into a stored function. We might as well parameterise the daynumber which the function is to use as a criterion:
DROP FUNCTION IF EXISTS MonthOfWeek; SET GLOBAL log_bin_trust_function_creators=1; DELIMITER go CREATE FUNCTION MonthOfWeek( pyear YEAR, pweek SMALLINT, pday SMALLINT ) RETURNS SMALLINT BEGIN DECLARE vdate DATE; DECLARE vday SMALLINT; SET vdate = AddDate(Concat( pyear, '-01-01'), 7*pweek ); SET vday = DayOfWeek( vdate ); SET vdate = IF( vday = 1, vdate, AddDate( vdate, 1-vday )); RETURN Month( vdate ); END; go DELIMITER ; SELECT MonthOfWeek( Year(CurDate()), Week( CurDate() ), 1 );
Last updated 30 Nov 2009
Feedback
YearMonth()
We often need to compute datetimes based on year and month. This tiny function simplifies such queries:
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 47 of 92
set global log_bin_trust_function_creators=1; create function yearmonth(d date) returns int return 100*year(d)+month(d);
Then to find date values within the three-month period bounded by the first day of last month and the last day of next month, write ...
select d from tbl where yearmonth(d) between yearmonth(curdate()-interval 1 month) and yearmonth(curdate() +interval 1 month);
Last updated 23 Aug 2010
Feedback
Game schedule
List all possible home-away encounters of teams listed in a table.
SELECT t1.name AS Visiting, t2.name AS Home FROM teams AS t1 STRAIGHT_JOIN teams AS t2 WHERE t1.ID <> t2.ID;
Feedback
AS AS AS AS AS
MAX() chooses existing over blank entries, and GROUP BY lines everything up on the same row.
Back to the top Browse the book Buy the book Feedback
or
SELECT a.* FROM tbl a JOIN tbl b ON a.id = b.id AND MOD(b.id, N ) = 0;
or
SELECT * FROM ( SELECT id FROM tbl ) AS tmp WHERE MOD( tmp.id, N ) = 0;
Last updated 14 Feb 2010
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 48 of 92
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 49 of 92
call dijaddpath( 'i', 'j', 7 ); call dijaddpath( 'i', 'h', 18 ); call dijaddpath( 'j', 'f', 8 ); SELECT * FROM dijnodes; +--------+----------+------+--------+------------+ | nodeID | nodename | cost | pathID | calculated | +--------+----------+------+--------+------------+ | 1 | a | NULL | NULL | 0 | | 2 | b | NULL | NULL | 0 | | 3 | d | NULL | NULL | 0 | | 4 | c | NULL | NULL | 0 | | 5 | e | NULL | NULL | 0 | | 6 | f | NULL | NULL | 0 | | 7 | j | NULL | NULL | 0 | | 8 | g | NULL | NULL | 0 | | 9 | h | NULL | NULL | 0 | | 10 | i | NULL | NULL | 0 | +--------+----------+------+--------+------------+ SELECT * FROM dijpaths; +--------+------------+----------+------+ | pathID | fromNodeID | toNodeID | cost | +--------+------------+----------+------+ | 1 | 1 | 2 | 4 | | 2 | 1 | 3 | 1 | | 3 | 2 | 1 | 74 | | 4 | 2 | 4 | 2 | | 5 | 2 | 5 | 12 | | 6 | 4 | 2 | 12 | | 7 | 4 | 6 | 74 | | 8 | 4 | 7 | 12 | | 9 | 3 | 5 | 32 | | 10 | 3 | 8 | 22 | | 11 | 5 | 3 | 66 | | 12 | 5 | 6 | 76 | | 13 | 5 | 9 | 33 | | 14 | 6 | 10 | 11 | | 15 | 6 | 7 | 21 | | 16 | 8 | 3 | 12 | | 17 | 8 | 9 | 10 | | 18 | 9 | 8 | 2 | | 19 | 9 | 10 | 72 | | 20 | 10 | 6 | 31 | | 21 | 10 | 7 | 7 | | 22 | 10 | 9 | 18 | | 23 | 7 | 6 | 8 | +--------+------------+----------+------+
Now for the stored procedure, a 6-step: null out path columns in the nodes table find the nodeIDs referenced by input params loop through all uncalculated one-step paths, calculating costs in each if a node remains uncalculated, the graph is invalid, so quit write the path sequence to a temporary table query the temp table to show the result
DROP PROCEDURE IF EXISTS dijResolve; DELIMITER | CREATE PROCEDURE dijResolve( pFromNodeName VARCHAR(20), pToNodeName VARCHAR(20) ) BEGIN DECLARE vFromNodeID, vToNodeID, vNodeID, vCost, vPathID INT; DECLARE vFromNodeName, vToNodeName VARCHAR(20); -- null out path info in the nodes table UPDATE dijnodes SET PathID = NULL,Cost = NULL,Calculated = 0; -- find nodeIDs referenced by input params SET vFromNodeID = ( SELECT NodeID FROM dijnodes WHERE NodeName = pFromNodeName ); IF vFromNodeID IS NULL THEN SELECT CONCAT('From node name ', pFromNodeName, ' not found.' ); ELSE BEGIN -- start at src node SET vNodeID = vFromNodeID; SET vToNodeID = ( SELECT NodeID FROM dijnodes WHERE NodeName = pToNodeName ); IF vToNodeID IS NULL THEN SELECT CONCAT('From node name ', pToNodeName, ' not found.' ); ELSE BEGIN -- calculate path costs till all are done UPDATE dijnodes SET Cost=0 WHERE NodeID = vFromNodeID; WHILE vNodeID IS NOT NULL DO BEGIN UPDATE dijnodes AS src JOIN dijpaths AS paths ON paths.FromNodeID = src.NodeID JOIN dijnodes AS dest ON dest.NodeID = Paths.ToNodeID SET dest.Cost = CASE WHEN dest.Cost IS NULL THEN src.Cost + Paths.Cost WHEN src.Cost + Paths.Cost < dest.Cost THEN src.Cost + Paths.Cost ELSE dest.Cost END, dest.PathID = Paths.PathID WHERE src.NodeID = vNodeID AND (dest.Cost IS NULL OR src.Cost + Paths.Cost < dest.Cost) AND dest.Calculated = 0; UPDATE dijnodes SET Calculated = 1 WHERE NodeID = vNodeID;
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 50 of 92
SET vNodeID = ( SELECT nodeID FROM dijnodes WHERE Calculated = 0 AND Cost IS NOT NULL ORDER BY Cost LIMIT 1 ); END; END WHILE; END; END IF; END; END IF; IF EXISTS( SELECT 1 FROM dijnodes WHERE NodeID = vToNodeID AND Cost IS NULL ) THEN -- problem, cannot proceed SELECT CONCAT( 'Node ',vNodeID, ' missed.' ); ELSE BEGIN -- write itinerary to map table DROP TEMPORARY TABLE IF EXISTS map; CREATE TEMPORARY TABLE map ( RowID INT PRIMARY KEY AUTO_INCREMENT, FromNodeName VARCHAR(20), ToNodeName VARCHAR(20), Cost INT ) ENGINE=MEMORY; WHILE vFromNodeID <> vToNodeID DO BEGIN SELECT src.NodeName,dest.NodeName,dest.Cost,dest.PathID INTO vFromNodeName, vToNodeName, vCost, vPathID FROM dijnodes AS dest JOIN dijpaths AS Paths ON Paths.PathID = dest.PathID JOIN dijnodes AS src ON src.NodeID = Paths.FromNodeID WHERE dest.NodeID = vToNodeID; INSERT INTO Map(FromNodeName,ToNodeName,Cost) VALUES(vFromNodeName,vToNodeName,vCost); SET vToNodeID = (SELECT FromNodeID FROM dijPaths WHERE PathID = vPathID); END; END WHILE; SELECT FromNodeName,ToNodeName,Cost FROM Map ORDER BY RowID DESC; DROP TEMPORARY TABLE Map; END; END IF; END; | DELIMITER ; CALL dijResolve( 'a','i'); +--------------+------------+------+ | FromNodeName | ToNodeName | Cost | +--------------+------------+------+ | a | b | 4 | | b | c | 6 | | c | j | 18 | | j | f | 26 | | f | i | 37 | +--------------+------------+------+ Back to the top Browse the book Buy the book Feedback
Feedback
To list children under their parents, self-join on left parent = right id and order on left parent, right ord:
SELECT t1.id,t1.parent,t2.ord,t1.title,t1.title as parentTitle FROM t t1 LEFT JOIN t t2 ON t1.parent=t2.id ORDER BY t1.parent,t2.ord; +------+--------+------+-------------+-------------+ | id | parent | ord | title | parentTitle | +------+--------+------+-------------+-------------+ | 1 | 0 | NULL | Root | Root | | 8 | 1 | 0 | Example | Example | | 2 | 1 | 0 | Home | Home | | 3 | 1 | 0 | Projects | Projects |
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 51 of 92
| 5 | 1 | 0 | Secret area | Secret area | | 4 | 1 | 0 | Tutorials | Tutorials | | 6 | 4 | 3 | Computing | Computing | | 7 | 4 | 3 | Life | Life | +------+--------+------+-------------+-------------+
Approximate joins
There are two main ways to reconcile payments against charges: Open Item: match payments against individual charges, typically by carrying the charge number in the payments table Statement: list and sum all charges and all payments, and show the difference as the outstanding balance. The Open Item method needs a foolproof way to match payments to charges, but what if the customer neglected to return a copy of the invoice, or to write the invoice number on the cheque? Reconciliation staff spend much of their time resolving such problems. Can we help? Yes! It won't be entirely foolproof, but it will drastically cut down the onerous work of reconciliation. Here is DDL for a test case:
CREATE SCHEMA approx; USE approx; CREATE TABLE charges ( ID INT UNSIGNED PRIMARY KEY AUTO_INCREMENT, custID INT UNSIGNED, amount DECIMAL(10,2) NOT NULL ); CREATE TABLE payments ( ID INT UNSIGNED PRIMARY KEY AUTO_INCREMENT, custID INT UNSIGNED, amount DECIMAL( 10,2) NOT NULL );
Both tables carry a custID column to identify whose charge or payment it is, but there is no foreign key linking payments to specific charges--that is the link we are going to approximate. Now populate the tables with a few rows of sample charges and payments for customer #1, ensuring that you have a variety of payments some that match the charge exactly, some that are close but not enough, and some that are slight overpayments.
INSERT INTO approx.charges VALUES (NULL,1,100),(NULL,1,12),(NULL,1,56),(NULL,1,43),(NULL,1,59),(NULL,1,998); INSERT INTO approx.payments VALUES (NULL,1,99),(NULL,1,62),(NULL,1,40),(NULL,1,50),(NULL,1,12),(NULL,1,1000); SELECT * FROM charges; +----+--------+--------+ | ID | custID | amount | +----+--------+--------+ | 1 | 1 | 100.00 | | 2 | 1 | 12.00 | | 3 | 1 | 56.00 | | 4 | 1 | 43.00 | | 5 | 1 | 59.00 | | 6 | 1 | 998.00 | +----+--------+--------+ SELECT * FROM payments; +----+--------+---------+ | ID | custID | amount | +----+--------+---------+ | 1 | 1 | 99.00 | | 2 | 1 | 62.00 | | 3 | 1 | 40.00 | | 4 | 1 | 50.00 | | 5 | 1 | 12.00 | | 6 | 1 | 1000.00 | +----+--------+---------+
The first thing to do is define an approximation threshold: how close must the amount paid be to the amount charged before we conclude that the amounts are related? For this example we define the proximity threshold as 2. In a real-world example, it might be 10, or 50, or perhaps percentage of the charge. It all depends on the nature of the organisation and the typical total purchase. A house builder may make frequent purchases valued at $1000 and more. You scale the threshold to the typical situation. Since the amount paid might be more or less or even equal to the amount charged, to link a payment to a charge we need not an equi-join but a theta-join that tests a range both below and above the charge amount. That might suggest a BETWEEN clause. Here is a better idea: use the ABS() function:
SET @proximity = 2; SELECT c.ID AS ChargeNo, c.Amount AS Charge, p.ID AS PaymentNo, p.Amount AS Payment FROM charges c -- change this value to suit your situation
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 52 of 92
JOIN payments p ON c.custID = p.custID AND ABS(c.amount - p.amount) <= @proximity WHERE c.custID = 1;
Before you run this query, look at the data to anticipate the result. Here it is:
+----------+--------+-----------+---------+ | ChargeNo | Charge | PaymentNo | Payment | +----------+--------+-----------+---------+ | 1 | 100.00 | 1 | 99.00 | | 2 | 12.00 | 5 | 12.00 | | 6 | 998.00 | 6 | 1000.00 | +----------+--------+-----------+---------+
The solution is correct, as far as it goes, but it doesnt go far enough. We correctly identified the three situations: underpayment, exact payment and overpayment, but we suppressed all charges that dont have a matching payment. Reconciliation staff are probably interested in a bigger picture of the situation. Fix this by changing the INNER JOIN to a LEFT JOIN:
SET @proximity = 2; SELECT c.ID AS ChargeNo, c.amount AS Charge, p.ID AS PaymentNo, p.amount AS Payment FROM charges c LEFT JOIN payments p ON c.custID = p.custID AND ABS(c.amount - p.amount) <= @proximity WHERE c.custID = 1; +----------+--------+-----------+---------+ | ChargeNo | Charge | PaymentNo | Payment | +----------+--------+-----------+---------+ | 1 | 100.00 | 1 | 99.00 | | 2 | 12.00 | 5 | 12.00 | | 3 | 56.00 | NULL | NULL | | 4 | 43.00 | NULL | NULL | | 5 | 59.00 | NULL | NULL | | 6 | 998.00 | 6 | 1000.00 | +----------+--------+-----------+---------+
Much better! The reconciliation people now know that three charges have no matching payment. What if the customer mistakenly pays for something twice? Add a row to the Payments table with an amount of $1000, then re-run the last query:
+----------+--------+-----------+---------+ | ChargeNo | Charge | PaymentNo | Payment | +----------+--------+-----------+---------+ | 1 | 100.00 | 1 | 99.00 | | 2 | 12.00 | 5 | 12.00 | | 3 | 56.00 | NULL | NULL | | 4 | 43.00 | NULL | NULL | | 5 | 59.00 | NULL | NULL | | 6 | 998.00 | 6 | 1000.00 | | 6 | 998.00 | 7 | 1000.00 | +----------+--------+-----------+---------+
How convenient! We can see at once that charge number 6 was paid for twice. Somebody in the reconciliation department owes you lunch.
Last updated 05 May 2010
Feedback
Cascading JOINs
Show parents, children and grandchildren including parents without children
SELECT parent.id AS ParentID, IFNULL(child.parent_id,') AS ChildParentID, IFNULL(child.id,') AS ChildID, IFNULL(grandchild.child_id,') AS GrandchildChildID FROM parent LEFT JOIN child ON parent.id=child.parent_id LEFT JOIN grandchild ON child.id=grandchild.child_id;
Feedback
Classroom scheduling
You have n student classes of known size, and m classrooms of known size, where m>=n. What's the best algorithm for assigning as many classes as possible to rooms of adequate size? It's a version of the combinatorial knapsack problem. It's known to be NP-complete, which means it's possible to verify any correct solution but there is no known algorithm for quickly finding a correct solution. How then to proceed?
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 53 of 92
Early in 2010 Joe Celko resurrected the problem in a Simple Talk column, and challenged readers to improve on SQL Server solutions he'd published in the third edition of his "SQL for Smarties". Here's his small version of the problem modified for MySQL:
DROP TABLE IF EXISTS Rooms, Classes; CREATE TABLE Rooms( room_nbr CHAR(2) NOT NULL PRIMARY KEY, room_size INTEGER NOT NULL ) ENGINE=MyISAM; CREATE TABLE Classes( class_nbr CHAR(2) NOT NULL PRIMARY KEY, class_size INTEGER NOT NULL ) ENGINE=MyISAM; INSERT INTO Classes VALUES ('c1', 80),('c2', 70),('c3', 65),('c4', 55),('c5', 50),('c6', 40); INSERT INTO Rooms VALUES ('r1', 70),('r2', 40),('r3', 50),('r4', 85),('r5', 30),('r6', 65),('r7', 55);
And here is the best solution posted by his contributors. It works in SQL Server 2005 and 2008:
WITH Matches AS ( SELECT class_nbr, class_size, room_nbr, room_size, exact_match = CASE WHEN class_size = room_size THEN 1 ELSE 0 END FROM Classes, Rooms WHERE class_size <= room_size ), Preferences AS ( SELECT class_nbr, class_size, class_room_pref = ROW_NUMBER() OVER ( PARTITION BY class_nbr ORDER BY exact_match, room_size, room_nbr ), room_nbr, room_size, room_class_pref = ROW_NUMBER() OVER ( PARTITION BY room_nbr ORDER BY exact_match, class_size DESC, class_nbr ) FROM Matches m WHERE NOT EXISTS ( SELECT 1 FROM Matches WHERE room_nbr = m.room_nbr AND class_size > m.class_size ) ), Final AS ( SELECT class_nbr, class_size, room_nbr, room_size, final_pref = ROW_NUMBER() OVER (PARTITION BY class_nbr ORDER BY class_room_pref) FROM Preferences p WHERE NOT EXISTS ( SELECT 1 FROM Preferences WHERE room_nbr = p.room_nbr AND class_room_pref = room_class_pref AND room_class_pref < p.room_class_pref ) ) SELECT c.class_nbr, c.class_size, f.room_nbr, f.room_size FROM Classes c LEFT JOIN Final f ON c.class_nbr = f.class_nbr AND f.final_pref = 1 ORDER BY 1;
As a MySQL user, you may be unfamiliar with two constructs in this query ROW_NUMBER() OVER ... [PARTITION]..., and Common Table Expressions (CTEs) introduced by the keyword WITH.
ROW_NUMBER() numbers resultset rows based on row values. This entry shows two MySQL equivalents for it, one relying on user variables, the other on aggregation. For this problem we will use the user variable method.
CTEs provide an elegant syntax for building derived tables. The above SQL Server query builds the derived table Matches, from which it builds the derived table Preferences, from which it builds the table Final, which it joins with Classes for the final result. Can this be done in MySQL? Yes, but not nearly so elegantly. Here we'll lay out an unoptimised step-by-step. Basically, we build the Matches, Preferences and Final tables, one at a time, then copy the final step of the SQL Server query. First the Matches table:
DROP TABLE IF EXISTS Matches; CREATE TABLE Matches SELECT class_nbr, class_size, room_nbr, room_size, IF(class_size=room_size,1,0) AS exact_match FROM Classes JOIN Rooms ON class_size <= room_size;
The Preferences table has two Row_Number() expressions to port, so we build each, then join them:
DROP TABLE IF EXISTS room_prefs; SET @class_nbr_prev='', @ordPrev=0; CREATE TABLE room_prefs
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 54 of 92
SELECT ID, class_nbr, class_size, room_nbr, room_size, class_room_pref FROM ( SELECT ID, class_size, room_nbr, room_size, @ordPrev := IF( @class_nbr_prev=class_nbr, @ordPrev+1, 1 ) as class_room_pref, @class_nbr_prev := class_nbr AS class_nbr FROM Matches m WHERE NOT EXISTS ( SELECT 1 FROM Matches WHERE room_nbr = m.room_nbr AND class_size > m.class_size ) ORDER BY class_nbr, exact_match, room_size, room_nbr ) AS tmp ; DROP TABLE IF EXISTS class_prefs; SET @room_nbr_prev = '', @ordPrev=0; CREATE TABLE class_prefs SELECT ID, room_class_pref FROM ( SELECT ID, @ordPrev := IF( @room_nbr_prev=room_nbr, @ordPrev+1, 1 ) as room_class_pref, @room_nbr_prev := room_nbr AS room_nbr FROM Matches m WHERE NOT EXISTS ( SELECT 1 FROM Matches WHERE room_nbr = m.room_nbr AND class_size > m.class_size ) ORDER BY room_nbr, exact_match, class_size DESC, class_nbr ) AS tmp ; DROP TABLE IF EXISTS Preferences; CREATE TABLE Preferences SELECT a.class_nbr, a.class_size, a.room_nbr, a.class_room_pref, a.room_size, b.room_class_pref FROM room_prefs a JOIN class_prefs b USING(ID);
The final step is identical to the last step in the SQL Server version:
SELECT c.class_nbr, c.class_size, f.room_nbr, f.room_size FROM Classes c LEFT JOIN Final f ON c.class_nbr = f.class_nbr AND f.final_pref = 1 ORDER BY 1; +-----------+------------+----------+-----------+ | class_nbr | class_size | room_nbr | room_size | +-----------+------------+----------+-----------+ | c1 | 80 | r4 | 85 | | c2 | 70 | r1 | 70 | | c3 | 65 | r6 | 65 | | c4 | 55 | r7 | 55 | | c5 | 50 | r3 | 50 | | c6 | 40 | r2 | 40 | +-----------+------------+----------+-----------+
Feedback
Data-driven joins
Data-driven table relationships are hard to maintain, but sometimes they cannot be avoided. How do we build joins for them? One way is to use a CASE statement in the SELECT list to handle the joining possibilities. In this example, the parent.linktable column determines the name of the table where a particular parent row's data is. The method is fine when the number of child tables is small:
USE test; DROP TABLE IF EXISTS parent, child1, child2; CREATE TABLE parent ( id INT UNSIGNED PRIMARY KEY, linktable CHAR(64) NOT NULL ); INSERT INTO parent VALUES (1, 'child1'), (2, 'child2'); CREATE TABLE child1 ( id INT UNSIGNED PRIMARY KEY, data CHAR(10) ); INSERT INTO child1 VALUES (1, 'abc'); CREATE TABLE child2 ( id INT UNSIGNED PRIMARY KEY, data CHAR(10)
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 55 of 92
To retrieve all child data for all parents, include in the SELECT list a CASE statement which handles all child table possibilities:
SELECT p.id, p.linktable, CASE linktable WHEN 'child1' THEN c1.data WHEN 'child2' THEN c2.data ELSE 'Error' END AS Data FROM parent AS p LEFT JOIN child1 AS c1 ON p.id=c1.id LEFT JOIN child2 AS c2 ON p.id=c2.id; +----+-----------+------+ | id | linktable | Data | +----+-----------+------+ | 1 | child1 | abc | | 2 | child2 | def | +----+-----------+------+
When the number of child tables is too large for a convenient CASE statement, PREPARE the query in a stored procedure. (Based on a MySQL Forum post by Felix Geerinckx)
Back to the top Browse the book Buy the book Feedback
MySQL does not support FULL OUTER JOIN. How to emulate it? If the joining keys of each table are unique, you can just UNION left and right joins:
SELECT * FROM a LEFT JOIN b ON a.id=b.id UNION SELECT * FROM a RIGHT JOIN b ON a.id=b.id;
But suppose the tables to be joined have duplicate rows, and you wish your result to preserve them. For example, add a duplicate row to table a:
INSERT INTO a VALUES(1,'a');
Now UNION removes the duplicate row you want preserved in the result. How to get back the desired duplicates? A FULL OUTER JOIN consists of: an INNER JOIN between a and b to catch row matches between a and b, a LEFT EXCLUSION JOIN from a to b to catch rows that are in a and not in b, a RIGHT EXCLUSION JOIN from b to a to catch rows in b that are not in a. In SQL:
SELECT * FROM a INNER JOIN b ON a.id=b.id
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 56 of 92
UNION ALL SELECT * FROM a LEFT JOIN b ON a.id=b.id WHERE b.id IS NULL UNION ALL SELECT * FROM a RIGHT JOIN b ON a.id=b.id WHERE a.id IS NULL
But the first two joinsthe inner join, and the left exclusion joinare logically equivalent to a left outer join, so we can write:
SELECT * FROM a LEFT JOIN b ON a.id=b.id UNION ALL SELECT * FROM a RIGHT JOIN b ON a.id=b.id WHERE a.id IS NULL; +------+------+------+------+ | id | name | id | name | +------+------+------+------+ | 1 | a | NULL | NULL | | 2 | b | 2 | b | | 1 | a | NULL | NULL | | NULL | NULL | 3 | c | +------+------+------+------+
Why doesn't MySQL implement FULL OUTER JOIN syntax for this? We don't know.
Back to the top Browse the book Buy the book Feedback
drop table if exists a,b; create table a(i int,j int); create table b like a; insert into a values(1,1),(2,2); insert into b values(1,1),(3,3); select * from a join b using(i,j); +------+------+ | i | j | +------+------+ | 1 | 1 | +------+------+
Get the DIFFERENCE between tables a and b by UNIONing exclusion joins from a to b, and from b to a:
select * from a left join b using(i,j) where b.i is null union select * from b left join a using(i,j) where a.i is null; +------+------+ | i | j | +------+------+ | 2 | 2 | | 3 | 3 | +------+------+
Last updated 30 Jun 2009
Feedback
Many-to-many joins
To model a many:many relationship between two tables a and b, you need a bridging table where each row represents one instance of an association between a row in a and a row in b, as in this example:
drop table if exists users,actions,useractions; create table users(userid int primary key, username char(32)); insert into users values(1, 'James'),(2, 'Alex'),(3, 'Justin'); create table actions(actionid int primary key, action char(32)); insert into actions values(1, 'Login'),(2, 'Logout'),(3, 'Delete'),(4, 'Promote'); create table useractions(uaid int primary key, userid int, actionid int); insert into useractions values(1,1,1),(2,1,2),(3,3,4); select u.username, a.action from useractions ua join users u using (userid) join actions a using (actionid); +----------+---------+ | username | action | +----------+---------+ | James | Login | | James | Logout | | Justin | Promote | +----------+---------+
Almost every non-trivial database will have at least one instance of this pattern. We have a collection of articles and users' scores of them. How to report statistics on these scores? We need three tables--one for articles, one for users, and a bridge table where each row represents one score on one article by one user:
DROP TABLE IF EXISTS art_articles; CREATE TABLE art_articles ( ID INT AUTO_INCREMENT PRIMARY KEY, title CHAR(30), txt TEXT, UNIQUE KEY (title) );
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 57 of 92
INSERT INTO art_articles VALUES (1,'abc',''),(2,'def',''),(3,'ghi',''),(4,'jkl',''); DROP TABLE IF EXISTS art_users; CREATE TABLE art_users( ID INT AUTO_INCREMENT PRIMARY KEY, name CHAR(20) ); INSERT INTO art_users VALUES (1,'A'),(2,'B'); DROP TABLE IF EXISTS art_scores; CREATE TABLE art_scores ( id INT AUTO_INCREMENT PRIMARY KEY, articleID INT NOT NULL, -- references article.articleID userID INT NOT NULL, -- references user.userID score DECIMAL(6,2) ); INSERT INTO art_scores VALUES (1,1,1,80),(2,1,2,90),(3,2,2,60); -- find average score for article titled 'abc' SELECT a.title, AVG( s.score ) AS Art1Avg FROM art_articles a JOIN art_scores s ON a.id=s.articleID WHERE a.title='abc' GROUP BY a.title; -- find average score submitted by user 1 SELECT u.name, AVG( s.score ) AS User1Avg FROM art_users u JOIN art_scores s ON u.id=s.userID WHERE u.id = 1 GROUP BY u.name;
Last updated 16 Mar 2010
Feedback
One solution begins with a self-join on prodID as often as necessary to get the answer. For example, to list all products bought by customers who'd already bought at least one other product, join userpurchases to itself on matching custIDs and non-matching prodIDs:
SELECT DISTINCT p2.prodid FROM userpurchases p1 JOIN userpurchases p2 ON p1.custID = p2.custID AND p1.prodID <> p2.prodID; +--------+ | prodid | +--------+ | 1 | | 2 | | 3 | +--------+
To find what else buyers of product 1 bought, copy the above join and group by custID:
SELECT p1.custID,GROUP_CONCAT(p2.prodid) as 'Buyers of #1 Also bought' FROM userpurchases p1 JOIN userpurchases p2 ON p1.custID=p2.custID AND p1.prodID <> p2.prodID WHERE p1.prodID = 1 GROUP BY p1.custID; +--------+--------------------------+ | custID | Buyers of #1 Also bought | +--------+--------------------------+ | 1 | 2 | | 3 | 2 | | 5 | 2,3 | +--------+--------------------------+
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 58 of 92
JOIN userpurchases p2 ON p1.custID=p2.custID AND p1.prodID=1 AND p2.prodID=2 +--------+ | custID | +--------+ | 1 | | 3 | | 5 | +--------+
The problem can also be solved with just Counts and a Having clause. Customers who bought products 1 and 2 ... (i) have custIDs in userpurchases where prodID is in (1,2) and (ii) have a distinct prodID count >= the number of prodIDs in (1,2), which is 2:
SELECT custID FROM ( SELECT custID, COUNT(DISTINCT prodID) AS N FROM userpurchases WHERE prodID IN(1,2) GROUP BY custID HAVING N>=2 -- N of (1,2) = 2 ) x;
When the criterion group has more than 2 items, that aggregating solution is easier to implement. For example, you track members and their projects, and you often have to query which members participated in all of a specified list of projects:
DROP TABLE IF EXISTS members,project_members; CREATE TABLE members(emp_id int,emp_name char(5)); INSERT INTO members VALUES (1,'Emp1'),(2,'Emp2'), (3, Emp3'),(4,'Emp4'),(5,'Emp5'); CREATE TABLE project_members (project_id int, emp_id int); INSERT INTO project_members VALUES (1, 1),(1, 2),(1, 3),(2, 1),(2, 2),(3, 1),(4, 3),(5, 1),(5, 2),(5, 3),(5, 4),(6, 1); SELECT project_id AS Project, GROUP_CONCAT(emp_id) AS Members FROM project_members GROUP BY project_id; +---------+---------+ | Project | Members | +---------+---------+ | 1 | 1,2,3 | | 2 | 1,2 | | 3 | 1 | | 4 | 3 | | 5 | 1,2,3,4 | | 6 | 1 | +---------+---------+
Here is a variation on the pattern. You have tables people(id, name), membership(id, name) and people_membership(id, membership_id, people_id). You need to find names of all persons and all their memberships where one of the memberships has id=X. It's another two-step: 1. Assemble the people who have a membership with id=X 2. Join from that derived result to people_memberships to get their other memberships, then to people to get their personal info:
SELECT GROUP_CONCAT(m.name ORDER BY m.membership_id SEPARATOR ', ') AS 'Member Name' pX.id, pX.name FROM ( SELECT p.id,p.id,p.name FROM people p JOIN ( SELECT id FROM people_membership WHERE membership_id=X ) pmX ON p.id=pmX.id ) pX JOIN people_membership pm ON pX.id=pm.id JOIN membership m ON pm.membership_id=m.membership_id GROUP BY pX.id ORDER BY pX.name;
Here is another variation on the pattern. You track companies, their agents, and their sales:
drop table if exists companies,agents,sales; create table companies( id int,name char(5) ); insert into companies values (1,'a ltd'),(2,'b ltd');
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 59 of 92
table agents( id int, agentname char(5), companyID int ); into agents values (1,'ann',1),(2,'bill',1),(3,'chad',2),(4,'david',2),(5,'ed',3); table sales( itemname char(5), agentID int ); into sales values('socks',1),('shoes',2),('shirt',3),('tie',4);
You need a report that lists all items sold by a given agent and by all other agents in that agent's company. It's a three-step: 1. Write a query to retrieve agents and their sales. 2. Join the result of [1] with the agents table on company. 3. Scope the result of [2] on a particular agent's name.
SET @agent='ann'; SELECT s.* FROM agents a JOIN ( SELECT x.id,x.agentname,x.companyID,y.itemname FROM agents x JOIN sales y ON x.id=y.agentID ) s ON a.companyID=s.companyID WHERE a.agentname=@agent; ------+-----------+-----------+----------+ id | agentname | companyID | itemname | ------+-----------+-----------+----------+ 1 | ann | 1 | socks | 2 | bill | 1 | shoes | ------+-----------+-----------+----------+
Last updated 28 Jan 2011
-- sales by agent
Feedback
Join or subquery?
Usually, a JOIN is faster than an uncorrelated subquery. For example in the sakila test database, customer is a parent of rental (via customer_id) which in turn is a parent of payment (via rental_id). The subquery version of a query for whether a customer has made payments and rentals...
SELECT DISTINCT c.customer_id FROM customer c WHERE c.customer_id IN ( SELECT r.customer_id FROM rental r JOIN payment p USING (rental_id) WHERE c.customer_id = 599; );
Running EXPLAIN on the two queries reveals why: the subquery version has to read most customer rows, while the join version proceeds inside out and discovers it needs to read just one customer row.
Back to the top Browse the book Buy the book Feedback
or with a NOT EXISTS subquery, which is logically equivalent to the exclusion join, but usually performs much slower:
SELECT parent.id AS ParentID FROM parent WHERE NOT EXISTS ( SELECT parent.id FROM parent JOIN child ON parent.ID = child.parent_id );
Last updated 28 Jan 2011
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 60 of 92
You want to list the names of all contractors and their clients.
SELECT clientpartyID, pCli.name AS Client, contractorpartyID, pCon.name AS Contractor FROM contracts INNER JOIN parties AS pCli ON contracts.clientpartyID = pCli.partyID INNER JOIN parties AS pCon ON contracts.contractorpartyID = pCon.partyID;
Feedback
Trouble is, in MySQL 5.0 this query takes 130 times longer to execute than a JOIN version of the same logic:
SELECT orderID FROM orders o JOIN orderdetails od USING (orderID) GROUP BY orderID HAVING COUNT(od.orderID) > 1
Why? The query optimiser decides to execute the IN() query outside-in: it runs the subquery once for each outer query row it finds, so if there are 800 orders and 2000 items, it does 1,600,000 reads. On the other hand, it optimises the JOIN very well. But here's a surprise. If a query has a FROM clause subquery, MySQL executes that subquery first. What if we were to add a redundant FROM clause subquery into the IN() subquery?
EXPLAIN EXTENDED SELECT SQL_NO_CACHE orderID FROM orders WHERE orderID IN ( SELECT orderID FROM ( -- redundant FROM subquery SELECT orderID FROM orderdetails GROUP BY orderID HAVING COUNT(orderID) > 1 ) AS tmp );
Sure enough, this version of IN() executes as fast as the JOIN version because the total number of reads is 800+2000=2,800, not 800*2000=1,600,000. Optimisations in MySQL 5.5 reduce the speed difference from 130:1 to 5:1. With large tables, that will still be significant. You wonder if EXISTS() might be better than IN()?
SELECT orderID FROM orders o WHERE EXISTS ( SELECT orderID FROM orderdetails WHERE orderID = o.orderID GROUP BY orderID HAVING COUNT(orderID)>1 );
In 5.0, EXISTS() is much faster than IN(), but slower than JOIN. In 5.5, EXISTS() performs about as well as JOIN. For NOT IN(...), consider a straight exclusion join. The basic pattern is that ...
SELECT ... FROM a WHERE a.x NOT IN ( SELECT y FROM b );
becomes ...
SELECT ... FROM a LEFT JOIN b ON a.x=b.y WHERE b.y IS NULL;
One more surprise. Running EXPLAIN EXTENDED on these queries does not predict all the performance differences. You have to benchmark the queries. So for problems like this, experiment with the MySQL 5.5/5.6
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 61 of 92
optimizer_switch variable to see whether your version of this problem is best solved by IN(), IN() with the extra subquery, EXISTS(), or JOIN.
Last updated 25 Apr 2011
Feedback
...but a JOIN (or decorrelated) version of the logic is usually much faster. This query pattern is simple: Inner join the table (t1) to itself (t2) on the grouping key. Add the condition on which you wish to find existing rows to the Join clause.
SELECT DISTINCT m.employeeID FROM employee AS e INNER JOIN employee AS m ON e.Mgr = m.employeeID AND e.salary > m.salary;
The correlated subquery version of the Not Exists query just inserts a strategic NOT:
SELECT DISTINCT employeeID FROM employee AS e WHERE NOT EXISTS ( SELECT employeeID FROM employee AS m WHERE m.Mgr = e.employeeID AND e.salary > m.salary );
The decorrelated version of Not Exists uses an exclusion join--a LEFT JOIN with an IS NULL condition imposed on the right side of the join: Left join the table to itself on the grouping key. Add the condition on which you wish to find existing rows to the Join clause. For the condition on which you wish to find missing rows on the right side, (a) add the value condition to the Join clause, and (b) in the Where clause, add an 'is null' condition for the right side:
SELECT DISTINCT m.employeeID FROM employee AS e LEFT JOIN employee AS m ON e.Mgr = m.employeeID AND e.salary > m.salary WHERE m.employeeID IS NULL;
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 62 of 92
This is another instance of the All X for which there is no Y pattern, easily solved with either a Not Exists subquery or an Exclusion join. Exclusion joins usually perform better, but here is a wrinkle (thanks to Pascal Mitride for this example): we might expect to join registrations to students to get student info into the result, yet the registrations table will be the object of the exclusion join, so how do we retrieve the required student info? A solution is to left join exams to a subquery: students left join registrations, restricted by a WHERE clause specifying the target student. The subquery encapsulates conditions on the joined table. Then we can impose the IS NULL condition on the subquery:
SELECT e.exam_name FROM exams AS e -- all exams LEFT JOIN ( SELECT eid -- Hiro's exams FROM students s LEFT JOIN registrations r ON s.sid=r.sid WHERE s.firstname='Hiro' AND s.lastname='Nakamura' ) AS tmp ON e.eid=tmp.eid WHERE tmp.eid IS NULL; -- exclusion +------------------------+ | exam_name | +------------------------+ | Javascript Expert | | Lost Survival Course | | Zend PHP Certification | +------------------------+
A query showing Hiro's registration or not for all exams proves our logic correct:
SELECT e.exam_name,IF(tmp.sid IS NULL, 'No', 'Yes') AS 'Hiro registered' FROM exams e LEFT JOIN ( SELECT eid FROM students s LEFT JOIN registrations r ON s.sid=r.sid WHERE s.firstname='Hiro' AND s.lastname='Nakamura' ) tmp ON e.eid=tmp.eid; +---------------------------+-----------------+ | exam_name | Hiro registered | +---------------------------+-----------------+ | Javascript Expert | No | | Lost Survival Course | No | | Zend PHP Certification | No | | Superhero Advanced Skills | Yes | | Desperation Certificate | Yes | +---------------------------+-----------------+
Beginners often have trouble working out how to write this kind of query, especially when the LEFT JOIN condition requires comparison with a literal value on the right side, thereby making the LEFT JOIN, effectively, an INNER JOIN and undermining the query design. The method is to subtract one set (here, the exams Hiro registered for) from another (all exams). Start by writing a query expression for the set to be subtracted, then write the query expression to be subtracted from, then LEFT JOIN the two, then add the IS NULL condition: 1. Write a query expression to retrieve rows which positively match the exclusion condition. In simple cases, this may be a simple table reference. Here we need it to be an inner query for the exams Hiro Nakamura did register for...
SELECT eid FROM students s LEFT JOIN registrations r ON s.sid=r.sid WHERE s.firstname='Hiro' AND s.lastname='Nakamura'
2. Write a query to retrieve the rows from which the result of [1] is to be subtracted--here, simply all exams...
SELECT e.exam_name FROM exams e
4. Add a WHERE clause condition requiring that some column in the inner query evaluates to NULL.
WHERE tmp.eid IS NULL;
Last updated 28 Jan 2011
Feedback
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 63 of 92
Feedback
Emulate Row_Number()
ISO SQL defines a ROW_NUMBER() OVER function with an optional PARTITION clause for generating a derived row number column in a resultset. Several RDBMSsincluding DB2, Oracle and SQL Serverimplement it. Here is the simplest possible example. Given a table with two columns i and j, generate a resultset that has a derived sequential row_number column taking the values 1,2,3,... for a defined ordering of j which resets to 1 when the value of i changes:
DROP TABLE IF EXISTS test; CREATE TABLE test(i int,j int); INSERT INTO test VALUES (3,31),(1,11),(4,14),(1,13),(2,21),(1,12),(2,22),(3,32),(2,23),(3,33);
Can we have this result in MySQL? Here are two ways to get it. First, with user variables:
SET @iprev=0, @jprev=0; SELECT i, j, row_number FROM ( SELECT j, @jprev := if(@iprev = i, @jprev+1, 1) AS row_number, @iprev := i AS i FROM test ORDER BY i,j ) AS tmp;
To simplify generalising these query patterns to multiple ordering columns, use self-documenting column names:
DROP TABLE IF EXISTS test; CREATE TABLE test(partition int, ord1 int, ord2 int); INSERT INTO test VALUES (3,31,55),(1,11,19),(4,14,23),(2,22,8),(2,22,42),(1,13,56), (2,21,77),(2,21,7),(1,12,17),(2,23,92),(3,32,24),(3,33,62);
The user variable solution easily handles multiple ordering columns: just add the ordering column names to the outer query's SELECT list and the inner query's ORDER BY list:
SET @partitionPrev=0, @ordPrev=0; SELECT partition, ord1, ord2, row_number FROM ( SELECT ord1, ord2,
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 64 of 92
@ordPrev := If(@partitionPrev = partition, @ordPrev+1, 1) as row_number, @partitionPrev := partition AS partition FROM test ORDER BY partition, ord1, ord2 ) AS tmp; +-----------+------+------+------------+ | partition | ord1 | ord2 | row_number | +-----------+------+------+------------+ | 1 | 11 | 19 | 1 | | 1 | 12 | 17 | 2 | | 1 | 13 | 56 | 3 | | 2 | 21 | 7 | 1 | | 2 | 21 | 77 | 2 | | 2 | 22 | 8 | 3 | | 2 | 22 | 42 | 4 | | 2 | 23 | 92 | 5 | | 3 | 31 | 55 | 1 | | 3 | 32 | 24 | 2 | | 3 | 33 | 62 | 3 | | 4 | 14 | 23 | 1 | +-----------+------+------+------------+
In the aggregating solution for multiple columns, the theta join gets more complex as ordering columns are added:
SELECT a.partition, a.ord1, a.ord2, count(*) as row_number FROM test a JOIN test b ON a.partition=b.partition AND (a.ord1>b.ord1 OR (a.ord1=b.ord1 AND a.ord2>=b.ord2)) GROUP BY a.partition, a.ord1, a.ord2 ;
Last updated 03 May 2010
Feedback
Next row
You have a table of names, you have retrieved a row with name $name, and you want the row for the next name in name order. MySQL LIMIT syntax makes this very easy:
SELECT * FROM tbl WHERE name > $name ORDER BY name LIMIT 1
Feedback
to write ...
SELECT ... ... ORDER BY colname+0, colname;
Feedback
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 65 of 92
You want to list authors' books while suppressing repeating authors' names. A simple solution is to use MySQL's extremely useful GROUP_CONCAT() function to group books by author:
SELECT a.name AS Author, GROUP_CONCAT(b.name ORDER BY b.name) AS Books FROM book_author AS ba JOIN book AS b ON ba.book_id=b.id JOIN author AS a ON ba.author_id=a.id GROUP BY a.name;
For a neater-looking result: 1. Retrieve authors and their books. 2. Order them 3. Use a variable to remember and suppress repeating author names:
SET @last=''; SELECT IF(r.author=@last,'',@last:=r.author) AS Author, r.book AS Book FROM ( SELECT DISTINCT a.name AS author,b.name AS book FROM book_author AS ba JOIN book AS b ON ba.book_id=b.id JOIN author AS a ON ba.author_id=a.id ORDER BY a.name,b.name ) AS r; +---------------+---------------------------+ | author | book | +---------------+---------------------------+ | Brad Phillips | Design Patterns | | | MySQL in a bucket | | | PHP Professional | | Don Charles | Java Script Programming | | | MySQL in a bucket | | Kur Silver | Databases for Delinquents | +---------------+---------------------------+
Feedback
Pagination
Suppose you have a phone book of names, addresses, etc. You are displaying 20 rows per page, you're on page 100, and you want to display page 99. How do you do this knowing only what page you are on? Assuming... 1-based page numbers you are on page P each page shows N rows then the general formula for translating a 1-based page number into a first LIMIT argument is ...
MAX(0,P-1) * N
which for the 99th 20-row page evaluates to 1960, and the second argument to LIMIT is just N, so to see page 99, write...
SELECT ... LIMIT (1960, N);
The trouble with this is scaling. MySQL doesn't optimise LIMIT well at all, so the bigger the table, the longer LIMIT takes. What's the alternative? Build pagination into the WHERE clause, and ensure sure there is a covering index for the paginating column. On a table of 100,000 indexed random integers, SELECT ... WHERE ... for the last 20 integers in the table is twice as fast as the comparable LIMIT query. With a million integers, the ratio is over 500!
Last updated 25 Nov 2009
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 66 of 92
Feedback
With that simple query, you're halfway toward cross-tabulation, halfway to implementing a simple CUBE, and halfway to basic entity-attribute-value (EAV) logic. This is easier to see if we have two columns, rather than just one, to tabulate against the grouping column:
DROP TABLE IF EXISTS tbl; CREATE TABLE tbl( id INT, colID INT, value CHAR(20) ); INSERT INTO tbl VALUES (1,1,'Sampo'),(1,2,'Kallinen'),(1,3,'Office Manager'), (2,1,'Jakko'),(2,2,'Salovaara'),(2,3,'Vice President');
To tabulate all colID and value values against all id valuesthat is, to write a reporting CUBE for the tablewrite a GROUP_CONCAT() instruction for each colID found in the table, then GROUP BY id:
SELECT id, GROUP_CONCAT(if(colID = 1, value, NULL)) AS 'First Name', GROUP_CONCAT(if(colID = 2, value, NULL)) AS 'Last Name', GROUP_CONCAT(if(colID = 3, value, NULL)) AS 'Job Title' FROM tbl GROUP BY id; +------+------------+-----------+----------------+ | id | First Name | Last Name | Title | +------+------------+-----------+----------------+ | 1 | Sampo | Kallinen | Office Manager | | 2 | Jakko | Salovaara | Vice President | +------+------------+-----------+----------------+
Since Paul Spinks first taught a spreadsheet how to do that cross-tabulation trick in 1979, this has also been known as a pivot table: we pivot colID and value against ID. (Of course for a proper EAV representation, we'd add an attributes table:
DROP TABLE IF EXISTS attrs; CREATE TABLE attrs(colID INT,attr CHAR(12)); INSERT INTO attrs VALUES (1,'First Name'),(2,'Last Name'),(3,'Title');
and write a stored procedure to PREPARE the above query from table-based attribute names.) More often, crosstab queries calculate. Here is a simple sales table:
DROP TABLE IF EXISTS sales; CREATE TABLE Sales (empID INT, yr SMALLINT, sales DECIMAL(10,2)); INSERT sales VALUES (1, 2005, 12000),(1, 2006, 18000),(1, 2007, 25000), (2, 2005, 15000),(2, 2006, 6000),(3, 2006, 20000),(3, 2007, 24000);
In Microsoft SQL Server, CUBE/PIVOT syntax for horizontal and vertical totals is:
SELECT EmpId, [2005], [2006], [2007], [ALL] FROM ( SELECT CASE WHEN GROUPING(EmpId)=0 THEN CAST(EmpId AS CHAR(7)) ELSE 'ALL' END AS EmpId, CASE WHEN GROUPING(Yr)=0 THEN CAST(Yr AS CHAR(7)) ELSE 'ALL' END AS Yr, SUM(Sales) AS Sales FROM Sales GROUP BY EmpId, Yr WITH CUBE ) AS s PIVOT( SUM(Sales) FOR Yr IN ([2005], [2006], [2007], [ALL]) ) AS p
In MySQL, it's a two-step: 1. In an inner query, write one aggregating expression per reporting column, 2. In an outer query, build the horizontal sums:
SELECT IFNULL(empId,'Totals') AS EmpId, -- outer query labels rollup row sums.2005, sums.2006, sums.2007, -- and calculates horizontal sums sums.2005 + sums.2006 + sums.2007 AS Sums FROM ( -- inner query groups by employee SELECT -- with an expression for each column EmpID, SUM(IF(Yr=2005,sales,0)) As '2005', SUM(IF(Yr=2006,sales,0)) As '2006', SUM(IF(Yr=2007,sales,0)) As '2007' FROM Sales GROUP BY EmpID WITH ROLLUP ) AS sums; +--------+----------+----------+----------+-----------+ | EmpId | 2005 | 2006 | 2007 | Sums | +--------+----------+----------+----------+-----------+ | 1 | 12000.00 | 18000.00 | 25000.00 | 55000.00 | | 2 | 15000.00 | 6000.00 | 0.00 | 21000.00 |
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 67 of 92
| 3 | 0.00 | 20000.00 | 24000.00 | 44000.00 | | Totals | 27000.00 | 44000.00 | 49000.00 | 120000.00 | +--------+----------+----------+----------+-----------+
That approach works beautifully for the common need to tabulate monthly amounts by year, say from an order history table orderhist(orderdate date,amount decimal(10,2)):
SELECT IfNull(Year,'Totals') Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, `Dec`, Qty AS Count, Yrly as 'Yrly Total' FROM ( SELECT year(orderdate) AS 'Year', Round(Sum(CASE WHEN Month(orderdate)= 1 THEN amount ELSE 0 END),2) Round(Sum(CASE WHEN Month(orderdate)= 2 THEN amount ELSE 0 END),2) Round(Sum(CASE WHEN Month(orderdate)= 3 THEN amount ELSE 0 END),2) Round(Sum(CASE WHEN Month(orderdate)= 4 THEN amount ELSE 0 END),2) Round(Sum(CASE WHEN Month(orderdate)= 5 THEN amount ELSE 0 END),2) Round(Sum(CASE WHEN Month(orderdate)= 6 THEN amount ELSE 0 END),2) Round(Sum(CASE WHEN Month(orderdate)= 7 THEN amount ELSE 0 END),2) Round(Sum(CASE WHEN Month(orderdate)= 8 THEN amount ELSE 0 END),2) Round(Sum(CASE WHEN Month(orderdate)= 9 THEN amount ELSE 0 END),2) Round(Sum(CASE WHEN Month(orderdate)=10 THEN amount ELSE 0 END),2) Round(Sum(CASE WHEN Month(orderdate)=11 THEN amount ELSE 0 END),2) Round(Sum(CASE WHEN Month(orderdate)=12 THEN amount ELSE 0 END),2) Count(*) AS Qty, Round(Sum(amount),2) AS Yrly FROM orderhist GROUP BY year WITH ROLLUP ) AS sums ;
AS AS AS AS AS AS AS AS AS AS AS AS
Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, `Dec`,
More likely the required sums need to be calculated across joins. Build them in using a good text editor. Here is the above query for the orders and orderdetails table in the Northwind database:
SELECT IfNull(Year,'Totals') Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, `Dec`, quantity AS Count, Yrly as 'Yrly Total' FROM ( SELECT year(orderdate) AS 'Year', Round(Sum(If( Month(o.orderdate)= 1, (d.unitprice*d.quantity)-d.discount, Round(Sum(If( Month(o.orderdate)= 2, (d.unitprice*d.quantity)-d.discount, Round(Sum(If( Month(o.orderdate)= 3, (d.unitprice*d.quantity)-d.discount, Round(Sum(If( Month(o.orderdate)= 4, (d.unitprice*d.quantity)-d.discount, Round(Sum(If( Month(o.orderdate)= 5, (d.unitprice*d.quantity)-d.discount, Round(Sum(If( Month(o.orderdate)= 6, (d.unitprice*d.quantity)-d.discount, Round(Sum(If( Month(o.orderdate)= 7, (d.unitprice*d.quantity)-d.discount, Round(Sum(If( Month(o.orderdate)= 8, (d.unitprice*d.quantity)-d.discount, Round(Sum(If( Month(o.orderdate)= 9, (d.unitprice*d.quantity)-d.discount, Round(Sum(If( Month(o.orderdate)=10, (d.unitprice*d.quantity)-d.discount, Round(Sum(If( Month(o.orderdate)=11, (d.unitprice*d.quantity)-d.discount, Round(Sum(If( Month(o.orderdate)=12, (d.unitprice*d.quantity)-d.discount, Count(*) AS quantity, Round(Sum((unitprice*quantity)-discount),2) AS Yrly FROM orders o JOIN orderdetails d USING(orderID) GROUP BY year WITH ROLLUP ) AS sums ;
0 0 0 0 0 0 0 0 0 0 0 0
)), )), )), )), )), )), )), )), )), )), )), )),
2 2 2 2 2 2 2 2 2 2 2 2
) ) ) ) ) ) ) ) ) ) ) )
AS AS AS AS AS AS AS AS AS AS AS AS
Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, `Dec`,
With many columns or subtotals, pivot tables get complicated, time-consuming and error-prone. Automation is needed. Oracle and Microsoft SQL Server have CUBE syntax to simplify the job, especially for big pivot tables. MySQL doesn't. See "Automate pivot table queries" for how to roll your own cube with MySQL.
Last updated 16 Feb 2011
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 68 of 92
If you are asked to tabulate sales amount against salesperson and product, you write a pivot table query:
SELECT product, SUM( CASE salesperson WHEN 'bob' THEN amount ELSE 0 END ) AS 'Bob', SUM( CASE salesperson WHEN 'sam' THEN amount ELSE 0 END ) AS 'Sam', SUM( amount ) AS Total FROM sales GROUP BY product WITH ROLLUP; +---------+--------+--------+--------+ | product | Bob | Sam | Total | +---------+--------+--------+--------+ | radio | 200.00 | 200.00 | 400.00 | | tv | 200.00 | 300.00 | 500.00 | | NULL | 400.00 | 500.00 | 900.00 | +---------+--------+--------+--------+
The query generates one product per row and one column per salesperson. The pivoting CASE expressions assign values of sales.amount to the matching salesperson's column. For two products and two salespersons, it's a snap once you've done it a few times. When there are dozens of products and salespersons, though, writing the query becomes tiresome and error-prone. Some years ago Giuseppe Maxia published a little query that automates writing the pivot expressions. His idea was to embed the syntax for lines like the SUM( CASE ...) lines above in a query for the DISTINCT values. At the time Giuseppe was writing, MySQL did not support stored procedures. Now that it does, we can further generalise Giuseppe's idea by parameterising it in a stored procedure. Admittedly, it's a little daunting. To write a query with variable names rather than the usual literal table and column names, we have to write PREPARE statements. What we propose to do here is to write SQL that writes PREPARE statements. Code which writes code which writes code. Not a job for the back of a napkin. It's easy enough to write the sproc shell. We keep generic queries in a sys database, so the routine needs parameters specifying database, table, pivot column and (in some cases) the aggregating column. Then what? What worked for us was to proceed from back to front: Write the pivot expressions for a specific case. Write the PREPARE statement that generates those expressions. Parameterise the result of #2. Put the result of #3 in an sproc. Further complicating matters, we soon found that different summary aggregations, for example COUNT and SUM, require different sprocs. Here is the routine for generating COUNT pivot expressions:
USE sys; DROP PROCEDURE IF EXISTS writecountpivot; DELIMITER | CREATE PROCEDURE writecountpivot( db CHAR(64), tbl CHAR(64), col CHAR(64) ) BEGIN DECLARE datadelim CHAR(1) DEFAULT '"'; DECLARE singlequote CHAR(1) DEFAULT CHAR(39); DECLARE comma CHAR(1) DEFAULT ','; SET @sqlmode = (SELECT @@sql_mode); SET @@sql_mode=''; SET @sql = CONCAT( 'SELECT DISTINCT CONCAT(', singlequote, ',SUM(IF(', col, ' = ', datadelim, singlequote, comma, col, comma, singlequote, datadelim, comma, '1,0)) AS `', singlequote, comma, col, comma, singlequote, '`', singlequote, ') AS countpivotarg FROM ', db, '.', tbl, ' WHERE ', col, ' IS NOT NULL' ); -- UNCOMMENT TO SEE THE MIDLEVEL CODE: -- SELECT @sql; PREPARE stmt FROM @sql; EXECUTE stmt; DROP PREPARE stmt; SET @@sql_mode=@sqlmode; END; | DELIMITER ; CALL sys.writecountpivot('test','sales','salesperson');
and returns...
+--------------------------------------------+ | countpivotarg | +--------------------------------------------+ | ,SUM(IF(salesperson = "bob",1,0)) AS `bob` | | ,SUM(IF(salesperson = "sam",1,0)) AS `sam` | +--------------------------------------------+
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 69 of 92
Not overwhelming for two columns, very convenient if there are 20. (Yes, it could also be written with COUNT( ... 1, NULL)). One point to notice is that the two levels of code generation create quotemark nesting problems. To make the double quotemark '"' available for data value delimiting, we turn off ANSI_QUOTES during code generation, and put it back afterwards.
SUM pivot
USE sys; DROP PROCEDURE IF EXISTS writesumpivot; DELIMITER | CREATE PROCEDURE writesumpivot( db CHAR(64), tbl CHAR(64), pivotcol CHAR(64), sumcol CHAR(64) ) BEGIN DECLARE datadelim CHAR(1) DEFAULT '"'; DECLARE comma CHAR(1) DEFAULT ','; DECLARE singlequote CHAR(1) DEFAULT CHAR(39); SET @sqlmode = (SELECT @@sql_mode); SET @@sql_mode=''; SET @sql = CONCAT( 'SELECT DISTINCT CONCAT(', singlequote, ',SUM(IF(', pivotcol, ' = ', datadelim, singlequote, comma, pivotcol, comma, singlequote, datadelim, comma, sumcol, ',0)) AS `', singlequote, comma, pivotcol, comma, singlequote, '`', singlequote, ') AS sumpivotarg FROM ', db, '.', tbl, ' WHERE ', pivotcol, ' IS NOT NULL' ); -- UNCOMMENT TO SEE THE MIDLEVEL SQL: -- SELECT @sql; PREPARE stmt FROM @sql; EXECUTE stmt; DROP PREPARE stmt; SET @@sql_mode=@sqlmode; END; | DELIMITER ; CALL writesumpivot('test','sales','salesperson','amount'); +-------------------------------------------------+ | sumpivotarg | +-------------------------------------------------+ | ,SUM(IF(salesperson = "bob",amount,0)) AS `bob` | | ,SUM(IF(salesperson = "sam",amount,0)) AS `sam` | +-------------------------------------------------+
There are higher levels of generality beckoning---say, a routine that generates a complete pivot table query, not just the pivot expressions.
Back to the top Browse the book Buy the book Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 70 of 92
There are nine employees, and 803 orders dated from 1996 through 1998. Each order points to an employeeID. Suppose we wish to report counts of orders taken by employees pivoted on year--how would we proceed? We do the pivot table 3-step: 1. Write the basic aggregating query, a GROUP BY query to aggregate the data on desired variables (in this case, employee and year). 2. Write the pivoting query as an outer query that creates a column for each pivot value (year) from #1 written as an inner subquery. (Sometimes it is more efficient to write the results of #1 to a temp table and write #2 to refer to the temp table.) 3. Fix a ROLLUP display glitch by encapsulating #2 in a new outer query that labels the ROLLUP row meaningfully. Here are the three steps in more detail: 1. Group the joined counts by the two criteria, employee and order year, yielding one result row per employee per year:
SELECT CONCAT(firstname,' ',lastname) AS 'Employee', YEAR(OrderDate) AS col, COUNT(*) AS Data FROM Employees e JOIN Orders o ON e.EmployeeID = o.EmployeeID GROUP BY e.employeeID, YEAR(o.OrderDate); +------------------+------+------+ | Employee | col | Data | +------------------+------+------+ | Nancy Davolio | 1996 | 26 | | Nancy Davolio | 1997 | 55 | | Nancy Davolio | 1998 | 42 | | Andrew Fuller | 1996 | 16 | | Andrew Fuller | 1997 | 41 | | Andrew Fuller | 1998 | 39 | | Janet Leverling | 1996 | 18 | | Janet Leverling | 1997 | 71 | | Janet Leverling | 1998 | 38 | | Margaret Peacock | 1996 | 31 | | Margaret Peacock | 1997 | 81 | | Margaret Peacock | 1998 | 44 | | Steven Buchanan | 1996 | 11 | | Steven Buchanan | 1997 | 18 | | Steven Buchanan | 1998 | 13 | | Michael Suyama | 1996 | 15 | | Michael Suyama | 1997 | 33 | | Michael Suyama | 1998 | 19 | | Robert King | 1996 | 11 | | Robert King | 1997 | 36 | | Robert King | 1998 | 25 | | Laura Callahan | 1996 | 19 | | Laura Callahan | 1997 | 54 | | Laura Callahan | 1998 | 31 | | Anne Dodsworth | 1996 | 5 | | Anne Dodsworth | 1997 | 19 | | Anne Dodsworth | 1998 | 19 | +------------------+------+------+
Nine employees for three years yield 27 aggregated rows. 2. We want one summary row per employee, and one count column for each year when an employee took an order. We pivot the rows of the above resultset on year by querying the above resultset, defining a colunn for every year found, for example:
SUM( CASE col WHEN '1996' THEN data ELSE 0 END ) AS '1996',
grouping the result by row WITH ROLLUP to provide a row of column sums at the bottom. This gives the following query:
SELECT Employee, SUM( CASE col WHEN '1996' THEN data ELSE 0 END ) AS '1996', SUM( CASE col WHEN '1997' THEN data ELSE 0 END ) AS '1997', SUM( CASE col WHEN '1998' THEN data ELSE 0 END ) AS '1998', SUM( data ) AS Total -- sums across years by employee FROM ( SELECT -- the query from step #1 CONCAT(firstname,' ',lastname) AS 'Employee', YEAR(OrderDate) AS 'col', COUNT(*) AS Data FROM Employees e JOIN Orders o ON e.EmployeeID = o.EmployeeID GROUP BY e.employeeID, YEAR(o.OrderDate) ) AS stats GROUP BY employee WITH ROLLUP; +------------------+------+------+------+-------+ | Employee | 1996 | 1997 | 1998 | Total | +------------------+------+------+------+-------+ | Andrew Fuller | 16 | 41 | 39 | 96 | | Anne Dodsworth | 5 | 19 | 19 | 43 |
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 71 of 92
| Janet Leverling | 18 | 71 | 38 | 127 | | Laura Callahan | 19 | 54 | 31 | 104 | | Margaret Peacock | 31 | 81 | 44 | 156 | | Michael Suyama | 15 | 33 | 19 | 67 | | Nancy Davolio | 26 | 55 | 42 | 123 | | Robert King | 11 | 36 | 25 | 72 | | Steven Buchanan | 11 | 18 | 13 | 42 | | NULL | 152 | 408 | 270 | 830 | +------------------+------+------+------+-------+
3. The result of #2 is correct except that sums ought not to be reported as NULL! We fix that bit of weirdness by writing query #2 as a derived table, and having the new outer query alias the yearly sums row:
SELECT IFNULL( employee, 'SUMS') AS Employee, `1996`, `1997`, `1998` Total FROM ( SELECT Employee, SUM( CASE col WHEN '1996' THEN data ELSE 0 END ) AS '1996', SUM( CASE col WHEN '1997' THEN data ELSE 0 END ) AS '1997', SUM( CASE col WHEN '1998' THEN data ELSE 0 END ) AS '1998', SUM( data ) AS Total FROM ( SELECT CONCAT(firstname,' ',lastname) AS 'Employee', YEAR(OrderDate) AS 'col', COUNT(*) AS Data FROM Employees e JOIN Orders o ON e.EmployeeID = o.EmployeeID GROUP BY e.employeeID, YEAR(o.OrderDate) ) AS stats GROUP BY employee WITH ROLLUP ) AS stats2; +------------------+------+------+-------+ | Employee | 1996 | 1997 | Total | +------------------+------+------+-------+ | Andrew Fuller | 16 | 41 | 39 | | Anne Dodsworth | 5 | 19 | 19 | | Janet Leverling | 18 | 71 | 38 | | Laura Callahan | 19 | 54 | 31 | | Margaret Peacock | 31 | 81 | 44 | | Michael Suyama | 15 | 33 | 19 | | Nancy Davolio | 26 | 55 | 42 | | Robert King | 11 | 36 | 25 | | Steven Buchanan | 11 | 18 | 13 | | SUMS | 152 | 408 | 270 | +------------------+------+------+-------+
(Thanks to Chris Gates for the correction.) With multiple statistics and pivot layers, a pivot table query can get complex, but following this 3-step will keep things clear.
Last updated 15 Apr 2011
Feedback
MAX(...) decides between an entry and a blank (the entry will win if one exists) while the group by lines everything up on the same row. Friendly caution: If more than one entry exists for the same day and time, you will only see the one that is alphabetically "greater". To see how many classes are scheduled by day for each slot (to check for conflicts) try:
SELECT slot , sum(if(day=1,1,0)) , sum(if(day=2,1,0)) , sum(if(day=3,1,0)) , sum(if(day=4,1,0)) , sum(if(day=5,1,0)) from schedule group by slot as as as as as day1 day2 day3 day4 day5
There is a pattern: Columns you want as "row headers" are listed both in the SELECT _and_ in the GROUP BY clauses Values you want as columns are selectively chosen by IF() functions to return something or nothing so that one of the aggregate functions (MIN, MAX, SUM,AVG, etc) can condense those rows and columns into single values
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 72 of 92
What to count in any column can be as complex as you like. If you don't like the IF() function you can use CASE statements or anything else to help you provide the correct set of values for the aggregate function you are using on that column. You can mix and match aggregate functions to be what you need. Let's say you join the appropriate tables together to form a data set that looks like:
day, slot, subject, student, grade
and you wanted to build a chart showing the statistics of grades vs. days and subject. You could use this type of query to work that out:
SELECT day, subject , AVG(grade) as average , MIN(grade) as lowest , MAX(grade) as highest from <necessary tables> group by day, subject
Now while one needn't choose values for columns, to "pivot" that table by days (each column representing statistics for just one day) change the query to:
SELECT subject , AVG(IF(day=1, grade,null)) as D1_average , MIN(IF(day=1, grade, null)) as D1_lowest , MAX(IF(day=1,grade,null)) as D1_highest , AVG(IF(day=2, grade,null)) as D2_average , MIN(IF(day=2, grade, null)) as D2_lowest , MAX(IF(day=2,grade,null)) as D2_highest , .... (repeat for rest of the days) FROM <necessary tables> GROUP BY day, subject
The IF ... NULL test prevents AVG() from counting all other grades for the same subject from different days. The same trick works for MIN and MAX functions.
Back to the top Browse the book Buy the book Feedback
and you wish a resultset that links first names to last names for each ID...
user_id 1 2 firstname Rogier Jean lastname Marat Smith
the following query accomplishes the required pivot via an INNER SELF-JOIN:
SELECT u1.user_ID, class_value AS firstname, u2.lastname FROM user_class AS u1 INNER JOIN ( SELECT user_ID, class_value AS lastname FROM user_class WHERE class_id='lastname' ) AS u2 ON u1.user_ID=u2.user_ID AND u1.class_id='firstname'
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 73 of 92
| 2 | | 3 | +---------+ select id as 'available ingredients' from p; +-----------------------+ | available ingredients | +-----------------------+ | 1 | | 2 | | 3 | +-----------------------+ select rid as recipe, pid as ingredient from ri; +--------+------------+ | recipe | ingredient | +--------+------------+ | 1 | 1 | | 1 | 2 | | 2 | 1 | | 2 | 4 | | 3 | 5 | | 3 | 6 | | 3 | 7 | +--------+------------+
Given our ingredients, what recipes can we make? Inspection shows the answer is recipe #1. SQL has no universal quantifier, so how do we proceed? 'All A is B' is logically equivalent to the double negative 'there is no A that is not B', so we can reformulate the requirement ... list the recipes for which we have all ingredients into terms SQL can handle ... list the recipes for which there is no ingredient we do not have A double negative, so a double query. One inner query, one outer. Tackle the inner one first: find the recipes for which we are missing an ingredient. That's a straight exclusion join, i.e., a left join on ingredient from 'required' to 'available', plus a where clause that restricts the resultset to nulls on the right ('available') side of the join:
SELECT DISTINCT rid AS 'Recipes for which the pantry is missing some ingredients' FROM ri LEFT JOIN p ON ri.pid=p.id WHERE p.id IS NULL; +----------------------------------------------------------+ | Recipes for which the pantry is missing some ingredients | +----------------------------------------------------------+ | 2 | | 3 | +----------------------------------------------------------+
Our outer query has to find the recipes which are not in this list. That's another exclusion join, this time from recipes to the above derived table:
SELECT r.id FROM r LEFT JOIN ( SELECT DISTINCT rid FROM ri LEFT JOIN p ON ri.pid=p.id WHERE p.id IS NULL ) AS rno ON r.id = rno.rid WHERE rno.rid IS NULL; +------+ | id | +------+ | 1 | +------+
It's an example of relational division, one of Codd's eight basic relational operations. Dividing a divisor table into a dividend table yields a quotient or results table: dividend divisor = quotient As in arithmetic, multiplication reverses it: divisor * quotient = dividend
+-----+ | A | +-----+ |key_a| +-----+ | 2 | | 4 | +-----+ +------+ | B | +------+ |key_b | +------+ | 1 | | 7 | | 3 | +------+ +-----------+ | table AxB | +-----+-----+ |key_a|key_b| +-----+-----+ | 2 | 1 | | 2 | 7 | | 2 | 3 | | 4 | 1 | | 4 | 7 | | 4 | 3 | +-----+-----+
When we multiply (CROSS JOIN) tables A and B to yield AxB, AxB gets a row combining every row of A with every row of B, and all the columns from A and B. When we reverse that operation, dividing AxB by B, we get back A by
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 74 of 92
Why is it called relational division? See the All possible recipes with given ingredients entry. Here the dividend is candidates, the divisor is districts and the quotient is a party count. Most NOT EXISTS() queries can be translated into exclusion joins, which are often much faster. An exclusion join from A to B excludes A rows for which the LEFT JOIN condition finds NULLs in B. The query we are translating has two NOT EXISTS clauses, so we need two exclusion joins:
SELECT p.party FROM parties p LEFT JOIN ( SELECT a.party FROM ( SELECT DISTINCT party,district FROM parties CROSS JOIN districts ) a LEFT JOIN candidates c ON a.party=c.party AND a.district=c.district WHERE c.party IS NULL ) b ON p.party=b.party WHERE b.party IS NULL;
Like numeric division, relational division has a gotcha: divide by zero. If the divisor table has zero rows, the quotient counts all distinct dividend instances. If that is not what you want, use aggregation. Most "all Xs for which all Y are Z" queries can be written in any of these three ways. Try each one to see which performs best for your problem.
Back to the top Browse the book Buy the book Feedback
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 75 of 92
-- two adjacent seats select a.row,a.seat,b.seat from seats a join seats b on a.row=b.row and a.seat=b.seat-1 where a.booked=0 and b.booked=0 limit 1; +------+------+------+ | row | seat | seat | +------+------+------+ | i | 1 | 2 | +------+------+------+ -- three adjacent seats select a.row,a.seat,b.seat,c.seat from seats a join seats b on a.row=b.row and a.seat=b.seat-1 join seats c on a.row=c.row and b.seat=c.seat-1 where a.booked=0 and b.booked=0 and c.booked=0 limit 1; +------+------+------+------+ | row | seat | seat | seat | +------+------+------+------+ | i | 1 | 2 | 3 | +------+------+------+------+
But that approach requires a different query for each different number of adjacent seats. The Find blocks of unused numbers query pattern can be used to find all empty seats. Optionally you can add a HAVING clause specifying the number of adjacent seats required:
SELECT e.row, firstUnused, IF(mincseat IS NULL, IFNULL(dseat,firstUnused),mincseat-1) AS lastUnused FROM ( SELECT first.row, first.seat AS firstUnused, MIN(c.seat) AS mincseat, MAX(d.seat) AS dseat FROM ( SELECT a.row, a.seat FROM seats a LEFT JOIN seats b ON a.row=b.row AND a.seat=b.seat + 1 WHERE a.booked=0 AND (b.seat IS NULL OR b.booked=1) ) AS first LEFT JOIN seats c ON first.seat < c.seat AND c.booked=1 AND c.row = first.row LEFT JOIN seats d ON first.seat < d.seat AND d.booked=0 AND d.row = first.row GROUP BY firstUnused ) AS e HAVING lastUnused-firstUnused>=N; -- N=no.of required adjacent seats
Last updated 26 Oct 2010
Feedback
The first ID in any unused sequence has used=0 and either no immediate predecessor, or an immediate predecessor where used=1. The last ID of any unused sequence either has no successor or the successor has used=1. So: 1. Find the first first ID of every unused sequence by left joining each row with used=0 to the immediate predecessor row, conditioning the result on the predecessor row not existing or having used=1. 2. As a basis for finding the last ID of every unused sequence that is followed by a row with used=1, left join first unused rows to rows with larger IDs and used=1. 3. As a basis for finding the last ID of an unused sequence which is also the largest ID in the table, left join first unused rows to rows with larger IDs and used=0. 4. For each first unused ID, the last unused ID in its sequence is one less than the smallest used ID greater than the first ID if it exists, otherwise it is the maximum unused ID greater than the first ID.
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 76 of 92
SELECT firstUnused, IF(mincid IS NULL, IFNULL(did,firstUnused),mincid-1) AS lastUnused FROM ( SELECT first.id AS firstUnused, MIN(c.id) AS mincid, MAX(d.id) AS did FROM ( SELECT a.id FROM tbl a LEFT JOIN tbl b ON a.id=b.id + 1 WHERE a.used=0 AND (b.id IS NULL OR b.used=1) ) AS first LEFT JOIN tbl c ON first.id<c.id AND c.used=1 LEFT JOIN tbl d ON first.id<d.id AND d.used=0 GROUP BY firstUnused ) AS e; +-------------+------------+ | firstUnused | lastUnused | +-------------+------------+ | 3 | 3 | | 5 | 5 | | 9 | 10 | | 13 | 15 | +-------------+------------+
Thanks to Don Armstrong for finding a case where our previous algorithm failed.
Last updated 25 Sep 2009
Feedback
For all the gaps, including gaps of more than 1 value, you need something a little more baroque...
SELECT a.id+1 AS 'Missing From', MIN(b.id) - 1 AS 'To' FROM tbl AS a, tbl AS b WHERE a.id < b.id GROUP BY a.id HAVING `Missing From` < MIN(b.id); +--------------+------+ | Missing From | To | +--------------+------+ | 3 | 3 | | 5 | 17 | +--------------+------+
We often need such lists, so the query is a natural for a stored procedure that finds missing sequence values in any table:
DROP PROCEDURE IF EXISTS MissingInSeq; DELIMITER | CREATE PROCEDURE MissingInSeq( db VARCHAR(64), tbl VARCHAR(64), col VARCHAR(64) ) BEGIN SET @sql = CONCAT( "SELECT a.", col, "+1 AS 'Missing From',", "MIN(b.", col, ") - 1 AS 'To' FROM ", db, ".", tbl, " AS a,", db, ".", tbl, " AS b WHERE a.", col, " < b.", col, " GROUP BY a.", col, " HAVING a.", col, " < MIN(b.", col, ") - 1" ); -- SELECT @sql; PREPARE stmt FROM @sql; EXECUTE stmt; DROP PREPARE stmt; END; | DELIMITER ;
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 77 of 92
Feedback
The same logic can retrieve next lowest values. Suppose we wish to track daily changes in closing price:
drop table if exists t; create table t (date date, closing_price decimal(8,2) ); insert into t values ('2009-01-02', 5),('2009-01-03', 5.5),('2009-01-04', 4),('2009-01-05', 6);
Here is another algorithm, by Baron Schwartz (xaprb.com), for retrieving the previous and next column values in a sequence, given a particular column value thisvalue. The previous value is the maximum value less than thisvalue, and the next value is the minimum value greater than thisvalue:
SELECT IF(col > thisvalue,'next','prev') AS Direction, IF(col > thisvalue,MIN(col),MAX(col)) AS 'Prev/Next' FROM tablename WHERE col <> thisvalue GROUP BY SIGN(col - thisvalue);
So, to find the previous and next order ids in the Northwind database table orders (nwib.orders), starting from order number 10800:
SELECT IF(orderid > 10800,'next','prev') AS Direction, IF(orderid > 10800,MIN(orderid),MAX(orderid)) AS 'Prev/Next' FROM nwib.orders WHERE orderid <> 10800 GROUP BY SIGN(orderid - 10800); +-----------+-----------+ | Direction | Prev/Next | +-----------+-----------+ | prev | 10799 | | next | 10801 | +-----------+-----------+
Or, it can be embedded in the FROM clause of another query, for example ...
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 78 of 92
SELECT o2.OrderID,o2.Value,o.customerid FROM orders o JOIN ( SELECT 'This' AS 'OrderId', 10800 AS 'Value' UNION SELECT IF( orderid > 10800, 'Next', 'Prev') AS Which, IF( orderid > 10800, MIN(orderid), MAX(orderid )) AS 'Value' FROM orders WHERE orderid <> 10800 GROUP BY SIGN( orderid - 10800 ) ) AS o2 ON o.orderid=o2.value ORDER BY o.orderid; +---------+-------+------------+ | OrderID | Value | customerid | +---------+-------+------------+ | Prev | 10799 | KOENE | | This | 10800 | SEVES | | Next | 10801 | BOLID | +---------+-------+------------+
Last updated 22 May 2009
Feedback
2. For each row in this resultset, find the row with the lowest ordering value amongst the higher values. For the example the result of this from the above resultset is
+-----+-----+ | 2 | 4 | | 4 | 6 | | 6 | 8 | | 8 | 10 | +-----+-----+
The fastest way to do step #2 is a self-exclusion join (see self-exclusion join examples elsewhere on this page). Put it together:
drop table if exists t; create table t(id int); insert into t values(2),(4),(6),(8),(10); select x.aid as id,x.bid as nextvalue from ( select a.id as aid,b.id as bid from t a join t b on a.id<b.id ) x left join ( select a.id as aid,b.id as bid from t a join t b on a.id<b.id ) y on x.aid=y.aid and x.bid>y.bid where y.bid is null order by x.aid,x.bid; +------+-----------+ | id | nextvalue | +------+-----------+ | 2 | 4 | | 4 | 6 | | 6 | 8 | | 8 | 10 | +------+-----------+
Modify the algorithm to suit for next lowest. Modify the algorithm to suit for next lowest. Here is an example
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 79 of 92
select x.* from ( select a.d as thisdate, a.i as thisvalue, b.d as nextdate, b.i as nextvalue from t a join t b on a.d < b.d ) x left join ( select a.d as thisdate, b.d as nextdate from t a join t b on a.d < b.d ) y on x.thisdate = y.thisdate and x.nextdate > y.nextdate where y.nextdate is null order by x.thisdate, x.nextdate; +---------------------+-----------+---------------------+-----------+ | thisdate | thisvalue | nextdate | nextvalue | +---------------------+-----------+---------------------+-----------+ | 2009-12-01 00:00:00 | 1 | 2009-12-03 00:00:00 | 3 | | 2009-12-03 00:00:00 | 3 | 2009-12-05 00:00:00 | 5 | | 2009-12-05 00:00:00 | 5 | 2009-12-08 00:00:00 | 8 | +---------------------+-----------+---------------------+-----------+
Last updated 11 Dec 2009
Feedback
... an exclusion join on the previous sequential value finds the first value of each sequence, and the minimum next value from a left join and an exclusion join on the previous sequential value finds the end of each sequence:
SELECT a.id AS Start, MIN( c.id ) AS End FROM tbl AS a LEFT JOIN tbl AS b ON a.id = b.id + 1 LEFT JOIN tbl AS c ON a.id <= c.id LEFT JOIN tbl AS d ON c.id = d.id - 1 WHERE b.id IS NULL AND c.id IS NOT NULL AND d.id IS NULL GROUP BY a.id; +-------+------+ | Start | End | +-------+------+ | 1 | 4 | | 6 | 8 | +-------+------+
Thanks to Scott Noyes for noticing that a.id<c.id fails to pick up sequences of 1 followed by skips of 1, but a.id<=c.id does. To see how that query works, look at the output of this version of the query with exclusion and aggregation clauses removed:
SELECT a.id AS aid,b.id AS bid, c.id AS c.id, d.di AS did FROM tbl AS a LEFT JOIN tbl AS b ON a.id = b.id + 1 LEFT JOIN tbl AS c ON a.id <= c.id LEFT JOIN tbl AS d ON c.id = d.id - 1 ORDER BY a.id,b.id,c.id,d.id; +------+------+------+------+ | aid | bid | cid | did | +------+------+------+------+ | 1 | NULL | 1 | 2 | | 1 | NULL | 2 | 3 | | 1 | NULL | 3 | 4 | | 1 | NULL | 4 | NULL | <-- end of sequence starting with 1 | 1 | NULL | 6 | 7 | | 1 | NULL | 7 | 8 | | 1 | NULL | 8 | NULL | | 2 | 1 | 2 | 3 | | 2 | 1 | 3 | 4 | | 2 | 1 | 4 | NULL | | 2 | 1 | 6 | 7 | | 2 | 1 | 7 | 8 | | 2 | 1 | 8 | NULL | | 3 | 2 | 3 | 4 | | 3 | 2 | 4 | NULL | | 3 | 2 | 6 | 7 | | 3 | 2 | 7 | 8 | | 3 | 2 | 8 | NULL | | 4 | 3 | 4 | NULL | | 4 | 3 | 6 | 7 |
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 80 of 92
A variant of the problem: when some IDs are used and some are not, find blocks of unused IDs:
DROP TABLE IF EXISTS tbl; CREATE TABLE tbl(id INT,used BOOL); INSERT INTO tbl VALUES(1,1),(2,0),(3,0),(4,1),(5,0),(6,0); SELECT a.id AS Start, MIN( c.id ) AS End FROM tbl AS a LEFT JOIN tbl AS b ON a.id=b.id + 1 AND a.used=0 AND b.used=0 LEFT JOIN tbl AS c ON a.id<=c.id AND a.used=0 AND c.used=0 LEFT JOIN tbl AS d ON c.id=d.id-1 AND c.used=0 AND d.used=0 WHERE b.id IS NULL AND c.id IS NOT NULL AND d.id IS NULL GROUP BY a.id; +-------+------+ | Start | End | +-------+------+ | 2 | 3 | | 5 | 6 | +-------+------+
Here's another variation on the pattern from a MySQL forum. You have a history of prescription dose changes ...
DROP TABLE IF EXISTS dose_change; CREATE TABLE dose_change ( oid INTEGER UNSIGNED PRIMARY KEY AUTO_INCREMENT, dose_date DATETIME NOT NULL, dose INTEGER UNSIGNED, ); INSERT INTO dose_change (dose_date, dose) values ('2000-01-01', 10),('2000-01-02', 10),('2000-01-03', 20),('2000-01-04', 20), ('2000-01-05', 10),('2000-01-06', 10),('2000-01-07', 10),('2000-01-08', NULL), ('2000-01-09', NULL),('2000-01-10', 30),('2000-01-11', 30),('2000-01-12', 30), ('2000-01-13', 10),('2000-01-14', 20),('2000-01-15', 10),('2000-01-16', NULL), ('2000-01-17', 10); SELECT * FROM dose_change; +-----+---------------------+------+ | oid | dose_date | dose | +-----+---------------------+------+ | 1 | 2000-01-01 00:00:00 | 10 | | 2 | 2000-01-02 00:00:00 | 10 | | 3 | 2000-01-03 00:00:00 | 20 | | 4 | 2000-01-04 00:00:00 | 20 | | 5 | 2000-01-05 00:00:00 | 10 | | 6 | 2000-01-06 00:00:00 | 10 | | 7 | 2000-01-07 00:00:00 | 10 | | 8 | 2000-01-08 00:00:00 | NULL | | 9 | 2000-01-09 00:00:00 | NULL | | 10 | 2000-01-10 00:00:00 | 30 | | 11 | 2000-01-11 00:00:00 | 30 | | 12 | 2000-01-12 00:00:00 | 30 | | 13 | 2000-01-13 00:00:00 | 10 | | 14 | 2000-01-14 00:00:00 | 20 | | 15 | 2000-01-15 00:00:00 | 10 | | 16 | 2000-01-16 00:00:00 | NULL | | 17 | 2000-01-17 00:00:00 | 10 | +-----+---------------------+------+
) LEFT (
) LEFT
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 81 of 92
( SELECT x.dose_date, x.dose, COUNT(*) id FROM dose_change x JOIN dose_change y ON y.dose_date <= x.dose_date GROUP BY x.oid ) AS c ON a.id <= c.id AND c.dose = a.dose LEFT JOIN ( SELECT x.dose_date, x.dose, COUNT(*) id FROM dose_change x JOIN dose_change y ON y.dose_date <= x.dose_date GROUP BY x.oid ) AS d ON c.id = d.id - 1 AND d.dose = c.dose WHERE b.id IS NULL AND c.id IS NOT NULL AND d.id IS NULL GROUP BY start; +---------------------+------------+------+ | start | end | dose | +---------------------+------------+------+ | 2000-01-01 00:00:00 | 2000-01-03 | 10 | | 2000-01-03 00:00:00 | 2000-01-05 | 20 | | 2000-01-05 00:00:00 | 2000-01-08 | 10 | | 2000-01-10 00:00:00 | 2000-01-13 | 30 | | 2000-01-13 00:00:00 | 2000-01-14 | 10 | | 2000-01-14 00:00:00 | 2000-01-15 | 20 | | 2000-01-15 00:00:00 | 2000-01-16 | 10 | | 2000-01-17 00:00:00 | 2000-01-18 | 10 | +---------------------+------------+------+
Feedback
You desire a count of the number of sessions where a user moved from one particular page directly to another, for example from 'A' to 'C'. To find the next hit in a given session, scope on id, order by time, and limit the output to one row. Then simply count the rows meeting the page criteria:
SELECT COUNT(DISTINCT h1.id) AS 'Moves from A to C' FROM hits AS h1 WHERE h1.page = 'A' AND 'C' = ( SELECT h2.page FROM hits AS h2 WHERE h2.id = h1.id AND h2.time > h1.time ORDER BY h2.time LIMIT 1 );
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 82 of 92
Feedback
Feedback
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 83 of 92
(8, 2, 1);
The requirement is for a query which, for every project, reports 'OK' if there is at least one detail row for every project step, or otherwise reports the number of the last sequential completed step: Here is one way to build such a query: 1. Join t1 to t2 on projectID. 2. Left Join t2 to itself on projectID and integer succession. 3. Add a WHERE condition which turns the left self-join into an exclusion join that finds the first missing sequential xid value. 4. To the SELECT list add this item:
IF( a.xid < p.projectstep,a.xid,'OK' ) AS StepState
so when the first sequential missing xid is not less than the number of project steps, display 'Ok', otherwise display the xid value before the first missing xid value. 5. Remove dupes with a GROUP BY clause.
SELECT p.projectname,p.projectsteps,a.xid, IF(a.xid < p.projectsteps, a.xid, 'OK') AS CompletionState FROM t1 p JOIN t2 a ON p.id = a.projectID LEFT JOIN t2 AS b ON a.projectID = b.projectID AND a.xid+1 = b.xid WHERE b.xid IS NULL GROUP BY p.projectname; +-------------+--------------+------+-----------------+ | projectname | projectsteps | xid | CompletionState | +-------------+--------------+------+-----------------+ | xx | 3 | 3 | OK | | yy | 3 | 1 | 1 | | zz | 5 | 2 | 2 | +-------------+--------------+------+-----------------+
Feedback
Winning Streaks
Given a table of IDs and won-lost results, how do we find the longest winning streak?
drop table if exists results; create table results(id int,result char(1)); insert into results values (1,'w'),(2,'l'),(3,'l'),(4,'w'),(5,'w'),(6,'w'),(7,'l'),(8,'w'),(9,'w'); select * from results; +------+--------+ | id | result | +------+--------+ | 1 | w | | 2 | l | | 3 | l | | 4 | w | | 5 | w | | 6 | w | | 7 | l | | 8 | w | | 9 | w | +------+--------+
We can find streaks of two with a left join on a.id=b.id+1. To count streak lengths, initialise a counter to 0 then increment it for every hit:
set @count=0; select a.id, a.result, b.result, @count := IF(a.result = b.result, @count + 1, 1) as Streak from results a left join results b on a.id = b.id + 1 where a.result = 'w';
That solution is from a response by Jon Roshko to a question by Ed Ball on the MySQL Newbie Forum. Scott Noyes points out that our query pattern for sequence starts and ends also works for winning streaks:
SELECT MIN( c.id ) - a.id + 1 as LongestStreak FROM results AS a LEFT JOIN results AS b ON a.id = b.id + 1 AND b.result = 'w'
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 84 of 92
LEFT JOIN results AS c ON a.id <= c.id AND c.result = 'w' LEFT JOIN results AS d ON c.id = d.id - 1 AND d.result = 'w' WHERE a.result = 'w' AND b.id IS NULL AND c.id IS NOT NULL AND d.id IS NULL GROUP BY a.id ORDER BY LongestStreak DESC LIMIT 1;
Feedback
We need to convert degrees latitude and longitude to radians, and we need to know the length in km of one radian on the earth's surface, which is 6378.388. The function:
set log_bin_trust_function_creators=TRUE; DROP FUNCTION IF EXISTS GeoDistKM; DELIMITER | CREATE FUNCTION GeoDistKM( lat1 FLOAT, lon1 FLOAT, lat2 FLOAT, lon2 FLOAT ) RETURNS float BEGIN DECLARE pi, q1, q2, q3 FLOAT; DECLARE rads FLOAT DEFAULT 0; SET pi = PI(); SET lat1 = lat1 * pi / 180; SET lon1 = lon1 * pi / 180; SET lat2 = lat2 * pi / 180; SET lon2 = lon2 * pi / 180; SET q1 = COS(lon1-lon2); SET q2 = COS(lat1-lat2); SET q3 = COS(lat1+lat2); SET rads = ACOS( 0.5*((1.0+q1)*q2 - (1.0-q1)*q3) ); RETURN 6378.388 * rads; END; | DELIMITER ; -- toronto to montreal (505km): select geodistkm(43.6667,-79.4167,45.5000,-73.5833); +----------------------------------------------+ | geodistkm(43.6667,-79.4167,45.5000,-73.5833) | +----------------------------------------------+ | 505.38836669921875 | +----------------------------------------------+
(Setting log_bin_trust_function_creators is the most convenient way to step round determinacy conventions implemented since 5.0.6.)
Back to the top Browse the book Buy the book Feedback
Moving average
Given a table of dates and daily values, retrieve their moving 5-day average:
DROP TABLE IF EXISTS t; CREATE TABLE t (dt DATE, qty INT); INSERT INTO t VALUES ('2007-1-1',5), ('2007-1-2',6), ('2007-1-3',7), ('2007-1-4',8), ('2007-1-5',9), ('2007-1-6',10), ('2007-1-7',11), ('2007-1-8',12), ('2007-1-9',13); SELECT t1.dt, ( SELECT SUM(t2.qty) / COUNT(t2.qty) FROM t AS t2 WHERE DATEDIFF(t1.dt, t2.dt) BETWEEN 0 AND 4 ) AS '5dayMovingAvg' FROM t AS t1 ORDER BY t1.dt;
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 85 of 92
(1,'2007-1-6',10), (1,'2007-1-7',11), (1,'2007-1-8',12), (1,'2007-1-9',13), (2,'2007-1-1',6), (2,'2007-1-2',7), (2,'2007-1-3',8), (2,'2007-1-4',9), (2,'2007-1-5',10), (2,'2007-1-6',11), (2,'2007-1-7',12), (2,'2007-1-8',13), (2,'2007-1-9',14); SELECT t1.item,t1.dt, ( SELECT SUM(t2.qty) / COUNT(t2.qty) FROM t AS t2 WHERE item=t1.item AND DATEDIFF(t1.dt, t2.dt) BETWEEN 0 AND 4 ) AS '5dayMovingAvg' FROM t AS t1 GROUP BY t1.item,t1.dt; +------+------------+---------------+ | item | dt | 5dayMovingAvg | +------+------------+---------------+ | 1 | 2007-01-01 | 5.0000 | | 1 | 2007-01-02 | 5.5000 | | 1 | 2007-01-03 | 6.0000 | | 1 | 2007-01-04 | 6.5000 | | 1 | 2007-01-05 | 7.0000 | | 1 | 2007-01-06 | 8.0000 | | 1 | 2007-01-07 | 9.0000 | | 1 | 2007-01-08 | 10.0000 | | 1 | 2007-01-09 | 11.0000 | | 2 | 2007-01-01 | 6.0000 | | 2 | 2007-01-02 | 6.5000 | | 2 | 2007-01-03 | 7.0000 | | 2 | 2007-01-04 | 7.5000 | | 2 | 2007-01-05 | 8.0000 | | 2 | 2007-01-06 | 9.0000 | | 2 | 2007-01-07 | 10.0000 | | 2 | 2007-01-08 | 11.0000 | | 2 | 2007-01-09 | 12.0000 | +------+------------+---------------+
or more simply...
SELECT t1.item,t1.dt, AVG(t2.qty) AS 5DayAvg FROM t t1 JOIN t t2 ON t1.item=t2.item AND DATEDIFF(t1.dt, t2.dt) BETWEEN 0 AND 4 GROUP BY t1.item,t1.dt;
Feedback
Feedback
Percentiles
In the Sakila table film, retrieve a top-down percentile ranking of film lengths:
SELECT a.film_id , ROUND( 100.0 * ( SELECT COUNT(*) FROM film AS b WHERE b.length <= a.length ) / total.cnt, 1 ) AS percentile FROM film a CROSS JOIN ( SELECT COUNT(*) AS cnt FROM film ) AS total ORDER BY percentile DESC;
If there are NULLs, filter them out before computing percentiles. On his blog, Roland Bouman shows a much faster query; here is a version retrieving the first film at or above the 90th percentile:
SELECT
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 86 of 92
SUM(g1.r) sr, g2.length l, SUM(g1.r)/(SELECT COUNT(*) FROM film) p FROM ( SELECT COUNT(*) r, length FROM film GROUP BY length ) g1 JOIN ( SELECT COUNT(*) r, length FROM film GROUP BY length ) g2 ON g1.length < g2.length GROUP BY g2.length HAVING p >= 0.9 ORDER BY p LIMIT 1
Last updated 07 Oct 2009
Feedback
Feedback
Chequebook balancing programs often use this pattern. This one tracks the running balance and how much money from the most recent deposit remains:
DROP TABLE IF EXISTS chequebook; CREATE TABLE chequebook ( entry_date timestamp default now() PRIMARY KEY, entry_item varchar(48) NOT NULL DEFAULT '', entry_amount decimal(10,2) NOT NULL DEFAULT 0.00 ) ENGINE=InnoDB DEFAULT CHARSET=utf8; INSERT INTO chequebook (entry_date,entry_item,entry_amount) VALUES ('2010-01-02 12:34:00','Deposit A',215.56), ('2010-01-02 21:44:00','Withdrawal A' ,-23.34), ('2010-01-03 10:44:00','Withdrawal B',-150.15), ('2010-01-03 15:44:00','Deposit B',154.67), ('2010-01-04 18:44:00','Withdrawal C',-65.09), ('2010-01-05 08:44:00','Withdrawal D',-74.23), ('2010-01-06 14:44:00','Deposit C',325.12), ('2010-01-06 20:44:00','Withdrawal E',-80.12), ('2010-01-07 04:44:00','Withdrawal F',-110.34), ('2010-01-07 16:44:00','Withdrawal G',-150.25), ('2010-01-08 16:44:00','Withdrawal H',-23.90), ('2010-01-08 21:44:00','Withdrawal I',-75.66), ('2010-01-08 22:44:00','Deposit C',275.78), ('2010-01-09 11:44:00','Withdrawal K',-85.99), ('2010-01-09 21:44:00','Withdrawal J',-100.00); set @depos=0; set @total=0; select entry_date, entry_item, entry_amount, if( entry_amount>0, @depos:=entry_amount, @depos:=@depos+entry_amount ) as depos_bal, @total:=@total+entry_amount as net_bal from chequebook order by entry_date; +---------------------+--------------+--------------+-----------+---------+ | entry_date | entry_item | entry_amount | depos_bal | net_bal | +---------------------+--------------+--------------+-----------+---------+ | 2010-01-02 12:34:00 | Deposit A | 215.56 | 215.56 | 215.56 | | 2010-01-02 21:44:00 | Withdrawal A | -23.34 | 192.22 | 192.22 | | 2010-01-03 10:44:00 | Withdrawal B | -150.15 | 42.07 | 42.07 | | 2010-01-03 15:44:00 | Deposit B | 154.67 | 154.67 | 196.74 | | 2010-01-04 18:44:00 | Withdrawal C | -65.09 | 89.58 | 131.65 | | 2010-01-05 08:44:00 | Withdrawal D | -74.23 | 15.35 | 57.42 | | 2010-01-06 14:44:00 | Deposit C | 325.12 | 325.12 | 382.54 | | 2010-01-06 20:44:00 | Withdrawal E | -80.12 | 245.00 | 302.42 | | 2010-01-07 04:44:00 | Withdrawal F | -110.34 | 134.66 | 192.08 | | 2010-01-07 16:44:00 | Withdrawal G | -150.25 | -15.59 | 41.83 | | 2010-01-08 16:44:00 | Withdrawal H | -23.90 | -39.49 | 17.93 | | 2010-01-08 21:44:00 | Withdrawal I | -75.66 | -115.15 | -57.73 | | 2010-01-08 22:44:00 | Deposit C | 275.78 | 275.78 | 218.05 | | 2010-01-09 11:44:00 | Withdrawal K | -85.99 | 189.79 | 132.06 |
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 87 of 92
If your platform does not permit multiple queries per connection, and if you can tolerate the O(N2) inefficiency of a self-join, a self-join does the same job as an accumulating user variable:
SELECT c.id, c.value, d.RunningSum FROM tbl c JOIN ( SELECT a.id, SUM(b.value) AS RunningSum FROM tbl a LEFT JOIN tbl b ON b.id <= a.id GROUP BY a.id ) d USING (id);
Last updated 06 Jan 2011
Feedback
Feedback
Top ten
We often want to know the top 1, 2, 10 or whatever values from a query. This is dead simple in MySQL. However many JOINs and WHEREs the query has, simply ORDER BY the column(s) whose highest values are sought, and LIMIT the resultset:
SELECT (somecolumn), (othercolumns) ... FROM (some tables) ... ORDER BY somecolumn DESC LIMIT 10;
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 88 of 92
but a simple join does exactly the same job at much less cost:
UPDATE photos SET tally = ( SELECT COUNT(*) FROM votes WHERE votes.photoid = photos.id );
Before you burden your app with a cursor, see if you can simplify the processing to a straightforward join.
Back to the top Browse the book Buy the book Feedback
Emulate sp_exec
Sometimes it is desirable to call multiple stored procedures in one command. In SQL Server this can be done with sp_exec. In MySQL we can easily write such an sproc that calls as many sprocs as we please, for example...
USE sys; DROP PROCEDURE IF EXISTS sp_exec; DELIMITER | CREATE PROCEDURE sp_exec( p1 CHAR(64), p2 CHAR(64) ) BEGIN -- permit doublequotes to delimit data SET @sqlmode=(SELECT @@sql_mode); SET @@sql_mode=''; SET @sql = CONCAT( "CALL ", p1 ); PREPARE stmt FROM @sql; EXECUTE stmt; DROP PREPARE stmt; SET @sql = CONCAT( "CALL ", p2 ); PREPARE stmt FROM @sql; EXECUTE stmt; DROP PREPARE stmt; SET @@sql_mode=@sqlmode; END; | DELIMITER ;
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 89 of 92
END; | DELIMITER ;
but so long as it has those elements, it can be as complex as you like. When you call the sproc: 1. Quote each argument with a pair of single quotes, 2. Separate these quoted arguments with commas, 3. Surround the whole param string with another set of single quotes:
CALL passInParam( 'SELECT * FROM tbl WHERE colval IN (', ('''abc'',''def'',''ghi''' ));
Feedback
Feedback
Count substrings
To count instances of a search string in a target string ... in the target string, replace the search string with a single character, subtract the length of the modified target string from the length of the original target string, divide that by the length of the search string:
SET @str = "The quick brown fox jumped over the lazy dog"; SET @find = "the"; SELECT ROUND(((LENGTH(@str) - LENGTH(REPLACE(LCASE(@str), @find, '')))/LENGTH(@find)),0) AS COUNT; +-------+ | COUNT | +-------+ | 2 | +-------+
Note that REPLACE() does a case-sensitive search; to get a case-insensitive result you must coerce target and search strings to one case. To remove decimals from the result:
SELECT CAST((LENGTH(@str) - LENGTH(REPLACE(LCASE(@str)), @find, '')))/LENGTH (@find) AS SIGNED) AS COUNT;
Feedback
Levenshtein distance
The Levenshtein distance between two strings is the minimum number of operations needed to transform one string into the other, where an operation may be insertion, deletion or substitution of one character. Jason Rust published this MySQL algorithm for it at http://www.codejanitor.com/wp/.
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) ) RETURNS INT DETERMINISTIC BEGIN DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 90 of 92
DECLARE s1_char CHAR; -- max strlen=255 DECLARE cv0, cv1 VARBINARY(256); SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; IF s1 = s2 THEN RETURN 0; ELSEIF s1_len = 0 THEN RETURN s2_len; ELSEIF s2_len = 0 THEN RETURN s1_len; ELSE WHILE j <= s2_len DO SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; END WHILE; WHILE i <= s1_len DO SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; WHILE j <= s2_len DO SET c = c + 1; IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; IF c > c_temp THEN SET c = c_temp; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; IF c > c_temp THEN SET c = c_temp; END IF; SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; END WHILE; SET cv1 = cv0, i = i + 1; END WHILE; END IF; RETURN c; END;
Helper function:
CREATE FUNCTION levenshtein_ratio( s1 VARCHAR(255), s2 VARCHAR(255) ) RETURNS INT DETERMINISTIC BEGIN DECLARE s1_len, s2_len, max_len INT; SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF; RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); END;
Feedback
Proper case
The basic idea is... lower-case the string upper-case the first character if it is a-z, and any other a-z character that follows a punctuation character Here is the function. To make it work with strings long than 128 characters, change its input and return declarations accordingly:
DROP FUNCTION IF EXISTS proper; SET GLOBAL log_bin_trust_function_creators=TRUE; DELIMITER | CREATE FUNCTION proper( str VARCHAR(128) ) RETURNS VARCHAR(128) BEGIN DECLARE c CHAR(1); DECLARE s VARCHAR(128); DECLARE i INT DEFAULT 1; DECLARE bool INT DEFAULT 1; DECLARE punct CHAR(18) DEFAULT ' ()[]{},.-_!@;:?/'; -- David Rabby added \\' SET s = LCASE( str ); WHILE i <= LENGTH( str ) DO -- Jesse Palmer corrected from < to <= for last char BEGIN SET c = SUBSTRING( s, i, 1 ); IF LOCATE( c, punct ) > 0 THEN SET bool = 1; ELSEIF bool=1 THEN BEGIN IF c >= 'a' AND c <= 'z' THEN BEGIN SET s = CONCAT(LEFT(s,i-1),UCASE(c),SUBSTRING(s,i+1)); SET bool = 0; END; ELSEIF c >= '0' AND c <= '9' THEN SET bool = 0; END IF; END; END IF; SET i = i+1; END; END WHILE; RETURN s;
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 91 of 92
But there are always exceptions, for example some guy with that name will want it spelled "d'Arcy".
Last updated 10 Nov 2009
Feedback
Hamilton Turner notes we can find the first octet with LEFT(ip,LOCATE('.',ip)-1).
Back to the top Browse the book Buy the book Feedback
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011
Page 92 of 92
Feedback
http://www.artfulsoftware.com/infotree/queries.php?&bw=1108
7/31/2011