How would you like to optimize a 10 minute query down to 24 seconds? This was an optimization I had to do at work and I thought I’d share this with anyone. Even though I was using MySQL as an example, I think the technique should work on any database. Let’s say you have a table to hold items and another table that holds the logs (could be logs for inventory count at a given time, or item status, etc). Let’s say you want the latest status of an item. For our example, the schema and data for the tables are like below:
To get the latest inventory count of an item, what may comes naturally to you is to write query something like this:
SELECT item.item_id, item.item_name, inv.inventory_count FROM item JOIN inventory inv ON item.item_id = inv.item_id AND inv.inventory_date IN ( SELECT MAX(inv2.inventory_date) FROM inventory inv2 WHERE inv2.item_id = inv.item_id ) -- If you do explain on query above: -- id select_type table type possible_keys key key_len ref rows Extra -- -- ------------------ ------ ------ ------------- ------ ------- ------ ------ ----------- -- 1 PRIMARY inv ALL (NULL) (NULL) (NULL) (NULL) 16 Using where -- 1 PRIMARY item ALL PRIMARY (NULL) (NULL) (NULL) 4 Using where -- 2 DEPENDENT SUBQUERY inv2 ALL (NULL) (NULL) (NULL) (NULL) 16 Using where
Here’s the result of running the query above:
I feel that the above code is how we naturally think; for each items, get the latest inventory and select it. However, if you see the sub query where we select from inventory table the 2nd time (line 7 – 10), it is dependent on each item_id (see line 18 above about DEPENDENT SUBQUERY on MySQL explain). What’s happening is that the database server is now forced to run the sub query once for every item records found. If you have only a few hundreds items, then the performance may not differ that much.
So how to optimize this query? Well, we know the bottle neck. We need to somehow think of a way to rewrite the query so that database server doesn’t need to execute sub query once per item records. One way to do that is since we know we want the latest inventory count, we can generate inventory table that has only the latest data. Below is one way to do it:
SELECT MAX(inventory_date) AS inventory_date, item_id FROM inventory GROUP BY item_id
On the code above, we can’t get the inventory count because if we do, we’ll have to group by inventory_count as well, which defeats the whole purpose of getting the maximum date. (Try it and you’ll get the whole table). Since we can’t get the inventory_count, we’ll have to join inventory table again. So the query looks like below:
SELECT * FROM inventory inv JOIN ( SELECT MAX(inventory_date) AS inventory_date, item_id FROM inventory GROUP BY item_id ) AS inv2 ON inv.item_id = inv2.item_id AND inv.inventory_date = inv2.inventory_date
But wait a minute … we don’t have item_name yet … well you can just add another join to item table. You can do it on the outer SQL or on the inner SQL. I think performance wise, it’s about the same since the end result is you’re joining 3 tables. I choose to join with the inner query. So this is what we end up with:
SELECT inv2.item_id, inv2.item_name, inv.inventory_count FROM inventory inv JOIN ( SELECT MAX(i.inventory_date) AS inventory_date, i.item_id, item.item_name FROM inventory i JOIN item ON item.item_id = i.item_id GROUP BY item_id ) AS inv2 ON inv.item_id = inv2.item_id AND inv.inventory_date = inv2.inventory_date -- If you do explain on query above: -- id select_type table type possible_keys key key_len ref rows Extra ------ ----------- ---------- ------ ------------- ------- ------- -------------------------------- ------ -------------------------------------------- -- 1 PRIMARY <derived2> ALL (NULL) (NULL) (NULL) (NULL) 5 -- 1 PRIMARY inv eq_ref PRIMARY PRIMARY 7 inv2.inventory_date,inv2.item_id 1 -- 2 DERIVED i index (NULL) PRIMARY 7 (NULL) 16 Using index; Using temporary; Using filesort -- 2 DERIVED item ALL PRIMARY (NULL) (NULL) (NULL) 4 Using where
At work, I optimized a query in similar situation before for table about 180K rows and about 500K inventory rows. Using sub query like our original case, the query runs for about 10 minutes. After I did the optimization like above, the query ran for 24 seconds. I hope this article helps you. As always, I welcome comments / questions / critics that will help me and other readers understand better.
Hi Maresa
First of all thanks for your post. I’m a newbie with SQL. I know a bit more building MSAccess Queries. I wanted to give it a try and learn Subqueries and some SQL.
I use your example to try to replicate my case. It is a single table where I have information of contracts by supplier. In that table a material-supplier can have more than one entry (contracts past and current). I just want to pick the most recent date. In a sense this is similar to your example.
What puzzles me is that I created the first step like in your case (it worked); even I saved those instructions as a MSAccess query (I called it “Query4”). Here are the instructions:
SELECT
MAX(SourceListIntl.BDATU) AS ContractDate, SourceListIntl.MATNR, SourceListIntl.WERKS
FROM
SourceListIntl
GROUP BY
SourceListIntl.MATNR, SourceListIntl.WERKS;
Then I wrote the instructions as per your second SQL query, but MSAccess returns an error message (Syntax error in JOIN operation). Here is the code:
SELECT SourceListIntl.MATNR, SourceListIntl.WERKS, SourceListIntl.BDATU, SourceListIntl.LIFNR, SourceListIntl.EBELN
FROM
SourceListInt INNER JOIN (SELECT MAX(SourceListIntl.BDATU) AS ContractDate, SourceListIntl.MATNR, SourceListIntl.WERKS
FROM
SourceListIntl
GROUP BY SourceListIntl.MATNR, SourceListIntl.WERKS
) AS LastDate
ON (SourceListIntl.MATNR= LastDate.MATNR) AND (SourceListIntl.WERKS= LastDate.WERKS) AND (SourceListIntl.BDATU= LastDate.ContractDate);
This does not works 🙁
BUT If I substitute the SELECT MAX … instructions by Query4 it works !!! Like this:
SELECT SourceListIntl.MATNR, SourceListIntl.WERKS, SourceListIntl.BDATU, SourceListIntl.LIFNR, SourceListIntl.EBELN
FROM SourceListIntl INNER JOIN Query4 ON (SourceListIntl.WERKS=Query4.WERKS) AND (SourceListIntl.MATNR=Query4.MATNR) AND (SourceListIntl.BDATU=Query4.ContractDate)
GROUP BY SourceListIntl.MATNR, SourceListIntl.WERKS, SourceListIntl.BDATU, SourceListIntl.LIFNR, SourceListIntl.EBELN;
This last set of instructions works but I would like it to work like yours
Any comments will be appreciated
Thanks in advance
Robert
Hi Robert … Just out of curiosity, when you say it doesn’t work, does it give error message? I noticed a typo on your 2nd query above. It’s near “SourceListInt INNER JOIN (SELECT MAX”. Notice that your table name is missing an l (el). Looks like there’s a typo there.
You make tnhigs so clear. Thanks for taking the time!