Bookmark and Share

Optimizing SQL that selects the max/min/etc from a group.

Posted: Thursday, January 8th, 2009 at 7:42 amUpdated: Friday, January 30th, 2009 at 10:12 pm

How would you like to optimize a 10 minute query down to 24 seconds? This was an optimization I had to do at work and I thought I’d share this with anyone. Even though I was using MySQL as an example, I think the technique should work on any database. Let’s say you have a table to hold items and another table that holds the logs (could be logs for inventory count at a given time, or item status, etc). Let’s say you want the latest status of an item. For our example, the schema and data for the tables are like below:

optimize_max_min_sql_schema

optimize_max_min_item_tableoptimize_max_min_inventory_table

To get the latest inventory count of an item, what may comes naturally to you is to write query something like this:

SELECT
   item.item_id, item.item_name, inv.inventory_count
FROM
   item
   JOIN inventory inv ON item.item_id = inv.item_id
      AND inv.inventory_date IN (
         SELECT
            MAX(inv2.inventory_date)
         FROM
            inventory inv2 WHERE inv2.item_id = inv.item_id
      )

-- If you do explain on query above:
-- id select_type         table   type    possible_keys  key     key_len  ref       rows  Extra
-- --  ------------------  ------  ------  -------------  ------  -------  ------  ------  -----------
-- 1  PRIMARY             inv     ALL     (NULL)         (NULL)  (NULL)   (NULL)      16  Using where
-- 1  PRIMARY             item    ALL     PRIMARY        (NULL)  (NULL)   (NULL)       4  Using where
-- 2  DEPENDENT SUBQUERY  inv2    ALL     (NULL)         (NULL)  (NULL)   (NULL)      16  Using where

Here’s the result of running the query above:

original_result

I feel that the above code is how we naturally think; for each items, get the latest inventory and select it. However, if you see the sub query where we select from inventory table the 2nd time (line 7 – 10), it is dependent on each item_id (see line 18 above about DEPENDENT SUBQUERY on MySQL explain). What’s happening is that the database server is now forced to run the sub query once for every item records found. If you have only a few hundreds items, then the performance may not differ that much.

So how to optimize this query? Well, we know the bottle neck. We need to somehow think of a way to rewrite the query so that database server doesn’t need to execute sub query once per item records. One way to do that is since we know we want the latest inventory count, we can generate inventory table that has only the latest data. Below is one way to do it:

SELECT
   MAX(inventory_date) AS inventory_date, item_id
FROM
   inventory
GROUP BY
   item_id

On the code above, we can’t get the inventory count because if we do, we’ll have to group by inventory_count as well, which defeats the whole purpose of getting the maximum date. (Try it and you’ll get the whole table). Since we can’t get the inventory_count, we’ll have to join inventory table again. So the query looks like below:

SELECT
   *
FROM
   inventory inv
   JOIN (
      SELECT
         MAX(inventory_date) AS inventory_date, item_id
      FROM
         inventory
      GROUP BY
         item_id
   ) AS inv2 ON inv.item_id = inv2.item_id
      AND inv.inventory_date = inv2.inventory_date

But wait a minute … we don’t have item_name yet … well you can just add another join to item table. You can do it on the outer SQL or on the inner SQL. I think performance wise, it’s about the same since the end result is you’re joining 3 tables. I choose to join with the inner query. So this is what we end up with:

SELECT
   inv2.item_id, inv2.item_name, inv.inventory_count
FROM
   inventory inv
   JOIN (
      SELECT
         MAX(i.inventory_date) AS inventory_date, i.item_id, item.item_name
      FROM
         inventory i
         JOIN item ON item.item_id = i.item_id
      GROUP BY
         item_id
   ) AS inv2 ON inv.item_id = inv2.item_id
      AND inv.inventory_date = inv2.inventory_date

-- If you do explain on query above: 
--  id  select_type  table       type    possible_keys  key      key_len  ref                                 rows  Extra                                       
------  -----------  ----------  ------  -------------  -------  -------  --------------------------------  ------  --------------------------------------------
--   1  PRIMARY      <derived2>  ALL     (NULL)         (NULL)   (NULL)   (NULL)                                 5                                              
--   1  PRIMARY      inv         eq_ref  PRIMARY        PRIMARY  7        inv2.inventory_date,inv2.item_id       1                                              
--   2  DERIVED      i           index   (NULL)         PRIMARY  7        (NULL)                                16  Using index; Using temporary; Using filesort
--   2  DERIVED      item        ALL     PRIMARY        (NULL)   (NULL)   (NULL)                                 4  Using where                                 

At work, I optimized a query in similar situation before for table about 180K rows and about 500K inventory rows. Using sub query like our original case, the query runs for about 10 minutes. After I did the optimization like above, the query ran for 24 seconds. I hope this article helps you. As always, I welcome comments / questions / critics that will help me and other readers understand better.

3 Responses to “Optimizing SQL that selects the max/min/etc from a group.”

  1. Robert Says:

    Hi Maresa
    First of all thanks for your post. I’m a newbie with SQL. I know a bit more building MSAccess Queries. I wanted to give it a try and learn Subqueries and some SQL.

    I use your example to try to replicate my case. It is a single table where I have information of contracts by supplier. In that table a material-supplier can have more than one entry (contracts past and current). I just want to pick the most recent date. In a sense this is similar to your example.

    What puzzles me is that I created the first step like in your case (it worked); even I saved those instructions as a MSAccess query (I called it “Query4”). Here are the instructions:
    SELECT
    MAX(SourceListIntl.BDATU) AS ContractDate, SourceListIntl.MATNR, SourceListIntl.WERKS
    FROM
    SourceListIntl
    GROUP BY
    SourceListIntl.MATNR, SourceListIntl.WERKS;

    Then I wrote the instructions as per your second SQL query, but MSAccess returns an error message (Syntax error in JOIN operation). Here is the code:

    SELECT SourceListIntl.MATNR, SourceListIntl.WERKS, SourceListIntl.BDATU, SourceListIntl.LIFNR, SourceListIntl.EBELN
    FROM
    SourceListInt INNER JOIN (SELECT MAX(SourceListIntl.BDATU) AS ContractDate, SourceListIntl.MATNR, SourceListIntl.WERKS
    FROM
    SourceListIntl
    GROUP BY SourceListIntl.MATNR, SourceListIntl.WERKS
    ) AS LastDate
    ON (SourceListIntl.MATNR= LastDate.MATNR) AND (SourceListIntl.WERKS= LastDate.WERKS) AND (SourceListIntl.BDATU= LastDate.ContractDate);
    This does not works :-(

    BUT If I substitute the SELECT MAX … instructions by Query4 it works !!! Like this:

    SELECT SourceListIntl.MATNR, SourceListIntl.WERKS, SourceListIntl.BDATU, SourceListIntl.LIFNR, SourceListIntl.EBELN
    FROM SourceListIntl INNER JOIN Query4 ON (SourceListIntl.WERKS=Query4.WERKS) AND (SourceListIntl.MATNR=Query4.MATNR) AND (SourceListIntl.BDATU=Query4.ContractDate)
    GROUP BY SourceListIntl.MATNR, SourceListIntl.WERKS, SourceListIntl.BDATU, SourceListIntl.LIFNR, SourceListIntl.EBELN;

    This last set of instructions works but I would like it to work like yours

    Any comments will be appreciated

    Thanks in advance

    Robert

  2. Maresa Says:

    Hi Robert … Just out of curiosity, when you say it doesn’t work, does it give error message? I noticed a typo on your 2nd query above. It’s near “SourceListInt INNER JOIN (SELECT MAX”. Notice that your table name is missing an l (el). Looks like there’s a typo there.

  3. Karcy Says:

    You make tnhigs so clear. Thanks for taking the time!

Leave a Reply