MySQL Hash Indexes for Optimization

Hituptony picture Hituptony · Jan 10, 2014 · Viewed 14.6k times · Source

So maybe this is noob, but I'm messing with a couple tables.

I have TABLE A roughly 45,000 records

I have TABLE B roughly 1.5 million records

I have a query:

update
    schema1.tablea a
    inner join (
        SELECT DISTINCT
            ID, Lookup,
            IDpart1, IDpart2
        FROM
            schema1.tableb
        WHERE
            IDpart1 is not NULL
        AND
            Lookup is not NULL
        ORDER BY
            ID,Lookup
    ) b Using(ID,Lookup)
set 
    a.Elg_IDpart1 = b.IDpart1, 
    a.Elg_IDpart2 = b.IDpart2
where
    a.ID is NOT NULL
AND
    a.Elg_IDpart1 is NULL

So I am forcing the index on ID, Lookup. Each table does have a index on those columns as well but because of the sub-query I forced it.

It is taking FOR-EVER to run, and it really should take, i'd imagine under 5 minutes...

My questions are in regards to the indexes, not the query.

I know that you can't use hash index in ordered index.

I currently have indexes on both ID, Lookup sperately, and as one index, and it is a B-Tree index. Based on my WHERE Clause, does a hash index fit for as an optimization technique??

Can I have a single hash index, and the rest of the indexes b B-tree index?

This is not a primary key field.

I would post my explain but i changed the name on these tables. Basically it is using the index only for ID...instead of using the ID, Lookup, I would like to force it to use both, or at least turn it into a different kind of index and see if that helps?

Now I know MySQL is smart enough to determine which index is most appropriate, so is that what it's doing? The Lookup field maps the first and second part of the ID...

Any help or insight on this is appreciated.


 UPDATE

An EXPLAIN on the UPDATE after I took out sub-query.

+----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+
| id | select_type | table | type |        possible_keys        |     key      | key_len |        ref        | rows  |    Extra    |
+----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+
|  1 | SIMPLE      | m     | ALL  | Lookup_Idx,ID_Idx,ID_Lookup |              |         |                   | 44023 | Using where |
|  1 | SIMPLE      | c     | ref  | ID_LookupIdx                | ID_LookupIdx |       5 | schema1.tableb.ID |     4 | Using where |
+----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+

tablea relevant indexes:

  • ID_LookupIdx (ID, Lookup)

tableb relevant indexes:

  • ID (ID)
  • Lookup_Idx (Lookup)
  • ID_Lookup_Idx (ID, Lookup)

All of the indexes are normal B-trees.

Answer

eggyal picture eggyal · Jan 10, 2014

Firstly, to deal with the specific questions that you raise:

  1. I currently have indexes on both ID, Lookup sperately, and as one index, and it is a B-Tree index. Based on my WHERE Clause, does a hash index fit for as an optimization technique??

    As documented under CREATE INDEX Syntax:

    +----------------+--------------------------------+
    | Storage Engine |    Permissible Index Types     |
    +----------------+--------------------------------+
    | MyISAM         | BTREE                          |
    | InnoDB         | BTREE                          |
    | MEMORY/HEAP    | HASH, BTREE                    |
    | NDB            | BTREE, HASH (see note in text) |
    +----------------+--------------------------------+
    

    Therefore, before even considering HASH indexing, one should be aware that it is only available in the MEMORY and NDB storage engines: so may not even be an option to you.

    Furthermore, be aware that indexes on combinations of ID and Lookup alone may not be optimal, as your WHERE predicate also filters on tablea.Elg_IDpart1 and tableb.IDpart1—you may benefit from indexing on those columns too.

  2. Can I have a single hash index, and the rest of the indexes b B-tree index?

    Provided that the desired index types are supported by the storage engine, you can mix them as you see fit.

  3. instead of using the ID, Lookup, I would like to force it to use both, or at least turn it into a different kind of index and see if that helps?

    You could use an index hint to force MySQL to use different indexes to those that the optimiser would otherwise have selected.

  4. Now I know MySQL is smart enough to determine which index is most appropriate, so is that what it's doing?

    It is usually smart enough, but not always. In this case, however, it has probably determined that the cardinality of the indexes is such that it is better to use those that it has chosen.


Now, depending on the version of MySQL that you are using, tables derived from subqueries may not have any indexes upon them that can be used for further processing: consequently the join with b may require a full scan of that derived table (there's insufficient information in your question to determine exactly how much of a problem this might be, but schema1.tableb having 1.5 million records suggests it could be a significant factor).

See Subquery Optimization for more information.

One should therefore try to avoid using derived tables if at all possible. In this case, there does not appear to be any purpose to your derived table as one could simply join schema1.tablea and schema1.tableb directly:

UPDATE   schema1.tablea a
    JOIN schema1.tableb b USING (ID, Lookup)
SET      a.Elg_IDpart1 = b.IDpart1, 
         a.Elg_IDpart2 = b.IDpart2
WHERE    a.Elg_IDpart1 IS     NULL
     AND a.ID          IS NOT NULL
     AND b.IDpart1     IS NOT NULL
     AND b.Lookup      IS NOT NULL
ORDER BY ID, Lookup

The only thing that has been lost is the filter for DISTINCT records, but duplicate records will simply (attempt to) overwrite updated values with those same values again—which will have no effect, but may have proved very costly (especially with so many records in that table).

The use of ORDER BY in the derived table was pointless as it could not be relied upon to achieve any particular order to the UPDATE, whereas in this revised version it will ensure that any updates that overwrite previous ones take place in the specified order: but is that necessary? Perhaps it can be removed and save on any sorting operation.

One should check the predicates in the WHERE clause: are they all necessary (the NOT NULL checks on a.ID and b.Lookup, for example, are superfluous given that any such NULL records will be eliminated by the JOIN predicate)?

Altogether, this leaves us with:

UPDATE   schema1.tablea a
    JOIN schema1.tableb b USING (ID, Lookup)
SET      a.Elg_IDpart1 = b.IDpart1, 
         a.Elg_IDpart2 = b.IDpart2
WHERE    a.Elg_IDpart1 IS     NULL
     AND b.IDpart1     IS NOT NULL

Only if performance is still unsatisfactory should one look further at the indexing. Are relevant columns (i.e. those used in the JOIN and WHERE predicates) indexed? Are the indexes being selected for use by MySQL (bear in mind that it can only use one index per table for lookups: for testing both the JOIN predicate and the filter predicates: perhaps you need an appropriate composite index)? Check the query execution plan by using EXPLAIN to investigate such issues further.