So maybe this is noob, but I'm messing with a couple tables.
I have TABLE A roughly 45,000 records
I have TABLE B roughly 1.5 million records
I have a query:
update
schema1.tablea a
inner join (
SELECT DISTINCT
ID, Lookup,
IDpart1, IDpart2
FROM
schema1.tableb
WHERE
IDpart1 is not NULL
AND
Lookup is not NULL
ORDER BY
ID,Lookup
) b Using(ID,Lookup)
set
a.Elg_IDpart1 = b.IDpart1,
a.Elg_IDpart2 = b.IDpart2
where
a.ID is NOT NULL
AND
a.Elg_IDpart1 is NULL
So I am forcing the index on ID, Lookup. Each table does have a index on those columns as well but because of the sub-query I forced it.
It is taking FOR-EVER to run, and it really should take, i'd imagine under 5 minutes...
My questions are in regards to the indexes, not the query.
I know that you can't use hash index in ordered index.
I currently have indexes on both ID, Lookup sperately, and as one index, and it is a B-Tree index. Based on my WHERE
Clause, does a hash index fit for as an optimization technique??
Can I have a single hash index, and the rest of the indexes b B-tree index?
This is not a primary key field.
I would post my explain but i changed the name on these tables. Basically it is using the index only for ID...instead of using the ID, Lookup, I would like to force it to use both, or at least turn it into a different kind of index and see if that helps?
Now I know MySQL is smart enough to determine which index is most appropriate, so is that what it's doing? The Lookup field maps the first and second part of the ID...
Any help or insight on this is appreciated.
An EXPLAIN
on the UPDATE
after I took out sub-query.
+----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+ | 1 | SIMPLE | m | ALL | Lookup_Idx,ID_Idx,ID_Lookup | | | | 44023 | Using where | | 1 | SIMPLE | c | ref | ID_LookupIdx | ID_LookupIdx | 5 | schema1.tableb.ID | 4 | Using where | +----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+
tablea
relevant indexes:
ID_LookupIdx (ID, Lookup)
tableb
relevant indexes:
ID (ID)
Lookup_Idx (Lookup)
ID_Lookup_Idx (ID, Lookup)
All of the indexes are normal B-trees.
Firstly, to deal with the specific questions that you raise:
I currently have indexes on both ID, Lookup sperately, and as one index, and it is a B-Tree index. Based on my
WHERE
Clause, does a hash index fit for as an optimization technique??
As documented under CREATE INDEX
Syntax:
+----------------+--------------------------------+ | Storage Engine | Permissible Index Types | +----------------+--------------------------------+ | MyISAM | BTREE | | InnoDB | BTREE | | MEMORY/HEAP | HASH, BTREE | | NDB | BTREE, HASH (see note in text) | +----------------+--------------------------------+
Therefore, before even considering HASH
indexing, one should be aware that it is only available in the MEMORY
and NDB
storage engines: so may not even be an option to you.
Furthermore, be aware that indexes on combinations of ID
and Lookup
alone may not be optimal, as your WHERE
predicate also filters on tablea.Elg_IDpart1
and tableb.IDpart1
—you may benefit from indexing on those columns too.
Can I have a single hash index, and the rest of the indexes b B-tree index?
Provided that the desired index types are supported by the storage engine, you can mix them as you see fit.
instead of using the ID, Lookup, I would like to force it to use both, or at least turn it into a different kind of index and see if that helps?
You could use an index hint to force MySQL to use different indexes to those that the optimiser would otherwise have selected.
Now I know MySQL is smart enough to determine which index is most appropriate, so is that what it's doing?
It is usually smart enough, but not always. In this case, however, it has probably determined that the cardinality of the indexes is such that it is better to use those that it has chosen.
Now, depending on the version of MySQL that you are using, tables derived from subqueries may not have any indexes upon them that can be used for further processing: consequently the join with b
may require a full scan of that derived table (there's insufficient information in your question to determine exactly how much of a problem this might be, but schema1.tableb
having 1.5 million records suggests it could be a significant factor).
See Subquery Optimization for more information.
One should therefore try to avoid using derived tables if at all possible. In this case, there does not appear to be any purpose to your derived table as one could simply join schema1.tablea
and schema1.tableb
directly:
UPDATE schema1.tablea a
JOIN schema1.tableb b USING (ID, Lookup)
SET a.Elg_IDpart1 = b.IDpart1,
a.Elg_IDpart2 = b.IDpart2
WHERE a.Elg_IDpart1 IS NULL
AND a.ID IS NOT NULL
AND b.IDpart1 IS NOT NULL
AND b.Lookup IS NOT NULL
ORDER BY ID, Lookup
The only thing that has been lost is the filter for DISTINCT
records, but duplicate records will simply (attempt to) overwrite updated values with those same values again—which will have no effect, but may have proved very costly (especially with so many records in that table).
The use of ORDER BY
in the derived table was pointless as it could not be relied upon to achieve any particular order to the UPDATE
, whereas in this revised version it will ensure that any updates that overwrite previous ones take place in the specified order: but is that necessary? Perhaps it can be removed and save on any sorting operation.
One should check the predicates in the WHERE
clause: are they all necessary (the NOT NULL
checks on a.ID
and b.Lookup
, for example, are superfluous given that any such NULL
records will be eliminated by the JOIN
predicate)?
Altogether, this leaves us with:
UPDATE schema1.tablea a
JOIN schema1.tableb b USING (ID, Lookup)
SET a.Elg_IDpart1 = b.IDpart1,
a.Elg_IDpart2 = b.IDpart2
WHERE a.Elg_IDpart1 IS NULL
AND b.IDpart1 IS NOT NULL
Only if performance is still unsatisfactory should one look further at the indexing. Are relevant columns (i.e. those used in the JOIN
and WHERE
predicates) indexed? Are the indexes being selected for use by MySQL (bear in mind that it can only use one index per table for lookups: for testing both the JOIN
predicate and the filter predicates: perhaps you need an appropriate composite index)? Check the query execution plan by using EXPLAIN
to investigate such issues further.