I've been working with importing large CSV files of data; usually less than 100,000 records. I'm working with PHP and MySQL (InnoDB tables). I needed to use PHP to transform some fields and do some text processing prior to the MySQL INSERT
s (part of process_note_data()
in code below). MySQL's LOAD DATA
was not feasible, so please do not suggest it.
I recently tried to improve the speed of this process by using MySQL transactions using START TRANSACTION
and COMMIT
. The performance increase was surprising. Processing time(s) dropped by a factor of 20. So, a 20 minute process only took about 1 minute.
QUESTIONS.
1.) Does anyone understand why there was such performance increase (20 mins to 1 min)?
2.) Should I be concerned about how big the transaction may get with 100,000 records?
3.) Should I be concerned with a large number of inserts and/or updates in the transaction?
/*
* Customer Notes Data:
* Rows are either a meeting, call or note!
*/
$row = 1;
$data = array();
$fields = array();
$line = '';
$db->query('SET autocommit=0;');
$db->query('START TRANSACTION;');
if (($handle = fopen("modules/".$currentModule."/Data/customernote.csv", "r")) !== FALSE) {
while (($data = fgetcsv($handle, 4096, ',', '"')) !== FALSE && $row < 999000) {
//Row 1 - CSV header row with field names
if ($row == 1) {
$csv_fields = $data;
} elseif ($row > 1) {
$fields = $this->process_note_data($data, $csv_fields, $row);
}
$row++;
} // end while
fclose($handle);
}
$db->query('COMMIT;');
$db->query('SET autocommit=1;');
Note: The text/field processing is done in the call to $this->process_note_data()
which then calls another helper class that has the INSERT
statement code. I didn't have enough room to include all of the code. $db->query()
is a typical database object for MySQL queries.
Please check this link:
https://dev.mysql.com/doc/refman/5.5/en/optimizing-innodb-transaction-management.html
InnoDB must flush the log to disk at each transaction commit if that transaction made modifications to the database. When each change is followed by a commit (as with the default autocommit setting), the I/O throughput of the storage device puts a cap on the number of potential operations per second.
Big transactions may affect performance during commit (check above)
Only in case of rollback, however it may be optimized using some settings (check the link)