I am using C# to import a CSV with 6-8million rows.
My table looks like this:
CREATE TABLE [Data] ([ID] VARCHAR(100) NULL,[Raw] VARCHAR(200) NULL)
CREATE INDEX IDLookup ON Data(ID ASC)
I am using System.Data.SQLite to do the import.
Currently to do 6 millions rows its taking 2min 55 secs on a Windows 7 32bit, Core2Duo 2.8Ghz & 4GB RAM. That's not too bad but I was just wondering if anyone could see a way of importing it quicker.
Here is my code:
public class Data
{
public string IDData { get; set; }
public string RawData { get; set; }
}
string connectionString = @"Data Source=" + Path.GetFullPath(AppDomain.CurrentDomain.BaseDirectory + "\\dbimport");
System.Data.SQLite.SQLiteConnection conn = new System.Data.SQLite.SQLiteConnection(connectionString);
conn.Open();
//Dropping and recreating the table seems to be the quickest way to get old data removed
System.Data.SQLite.SQLiteCommand command = new System.Data.SQLite.SQLiteCommand(conn);
command.CommandText = "DROP TABLE Data";
command.ExecuteNonQuery();
command.CommandText = @"CREATE TABLE [Data] ([ID] VARCHAR(100) NULL,[Raw] VARCHAR(200) NULL)";
command.ExecuteNonQuery();
command.CommandText = "CREATE INDEX IDLookup ON Data(ID ASC)";
command.ExecuteNonQuery();
string insertText = "INSERT INTO Data (ID,RAW) VALUES(@P0,@P1)";
SQLiteTransaction trans = conn.BeginTransaction();
command.Transaction = trans;
command.CommandText = insertText;
Stopwatch sw = new Stopwatch();
sw.Start();
using (CsvReader csv = new CsvReader(new StreamReader(@"C:\Data.txt"), false))
{
var f = csv.Select(x => new Data() { IDData = x[27], RawData = String.Join(",", x.Take(24)) });
foreach (var item in f)
{
command.Parameters.AddWithValue("@P0", item.IDData);
command.Parameters.AddWithValue("@P1", item.RawData);
command.ExecuteNonQuery();
}
}
trans.Commit();
sw.Stop();
Debug.WriteLine(sw.Elapsed.Minutes + "Min(s) " + sw.Elapsed.Seconds + "Sec(s)");
conn.Close();
This is quite fast for 6 million records.
It seems that you are doing it the right way, some time ago I've read on sqlite.org that when inserting records you need to put these inserts inside transaction, if you don't do this your inserts will be limited to only 60 per second! That is because each insert will be treated as a separate transaction and each transaction must wait for the disk to rotate fully. You can read full explanation here:
http://www.sqlite.org/faq.html#q19
Actually, SQLite will easily do 50,000 or more INSERT statements per second on an average desktop computer. But it will only do a few dozen transactions per second. Transaction speed is limited by the rotational speed of your disk drive. A transaction normally requires two complete rotations of the disk platter, which on a 7200RPM disk drive limits you to about 60 transactions per second.
Comparing your time vs Average stated above: 50,000 per second => that should take 2m 00 sec. Which is only little faster than your time.
Transaction speed is limited by disk drive speed because (by default) SQLite actually waits until the data really is safely stored on the disk surface before the transaction is complete. That way, if you suddenly lose power or if your OS crashes, your data is still safe. For details, read about atomic commit in SQLite..
By default, each INSERT statement is its own transaction. But if you surround multiple INSERT statements with BEGIN...COMMIT then all the inserts are grouped into a single transaction. The time needed to commit the transaction is amortized over all the enclosed insert statements and so the time per insert statement is greatly reduced.
There is some hint in next paragraph that you could try to speed up the inserts:
Another option is to run PRAGMA synchronous=OFF. This command will cause SQLite to not wait on data to reach the disk surface, which will make write operations appear to be much faster. But if you lose power in the middle of a transaction, your database file might go corrupt.
I always thought that SQLite was designed for "simple things", 6 millions of records seems to me is a job for some real database server like MySQL.
Counting records in a table in SQLite with so many records can take long time, just for your information, instead of using SELECT COUNT(*), you can always use SELECT MAX(rowid) which is very fast, but is not so accurate if you were deleting records in that table.
EDIT.
As Mike Woodhouse stated, creating the index after you inserted the records should speed up the whole thing, that is a common advice in other databases, but can't say for sure how it works in SQLite.