How to speed up bulk insert to MS SQL Server from CSV using pyodbc

Question 1

How to speed up bulk insert to MS SQL Server from CSV using pyodbc

python sql-server sql-server-2012 bulkinsert pyodbc

TangoAlee · Apr 15, 2015 · Viewed 64.9k times · Source

Answer

Answer

As noted in a comment to another answer, the T-SQL BULK INSERT command will only work if the file to be imported is on the same machine as the SQL Server instance or is in an SMB/CIFS network location that the SQL Server instance can read. Thus it may not be applicable in the case where the source file is on a remote client.

pyodbc 4.0.19 added a Cursor#fast_executemany feature which may be helpful in that case. fast_executemany is "off" by default, and the following test code ...

cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")

sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')

... took approximately 22 seconds to execute on my test machine. Simply adding crsr.fast_executemany = True ...

cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")

crsr.fast_executemany = True  # new in pyodbc 4.0.19

sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')

... reduced the execution time to just over 1 second.

Question 2

Below is my code that I'd like some help with. I am having to run it over 1,300,000 rows meaning it takes up to 40 minutes to insert ~300,000 rows.

I figure bulk insert is the route to go to speed it up? Or is it because I'm iterating over the rows via for data in reader: portion?

#Opens the prepped csv file
with open (os.path.join(newpath,outfile), 'r') as f:
    #hooks csv reader to file
    reader = csv.reader(f)
    #pulls out the columns (which match the SQL table)
    columns = next(reader)
    #trims any extra spaces
    columns = [x.strip(' ') for x in columns]
    #starts SQL statement
    query = 'bulk insert into SpikeData123({0}) values ({1})'
    #puts column names in SQL query 'query'
    query = query.format(','.join(columns), ','.join('?' * len(columns)))

    print 'Query is: %s' % query
    #starts curser from cnxn (which works)
    cursor = cnxn.cursor()
    #uploads everything by row
    for data in reader:
        cursor.execute(query, data)
        cursor.commit()

I am dynamically picking my column headers on purpose (as I would like to create the most pythonic code possible).

SpikeData123 is the table name.

How to speed up bulk insert to MS SQL Server from CSV using pyodbc

Answer

Related questions