I have a very large Redshift database that contains billions of rows of HTTP request data.
I have a table called requests
which has a few important fields:
ip_address
city
state
country
I have a Python process running once per day, which grabs all distinct rows which have not yet been geocoded (do not have any city / state / country information), and then attempts to geocode each IP address via Google's Geocoding API.
This process (pseudocode) looks like this:
for ip_address in ips_to_geocode:
country, state, city = geocode_ip_address(ip_address)
execute_transaction('''
UPDATE requests
SET ip_country = %s, ip_state = %s, ip_city = %s
WHERE ip_address = %s
''')
When running this code, I often receive errors like the following:
psycopg2.InternalError: 1023
DETAIL: Serializable isolation violation on table - 108263, transactions forming the cycle are: 647671, 647682 (pid:23880)
I'm assuming this is because I have other processes constantly logging HTTP requests into my table, so when I attempt to execute my UPDATE statement, it is unable to select all rows with the ip address I'd like to update.
My question is this: what can I do to update these records in a sane way that will stop failing regularly?
Your code is violating the serializable isolation level of Redshift. You need to make sure that your code is not trying to open multiple transactions on the same table before closing all open transactions.
You can achieve this by locking the table in each transaction so that no other transaction can access the table for updates until the open transaction gets closed. Not sure how your code is architected (synchronous or asynchronous), but this will increase the run time as each lock will force others to wait till the transaction gets over.
Refer: http://docs.aws.amazon.com/redshift/latest/dg/r_LOCK.html