I'm building a system for updating large amounts of data through various CSV feeds. Normally I would just loop though each row in the feed, do a select query to check if the item already exists and insert/update an item depending if it exists or not.
I feel this method isn't very scalable and could hammer the server on larger feeds. My solution is to loop through the items as normal but store them in memory. Then for every 100 or so items do a select on those 100 items and get a list of existing items in the database that match. Then concatenate the insert/update statements together and run them into the database. This would essentially cut down on the trips to the database.
Is this a scalable enough solution and are there any example tutorials on importing large feeds into a productive environment?
Thanks
Seeing that you're using SQL Server 2008, I would recommend this approach:
Check out the MSDN docs and a great blog post on how to use the MERGE command.
Basically, you create a link between your actual data table and the staging table on a common criteria (e.g. a common primary key), and then you can define what to do when
You would have a MERGE
statement something like this:
MERGE TargetTable AS t
USING SourceTable AS src
ON t.PrimaryKey = src.PrimaryKey
WHEN NOT MATCHED THEN
INSERT (list OF fields)
VALUES (list OF values)
WHEN MATCHED THEN
UPDATE
SET (list OF SET statements)
;
Of course, the ON
clause can be much more involved if needed. And of course, your WHEN
statements can also be more complex, e.g.
WHEN MATCHED AND (some other condition) THEN ......
and so forth.
MERGE
is a very powerful and very useful new command in SQL Server 2008 - use it, if you can!