I have been tasked with downloading around 100 million rows of data from Azure Table Storage. The important thing here being speed.
The process we are using is downloading 10,000 rows from Azure Table storage. Process them into a local instance of Sql Server. While processing the rows it deletes 100 rows at a time from the Azure table. This process is threaded to have 8 threads downloading 10,000 rows at a time.
The only problem with this is that according to our calculations. It will take around 40 days to download and process the around 100 million rows we have stored. Does anyone know a faster way to accomplish this task?
A side question: During the download process Azure will send back xml that just does not have any data. It doesn't send back an error. But it sends this:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="azure-url/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
<title type="text">CommandLogTable</title>
<id>azure-url/CommandLogTable</id>
<updated>2010-07-12T19:50:55Z</updated>
<link rel="self" title="CommandLogTable" href="CommandLogTable" />
</feed>
0
Does anyone else have this problem and have a fix for it?
In addition to the suggestions of Disabling Nagling, there is an extremely nice post on improving performance of Azure Table Storage. Actually improving the speed of ADO.NET Deserialization provided 10x speed-up for Sqwarea (massive online multiplayer game built with Lokad.Cloud framework).
However, table storage might not be the best solution for huge storage scenarios (more than millions of records). Latency is the killing factor here. To work around that, I've been successfully using file-based database storages, where changes are done locally (without any network latency of CLAP) and are committed to BLOB by uploading the file back (concurrency and scaling out was enforced here by Lokad.CQRS App Engine for Windows Azure).
Inserting 10 millions of records to SQLite database at once (within transaction, where each record was indexed by 2 fields and had arbitrary schema-less data serialized via ProtoBuf) took only 200 seconds in total on the average. Uploading/downloading resulting file - roughly 15 seconds on the average. Random reads by index - instantaneous (provided the file is cached in the local storage and ETag is matching).