views:

500

answers:

6

I have been tasked with downloading around 100 million rows of data from Azure Table Storage. The important thing here being speed.

The process we are using is downloading 10,000 rows from Azure Table storage. Process them into a local instance of Sql Server. While processing the rows it deletes 100 rows at a time from the Azure table. This process is threaded to have 8 threads downloading 10,000 rows at a time.

The only problem with this is that according to our calculations. It will take around 40 days to download and process the around 100 million rows we have stored. Does anyone know a faster way to accomplish this task?

A side question: During the download process Azure will send back xml that just does not have any data. It doesn't send back an error. But it sends this:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="azure-url/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom"&gt;
  <title type="text">CommandLogTable</title>
  <id>azure-url/CommandLogTable</id>
  <updated>2010-07-12T19:50:55Z</updated>
  <link rel="self" title="CommandLogTable" href="CommandLogTable" />
</feed>
0

Does anyone else have this problem and have a fix for it?

A: 

Most likely, your limiting factor is network bandwidth, not processing. If that's the case, your only real hope is to expand out: more machines running more threads to download data.

BTW, doesn't Azure expose some "export" mechanism that will remove the need to download all of the rows manually?

JSBangs
From what I can tell the limiting factor is not bandwidth. Its the latency from getting and deleting rows from Azure thats the problem.
jWoose
@jWoose: How are you determining this? I have hard time believing that you're NOT I/O bound.
Esteban Araya
+4  A: 

As to your side question, I expect you're getting a "continuation token." If you're using the .NET storage client library, try adding .AsTableServiceQuery() to your query.

As to your main question, fanning out the query is the best thing that you can do. It sounds like you're accessing storage from a local machine (not in Windows Azure). If so, I would imagine you can speed things up quite a bit by deploying a small service to Windows Azure which fetches the data from table storage (much faster, since there's higher bandwidth and lower latency within the data center), and then compresses the results and sends them back down to your local machine. There's a lot of overhead to the XML Windows Azure tables send back, so stripping that out and bundling up rows would probably save a lot of transfer time.

smarx
I agree with Steve's suggested approach. Additionally, consider writing your compressed images to blob storage. That makes them very easy to retrieve from your on-premise environment.
David Makogon
You are correct about my side question. The continuation token get sent back if your request takes longer than 5 seconds.
jWoose
+1  A: 

Aside from suggestions about bandwidth limits, you could easily be running into storage account limits, as each table partition is limited to roughly 500 transactions per second.

Further: there's an optimization deployed (Nagle's algorithm) that could actually slow things down for small reads (such as your 1K data reads). Here's a blog post about disabling Nagling, which could potentially speed up your reads considerably, especially if you're running directly in an Azure service without Internet latency in the way.

David Makogon
+5  A: 

In addition to the suggestions of Disabling Nagling, there is an extremely nice post on improving performance of Azure Table Storage. Actually improving the speed of ADO.NET Deserialization provided 10x speed-up for Sqwarea (massive online multiplayer game built with Lokad.Cloud framework).

However, table storage might not be the best solution for huge storage scenarios (more than millions of records). Latency is the killing factor here. To work around that, I've been successfully using file-based database storages, where changes are done locally (without any network latency of CLAP) and are committed to BLOB by uploading the file back (concurrency and scaling out was enforced here by Lokad.CQRS App Engine for Windows Azure).

Inserting 10 millions of records to SQLite database at once (within transaction, where each record was indexed by 2 fields and had arbitrary schema-less data serialized via ProtoBuf) took only 200 seconds in total on the average. Uploading/downloading resulting file - roughly 15 seconds on the average. Random reads by index - instantaneous (provided the file is cached in the local storage and ETag is matching).

Rinat Abdullin
Thanks for your advice. This should end up helping a lot. And I just wanted to say that yes, table storage is not ideal for this many records. It was a work around to being throttled by SQL Azure. The SQL Azure problem has been fixed and we are no longer storing the data in table storage, but we still want the data stored there.
jWoose
I'm glad I've helped. Table storage is good (although API could've been much better) and irreplaceable for things like storing view data of highly scalable web applications. Yet in scenarios that require extremely low latency and high throughput - it's not the best (just like SQL Azure)
Rinat Abdullin
Rinat and jWoose. Azure Table Storage is NOT relational. It is a NoSQL, noschema, distributed database, probably implemented in a way similar to what you describe. Azure Table Storage is specifically designed for Gazillions of records.
Panagiotis Kanavos
Panagiotis, nobody argued ATS to be RDB.
Rinat Abdullin
A: 

The big factor here is how the data is spread across partitions. A query that spans partition boundaries will return at each boundary requiring a resubmit - even if the partition in question has 0 rows. IF the data is 1 Partition = 1 Row, then it will be slow, but you could increase the thread count well above 8. If the data is in n partitions = m rows, then the ideas below should speed you up.

Assuming that you have multiple partitions and each with some number of rows, the fastest way to go will be to spin up as many threads as possible (if you are using .Net the PLINQ or Parallel.ForEach(partition) or QueueWorkItem()) and have a thread scan its partition for all rows, process, post to SQL, & delete before returning.

Given the latencies involved (10s of ms) and the multiple round trips, even w/8 threads you are probably not as busy as you might think. Also, you don't mention which VM you are using but you may want to profile different sizes.

Alternatively, another way to do this would be to leverage a queue and some 'n' workers. For each partition (or set of partitions) put a message in the queue. Have the workers pull from the queue (multi-threaded) and query/process/post/repeat. You could spin up as many workers as needed and be spread across more of the data center (i.e. more throughput, etc.).

Pat
A: 

The fastest way to get your data, supported by Amazon but not yet Azure, is to ship them a USB disk (even a USB stick), have them put the data in the disk and ship it back to you.

Another option is to use AppFabric Service Bus to get the data out to another system when it is created, instead of waiting to download it all at once.

Panagiotis Kanavos