views:

82

answers:

2

Hello,

I am trying to use the SimpleDB in following way.

I want to keep 48 hrs worth data at anytime into simpledb and query it for different purposes. Each domain has 1 hr worth data, so at any time there are 48 domains present in the simpledb. As the new data is constantly uploaded, I delete the oldest domain and create a new domain for each new hour.

Each domain is about 50MB in size, the total size of all the domains is around 2.2 GB. The item in the domain has following type of attributes
identifier - around 50 characters long -- 1 per item
timestamp - timestamp value -- 1 per item
serial_n_data - 500-1000 bytes data -- 200 per item

I'm using python boto library to upload and query the data. I send 1 item/sec with around 200 attributes in the domain.

For one of the application of this data, I need to get all the data from all the 48 domains. The Query looks like, "SELECT * FROM domain", for all the domains. I use 8 threads to query data with each thread taking responsibility of few domains.
e.g domain 1-6 thread 1
domain 7-12 thread 2 and so on

It takes close to 13 minutes to get the entire data.I am using boto's select method for this.I need much more faster performance than this. Any suggestions on speed up the querying process? Is there any other language that I can use, which can speed up the things?

+1  A: 

Use more threads

I would suggest inverting your threads/domain ratio from 1/6 to something closer to 30/1. Most of the time taken to pull down large chunks of data from SimpleDB is going to be spent waiting. In this situation upping the thread count will vastly improve your throughput.

One of the limits of SimpleDB is the query response size cap at 1MB. This means pulling down the 50MB in a single domain will take a minimum of 50 Selects (the original + 49 additional pages). These must occur sequentially because the NextToken from the current response is needed for the next request. If each Select takes 2+ seconds (not uncommon with large responses and high request volume) you spend 2 minutes on each domain. If every thread has to iterate thru each of 6 domains in turn, that's about 12 minutes right there. One thread per domain should cut that down to about 2 minutes easily.

But you should be able to do much better than that. SimpleDB is optimized for concurrency. I would try 30 threads per domain, giving each thread a portion of the hour to query on, since it is log data after all. For example:

SELECT * FROM domain WHERE timestamp between '12:00' and '12:02'

(Obviously, you'd use real timestamp values) All 30 queries can be kicked off without waiting for any responses. In this way you still need to make at least 50 queries per domain, but instead of making them all sequentially you can get a lot more concurrency. You will have to test for yourself how many threads gives you the best throughput. I would encourage you to try up to 60 per domain, breaking the Select conditions down to one minute increments. If it works for you then you will have fully parallel queries and most likely have eliminated all follow up pages. If you get 503 ServiceUnavailable errors, scale back the threads.

The domain is the basic unit of scalability for SimpleDB so it is good that you have a convenient way to partition your data. You just need take advantage of the concurrency. Rather than 13 minutes, I wouldn't be surprised if you were able to get the data in 13 seconds for an app running on EC2 in the same region. But the actual time it takes will depend on a number of other factors.

Cost Concerns

As a side note, I should mention the costs of what you are doing, even though you haven't raised the issue. CreateDomain and DeleteDomain are heavyweight operations. Normally I wouldn't advise using them so often. You are charged about 25 seconds of box usage each time so creating and deleting one each hour adds up to about $70 per month just for domain management. You can store orders of magnitude more data in a domain than the 50MB you mention. So you might want to let the data accumulate more before you delete. If your queries include the timestamp (or could be made to include the timestamp) query performance may not be hurt at all by having an extra GB of old data in the domain. In any case, GetAttributes and PutAttributes will never suffer a performance hit with a large domain size, it is only queries that don't make good use of a selective index. You'd have to test your queries to see. That is just a suggestion, I realize that the create/delete is cleaner conceptually.

Also writing 200 attributes at a time is expensive, due to a quirk in the box usage formula. The box usage for writes is proportional to the number of attributes raised to the power of 3 ! The formula in hours is:

0.0000219907 + 0.0000000002 N^3

For the base charge plus the per attribute charge, where N is the number of attributes. In your situation, if you write all 200 attributes in a single request, the box usage charges will be about $250 per million items ($470 per million if you write 256 attributes). If you break each request in to 4 requests with 50 attributes each, you will quadruple your PutAttributes volume, but reduce the box usage charges by an order of magnitude to about $28 per million items. If you are able break the requests down, then it may be worth doing. If you cannot (due to request volume, or just the nature of your app) it means that SimpleDB can end up being extremely unappealing from a cost standpoint.

Mocky
A: 

Mocky, currently using our 8 thread approach (as opposed to your 30 thread approach), we are getting very high CPU usage and load averages (we are running on a High-CPU Medium Instance by the way). Should we be seeing this high CPU load with 8 threads (python accessing SimpleDB via the boto library)? Do you think the high CPU load is do more to thread overhead or network connectivity between our EC2 instance and the SimpleDB server farm?

Thanks, Charlie

Charlie Benton
I wouldn't think that thread overhead from 8 threads or network connectivity would saturate your CPU. I haven't profiled boto, but it's possible that parsing large XML responses is the culprit. I have seen this happen with other libraries.
Mocky
Mocky, I tried pretty much every possible approach to process the simpledb data with my time constraints. I have around 1.5 millon rows to process and boto and python combination takes a lot of time to process this much data. It takes around 6.5 minutes to get the data in desired format and then I start the calculations on the data.Do you think that using C or C++ would help in some performance increase?Is there good simpledb library for C/C++?
Sujit