ansaurus

Question

Storing vast amounts of (simple) timeline graph data in a DB

Answer 1

A:

Would it be problematic to use each second, and how many plays is on a per-second basis? That means 10K rows, which isn't bad, and you just have to INSERT a row every second with the current data.

EDIT: I would say that that solutions is better than doing a comma-separated something in a TEXT column... especially since getting and manipulating data (which you say you want to do) would be very messy.

zebediah49 2010-07-14 21:57:54

First off, thanks for your quick reply!I should perhaps clarify that I'm not concerned too much with the update logic when the data structure itself is this simple, nor am I too concerned with the update frequency (only updating every 15min is acceptable).As far as I can see, I'd need three columns: podcast_id, second, and plays. To retrieve the graph data, I'll need to retrieve 10K records, queried on a foreign key and sorted by an integer. Wouldn't this take a second or two just to retrieve?

Jamie Appleseed 2010-07-14 22:16:49

Answer 2

A:

I would view it as a key-value problem.

for each second played

   Song[second] += 1

end

As a relational database -

song
----
name | second | plays

And a hack psuedo-sql to start a second:

insert into song(name, second, plays) values("xyz", "abc", 0)

and another to update the second

update song plays = plays + 1 where name = xyz and second = abc

A 3-hour podcast would have 11K rows.

Paul Nathan 2010-07-14 22:06:41

Thanks for your reply. 10-11K rows seems like a lot for each podcast if I need to retrieve the raw graph data fast (preferably around 200ms). Especially since the rows needs to be queried and sorted. Do you have any experience with how long 10-11K rows take to retrieve when queried on a foreign key and sorted by an integer-column? Most of the stuff I've been working with so far has been around 100 rows (representing pages in a CMS) which is a whole different story!

Jamie Appleseed 2010-07-14 22:24:30

If the data is only updated every 15 minutes, then why do you need to retrieve the raw data in 200ms? Retrieve it in 3 seconds and cache it until the next update.

WW 2010-07-14 22:46:37

Having to wait 3 seconds on the first load is undesirable. Especially since most graphs will only be opened a few times a day (meaning most users will experience a 3 sec load time). I guess I could look into "pre-caching" it (generate the cache immediately after updates), however, that would require an awful lot of computing power to do for all updated podcasts every 15 min.

Jamie Appleseed 2010-07-14 22:56:55

Answer 3

A:

It really depends on what is generating the data ..

As I understand you want to implement a map with the key being the second mark and the value being the number of plays.

What is the pieces in the event, unit of work, or transaction you are loading?

Can I assume you have a play event along the podcastname , start and stop times And you want to load into the map for analysis and presentation?

If that's the case you can have a table

podcastId
secondOffset
playCount

each even would do an update of the row between the start and ending position

update t set playCount = playCount +1 where podCastId = x and secondOffset between y and z

and then followed by an insert to add those rows between the start and stop that don't exist, with a playcount of 1, unless you preload the table with zeros.

Depending on the DB you may have the ability to setup a sparse table where empty columns are not stored, making more efficient.

Rawheiser 2010-07-14 22:16:39

Answer 4

A:

Problem with the blob storage is you need to update the entire blob for all of the changes. This is not necessarily a bad thing. Using your format: (100000, 90000,...), 7 * 3600 * 3 = ~75K bytes. But that means you're updating that 75K blob for every play for every second.

And, of course, the blob is opaque to SQL, so "what second of what song has the most plays" will be an impossible query at the SQL level (that's basically a table scan of all the data to learn that).

And there's a lot of parsing overhead marshalling that data in and out.

On the other hand. Podcast ID (4 bytes), second offset (2 bytes unsigned allows pod casts up to 18hrs long), play count (4 byte) = 10 bytes per second. So, minus any blocking overhead, a 3hr song is 3600 * 3 * 10 = 108K bytes per song.

If you stored it as a blob, vs text (block of longs), 4 * 3600 * 3 = 43K.

So, the second/row structure is "only" twice the size (in a perfect world, consult your DB server for details) of a binary blob. Considering the extra benefits this grants you in terms of being able to query things, that's probably worth doing.

Only down side of second/per row is if you need to to a lot of updates (several seconds at once for one song), that's a lot of UPDATE traffic to the DB, whereas with the blob method, that's likely a single update.

Your traffic patterns will influence that more that anything.

Will Hartung 2010-07-14 22:39:13

Thanks for this pros/cons – very helpful! I'll be able to receive the tracking data in batches of 15min intervals (= easily 1,000 updates), so that's definitely a plus for the blob-approach. Also, I only need to use this data for the timeline graph, so being able to query the data is unimportant. That being said, I do find the flexibility of separate rows appealing (and looking at the answers, a lot of people seem to feel the same way). It does seem like the blob-approach is feasible though, so I'll do some testing on both approaches and see which one works best in practice.

Jamie Appleseed 2010-07-14 22:52:55

ansaurus

tags:

views:

answers:

Storing vast amounts of (simple) timeline graph data in a DB

related questions