views:

50

answers:

4

I need to store the number of plays for every second of a podcast / audio file. This will result in a simple timeline graph (like the "hits" graph in Google Analytics) with seconds on the x-axis and plays on the y-axis.

However, these podcasts could potentially go on for up to 3 hours, and 100,000 plays for each second is not unrealistic. That's 10,800 seconds with up to 100,000 plays each. Obviously, storing each played second in its own row is unrealistic (it would result in 1+ billion rows) as I want to be able to fetch this raw data fast.

So my question is: how do I best go about storing these massive amounts of timeline data?

One idea I had was to use a text/blob column and then comma-separate the plays, each comma representing a new second (in sequence) and then the number for the amount of times that second has been played. So if there's 100,000 plays in second 1 and 90,000 plays in second 2 and 95,000 plays in second 3, then I would store it like this: "100000,90000,95000,[...]" in the text/blob column.

Is this a feasible way to store such data? Is there a better way?

Thanks!

Edit: the data is being tracked to another source and I only need to update the raw graph data every 15min or so. Hence, fast reads is the main concern.

Note: due to nature of this project, each played second will have to be tracked individually (in other words, I can't just track 'start' and 'end' of each play).

A: 

Would it be problematic to use each second, and how many plays is on a per-second basis? That means 10K rows, which isn't bad, and you just have to INSERT a row every second with the current data.

EDIT: I would say that that solutions is better than doing a comma-separated something in a TEXT column... especially since getting and manipulating data (which you say you want to do) would be very messy.

zebediah49
First off, thanks for your quick reply!I should perhaps clarify that I'm not concerned too much with the update logic when the data structure itself is this simple, nor am I too concerned with the update frequency (only updating every 15min is acceptable).As far as I can see, I'd need three columns: podcast_id, second, and plays. To retrieve the graph data, I'll need to retrieve 10K records, queried on a foreign key and sorted by an integer. Wouldn't this take a second or two just to retrieve?
Jamie Appleseed
A: 

I would view it as a key-value problem.

for each second played

   Song[second] += 1

end

As a relational database -

song
----
name | second | plays

And a hack psuedo-sql to start a second:

insert into song(name, second, plays) values("xyz", "abc", 0)

and another to update the second

update song plays = plays + 1 where name = xyz and second = abc

A 3-hour podcast would have 11K rows.

Paul Nathan
Thanks for your reply. 10-11K rows seems like a lot for each podcast if I need to retrieve the raw graph data fast (preferably around 200ms). Especially since the rows needs to be queried and sorted. Do you have any experience with how long 10-11K rows take to retrieve when queried on a foreign key and sorted by an integer-column? Most of the stuff I've been working with so far has been around 100 rows (representing pages in a CMS) which is a whole different story!
Jamie Appleseed
If the data is only updated every 15 minutes, then why do you need to retrieve the raw data in 200ms? Retrieve it in 3 seconds and cache it until the next update.
WW
Having to wait 3 seconds on the first load is undesirable. Especially since most graphs will only be opened a few times a day (meaning most users will experience a 3 sec load time). I guess I could look into "pre-caching" it (generate the cache immediately after updates), however, that would require an awful lot of computing power to do for all updated podcasts every 15 min.
Jamie Appleseed
A: 

It really depends on what is generating the data ..

As I understand you want to implement a map with the key being the second mark and the value being the number of plays.

What is the pieces in the event, unit of work, or transaction you are loading?

Can I assume you have a play event along the podcastname , start and stop times And you want to load into the map for analysis and presentation?

If that's the case you can have a table

  • podcastId
  • secondOffset
  • playCount

each even would do an update of the row between the start and ending position

update t set playCount = playCount +1 where podCastId = x and secondOffset between y and z

and then followed by an insert to add those rows between the start and stop that don't exist, with a playcount of 1, unless you preload the table with zeros.

Depending on the DB you may have the ability to setup a sparse table where empty columns are not stored, making more efficient.

Rawheiser
A: 

Problem with the blob storage is you need to update the entire blob for all of the changes. This is not necessarily a bad thing. Using your format: (100000, 90000,...), 7 * 3600 * 3 = ~75K bytes. But that means you're updating that 75K blob for every play for every second.

And, of course, the blob is opaque to SQL, so "what second of what song has the most plays" will be an impossible query at the SQL level (that's basically a table scan of all the data to learn that).

And there's a lot of parsing overhead marshalling that data in and out.

On the other hand. Podcast ID (4 bytes), second offset (2 bytes unsigned allows pod casts up to 18hrs long), play count (4 byte) = 10 bytes per second. So, minus any blocking overhead, a 3hr song is 3600 * 3 * 10 = 108K bytes per song.

If you stored it as a blob, vs text (block of longs), 4 * 3600 * 3 = 43K.

So, the second/row structure is "only" twice the size (in a perfect world, consult your DB server for details) of a binary blob. Considering the extra benefits this grants you in terms of being able to query things, that's probably worth doing.

Only down side of second/per row is if you need to to a lot of updates (several seconds at once for one song), that's a lot of UPDATE traffic to the DB, whereas with the blob method, that's likely a single update.

Your traffic patterns will influence that more that anything.

Will Hartung
Thanks for this pros/cons – very helpful! I'll be able to receive the tracking data in batches of 15min intervals (= easily 1,000 updates), so that's definitely a plus for the blob-approach. Also, I only need to use this data for the timeline graph, so being able to query the data is unimportant. That being said, I do find the flexibility of separate rows appealing (and looking at the answers, a lot of people seem to feel the same way). It does seem like the blob-approach is feasible though, so I'll do some testing on both approaches and see which one works best in practice.
Jamie Appleseed