views:

74

answers:

6

Hi all. Basically I have an xml feed from an offsite server.

The xml feed has one parameter ?value=n now N can only be between 1 and 30

What ever value i pick, there will always be 4000 rows returned from the XML file. My script will call this xml file 30 times for each value once a day. So thats 120000 rows. I will be doing quite complicated queries on these rows. But the main thing is I will always filter by value first so SELECT * WHERE value = 'N' etc. That will ALWAYS be used.

Now is it better to have one table where all 120k rows are stored? or 30 tables were 4k rows are stored?

EDIT: the SQL database in question will be MySQL

EDIT: Just to make it abit clearer, the data will be UPDATED every day, so the old tables will be overwritten, I do not want any archive solutions, just the best way to store the data to have as little performance bottlenecks as possible, the database results once outputted will be cached and also updated daily.

EDIT: I guess i was being too vague for my own good :( Basically the feeds are leaderboards, each value is a different leaderboard location

The values will only be updated if the leaderboard position changes and there will always be ONLY 120k rows. no more, no less.

Lets say:

  1. Blue
  2. Green
  3. Red

Thats the current leaderboard and the next update the feed returns:

  1. Blue
  2. Red
  3. Green

Only rows 2 and 3 will change. That is my plan anyway :)

ANOTHER EDIT >.<: The rows will only contain at most 12 columns and less than 1kb per row. And the update wil only happen ONCE a day because the server the feeds are from is slow and it takes 80 minutes for my server to get all feed values from it.

+3  A: 

In terms of storage, there is little difference between a 120k rows table and 30 4k tables.

In terms of maintenance, I would always go with one table. It makes your code and SQL much easier to work with, and as you are already using the WHERE clause, I don't see any reason to split the table.

Oded
Im just wondering if 30 tables will be better for performance reasons? I dont work with databases often and I dont want it to be an unnecessary bottleneck
Ozzy
It is not a problem for that amount of data. Tens of millions of rows and then you can start talking about performance issues ;)
Oded
A: 

As long as you index properly, one table will be faster. You'll definitely need an index on your value (which you should call something else as 'value' is a reserved word in sql).

At the volume you're considering, storage should not be an issue. If you're doing this long term, you may want to investigate archiving solutions for old data.

dnagirl
is indexing really worth it when I will be updating the 120k rows every day? unless the rows dont change*
Ozzy
@Ozzy: if you do queries with conditions, you need to have indices on the fields in the conditions. Make a minimal dataset (1 day's data). Run EXPLAIN on your query. Then add appropriate indices and run EXPLAIN again. If your indices are good, the query speed will be vastly improved. If you're concerned that you'll be replacing your data everyday and think that inserts will be slowed down by indices, I don't think the scale of your data really merits that concern.
dnagirl
A: 

A single table is my preferred choice.

I understand it will not contain data for just a single import, but that what a WHERE clause comes in.

Queries may not initially come back as fast as you'd like, and you can solve that with proper indexing.

More importantly, what will you do if for some reason you choose to go with 45 times a day or 90 times a day or once every 5 minutes (12 * 24 = 288 times a day). Creating 288 tables and changing all queries that tie to those tables will be a huge exercise.

Raj More
A: 

You want one table. Otherwise, you'd have to write 30 different queries, or build a dynamic query solution (yuck).

How wide are the rows? More to the point, how many rows fit on an 8k SQL page? (You can guesstimate your disk I/O based on that.) How long does it take your hardware to read that much data? Or can it all fit in memory, so that you're not hitting the disk all that often? My point being, do you actually have a performance problem?

Putting a compound clustered index on the table such that your "n" value is the first column would optimize these reads (but only if you always have an "n" value in a WHERE clause). Alternatively, if "n" always ranges between the fixed values of 1 and 30 and you are using SQL 2005 and up, you could implement table partitioning, which would give you the same performance boost, and possibly a bit more flexibility when it comes to loading or unloading data.

Philip Kelley
The "MySQL" tag got added after I wrote this. This still holds true, though I don't know if MySql has table partitioning-type structures.
Philip Kelley
A: 

As all the others have said, go with one table. There will not be any bottleneck performance on the database side because of this one table, unless your database is already badly set up, in which case this will reveal the situation, not cause it. If you do a performance analysis to the detail involving all the components in the flow (from when the user starts the request to when the results are returned to), you will see that in your example the database component will not add any important performance hit. And as the other answers have pointed out, you must define the right index or indexes, based on your specific queries.

JorgeLarre
A: 

As Oded said, there's no real scaling/performance issue at 120K rows, so I'd go for a unique table, too (to keep things simple).

If in future you need to scale a lot, just keep in mind this article on "why SQL databases don't scale". Among the other things, the article explains why "partitioning" (or "sharding") is bad for an SQL database:

Sharding divides your data along some kind of application-specific boundary. For example, you might store users whose names start with A-M on one database, and N-Z on another. Or use a modulo of the user id by the number of databases.

This requires deep integration into the application and careful planning of the partitioning scheme relative to the database schema and the kinds of queries you want to do. Summary: big pain in the ass.

So while sharding is a form of horizontal scaling, it fails point #2: it is not transparent to the business logic of the application.

The deeper problem with sharding is that SQL databases are relational databases, and most of the value in a relational database is that it stores relationships. Once you split records across multiple servers, you’re servering many of those relations; they now have to be reconstructed on the client side. Sharding kills most of the value of a relational database.

Even if this was originally referred to partitioning among more databases, the same concept can be applied to your case, in which you're trying to achieve some sort of "internal" partitioning.

Conclusion, the answer for real scaling is NoSQL. Again, not with 120K rows :)

Roberto Aloi