tags:

views:

96

answers:

3

For a very large site such as a Social Network (say Facebook), which method would you recommend for user accounts storage?

1) Single XML files for each type of features, on the user's directory: basicinfo.xml, comments.xml, photos.xml, ...

2) MySQL, although not sure how to organize on this. Maybe separated databases for each feature? E.g. a database for Comments, with a table for each user, where columns are id,from,message,time?

I know XML is not designed for storage, and PHP (this is the language I use) must read the entire XML file and store in memory before it is used.

But, here are the reasons why I prefer XML (but I may be wrong, please tell me if you disagree with any):

1) If I have user accounts' paths organized in this way

User ID 2342:
/users/00/00/00/00/00/00/00/23/42/

I think it's faster to find the Comments of a user by file path than seeking in a large database (note that user IDs do not increment regularly, so they can't be used as an index of MySQL).
Also, if each feature is splitted in databases, each user profile will seek more than once, to display comments, photos, basic info, etc.

2) I heard MySQL is globaly locked when writing on it. Is this true? If yes, I rather to lock a single file than everything.

3) Is MySQL "shared" between the cluster? I mean, if 1 disk gets full, will it "continue" on another? Or do I, as the programmer, have to manage it myself and create new databases on another disk? (note, I use Unix)
It is ok that it is about the same by using XML files, but it is easier to split between disks, because structure is splitted by account IDs, not by feature as it would be in a database.

4) Note that I don't store each comment on the comments.xml. I just note their attributes in each XML tag, and the messages are in separated text files commentid.txt. Once each XML should not be much large, there should not be problems with memory/time.

As for the problem of parsing entire XML, maybe I should think on using XMLReader/Writer instead of SimpleXML/DOM? Or, will it decrease performance allot?

Thank you!

+4  A: 

Depends heavily on the nature of your site. On the one hand the XML approach gives you a free pass on things like “SELECT * FROM $table where $table.id=$id” type queries. On the other hand...

For a very large site, in the worst case scenario the data files end up pretty big too. If it is any kind of community site this may easily happen for any account go to any forum with a true number of old-guard members in its community and you'll find a couple of posters that have say 10K posts... This means you will wish for SQL style result sets which are implemented using a memory efficient model, rather than a speed efficient one. To the end user 1s versus 1.1s response time is not that much of a deal; but to you 1K of simultaneous requests versus 1.5K or better definitely is.

Then there is the aspect that if you are mostly reading data XML may be fine if somewhat crude for large data sets and DOM based implementations. But if you are writing a lot, things become much much worse. Caching of data is still possible, but giving ACID like guarantees on these file transactions requires you to pretty much write your own database software.

And then there is storage requirements and such like which mean you may need a distributed approach for storing your data. These kind of setups are relatively well understood in the database world, and they bring a lot of interesting problems with them to the table (like what do you do if a single disk fails?, how do you know on what disk to find the data and how do you implement efficient caching?) that essentially amount to again writing your own mini-database software from scratch.

So for a very large site I think the hard technical requirements of performance at not too great a cost in terms of memory and also a certain reliability and not needing to reinvent 21 wheels at the same time means that your approach would not work that well. I think it is better suited to smallish read-only sites where you can afford to experiment with and pursue alternative routes, where you can easily make changes and roll them out across the entire site.

+3  A: 

IME: An in-house application using a single XML file for persistence didn't stand up to use by a single user...

1) What you're suggesting is that an XML file system with a manager application... There are XML databases, and XML there's been increasing support for storing XML within RDBMS. You're looking at re-inventing the wheel...

That's besides the normalization that would come out of storing the data in a RDBMS, which would enforce referential integrity that XML will never do...

2) "Global locking" is without any contextual scope. No database I know of locks globally when writing; most support degrees of locking (table/row/etc, varies between vendors) for sake of retaining concurrency when directed to - not by default.

3) Without a database, data or actual users--being concerned about clustering is definitely premature optimization.

4) If the system crashes without having written the referential integrity to some sort of persistence that will survive the application being turned off, the data will be useless.

OMG Ponies
+2  A: 

Facebook uses MySQL.

That being said. Here's the long version:

I always say that XML is a data transfer technology, not a data storage technology, but not everyone agrees. XML is not designed to be use a relational datastore. XML was first introduced in order to provide a standard way of transmitting data from system to system w/o giving access to the originating systems.

Since you are talking about a large application, I would strongly urge you to use MySQL (or other RDBMS), as your dataset grows and grows the XML will be increasingly slower and slower unless you always keep a fresh copy in memory and only read the XML files upon service reboot.

Using an XML database is reportedly more efficient in terms of conversion costs when you're constantly sending XML into and retrieving XML out of a database. The rationale is, when XML is the only transport syntax used to get things in and out of the DB, why squeeze everything through a layer of SQL abstraction and all those relational tables, foreign keys, and the like? It basically takes a parsing layer out of the application and brings it into the data engine - where it's probably going to work faster and more efficiently than the SQL alternative. Probably.

shamittomar
Not anymore I believe: http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf [PDF].
Daniel Vassallo
@Daniel, they use Casandra in combination with MySQL: http://www.facebook.com/note.php?note_id=24413138919 Anyways, NOT using XML.
shamittomar
@shamittomar: Yes definitely :)
Daniel Vassallo