tags:

views:

160

answers:

3

Hello,

My application creates pieces of data that, in xml, would look like this:

<resource url="someurl">
   <term>
      <name>somename</name>
      <frequency>somenumber</frequency>
   </term>    
   ...
   ...
   ...
</resource>

This is how I'm storing these "resources" now. A resource per XML file. As many "term" per "resource" as needed. The problem is, I'll need to generate about 2 million of these resources. I've generated almost 500.000 and my mac isn't very happy about it. So my question is: how should I store this data?

  • A database? that would be hard, because the structure of the data isn't fixed...
  • Maybe merge some resources into larger XML files?
  • ...?

I don't need to change the data once it's created. Right now I'm accessing a specific resource by the name of that resource's file.

Any suggestions are greatly appreciated!

A: 

Not all databases are relational. Have a look at for example mongodb. It stores your data as json-like objects, similar to your resources.

An example using the shell:

$ mongo
> db.resources.save({url: "someurl", 
                     terms: [{name: "name1", frequency: 17.0},
                             {name: "name2", frequency: 42.0}]})
> db.resources.find()
{"_id" :  ObjectId( "4b00884b3a77b8b2fa3a8f77"), 
 "url" : "someurl" , 
 "terms" : [{"name" : "name1" , "frequency" : 17},
            {"name" : "name2" , "frequency" : 42}]}
Serbaut
ok, I'm going to give mongodb or couchdb a try. I'm guessing these can handle large datasets fairly well?
pns
also, can anyone confirm that I won't have any problems moving the datasets across different operating systems?
pns
without knowing the details i think mongodb should handle your case well. you can access mongo via the api from any supported platform and i think you can just copy the data files if you want to move the database to another platform.
Serbaut
A: 

You should deffinetely have several resourses per XML file, but only if you are expected to have all the resources toguether at the same time. If you need to send only a handfull of resourses to anybody, then keep making the individual XML.

Even in that situation, you could keep the large XML file, and generate on demand the smaller ones from the original dataset.

Using a database like SQLite3 would allow you to have faster seek times and easier manipulation of the data, using SQL syntax.

voyager
+2  A: 

If your can't predict how your data is going to be organized, maybe http://couchdb.apache.org/ can be interesting for you. It is a schema-less database.

Anyways, XML is maybe not the best choice for big amout of data.

Maybe trying JSON or YAML works out better? They need less space and are easier to parse (I have however no experience on using those formats on larger scale. Maybe I'm wrong).

Tristram Gräbener