tags:

views:

511

answers:

9

I have this very huge XML file of size 2.8GB. This is Polish Wikipedia's articles dump. The size of this file is very problematic for me. The task is to search this file for some big amount of data. All I have are titles of the articles. I thought that I could sort that titles and use one linear loop through the file. Idea is not so bad, but articles are not sorted alphabetically. They are sorted by ID, which I don't know a priori.

So, my second thought was to make an index of that file. To store in other file (or database) lines in following format: title;id;index (maybe without an ID). I my other question I asked for help with that. The hypothesis was that if I had index of needed tag I could use just simple Seek method to move the cursor within the file without reading all content, etc. For smaller files I think this could work fine. But on my computer (laptop, C2D proc, Win7, VS2008) I get error that application is not responding.

In my program, I am reading each line from file and checks if it contains a tag that I need. I am also counting all bytes I read and save lines in format mentioned above. So while indexing program gets hung up. But till then the result index file is 36.2MB and the last index is like 2,872,765,202 (B) while whole XML file is 3,085,439,630 B long.

My third thought was to split the file into smaller pieces. To be precise into 26 pieces (there are 26 letters in Latin language), each containing only entries starting for the same letter, e.g. in a.xml all entries that titles starts at "A" letter. Final files would be like tens of MB, max around 200 MB I think. But there's the same problem with reading whole file.

To read the file I used probably the fastest way: using StreamReader. I read somewhere that StreamReader and XmlReader class from System.Xml are the fastest methods. StreamReader even faster that XmlReader. It's obvious that I can't load all this file into memory. I have installed only 3GB of RAM and Win7 takes like 800MB-1GB when fully loaded.

So I'm asking for help. What is the best to do. The point is that search this XML file has to be fast. Has to be faster then downloading single Wikipedia pages in HTML format. I'm not even sure if that is possible.

Maybe load all the needed content into database? Maybe that would be faster? But still I will need to read the whole file as least once.

I'm not sure if there are some limits about 1 question length, but I will put here also a sample of my indexing source code.

while (reading)
{
    if (!reader.EndOfStream)
    {
        line = reader.ReadLine();
        fileIndex += enc.GetByteCount(line) + 2; //+2 - to cover characters \r\n not included into line
        position = 0;
    }
    else
    {
        reading = false;
        continue;
    }

    if (currentArea == Area.nothing)    //nothing interesting at the moment
    {
         //search for position of <title> tag
         position = MoveAfter("&lt;title>", line, position);    //searches until it finds &lt;title> tag
         if (position >= 0) currentArea = Area.title;
         else continue;
    }

    (...)

    if (currentArea == Area.text)
    {
         position = MoveAfter("&lt;text", line, position);
         if (position >= 0)
         {
              long index = fileIndex;
              index -= line.Length;
              WriteIndex(currentTitle, currentId, index);
              currentArea = Area.nothing;
         }
         else continue;
     }
 }

 reader.Close();
 reader.Dispose();
 writer.Close();
 }

 private void WriteIndex(string title, string id, long index)
 {
     writer.WriteLine(title + ";" + id + ";" + index.ToString());
 }

Best Regards and Thanks in advance,

ventus

Edit: Link to this Wiki's dump http://download.wikimedia.org/plwiki/20100629/

+6  A: 

Well.... If you're going to search it, I would highly recommend you find a better way than to deal with the file itself. I suggest as you mention to put it into a well normalized and indexed database and do your searching there. Anything else you do will be effectively duplicating exactly what a database does.

Doing so will take time, however. XmlTextReader is probably your best bet, it works one node at a time. LINQ to XML should also be a fairly efficient processing, but I haven't tried it with a large file and so can't comment.

May I ask: where did this huge XML file come from? Perhaps there's a way to deal with the situation at the source, rather than before having to process a 3 GB file.

Randolpho
+1, though he did mention the source (right?): "This is Polish Wikipedia's articles dump."
Jeff Sternal
@Jeff: ahh, missed that.
Randolpho
XmlTextReader is just an extension to XmlReader class so the speed is the same. LINQ to XML is slower that Xml(Text)Reader. That's for sure
Ventus
@Ventus, well Linq to XML could be used to stream, http://lennilobel.wordpress.com/2009/09/02/streaming-into-linq-to-xml-using-c-custom-iterators-and-xmlreader/ , which may be what Randolpho meant.
Richard Morgan
A: 

XmlReader will be fast but you need to verify if it is fast enough in your scenario. Suppose that we are looking for a value located in a node called Item:

using (var reader = XmlReader.Create("data.xml"))
{
    while (reader.Read())
    {
        if (reader.NodeType == XmlNodeType.Element && reader.Name == "Item")
        {
            string value = reader.ReadElementContentAsString();
            if (value == "ValueToFind")
            {
                // value found
                break;
            }
        }
    }
}
Darin Dimitrov
A: 

I would do this:

1) Break the XML into smaller files. For example, if the XML looks like this then I would create one file per article node with a name that matches the title attribute. If the title isn't unique, then I would just number the files.

Since that is a lot of files, I would break them into sub directories with each having no more than 1,000 files.

<root>
    <article title="aaa"> ... </article>
    <article title="bbb"> ... </article>
    <article title="ccc"> ... </article>
</root>

2) Create an index table with the file names and the columns you want to search on.

3) As an option, you could store the XML fragments in the database instead of on the hard drive. SQL Server's varChar(MAX) type is good for this.

Jonathan Allen
why store the raw XML? he should just parse the XML and store the data in as database.
Byron Whitlock
Even if the XML is clean enough to store as relational tables, and it probably isn't, you should only store the data you actually intend to search on. Anything more and you are just wasting space and potentially causing performance problems.
Jonathan Allen
The problem is that I need store then very large amount of data. Everybody knows that Wiki's dump will have a lot text. And I need to retrieve that text. So will storing all those text nodes in database make faster searching?
Ventus
Are you searching on the whole article or just titles? If just the titles, then the correct database index will make that fast. If the whole article, then you should look at something like SQL Server's Full Text Search. Even SQL Server Express includes this, but I don't know how to use it.
Jonathan Allen
I need to search titles and then for particular title get the page content. Thats all what I want to achieve.
Ventus
In that case your database table would have two columns, `Title` and `Filename`. The filename would point to the small XML file I talked about in step 1. Use the database to find the right file(s) then load each from the hard drive.
Jonathan Allen
+5  A: 

Well, if it fits with your requirements, I would first import this XML into a RDMS like SQL Server and then query against this SQL Server.

With the right indexes (full text indexes if you want to search through a lot of text), it should be pretty fast...

It would reduce a lot of the overhead coming from the parsing of the XML file by the libraries...

Kharlos Dominguez
Thanks, but I'm not sure if I will have access to any storage engine at all.
Ventus
@Ventus. This is the only sane way to go. take a look at SQLLite, derby or some other embedded db solution.
Byron Whitlock
Oh, right, I forgot about SQLite... :)
Ventus
There is also SQL Compact 3.5... which is a sort of file-based SQL Server which only needs the .net 3.5 and above framework to run... I have never used it so I'm not sure how it fares speed-wise... SQLLite has the reputation to be pretty fast and there are ADO.Net classes for it around...
Kharlos Dominguez
Finally I exported all needed content into SQLite database. But still topic is about working with huge XML. Afterwards DB is not XML (unless you use XML database). +1 one for you Kharlos :)
Ventus
A: 

I like the idea of creating an index - you get to keep your code super simple and you don't need any horrible dependencies like databases :)

So - Create an index where you store the following

[content to search]:[byte offset to the start of the xml node that contains the content]

To capture the byte offset, you'll need to create your own stream over the input file, and create a reader from that. you'll query the position on every reader.Read(..). An example index record would be :

"Now is the winter of our discontent":554353

This means that the entry in the xml file that contains "Now is the winter of our discontent" is at the node at byte position 554,353. Note: I'd be tempted to encode the search portion of index such that you don't collide with the separators that you use.

To read the index, you scan through the index on disk (i.e. don't bother loading it into memory) looking for the appropriate record. Once found, you'll have the byte offset. Now create a new Stream over the .xml file and set it's position to the byte offset - create a new reader and read the document from that point.

headsling
About the index, In my question I wrote that I wanted to make one. Problem was that file is so huge that program hung up. But I think I will try with `Read()` method. Reading fixed size block of data may be faster than `ReadLine()`. Worth to check it :)
Ventus
the general idea with my proposal is that you'll only be reading a tiny fraction of both the index and data file at any time. the xmlreader will read a file of any size while using a very small amount of memory - where are you hanging?
headsling
A: 

Dump it into a Solr index and use that to search it. You can run up Solr as a standalone search engine, and a simple bit of scripting to loop over the file and dump every article into the index. Solr then gives you full text search over whichever fields you decided to index...

Graham
A: 

The only way your going to be able to search through this quickly is to store it in a database, like others have suggested. If a database is not an option, then its going to take a long time, no doubt about it. What I would do is create a multithread application. Create worker threads that will read in the data and maybe stick it in a string queue. Have like 5 threads doing this segmented through the whole file (so one thread will start the beginning, the second thread will start 1/5 of the way into the file, the third thread will start 2/5 of the way in, etc etc). Meanwhile, have another thread that reads the string queue and searches for whatever it is your looking for. Then have the thread dequeue once its done. This will take a while, but it shouldn't crash or eat up tons of memory.

If you find it is eating a lot of memory, then set a limit on the number of items the queue can hold and have the threads suspend until the queue size is below this threshold.

icemanind
@icemanind Multi-threaded Xml reading is going to be non-trivial? Perhaps best to keep the Xml shredding single-threaded, and look at multi-threading the rest.
chibacity
@chibacity It is not that hard. Look here: http://www.c-sharpcorner.com/UploadFile/jbailo/MultithreadedXmlDoc11212005065943AM/MultithreadedXmlDoc.aspx or here: http://www.c-sharpcorner.com/UploadFile/mmehta/LoadingXmlInTreeView11172005011544AM/LoadingXmlInTreeView.aspx
icemanind
@icemanind No offence, but I don't think you read that article properly in context of the question.
chibacity
@Chilbacity you may be right. Maybe single thread is the way to go. I just didn't want to see his interface lock up again.
icemanind
@icemanin Anyway, indexing the data is the only sane way to go. It's quite a lot of data...
chibacity
+1  A: 

Hi there,

you could store the file in couchDB. i wrote a python-script to do it:

import couchdb
import datetime
import time
from lxml import etree

couch = couchdb.Server()
db = couch["wiki"]

infile = '/Users/johndotnet/Downloads/plwiki-20100629-pages-articles.xml'


context = etree.iterparse(source=infile, events=("end", ), tag='{http://www.mediawiki.org/xml/export-0.4/}page')


for event, elem in context:
    #dump(elem)
 couchEle = {}
 for ele in elem.getchildren():
  if ele.tag == "{http://www.mediawiki.org/xml/export-0.4/}id":
   couchEle['id'] = ele.text
  if ele.tag == "{http://www.mediawiki.org/xml/export-0.4/}title":
   couchEle['title'] = ele.text
  if ele.tag == "{http://www.mediawiki.org/xml/export-0.4/}revision":
   for subEle in ele.getchildren():
    if subEle.tag == "{http://www.mediawiki.org/xml/export-0.4/}text":
     couchEle['text'] = subEle.text


 db[couchEle['title']] = couchEle

This should import all the article with id, title and text into couchDB.

now you should do a query like this:

code = '''
  function(doc) { 
   if(doc.title.indexOf("Brzeg") > -1) {
    emit(doc._id, doc);
   }

  }
  '''
results = db.query(code)

Hope it helps!

+1  A: 

I just did my thesis on using Wikipedia to create training data for named entity recognition. I have some software that could probably help you out. I was working with the English and Dutch which are 26 and 5 GB respectively. So I created an xml file splitter (creates a bunch of smaller files split on wikipedia pages), and a parser which can optionally rip out a lot of the useless formatting, and it also stores title, text, id, categories, etc in a database. I still haven't put it online yet but I could do that if you'd be interested in using it.

Jessica
One thing - my software is in Java..
Jessica
Nice! I would appreciate if you send your software :) It doesn't really matters that it's a Java application since I need only to parse XML file. Then I could use is in my c# software (which by the way is also my thesis). Please contact with me via e-mail: k.wegrzynowicz [at] [remove-this-tag] gmail.com
Ventus