ansaurus

Question

Answer 1

+6 A:

Well.... If you're going to search it, I would highly recommend you find a better way than to deal with the file itself. I suggest as you mention to put it into a well normalized and indexed database and do your searching there. Anything else you do will be effectively duplicating exactly what a database does.

Doing so will take time, however. XmlTextReader is probably your best bet, it works one node at a time. LINQ to XML should also be a fairly efficient processing, but I haven't tried it with a large file and so can't comment.

May I ask: where did this huge XML file come from? Perhaps there's a way to deal with the situation at the source, rather than before having to process a 3 GB file.

Randolpho 2010-07-26 18:54:14

+1, though he did mention the source (right?): "This is Polish Wikipedia's articles dump."

Jeff Sternal 2010-07-26 18:55:10

@Jeff: ahh, missed that.

Randolpho 2010-07-26 18:56:59

XmlTextReader is just an extension to XmlReader class so the speed is the same. LINQ to XML is slower that Xml(Text)Reader. That's for sure

Ventus 2010-07-26 18:59:20

@Ventus, well Linq to XML could be used to stream, http://lennilobel.wordpress.com/2009/09/02/streaming-into-linq-to-xml-using-c-custom-iterators-and-xmlreader/ , which may be what Randolpho meant.

Richard Morgan 2010-07-26 20:25:48

Answer 2

A:

XmlReader will be fast but you need to verify if it is fast enough in your scenario. Suppose that we are looking for a value located in a node called Item:

using (var reader = XmlReader.Create("data.xml"))
{
    while (reader.Read())
    {
        if (reader.NodeType == XmlNodeType.Element && reader.Name == "Item")
        {
            string value = reader.ReadElementContentAsString();
            if (value == "ValueToFind")
            {
                // value found
                break;
            }
        }
    }
}

Darin Dimitrov 2010-07-26 18:54:48

Answer 3

A:

I would do this:

1) Break the XML into smaller files. For example, if the XML looks like this then I would create one file per article node with a name that matches the title attribute. If the title isn't unique, then I would just number the files.

Since that is a lot of files, I would break them into sub directories with each having no more than 1,000 files.

<root>
    <article title="aaa"> ... </article>
    <article title="bbb"> ... </article>
    <article title="ccc"> ... </article>
</root>

2) Create an index table with the file names and the columns you want to search on.

3) As an option, you could store the XML fragments in the database instead of on the hard drive. SQL Server's varChar(MAX) type is good for this.

Jonathan Allen 2010-07-26 19:05:12

why store the raw XML? he should just parse the XML and store the data in as database.

Byron Whitlock 2010-07-26 19:14:35

Even if the XML is clean enough to store as relational tables, and it probably isn't, you should only store the data you actually intend to search on. Anything more and you are just wasting space and potentially causing performance problems.

Jonathan Allen 2010-07-26 19:35:59

The problem is that I need store then very large amount of data. Everybody knows that Wiki's dump will have a lot text. And I need to retrieve that text. So will storing all those text nodes in database make faster searching?

Ventus 2010-07-26 19:43:03

Are you searching on the whole article or just titles? If just the titles, then the correct database index will make that fast. If the whole article, then you should look at something like SQL Server's Full Text Search. Even SQL Server Express includes this, but I don't know how to use it.

Jonathan Allen 2010-07-26 19:52:19

I need to search titles and then for particular title get the page content. Thats all what I want to achieve.

Ventus 2010-07-26 19:59:16

In that case your database table would have two columns, `Title` and `Filename`. The filename would point to the small XML file I talked about in step 1. Use the database to find the right file(s) then load each from the hard drive.

Jonathan Allen 2010-07-26 20:04:14

Answer 4

+5 A:

Well, if it fits with your requirements, I would first import this XML into a RDMS like SQL Server and then query against this SQL Server.

With the right indexes (full text indexes if you want to search through a lot of text), it should be pretty fast...

It would reduce a lot of the overhead coming from the parsing of the XML file by the libraries...

Kharlos Dominguez 2010-07-26 19:05:38

Thanks, but I'm not sure if I will have access to any storage engine at all.

Ventus 2010-07-26 19:09:40

@Ventus. This is the only sane way to go. take a look at SQLLite, derby or some other embedded db solution.

Byron Whitlock 2010-07-26 19:13:11

Oh, right, I forgot about SQLite... :)

Ventus 2010-07-26 19:25:50

There is also SQL Compact 3.5... which is a sort of file-based SQL Server which only needs the .net 3.5 and above framework to run... I have never used it so I'm not sure how it fares speed-wise... SQLLite has the reputation to be pretty fast and there are ADO.Net classes for it around...

Kharlos Dominguez 2010-07-26 20:30:39

Finally I exported all needed content into SQLite database. But still topic is about working with huge XML. Afterwards DB is not XML (unless you use XML database). +1 one for you Kharlos :)

Ventus 2010-07-30 09:14:04

Answer 5

A:

I like the idea of creating an index - you get to keep your code super simple and you don't need any horrible dependencies like databases :)

So - Create an index where you store the following

[content to search]:[byte offset to the start of the xml node that contains the content]

To capture the byte offset, you'll need to create your own stream over the input file, and create a reader from that. you'll query the position on every reader.Read(..). An example index record would be :

"Now is the winter of our discontent":554353

This means that the entry in the xml file that contains "Now is the winter of our discontent" is at the node at byte position 554,353. Note: I'd be tempted to encode the search portion of index such that you don't collide with the separators that you use.

To read the index, you scan through the index on disk (i.e. don't bother loading it into memory) looking for the appropriate record. Once found, you'll have the byte offset. Now create a new Stream over the .xml file and set it's position to the byte offset - create a new reader and read the document from that point.

headsling 2010-07-26 20:53:48

About the index, In my question I wrote that I wanted to make one. Problem was that file is so huge that program hung up. But I think I will try with `Read()` method. Reading fixed size block of data may be faster than `ReadLine()`. Worth to check it :)

Ventus 2010-07-26 21:27:41

the general idea with my proposal is that you'll only be reading a tiny fraction of both the index and data file at any time. the xmlreader will read a file of any size while using a very small amount of memory - where are you hanging?

headsling 2010-07-27 01:30:34

Answer 6

A:

Dump it into a Solr index and use that to search it. You can run up Solr as a standalone search engine, and a simple bit of scripting to loop over the file and dump every article into the index. Solr then gives you full text search over whichever fields you decided to index...

Graham 2010-07-26 21:44:02

Answer 7

A:

The only way your going to be able to search through this quickly is to store it in a database, like others have suggested. If a database is not an option, then its going to take a long time, no doubt about it. What I would do is create a multithread application. Create worker threads that will read in the data and maybe stick it in a string queue. Have like 5 threads doing this segmented through the whole file (so one thread will start the beginning, the second thread will start 1/5 of the way into the file, the third thread will start 2/5 of the way in, etc etc). Meanwhile, have another thread that reads the string queue and searches for whatever it is your looking for. Then have the thread dequeue once its done. This will take a while, but it shouldn't crash or eat up tons of memory.

If you find it is eating a lot of memory, then set a limit on the number of items the queue can hold and have the threads suspend until the queue size is below this threshold.

icemanind 2010-07-26 22:41:34

@icemanind Multi-threaded Xml reading is going to be non-trivial? Perhaps best to keep the Xml shredding single-threaded, and look at multi-threading the rest.

chibacity 2010-07-26 22:46:35

@chibacity It is not that hard. Look here: http://www.c-sharpcorner.com/UploadFile/jbailo/MultithreadedXmlDoc11212005065943AM/MultithreadedXmlDoc.aspx or here: http://www.c-sharpcorner.com/UploadFile/mmehta/LoadingXmlInTreeView11172005011544AM/LoadingXmlInTreeView.aspx

icemanind 2010-07-26 22:49:06

@icemanind No offence, but I don't think you read that article properly in context of the question.

chibacity 2010-07-26 22:52:16

@Chilbacity you may be right. Maybe single thread is the way to go. I just didn't want to see his interface lock up again.

icemanind 2010-07-26 22:53:47

@icemanin Anyway, indexing the data is the only sane way to go. It's quite a lot of data...

chibacity 2010-07-26 23:00:16

Answer 8

+1 A:

Hi there,

you could store the file in couchDB. i wrote a python-script to do it:

import couchdb
import datetime
import time
from lxml import etree

couch = couchdb.Server()
db = couch["wiki"]

infile = '/Users/johndotnet/Downloads/plwiki-20100629-pages-articles.xml'


context = etree.iterparse(source=infile, events=("end", ), tag='{http://www.mediawiki.org/xml/export-0.4/}page')


for event, elem in context:
    #dump(elem)
 couchEle = {}
 for ele in elem.getchildren():
  if ele.tag == "{http://www.mediawiki.org/xml/export-0.4/}id":
   couchEle['id'] = ele.text
  if ele.tag == "{http://www.mediawiki.org/xml/export-0.4/}title":
   couchEle['title'] = ele.text
  if ele.tag == "{http://www.mediawiki.org/xml/export-0.4/}revision":
   for subEle in ele.getchildren():
    if subEle.tag == "{http://www.mediawiki.org/xml/export-0.4/}text":
     couchEle['text'] = subEle.text


 db[couchEle['title']] = couchEle

This should import all the article with id, title and text into couchDB.

now you should do a query like this:

code = '''
  function(doc) { 
   if(doc.title.indexOf("Brzeg") > -1) {
    emit(doc._id, doc);
   }

  }
  '''
results = db.query(code)

Hope it helps!

2010-07-26 23:37:16

Answer 9

+1 A:

I just did my thesis on using Wikipedia to create training data for named entity recognition. I have some software that could probably help you out. I was working with the English and Dutch which are 26 and 5 GB respectively. So I created an xml file splitter (creates a bunch of smaller files split on wikipedia pages), and a parser which can optionally rip out a lot of the useless formatting, and it also stores title, text, id, categories, etc in a database. I still haven't put it online yet but I could do that if you'd be interested in using it.

Jessica 2010-07-27 08:24:02

One thing - my software is in Java..

Jessica 2010-07-27 08:25:29

Nice! I would appreciate if you send your software :) It doesn't really matters that it's a Java application since I need only to parse XML file. Then I could use is in my c# software (which by the way is also my thesis). Please contact with me via e-mail: k.wegrzynowicz [at] [remove-this-tag] gmail.com

Ventus 2010-07-27 09:37:32

ansaurus

tags:

views:

answers:

Working with very huge XML file in C#

related questions