I have this very huge XML file of size 2.8GB. This is Polish Wikipedia's articles dump. The size of this file is very problematic for me. The task is to search this file for some big amount of data. All I have are titles of the articles. I thought that I could sort that titles and use one linear loop through the file. Idea is not so bad, but articles are not sorted alphabetically. They are sorted by ID, which I don't know a priori.
So, my second thought was to make an index of that file. To store in other file (or database) lines in following format: title;id;index
(maybe without an ID). I my other question I asked for help with that. The hypothesis was that if I had index of needed tag I could use just simple Seek
method to move the cursor within the file without reading all content, etc. For smaller files I think this could work fine. But on my computer (laptop, C2D proc, Win7, VS2008) I get error that application is not responding.
In my program, I am reading each line from file and checks if it contains a tag that I need. I am also counting all bytes I read and save lines in format mentioned above. So while indexing program gets hung up. But till then the result index file is 36.2MB and the last index is like 2,872,765,202 (B) while whole XML file is 3,085,439,630 B long.
My third thought was to split the file into smaller pieces. To be precise into 26 pieces (there are 26 letters in Latin language), each containing only entries starting for the same letter, e.g. in a.xml all entries that titles starts at "A" letter. Final files would be like tens of MB, max around 200 MB I think. But there's the same problem with reading whole file.
To read the file I used probably the fastest way: using StreamReader
. I read somewhere that StreamReader
and XmlReader
class from System.Xml
are the fastest methods. StreamReader
even faster that XmlReader
. It's obvious that I can't load all this file into memory. I have installed only 3GB of RAM and Win7 takes like 800MB-1GB when fully loaded.
So I'm asking for help. What is the best to do. The point is that search this XML file has to be fast. Has to be faster then downloading single Wikipedia pages in HTML format. I'm not even sure if that is possible.
Maybe load all the needed content into database? Maybe that would be faster? But still I will need to read the whole file as least once.
I'm not sure if there are some limits about 1 question length, but I will put here also a sample of my indexing source code.
while (reading)
{
if (!reader.EndOfStream)
{
line = reader.ReadLine();
fileIndex += enc.GetByteCount(line) + 2; //+2 - to cover characters \r\n not included into line
position = 0;
}
else
{
reading = false;
continue;
}
if (currentArea == Area.nothing) //nothing interesting at the moment
{
//search for position of <title> tag
position = MoveAfter("<title>", line, position); //searches until it finds <title> tag
if (position >= 0) currentArea = Area.title;
else continue;
}
(...)
if (currentArea == Area.text)
{
position = MoveAfter("<text", line, position);
if (position >= 0)
{
long index = fileIndex;
index -= line.Length;
WriteIndex(currentTitle, currentId, index);
currentArea = Area.nothing;
}
else continue;
}
}
reader.Close();
reader.Dispose();
writer.Close();
}
private void WriteIndex(string title, string id, long index)
{
writer.WriteLine(title + ";" + id + ";" + index.ToString());
}
Best Regards and Thanks in advance,
ventus
Edit: Link to this Wiki's dump http://download.wikimedia.org/plwiki/20100629/