tags:

views:

64

answers:

3

I want to merge several xml files in perl. Each file is composed of a lot of elements; I need to merge data with the same element from these files. e.g.

file1 has elements {e1, e2, e4}

file2 has elements {e1, e3, e4}

file3 has elements {e2, e4, e5}

so I need to merge e1 of file1 with e1 of file2, merge e2 of file1 and e2 of file3, etc. the merged result will be saved in another file.

Since the size of these files are big, it is not good to merge data file by file, i.e. parse the whole file1, then parse the whole file2 and merge it with file1. etc. because that will consume a lot of memory.

So I plan to merge data element by element. i.e. parse e1 of all files, release memory, then parse element2 of all file, release memory, etc.

Currently I use xml:parser: sax parser to parse and handle the files.

My question is how to implement the merge element by element. I do not know how these files can be controlled to process the same element. Using conditional signal? fork() Or sth. else? Can anybody gives me an example here because I am not familiar with either way. Thx.

here is the example how the data is merged: file1:

<class name="math">

<string>luke1</string>

<string>luke2</string>

</class name>

<class name="music">

<string>mary1</string>

<string>mary2</string>

</class name>

file2:

<class name="math">

<string>luke1</string>

<string>luke3</string>

</class name>

<class name="music">

<string>mary1</string>

<string>mary3</string>

</class name>

<class name="english">

<string>tom1</string>

<string>tom2</string>

</class name>

should be merged to another file as:

<class name="math">

<string>luke1</string>

<string>luke2</string>

<string>luke3</string>

</class name>

<class name="music">

<string>mary1</string>

<string>mary2</string>

<string>mary3</string>

</class name>

<class name="english">

<string>tom1</string>

<string>tom2</string>

</class name>

Note I want to merge element math of all files, then merge element music of all files and then merge element english of all files.

+2  A: 

UPDATE:

Yes, you can try to process the 3 files in "parallel" using SAX parsers, if your callbacks implement "Sleep/wake up/check if other SAX parsers said to proceed" mechanism. Basically poor approximation of threads and messaging.

It would only work if the elements in each XML file were ordered in the same exact order and ideally in alphabetical one - this way you can move linearly inside each file via SAX parser and guarantee that you hit identical elements at the same time and thus only hold 3-6 elements in memory at once. It's basically merging 3 sorted arrays into 1 sorted array.

I seriously doubt this approach would even remotely be superior to the original algo I listed below, but if that's what you want to try to implement, go for it.

ORIGINAL:

Basically, the best (if not the only) way of doing what you want is to build a database of all the elements in need of merging.

Probably mapping an element name-or-id to N true/false fields, one for each XML file; or even a single yes/no value for "already merged" - I will use the latter option in my example logic below.

Whether that database would be implemented as in-memory-hash; or a tied hash stored in a file to avoid memory issues, or a proper database (implemented as XML, or as SQLite, or DBM, or a real database backend) is less important; except that the first option obviously sucks memory-consumption-wise.

Please note the XML database option, since you MIGHT manage to use the resulting XML file as the database. That might actually be your easiest option, not sure - I would personally recommend a tied hash or real database back-end if you have one.

Having done that, the algorithm is obvious:

  • Loop over each file using SAX parser

  • On each element found, search out that element in the database. if already marked as processed, skip. If not, add to database as processed.

  • Find that same element in all the subsequent files, using XPath. E.g. when processing file2.xml, only search file3.xml, since file1.xml would not have the element (or else it would have been processed out of file1.xml and already appear in database).

  • Merge all the elements you found using XPath as well as the element from the current file, and insert into resultant XML file and save that.

  • End both loops.

Please note that this answer does not directly address which modules to use to implement each step - presumably XML::Parser or any other sax parser for parsing, XML::XPath for searching in other files, and something like XML::SAX::Writer to write resulting file I presume, though as I never had to write a file in non-DOM model, I don't want to make te latter an official recommendation; and if you want to know which module is best for that you may want to make that into a separate question or hope someone else answers this one with more precise module recommendations.

DVK
A: 

(sorry I could not add comment somehow so I have to post my comment in "post your answer")

Hi DVK,

I do not understand what you mean. As I said, I do not want to parse file by file, ie. parse all elements in file1, record the data in memory, then parse all elements in file2, record the data in memory and merge it with the data got from file1, then parse all elements in file3.... and finally got the merged data and save it to the result file. this approach eats a lot of memory.

so I want to process one element of all files save it, free the memory of this element, then process next element of all files, save it....

I do not understand what is Loop over each file. so you still suggestion using the first approach I mentioned? what is "Find that same element in all the subsequent files," you parse all files now you want to parse each element of each file using XPATH again?

You can't avoid parsing each file in turn. But you don't have to store all the parsed elements in memory if you use my suggestion of using the database/registry taht's NOT a in-memory hash.
DVK
Yes I know you suggested me using sth. like tied hash rather than memory hash. but I still do not quite understand your suggestion like checking if an element is processed or not... (yes/no) in a mapping table. do you suggest parsing these files multiple times or 1or2 times?
@lilli - either parse/loop each file onlu once and accumulate the intended contents of final XML in a tied hash. Or loop over each file once but you will parse them a lot more due to searching via XPath
DVK
@lili - see my update.... you CAN try to do it in 3 parallel parsings under very specific data domain, but I strongly dislike that approach compared to my original one, as it's too brittle/complicated/fragile.
DVK
thanks, I will try
A: 

I like XML::LibXML, so I'd use XML::LibXML::Reader. Open a separate XML::LibXML::Reader on each input file specified as an argument to your script, and just call ->read on each of them in turn, reproducing the input on output just once for each round, with slightly more complicated logic at the merge points. If you have more input files than file descriptors, you're going to have to merge them in batches; I'd do this in a shell script or Makefile.

reinierpost