views:

269

answers:

8

I've got a huge file (500 MB) that is organized like this:

<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>
<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>
<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>

I'd like to transform this into a new format where s1 goes to a new file with each s1 on its own line with a line break, and s2 goes to a new file with each s2 on its own line.

Is Perl the way to go here? If so, can someone let me know how I can accomplish this?

+4  A: 

Yes, Perl is the (or maybe "a") way to go.

You need an XML parser. There are several choices on CPAN so have a look.

XML::LibXML::Parser looks like it has something for parsing parts of files, which sounds like what you need.

Kinopiko
+6  A: 

Use an XML parser. This problem is quite well-suited to parsing with an event-based parser, so I'd recommend looking into how the built-in XML::Parser or XML::SAX modules work. You should be able to create two event handlers for each kind of tag you want to process and direct the matching content to two separate files.

Matt Ryall
Can you provide a code example?
PP
There are examples in their documentation.
brian d foy
A: 

If the file is huge an XML parser could result in significant slow down or even an application crash as XML parsers require the entire file in memory before any operations can be performed on the file (something high-level fluffy cloud developers often forget about recursive structures).

Instead you can be pragmatic. It appears that your data follows fairly consistent patterns. And this is a one-time transformation.

Try something like


BEGIN {
  open( FOUT1 ">s1.txt" ) or die( "Cannot open s1.txt: $!" );
  open( FOUT2 ">s2.txt" ) or die( "Cannot open s2.txt: $!" );
}
while ( defined( my $line = <> ) ) {
  if ( $line =~ m{<s1>(.+?)</s1>} ) {
    print( FOUT1 "$1\n" );
  } elsif ( $line =~ m{<s2>(.+?)</s2>} ) {
    print( FOUT2 "$1\n" );
  }
}
END {
  close( FOUT2 );
  close( FOUT1 );
}

Then run this script as perl myscript.pl <bigfile.txt.

Update 1: corrected reference to matched section as $1 from $2.

PP
Many XML parsers would not require this entire structure to be in memory. Especially SAX parsers.
jrockway
That's good to hear, however is there a risk of the XML parser doing bad things if a tag is missing, if there's a syntax error anywhere in the huge file, etc..? The concern is the size of the input file.
PP
@PP: If there's a syntax error, what makes you think any other technique is going to work?
brian d foy
Either you are dealing with huge XML files or you aren't. If you are, you better make sure that the XML is valid and that you have the tools to deal with them.
innaM
Those `BEGIN` and `END` blocks are complete eyesores. You do not need `defined` in the `while` statement. And, ditto @jrockway, @brian d foy and @Manni.
Sinan Ünür
Wow. I hand't even noticed the `BEGIN` and `END` blocks, yet. I smell a cargo cult.
innaM
@brian d foy - there's a general principle on the internet: be strict in what you send, be liberal in what you accept - you're right, we make assumptions, and a syntax error in the middle of a large file could cause problems regardless of the parsing solution. @Sinan I believe `defined` is good practice when reading input as only undefined is a valid end-of-file marker, however you're right in that the `BEGIN` and `END` tags are unnecessary, although it's pretty clear what's happening there. Feel free to optimise in your detailed and educational answer!
PP
@PP `defined` is **completely and utterly** unnecessary in the `while` condition above. `while ( my $line = <$input> )` is always exactly the same as `while ( defined(my $line = <$input>) )`.
Sinan Ünür
+5  A: 

You can use Perl, but it's NOT the only way. Here's one with gawk:

gawk -F">" '/<s[12]>/{o=$0;sub(/.*</,"",$1);print o > "file_"$1 }' file

Or, if your task is very simple, then:

awk '/<s1>/' file > file_s1
awk '/<s2>/' file > file_s2

or grep:

grep "<s1>" file > file_s1
grep "<s2>" file > file_s2
ghostdog74
this also solve the problem. why the down vote.?
ghostdog74
Because attempting to parse XML with anything less than an XML parser is a bad idea, no exceptions.
squeeks
yes, there are exceptions. don't ever assume on behalf of the OP. His requirement stated is very simple, in this case an XML parser is not needed. Also OP's XML file format may sometimes not be compliant to XML format and XML parser will flag out errors. this is not desired sometimes. of course, i am not saying completely forget about XML parsers either, but for simple stuff like that, its also not necessary.
ghostdog74
Ignore the religious zealots who think all XML problems are solved with XML parsers. Any one who says "there is only one way to do it" a) hasn't used Perl, and b) is fairly new to the world of technology. XML parsers have their uses, don't get me wrong, and when you need accurate parsing of unknown data then an XML parser is often the best way. However if you simply need to transform data you have control over there's often better, faster, and less memory risky methods. Sometimes you have to be prepared to be down-voted, because the fact is there are less wizards than there are apprentices.
PP
Not being an expert in Awk I'm wondering if the print statement needs two right arrows for append instead of one right arrow?
PP
no need 2 arrows. but if for the next run of the script, you want to append to previous results, then you should use >>
ghostdog74
The OP has presented this as a text-processing problem. He has not indicated that the output should be XML. It is therefore appropriate to approach it as a text-processing task. The fact that the input text happens to be XML is inconsequential.
Dave Sherohman
by "previous results", i mean the previous same file name
ghostdog74
Why does everyone seem to think that someone who has to ask a question like this on Stackoverflow actually has a grasp of the requirements? Most often that is the exception; if they knew what they were doing they wouldn't be asking the question.
brian d foy
Jeff Atwood had something to say on this topic a few days ago on his blog at http://www.codinghorror.com/blog/archives/001311.html
PP
try again with xml entities support.
J-16 SDiZ
+7  A: 

I warmly recommend using XML::Twig, since it is capable of handling streams of XML data. You can use it something like this:

use XML::Twig;
my $xml = new XML::Twig( TwigHandlers => { link => \&process_link });

$xml->parsefile('Your file here');

sub process_link
{
    my($xml, $link) = @_;
    # You can now handle each individual block here..

One trick is to do something like:

my $structure = $link->simplify;

Now it's a mixture of hashrefs and arrayrefs depending on structure! Everything, including attributes are there,

print Dumper $structure; exit;

And you can use Data::Dumper to inspect it to take what you need.

Just remember to flush it out to free up memory when you're done.

    $link->flush;
}
squeeks
How does the input content end up in two different output files using this approach?
Peter Mortensen
A: 
>> Is perl the way to go here

Definitely not always the way to go. Here's one in Python

f=open("xmlfile")
out1=open("file_s1","a")
out2=open("file_s2","a")
for line in f:    
    if "<s1>" in line:
        out1.write(line)
    elif "<s2>" in line:
        out2.write(line)
f.close()
out1.close()
out2.close()
lol, down vote because its a Python solution? or is it because i didn't use an XML parser.?
-1 How many times are you going to open and close `file_s1` and `file_s2`?
Sinan Ünür
all right, all right, changed. Its just a 500mb file, and not every line is s1 or s2! gosh
everybody thinks their way is the best.
It's interesting that even after you corrected this, you are still getting downvotes.
Kinopiko
well, i can't really do anything right? :) I think i can downvote once i get more reputation, so i am going to go on a down voting spree. :) kidding ....
+5  A: 

First, if you are going to ignore the fact that the input is XML, then there is no need for Perl or Python or gawk or any other language. Just use

$ grep '<s1>' input_file > s1.txt
$ grep '<s2>' input_file > s2.txt

and be done with it. This seems inefficient but given the time it takes to write a script and then invoke it, the inefficiency is insignificant. What's worse, if you do not know how to write that particularly simple script, you have to post on SO and wait for an answer which exceeds the inefficiency of the grep solution by many many many orders of magnitudes.

Now, if the fact that the input is XML matters in the slightest, you should use an XML parser. Contrary to the incorrect claim made elsethread, there are plenty of XML parser which do not have to load the whole file in to memory. Such a parser would have the advantage of being extensible and correct.

The example I give below is intended to replicate the structure of the answer you have already accepted to show you that it is no more complicated to use a proper solution.

Just to give fair warning, the script below is likely to be the slowest possible way. I wrote it to exactly mimic the accepted solution.

#!/usr/bin/perl

use strict; use warnings;
use autodie;

my %fh = map { open my $f, '>',  $_; $_ => $f } qw{ s1.txt s2.txt };

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);
$parser->xml_mode(1);

while ( my $tag = $parser->get_tag('s1',  's2') ) {
    my $type = $tag->get_tag;
    my $text = $parser->get_text("/$type");
    print { $fh{"$type.txt"} } $text,  "\n";
}    
__DATA__
<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>
<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>
<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>

Output:

C:\Temp> cat s1.txt
bunch of text here
bunch of text here
bunch of text here

C:\Temp> cat s2.txt
some more here
some more here
some more here
Sinan Ünür
You have given a good example, but your arrogance is too much. Personal attacks, ignoring the fact they are completely incorrect, are not warranted. I suggest you stick to providing _good quality_ and _helpful_ assistance, and let the good solutions stand up for themselves. If you feel the need to market yourself too much you leave people wondering why...
PP
@PP There is nothing personal in my answer. However, I seem to recall that you need to make an effort to understand that critical comments about your answers imply nothing about your personal worth. Votes on SO are cast for answers, not people.
Sinan Ünür
grep can be used for such simple requirement no doubt BUT you go through the file , which is 500Mb 2 times, for s1 and s2. What if there are s3 and s4?? the awk solution i posted, for example only process the file only once. Similarly, a Perl or Python solution can be used in similar manner. In terms of efficiency on large filrs, grep on separate patterns is still not the best choice !
ghostdog74
@ghostdog74 There are two options: (1) This is a one-off situation with a **given** file containing exactly what is shown above: In that case, go with `grep`. Make a cup of coffee while the processing is under way. (2) This is a recurring situation with a variety of files and content. In that case, use an XML parser so the solution remains robust and generalizable.
Sinan Ünür
fair enough. Hypothetically and OT, if OP has to produce this result to potential clients, i guess making a cup of coffee and wait for it to finish is not an option. :)
ghostdog74
@ghostdog74 ;-))))) How about making coffee for the clients? Add a few Krispy Kreme's to the mix and you're golden.
Sinan Ünür
+1  A: 

You can use one of those methods to do this task:

  1. Regular expressions
  2. HTML::TreeBuilder module
  3. HTML::TokeParser module
  4. XML::LibXML module
Thor