ansaurus

Question

How can I filter a large file into two separate files?

Answer 1

+4 A:

Yes, Perl is the (or maybe "a") way to go.

You need an XML parser. There are several choices on CPAN so have a look.

XML::LibXML::Parser looks like it has something for parsing parts of files, which sounds like what you need.

Kinopiko 2009-11-23 10:17:56

Answer 2

+6 A:

Use an XML parser. This problem is quite well-suited to parsing with an event-based parser, so I'd recommend looking into how the built-in XML::Parser or XML::SAX modules work. You should be able to create two event handlers for each kind of tag you want to process and direct the matching content to two separate files.

Matt Ryall 2009-11-23 10:21:32

Can you provide a code example?

PP 2009-11-23 17:18:46

There are examples in their documentation.

brian d foy 2009-11-23 17:49:20

Answer 3

A:

If the file is huge an XML parser could result in significant slow down or even an application crash as XML parsers require the entire file in memory before any operations can be performed on the file (something high-level fluffy cloud developers often forget about recursive structures).

Instead you can be pragmatic. It appears that your data follows fairly consistent patterns. And this is a one-time transformation.

Try something like


BEGIN {
  open( FOUT1 ">s1.txt" ) or die( "Cannot open s1.txt: $!" );
  open( FOUT2 ">s2.txt" ) or die( "Cannot open s2.txt: $!" );
}
while ( defined( my $line = <> ) ) {
  if ( $line =~ m{<s1>(.+?)</s1>} ) {
    print( FOUT1 "$1\n" );
  } elsif ( $line =~ m{<s2>(.+?)</s2>} ) {
    print( FOUT2 "$1\n" );
  }
}
END {
  close( FOUT2 );
  close( FOUT1 );
}

Then run this script as perl myscript.pl <bigfile.txt.

Update 1: corrected reference to matched section as $1 from $2.

PP 2009-11-23 10:23:17

Many XML parsers would not require this entire structure to be in memory. Especially SAX parsers.

jrockway 2009-11-23 10:32:12

That's good to hear, however is there a risk of the XML parser doing bad things if a tag is missing, if there's a syntax error anywhere in the huge file, etc..? The concern is the size of the input file.

PP 2009-11-23 11:07:32

@PP: If there's a syntax error, what makes you think any other technique is going to work?

brian d foy 2009-11-23 12:08:31

Either you are dealing with huge XML files or you aren't. If you are, you better make sure that the XML is valid and that you have the tools to deal with them.

innaM 2009-11-23 15:41:22

Those `BEGIN` and `END` blocks are complete eyesores. You do not need `defined` in the `while` statement. And, ditto @jrockway, @brian d foy and @Manni.

Sinan Ünür 2009-11-23 16:13:50

Wow. I hand't even noticed the `BEGIN` and `END` blocks, yet. I smell a cargo cult.

innaM 2009-11-23 16:45:26

@brian d foy - there's a general principle on the internet: be strict in what you send, be liberal in what you accept - you're right, we make assumptions, and a syntax error in the middle of a large file could cause problems regardless of the parsing solution. @Sinan I believe `defined` is good practice when reading input as only undefined is a valid end-of-file marker, however you're right in that the `BEGIN` and `END` tags are unnecessary, although it's pretty clear what's happening there. Feel free to optimise in your detailed and educational answer!

PP 2009-11-23 17:17:04

@PP `defined` is **completely and utterly** unnecessary in the `while` condition above. `while ( my $line = <$input> )` is always exactly the same as `while ( defined(my $line = <$input>) )`.

Sinan Ünür 2009-11-23 18:10:50

Answer 4

+5 A:

You can use Perl, but it's NOT the only way. Here's one with gawk:

gawk -F">" '/<s[12]>/{o=$0;sub(/.*</,"",$1);print o > "file_"$1 }' file

Or, if your task is very simple, then:

awk '/<s1>/' file > file_s1
awk '/<s2>/' file > file_s2

or grep:

grep "<s1>" file > file_s1
grep "<s2>" file > file_s2

ghostdog74 2009-11-23 10:37:22

this also solve the problem. why the down vote.?

ghostdog74 2009-11-23 10:41:28

Because attempting to parse XML with anything less than an XML parser is a bad idea, no exceptions.

squeeks 2009-11-23 10:45:57

yes, there are exceptions. don't ever assume on behalf of the OP. His requirement stated is very simple, in this case an XML parser is not needed. Also OP's XML file format may sometimes not be compliant to XML format and XML parser will flag out errors. this is not desired sometimes. of course, i am not saying completely forget about XML parsers either, but for simple stuff like that, its also not necessary.

ghostdog74 2009-11-23 10:51:32

Ignore the religious zealots who think all XML problems are solved with XML parsers. Any one who says "there is only one way to do it" a) hasn't used Perl, and b) is fairly new to the world of technology. XML parsers have their uses, don't get me wrong, and when you need accurate parsing of unknown data then an XML parser is often the best way. However if you simply need to transform data you have control over there's often better, faster, and less memory risky methods. Sometimes you have to be prepared to be down-voted, because the fact is there are less wizards than there are apprentices.

PP 2009-11-23 11:03:29

Not being an expert in Awk I'm wondering if the print statement needs two right arrows for append instead of one right arrow?

PP 2009-11-23 11:04:54

no need 2 arrows. but if for the next run of the script, you want to append to previous results, then you should use >>

ghostdog74 2009-11-23 11:06:58

The OP has presented this as a text-processing problem. He has not indicated that the output should be XML. It is therefore appropriate to approach it as a text-processing task. The fact that the input text happens to be XML is inconsequential.

Dave Sherohman 2009-11-23 11:08:15

by "previous results", i mean the previous same file name

ghostdog74 2009-11-23 11:09:30

Why does everyone seem to think that someone who has to ask a question like this on Stackoverflow actually has a grasp of the requirements? Most often that is the exception; if they knew what they were doing they wouldn't be asking the question.

brian d foy 2009-11-23 12:09:47

Jeff Atwood had something to say on this topic a few days ago on his blog at http://www.codinghorror.com/blog/archives/001311.html

PP 2009-11-23 17:29:26

try again with xml entities support.

J-16 SDiZ 2009-11-24 01:15:51

Answer 5

+7 A:

I warmly recommend using XML::Twig, since it is capable of handling streams of XML data. You can use it something like this:

use XML::Twig;
my $xml = new XML::Twig( TwigHandlers => { link => \&process_link });

$xml->parsefile('Your file here');

sub process_link
{
    my($xml, $link) = @_;
    # You can now handle each individual block here..

One trick is to do something like:

my $structure = $link->simplify;

Now it's a mixture of hashrefs and arrayrefs depending on structure! Everything, including attributes are there,

print Dumper $structure; exit;

And you can use Data::Dumper to inspect it to take what you need.

Just remember to flush it out to free up memory when you're done.

    $link->flush;
}

squeeks 2009-11-23 10:42:59

How does the input content end up in two different output files using this approach?

Peter Mortensen 2010-01-16 21:53:15

Answer 6

A:

>> Is perl the way to go here

Definitely not always the way to go. Here's one in Python

f=open("xmlfile")
out1=open("file_s1","a")
out2=open("file_s2","a")
for line in f:    
    if "<s1>" in line:
        out1.write(line)
    elif "<s2>" in line:
        out2.write(line)
f.close()
out1.close()
out2.close()

2009-11-23 12:28:20

lol, down vote because its a Python solution? or is it because i didn't use an XML parser.?

2009-11-23 14:00:31

-1 How many times are you going to open and close `file_s1` and `file_s2`?

Sinan Ünür 2009-11-23 14:44:52

all right, all right, changed. Its just a 500mb file, and not every line is s1 or s2! gosh

2009-11-23 16:23:48

everybody thinks their way is the best.

2009-11-23 16:34:37

It's interesting that even after you corrected this, you are still getting downvotes.

Kinopiko 2009-11-24 01:40:17

well, i can't really do anything right? :) I think i can downvote once i get more reputation, so i am going to go on a down voting spree. :) kidding ....

2009-11-24 02:06:03

Answer 7

+5 A:

First, if you are going to ignore the fact that the input is XML, then there is no need for Perl or Python or gawk or any other language. Just use

$ grep '<s1>' input_file > s1.txt
$ grep '<s2>' input_file > s2.txt

and be done with it. This seems inefficient but given the time it takes to write a script and then invoke it, the inefficiency is insignificant. What's worse, if you do not know how to write that particularly simple script, you have to post on SO and wait for an answer which exceeds the inefficiency of the grep solution by many many many orders of magnitudes.

Now, if the fact that the input is XML matters in the slightest, you should use an XML parser. Contrary to the incorrect claim made elsethread, there are plenty of XML parser which do not have to load the whole file in to memory. Such a parser would have the advantage of being extensible and correct.

The example I give below is intended to replicate the structure of the answer you have already accepted to show you that it is no more complicated to use a proper solution.

Just to give fair warning, the script below is likely to be the slowest possible way. I wrote it to exactly mimic the accepted solution.

#!/usr/bin/perl

use strict; use warnings;
use autodie;

my %fh = map { open my $f, '>',  $_; $_ => $f } qw{ s1.txt s2.txt };

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);
$parser->xml_mode(1);

while ( my $tag = $parser->get_tag('s1',  's2') ) {
    my $type = $tag->get_tag;
    my $text = $parser->get_text("/$type");
    print { $fh{"$type.txt"} } $text,  "\n";
}    
__DATA__
<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>
<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>
<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>

Output:

C:\Temp> cat s1.txt
bunch of text here
bunch of text here
bunch of text here

C:\Temp> cat s2.txt
some more here
some more here
some more here

Sinan Ünür 2009-11-23 17:09:36

You have given a good example, but your arrogance is too much. Personal attacks, ignoring the fact they are completely incorrect, are not warranted. I suggest you stick to providing _good quality_ and _helpful_ assistance, and let the good solutions stand up for themselves. If you feel the need to market yourself too much you leave people wondering why...

PP 2009-11-23 17:33:27

@PP There is nothing personal in my answer. However, I seem to recall that you need to make an effort to understand that critical comments about your answers imply nothing about your personal worth. Votes on SO are cast for answers, not people.

Sinan Ünür 2009-11-23 17:45:12

grep can be used for such simple requirement no doubt BUT you go through the file , which is 500Mb 2 times, for s1 and s2. What if there are s3 and s4?? the awk solution i posted, for example only process the file only once. Similarly, a Perl or Python solution can be used in similar manner. In terms of efficiency on large filrs, grep on separate patterns is still not the best choice !

ghostdog74 2009-11-23 23:45:00

@ghostdog74 There are two options: (1) This is a one-off situation with a **given** file containing exactly what is shown above: In that case, go with `grep`. Make a cup of coffee while the processing is under way. (2) This is a recurring situation with a variety of files and content. In that case, use an XML parser so the solution remains robust and generalizable.

Sinan Ünür 2009-11-23 23:49:18

fair enough. Hypothetically and OT, if OP has to produce this result to potential clients, i guess making a cup of coffee and wait for it to finish is not an option. :)

ghostdog74 2009-11-24 00:15:21

@ghostdog74 ;-))))) How about making coffee for the clients? Add a few Krispy Kreme's to the mix and you're golden.

Sinan Ünür 2009-11-24 01:00:03

Answer 8

+1 A:

You can use one of those methods to do this task:

Thor 2009-11-23 21:45:39

ansaurus

tags:

views:

answers:

How can I filter a large file into two separate files?

related questions