ansaurus

Question

Perl splitting text string (from HTML page, text document, etc.) by line into array?

Answer 1

A:

It's hard to tell what your code's doing since we don't have it in front of us; it would be easier to help if you posted what you had. However, I'll give it a shot. If you scrape the text into a variable, you will have a string which may have embedded line breaks. These will either be \n (the traditional Unix newline) or \r\n (the traditional Windows newline sequence). Just as you can split on a space to get (a first approximation of) the words in a sentence, you can instead split on the newline sequence to get the lines in. Thus, the single line you'll need should be

my @lines = split(/\r?\n/, $scraped_text);

Antal S-Z 2010-07-17 12:17:20

Answer 2

A:

Use the $/ variable, this determines what to break lines on. So:

local $/ = " ";
while(<FILE>)...

would give you chunks separated by spaces. Just set it back to "\n" to get back to the way it was - or better yet, go out of the local $/ scope and let the global one come back, just in case it was something other than "\n" to begin with.

You can eliminate it altogether:

local $/ = undef;

To read whole files in one slurp. And then iterate through them however you like. Just be aware that if you do a split or a splice, you may end up copying the string over and over, using lots of CPU and lots of memory. One way to do it with less is:

# perl -de 0
> $_="foo\nbar\nbaz\n";
> while( /\G([^\n]*)\n/go ) { print "line='$1'\n"; }
line='foo'
line='bar'
line='baz'

If you're breaking apart things by newlines, for example. \G matches either the beginning of the string or the end of the last match, within a /g-tagged regex.

Another weird tidbit is $/=\10... if you give it a scalar reference to an integer (here 10), you can get record-length chunks:

# cat fff
eurgpuwergpiuewrngpieuwngipuenrgpiunergpiunerpigun
# perl -de 0
$/ = \10;
open FILE, "<fff";
while(<FILE>){ print "chunk='$_'\n"; }
chunk='eurgpuwerg'
chunk='piuewrngpi'
chunk='euwngipuen'
chunk='rgpiunergp'
chunk='iunerpigun'
chunk='
'

More info: http://www.perl.com/pub/a/2004/06/18/variables.html

If you combine this with FM's answer of using:

$data = "eurgpuwergpiuewrngpieuwngipuenrgpiunergpiunerpigun";
open STRING, "<", \$data;
while(<STRING>){ print "chunk='$_'\n"; }

I think you can get every combination of what you need...

eruciform 2010-07-17 16:17:35

Thanks for the post.. however the issue is that I'm not reading from a file, its taken from a string variable and I want to avoid using a file at all

Rick 2010-07-18 02:12:19

@rick: the first and last parts are all about strings: the bit about `\G` and the bit about handing a scalar reference to `open`. no files necessary. hope it helps.

eruciform 2010-07-18 03:43:26

Answer 3

+1 A:

Here's an idea that might help you: you can open from strings as well as files.

So if you used to do this:

open( my $io, '<', 'blah.txt' ) or die "Could not open blah.txt! - $!";
my @list = <$io>;

You can just do this:

open( my $io, '<', \$text_I_captured ); 
my @list = <$io>;

Axeman 2010-07-17 18:13:38

thanks, thats exactly the sort of thing I am looking for.. I will try that

Rick 2010-07-18 02:13:08

ansaurus

tags:

views:

answers:

Perl splitting text string (from HTML page, text document, etc.) by line into array?

related questions