tags:

views:

89

answers:

4

I am still new to regex and I've run into a bit of a problem. I am building a parsing script and I need to to be able to pull out lines with a certain length out of a file.

How would I write a regex to match lines that have a certain number of words? Eg I want to match all lines in a file that have 3 words.

Could I extend that to find all lines within certain parameters? Eg I want to match all lines in a file that have between 2 and 5 words.

I am using perl in case that matters. Thanks!

+1  A: 

This depends on what you consider to be a word. Perl 5 considers a word to be /\w+/. If you have a different definition you will need to supply it.

You can find the number of times a regex matched by using the Count Of secret operator: ()=:

my $count = ()= $line =~ /\w+/g;

Once you know the number of words, you can easily construct an if statement to print a line if the number or words is between two numbers using the >= and <= operators.

In Perl 5.10 and later, it is possible to match two to five words using the possessive quantifier:

#!/usr/bin/perl

use strict;
use warnings;

while (my $line = <DATA>) {
    next unless $line =~ /^(?:\W*+\w++){2,5}$/;
    print $line;
}

__DATA__
one
one two
one two three
one two three four
one two three four five
one two three four five six
Chas. Owens
/\b\w+\b/ would be an acceptable word definition for me. How can I get Perl to match that a number of times? something like (/\b\w+\b/){2,5} doesn't seem to work for me. What am I missing?
The slashes need to go on the outside of the entire expressions, for starters: /(\b\w+\b){2,5}/. Can you provide more specific than "doesn't seem to work"? What is your input and expected outcome?
Matt Kane
I like your idea Chas. Owens, however, my $count = ()= $line =~ /\w+/; for me produces a result of 1 no matter what I feed into it.Looks like changing it a bit fixed it. This seems to work for me: my $count = ()= $line =~ m/\w+/g;
Matt, I did try with the slashes outside, I just retyrped it wrong, my apologies. My input is a long text file (ebook), and I am looking to create a regex that will pull out chapter titles so I can create bookmarks. The chapter titles are always between 2 and 5 words long. So my output should be the line that matches the regex.
@user436157: `/(\b\w+\b){2,5}/` is self-contradictory and will never match; it says each word must not have a word character after it (the \b) but must have a word immediately after. Try `/(\b\w++\b\W*+){2,5}/` (assuming 5.10+)
ysth
@user436157 My fault, I forgot the `/g` to make it match more than once.
Chas. Owens
Isn't what you are calling `Count Of` operator of `()=` really just a coercion to an empty list assignment which then get assigned in scalar context to `$count` to produce the count of matches? To be clear: it is a great Perlism, commonly seen, wonderful shortcut, but not an "operator" per se... Let's not loose the newbie Perlies on the first hello!
drewk
@ysth That is unanchored, so it matches two or more words. I think `/^(?:w++\W*+){2,5}$/` is probably better.
Chas. Owens
Looks like your solution does the trick for me Chas. Thank you!
@drewk I called it the `Count Of` **secret** operator and pointed him or her to a document that explains what is going on in detail. At some point in the near future that document should be replacing `perlop`.
Chas. Owens
`()=` is a "secret" operator in the same way that `-->` is a secret operator -- they are both just a grouping of two operators, but have a useful semantic meaning together, so is a useful addition to one's toolbox and worthy of a special mention in the documentation.
Ether
@Chas. Owens: oh, right; lost track of the overall goal. But yours now requires starting with \w; `/^(?:\W*+\w++){2,5}+\W*+\z/`
ysth
@ysth Good point. Fixing now.
Chas. Owens
@Chas: To quote from the file link you supplied: "Secret Operators:There are idioms in Perl 5 that appear to be operators, but are really a combination of several operators or pieces of syntax. These Secret Operators have the precedence of the constituent parts.N.B. There are often better ways of doing all of these things and those ways will be easier to understand for the people who will maintain your code after you." My point is that the `()=` idiom is slick; I like it; I use it. But it is an idiom not an operator. To call it an Operator can be really confusing to non Perl programmers.
drewk
@drewk I didn't make up the term secret operator. That is the community term for them. If you don't like the term then you just need to come up with a different one and start getting people to use it. Personally I like pseudo-operator, but the term secret operator was already in wide use (http://www.google.com/search?q=perl+secret+operator), so I changed my document and usage to reflect what the community is already doing.
Chas. Owens
@Chas: I disagree that this is the common and accepted community term for these constructs. Perlmonks calls them `composite operators`, `=()=` which is the form of what you used is often called the "Goatse" here on SO and on Perlmonks. My point is not to criticize you at all. My point is that with beginning Perl coders the usage is potentially confusing. It is absolutely useful to understand the idioms of Perl like `}{` in single line use; `=()=` `~~` `~-$i`; `@{[]}` and `-+-` and their effects on strings and lists. But these are idioms unlike `++` `--` and `<=>` which are real operators. :-)
drewk
@drewk Your proof that I should not have called it a secret operator is that others in the community have called them composite operators? A simple Google search brings me no mention of composite operators except in articles about **secret** operators. But that is beside the point, it is still calling them operators, which you object to. Unless you have a better name that doesn't use the term operator and get a bunch of people to agree with you, I am not changing my ways or the `perlopquick` document.
Chas. Owens
@Chas: I am not trying to 'prove' or criticize. Really! Promise! I am asking you to consider that a language that already has a 'read only' reputation is not necessarily helped by this blurry distinction in key documentation. Your earlier 'Pseudo operators' in your previous SO post was better; "secret" in quotes is better; `Common perl operator idioms` is way better; Secret operators with no quotes and no distinction that it is not a true operator is a confusing choice IMHO. I respect what you are doing with perlopquick. I just humbly ask you to consider this... :-))
drewk
@drewk: speaking as a perlmonks admin, perlmonks doesn't call them anything, but perl monks call them whatever they please.
ysth
@ysth: yes, granted, true. I really am just speaking anecdotally that the term I see most often there is `composite operators` vs `secret operators`. Both are less than ideal IMHO. `Perl operator idiom` is better or at least use quotes. I just think calling these `operators` has high potential to be confusing to Perl newcomers. You and Chas are in a position to be influencers and teachers rather than just describing what Google sees. Heck! Google "Bush+responsible+for+9+11" has 12 million hits, and that doesn't make the conspiracy theories true!!! >;-)
drewk
A: 

(Chas's answer wasn't quite right -- he missed a flag on the m// operator.) :)

use strict;
use warnings;

use Data::Dumper;

my @good;
foreach my $line (<DATA>)
{
    chomp $line;
    my $matches =()= ($line =~ /\b\w+\b/g);
    print "(debugging) found matches $matches\n";
    push @good, $line if $matches >= 2 and $matches <= 5;
}

print "matching lines: ", Dumper(\@good);

__DATA__
foo bar baz bap
foo bar baz
blah blah blah foooo

bip

produces:

(debugging) found matches 4
(debugging) found matches 3
(debugging) found matches 4
(debugging) found matches 0
(debugging) found matches 1
matching lines: $VAR1 = [
          '    foo bar baz bap',
          '    foo bar baz',
          '    blah blah blah foooo'
        ];
Ether
A: 

Replace the 3 with how many words you are looking for. This regex assumes no spaces or tabs start the line:

^(?=(\b[A-Za-z0-9.]+\b[\x20]){3})(.)*

This says match: from the beginning of each line look through each line for 3 alpha numeric or period words are each trailed by a single space and if what we looked ahead for matches then select the entire line no matter what is on it

Note: the \x20 matches a space character and the regex was developed in notepad++ by memory and hand.

Mike Cheel
It assumes a word starts the line. You will need to add some type of match to the front of the regex to cover that. I can help if need be.
Mike Cheel
A: 

Here's a KISS way.

while(<>){
  #assumption: words separated by spaces
  @s = split /\s+/ ;
  # now check the length of @s and do if/else
}
ghostdog74
This one actually works great for my needs also! Thanks!