views:

138

answers:

3

Hello, I am trying to capture "Rio Grande Do Leste" from:

...
<h1>Rio Grande Do Leste<br />
...

using

var myregexp = /<h1>()<br/;

var nomeAldeiaDoAtaque = myregexp.exec(document);

what am I doing wrong?

update:

2 questions remain:

1) searching (document) didn´t produce any result, but changing it to (document.body.innerHTML) worked. Why is that?

2) I had to change it to: myregexp.exec(document.body.innerHTML)[1]; to get what I want, otherwise it would give me some result which includes <h1>. why is that?

3) (answered) why do I need to use ".*" ? I tought it would collect anything between ()?

+8  A: 

Try /<h1>(.*?)<br/.

Will A
it worked, I have updated the first post with some info, can you review it please? thanks
Fernando SBS
A: 

or

^(<h1>)(.)+(<br />)

go here to test gskinner.com

PurplePilot
Plus outside brackets - will not capture the text.
Amadan
+5  A: 

On capturing group

A capturing group attempts to capture what it matches. This has some important consequences:

  • A group that matches nothing, can never capture anything.
  • A group that only matches an empty string, can only capture an empty string.
  • A group that captures repeatedly in a match attempt only gets to keep the last capture
    • Generally true for most flavors, but .NET regex is an exception (see related question)

Here's a simple pattern that contains 2 capturing groups:

(\d+) (cats|dogs)
\___/ \_________/
  1        2

Given i have 16 cats, 20 dogs, and 13 turtles, there are 2 matches (as seen on rubular.com):

  • 16 cats is a match: group 1 captures 16, group 2 captures cats
  • 20 dogs is a match: group 1 captures 20, group 2 captures dogs

Now consider this slight modification on the pattern:

(\d)+ (cats|dogs)
\__/  \_________/
 1         2

Now group 1 matches \d, i.e. a single digit. In most flavor, a group that matches repeatedly (thanks to the + in this case) only gets to keep the last match. Thus, in most flavors, only the last digit that was matched is captured by group 1 (as seen on rubular.com):

  • 16 cats is a match: group 1 captures 6, group 2 captures cats
  • 20 dogs is a match: group 1 captures 0, group 2 captures dogs

References


On greedy vs reluctant vs negated character class

Now let's consider the problem of matching "everything between A and ZZ". As it turns out, this specification is ambiguous: we will come up with 3 patterns that does this, and they will yield different matches. Which one is "correct" depends on the expectation, which is not properly conveyed in the original statement.

We use the following as input:

eeAiiZooAuuZZeeeZZfff

We use 3 different patterns:

  • A(.*)ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)
    • This is the greedy variant; group 1 matched and captured iiZooAuuZZeee
  • A(.*?)ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
    • This is the reluctant variant; group 1 matched and captured iiZooAuu
  • A([^Z]*)ZZ yields 1 match: AuuZZ (as seen on ideone.com)
    • This is the negated character class variant; group 1 matched and captured uu

Here's a visual representation of what they matched:

         ___n
        /   \              n = negated character class
eeAiiZooAuuZZeeeZZfff      r = reluctant
  \_________/r   /         g = greedy
   \____________/g

See related question for a more in-depth treatment on the difference between these 3 techniques.

Related questions


Going back to the question

So let's go back to the question and see what's wrong with pattern:

<h1>()<br
    \/
     1

Group 1 matches the empty string, therefore the whole pattern overall can only match <hr1><br, and group 1 can only match the empty string.

One can try to "fix" this in many different ways. The 3 obvious ones to try are:

  • <h1>(.*)<br; greedy
  • <h1>(.*?)<br; reluctant
  • <h1>([^<]*)<br; negated character class

You will find that none of the above "work" all the time; there will be problems with some HTML. This is to be expected: regex is the "wrong" tool for the job. You can try to make the pattern more and more complicated, to get it "right" more often and "wrong" less often. More than likely you'll end up with a horrible mess that no one can understand and/or maintain, and it'd still probably won't work "right" 100% of the time.

polygenelubricants
+1 for nice detailed answer
Sarfraz
yes, I like the detail, but most of my questions remain unsolved.
Fernando SBS