views:

58

answers:

3

I have several HTML blocks on a page set up like:

<p class="something">
    <a href="http://example.com/9999"&gt;text 1 2 3</a>
    <a href="http://example.com/2346saasdf"&gt;text 3 4 5</a>
    (9999)
    <a href="http://example.com/sad3ws"&gt;text 5 6 7random</a>
</p>

I want to get the digit that is in the parentheses, with them. I have to admit I've never really used regex before -- read about it, seen examples of it but haven't used it myself. Anyway, I created this with a little bit of looking around:

<p class="something">(.*?)</p>

That correctly gets the entire <p> block, but again, I just want the (9999) (with parentheses intact). I'm not really sure how to get it.

Assuming that other elements on the page could also have digits in parentheses (but they won't be included in this exact format), and that the HTML will remain valid and consistent, how can I get it?

I understand this is probably easy for someone who has used regex before, but for the solution, I'd appreciate a little detail on what each character captures so I can learn from it.

+6  A: 

Don't use regex to parse HTML.

Instead, use an HTML parser, then simply read the text (non-tag) content within the desired <p> block.

jQuery is a pretty decent HTML parser, so you can get the desired text stored in a variable x using:

var x = $('p').clone().find('a').remove().end().text();

working example

If you can't use jQuery to make your life easy for whatever reason, you can use raw JavaScript at the DOM:

var y = document.getElementsByTagName("p")[0].cloneNode(true);
var x = "";
for(var k in y.childNodes){ 
    if(y.childNodes[k].nodeType == 3){ 
        x += y.childNodes[k].textContent; 
    }
}
x = x.trim();

working example

Mark E
I'm not trying to parse the entire internet. I'm only parsing one page where its content will remain consistent.
Corey
@Corey, if you choose to use a regex, you're still doing it the hard way.
jball
@Corey: The easiest way to do this is with an HTML parser, and that's particularly easy in JavaScript since the browser does all the heavy lifting. (see my edited post for an example of how trivial it is)
Mark E
A: 

If you really want to use Regex, the following pattern might work for you.

var re = /<\/a>\s*([^\s]+)\s*<a /ig;
z1x2
+1  A: 

With most regex engines, parenthesis means grouping parts of the expression, not matching parenthesis in the input.

As such, this (which you say work, somewhat):

<p class="something">(.*?)</p>
                     ^   ^
                     |   |
                     +---+--- creates a group

Since this "works", you can just extract the contents of that group, but that would give you the parenthesis as well.

I would try this:

<p class="something">\((.*?)\)</p>
                     ^^     ^^
                      |     |
                      +-----+-- matches (...)

And then extract the contents of the first group.

Now, as for what each character means:

<p class="something">\((.*?)\)</p>

<p class="something">                 match <p class="something">
                     \(               match (, without the \ it would be a group
                       (              create a group
                        .             match one character (usually not newlines)
                         *            ... repeated zero or more times
                          ?           ... in a non-greedy way
                           )          end the group
                            \)        match )
                              </p>    match </p>
Lasse V. Karlsen