ansaurus

Question

Answer 1

A:

Depends a little bit on the syntax your actual regex library or tool, but basically use something like this:

<span id="label_85110"><b>([^<]+)</b>

Then you can access the first match group via some API.

Extract the last name similar to that.

Btw, some may argue: 'regex are the wrong tool for extracting data from HTML !!elf!1!'

Well, that is up to the poster. We don't know the details. Perhaps for his restricted use case everything else is overkill. (e.g. one time analysis and it is guaranteed that input data always uses the posted skeleton etc.)

maxschlepzig 2010-08-31 19:54:40

-1 as "we don't know the details" is exactly why we can't encourage any poster to use RegEx to parse HTML. In the absence of other information, the norm is that you shouldn't parse (X)HTML with RegEx.

LeguRi 2010-08-31 20:29:09

I am not encouraging the poster. The poster got a few comments about possible disadvantages/pitfalls of using regexes. It is his decision what to do. I posted the answer, s.t. if he decides pro-regex he gets a hint how to use regex-group matchings.

maxschlepzig 2010-08-31 21:21:55

+1 for trying to give the poster what he's asking for. He's not asking for a diatribe on why he shouldn't... he wants to know how. He's not even asking for a full blown parser... he just wants to extract some text.

bgould 2010-08-31 21:51:40

@bgould - I agree that the OP doesn't need a diatribe; linking to [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454) has never been useful for anyone. Instead we have to explain _why_ what the OP is doing is not a good idea; your answer is good (and not voted down) as you emphasize that your solution is limited in scope and not a general solution. @maxschlepzig's answer doesn't concede that point and instead challenges the competence of those who suggest otherwise.

LeguRi 2010-09-01 14:25:35

Using a html/xml parser is hardly overkill. it is only a few more lines of code. The hardest part is conceptual. It means the OP gets a shiny new tool for his belt.

Byron Whitlock 2010-09-01 18:47:57

Answer 2

+3 A:

@oliver1,

Please note that the keyword in Regular Expression is "Regular." Regular Expressions are used with Regular Languages.

Unfortunately, (X)HTML is not a Regular Language. Rather, it is a Context Free Language.

You cannot write a RegEx which can properly parse a Context Free Language. This is a mathematically proven reality; you cannot write a RegEx which can properly parse a Context Free Language.

The Solution: Use XPath

Instead you should use an XML parser; you are already using XHTML which means you could instead use XPath. (although you're missing an  at the beginning of your code snippet)

How can any parser, RegEx or query identify the first names and last names? The best I see is " elements which come after a  " which is pretty weak.

You can nonetheless write an XPath query to find " elements which come after a  ".

//br/following-sibling::span/text()

... but that also finds the values of Email and Phone, so you'll want only the first two results.

Alternately, you could instead use the id attributes on the  elements:

//span[@id='value_85110']/text()|//span[@id='value_86004']/text()

If You Can Modify The HTML

Ideally, my suggestion is to make your XHTML more semantic:

<label for="first-name-1">First Name</label>
<span id="first-name-1" class="first-name">Aweber- Email Parser</span>
<label for="last-name-1">Last Name</label>
<span id="last-name-1" class="last-name">Submission</span>
<label for="email-address-1">Email</label>
<span id="email-address-1" class="email-address">[email protected]</span>
<label for="phone-number-1">Phone</label>
<span id="phone-number-1" class="phone-number">919-923-7017</span>

Enhance it with CSS (instead of using  and   all over the place)...

label {
    font-weight:bolder;
    display:block;
    maring-top:5px;
}
span {
    display:block;
    maring-bottom:5px;
}

... and then use an XPath query like so:

//span[@class='first-name'] | //span[@class='last-name']

LeguRi 2010-08-31 20:43:54

Why do you assume that the poster can influence the generation of the html? If you could, he would not need to parse it in the first place ... He could just to a normal DB query then ...

maxschlepzig 2010-08-31 21:24:57

@maxschlepzig - Edited to emphasize "Use XPath" as opposed to "Fix the HTML"

LeguRi 2010-09-01 03:17:49

Answer 3

A:

Disclaimer: This is just an answer to the problem, not an endorsement of using regex for this purpose.

<span[^>]*?><b>First Name(?:<[^>]+?>|\s)+([^<]*?)(?:<[^>]+?>|\s)+?Last Name(?:<[^>]+?>|\s)+([^<]*)[\S\s]+?Phone[\S\s]+?<\/p>

then just grab groups 1 and 2 for each match. tested this with firefox's javascript flavor of regex.

From a philosophical standpoint XPath is probably a more robust solution if you have an XPath-capable HTML parser or if you are sure that you are working with valid XML, which what you posted is not (missing a document root node and an opening tag at the beginning).

bgould 2010-09-01 14:06:32

ansaurus

tags:

views:

answers:

Big Regular expression help

The Solution: Use XPath

If You Can Modify The HTML

related questions