@oliver1,
Please note that the keyword in Regular Expression is "Regular." Regular Expressions are used with Regular Languages.
Unfortunately, (X)HTML is not a Regular Language. Rather, it is a Context Free Language.
You cannot write a RegEx which can properly parse a Context Free Language. This is a mathematically proven reality; you cannot write a RegEx which can properly parse a Context Free Language.
The Solution: Use XPath
Instead you should use an XML parser; you are already using XHTML which means you could instead use XPath. (although you're missing an <p>
at the beginning of your code snippet)
How can any parser, RegEx or query identify the first names and last names? The best I see is "<span>
elements which come after a <br />
" which is pretty weak.
You can nonetheless write an XPath query to find "<span>
elements which come after a <br />
".
//br/following-sibling::span/text()
... but that also finds the values of Email
and Phone
, so you'll want only the first two results.
Alternately, you could instead use the id
attributes on the <span>
elements:
//span[@id='value_85110']/text()|//span[@id='value_86004']/text()
If You Can Modify The HTML
Ideally, my suggestion is to make your XHTML more semantic:
<label for="first-name-1">First Name</label>
<span id="first-name-1" class="first-name">Aweber- Email Parser</span>
<label for="last-name-1">Last Name</label>
<span id="last-name-1" class="last-name">Submission</span>
<label for="email-address-1">Email</label>
<span id="email-address-1" class="email-address">[email protected]</span>
<label for="phone-number-1">Phone</label>
<span id="phone-number-1" class="phone-number">919-923-7017</span>
Enhance it with CSS (instead of using <b>
and <br/>
all over the place)...
label {
font-weight:bolder;
display:block;
maring-top:5px;
}
span {
display:block;
maring-bottom:5px;
}
... and then use an XPath query like so:
//span[@class='first-name'] | //span[@class='last-name']