tags:

views:

84

answers:

3

Consider the two following regular expression snippets and dummy HTML that it should be matching:

Apparently, I can only post one link until I get more reputation, so the link below contains the three links I referenced above:

http://pastebin.com/Qj1uxfdk

The difference between the two snippets, if anyone is wondering, is a removed (((.{2,20}?), (.{2,20}?))?) about half-way through the snippet.

The first snippet does not match the text, but the second one does, and I cannot figure out why. I tried putting a dummy expression that should match anything in its place (such as (.{1})?) and it still fails to match it, but when I remove it, it suddenly matches again.

I've been toiling with this stupid expression for the last 4 hours and I'm about at my wits' end. Can anybody help?

A: 

I am terribly sorry, I know this answer wouldn't be much appreciated by anybody for various reasons, but anyway, I feel that I have to say this.

It seems to me, that you are probably using the wrong tool. I suggest, that you use a real parser, that is intended to parse (x)html/xml. I think, html contains far more subtleties, than you are realistically able to catch with your regular expression. I, myself, haven't written any php for quite a time, but I am sure it has the neccessary tools to do the parsing for you (maybe this?).

Of course it is exciting to do everything yourself, but it is more practical to take advantage of what's been done (and tested) for you.

I hope, that you will keep this in mind.

PS: Yes, I know, that the usual "Do not parse xml with regex" statement is extremely trite/banal, but it doesn't stop it from being true for the majority of cases.

shylent
Congratulations, you have convinced me to try it this way instead. Thanks for being nice about it instead of telling me 'LOL UR DOIN IT RONG'.
Paragone
A: 

It was a bit easier to rewrite it than to debug it, so here's my approach :

preg_match_all(
    '%<tr>[^<]*
      <td[^>]*><a.*?employee_id=(\d*).*?>(\w*)\s*.*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*),\s*(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*><a[^>]*>(.*?)</a>.*?&nbsp;</td>[^<]*
      <td[^>]*>(\d{3}\.\d{3}\.\d{4}).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
    </tr>%sx', 
    $subject, $result, PREG_SET_ORDER);

It works for your example and you can tweak it if you like more or less validation.

Diadistis
Thank you for actually attempting to answer my question as opposed to telling me I'm doing it wrong. If I don't succeed in rewriting it to parse the whole DOM tree (which I suspect I might not due to the poor quality of the HTML that I have to parse) I will definitely come back to this.
Paragone
A: 
Kuchen
The reason I was wanting to use regex originally is because I only needed to pull one matching HTML group from a page and it seemed to be a more efficient use of my time to just brew up a regex string than to write up a whole DOM parser, plus I wasn't even sure if that would be feasible because of the fact that the HTML I'm parsing is *very* poorly formed, but supposedly DOMDocument works for even malformed HTML documents, which I didn't know... But in any event, regex is being too much of a pain anyway, so I've officially given up on it.
Paragone