ansaurus

Question

Regex fails to match for no obvious reason.

Answer 1

A:

I am terribly sorry, I know this answer wouldn't be much appreciated by anybody for various reasons, but anyway, I feel that I have to say this.

It seems to me, that you are probably using the wrong tool. I suggest, that you use a real parser, that is intended to parse (x)html/xml. I think, html contains far more subtleties, than you are realistically able to catch with your regular expression. I, myself, haven't written any php for quite a time, but I am sure it has the neccessary tools to do the parsing for you (maybe this?).

Of course it is exciting to do everything yourself, but it is more practical to take advantage of what's been done (and tested) for you.

I hope, that you will keep this in mind.

PS: Yes, I know, that the usual "Do not parse xml with regex" statement is extremely trite/banal, but it doesn't stop it from being true for the majority of cases.

shylent 2010-07-03 11:50:08

Congratulations, you have convinced me to try it this way instead. Thanks for being nice about it instead of telling me 'LOL UR DOIN IT RONG'.

Paragone 2010-07-03 12:12:22

Answer 2

A:

It was a bit easier to rewrite it than to debug it, so here's my approach :

preg_match_all(
    '%<tr>[^<]*
      <td[^>]*><a.*?employee_id=(\d*).*?>(\w*)\s*.*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*),\s*(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*><a[^>]*>(.*?)</a>.*?&nbsp;</td>[^<]*
      <td[^>]*>(\d{3}\.\d{3}\.\d{4}).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
    </tr>%sx', 
    $subject, $result, PREG_SET_ORDER);

It works for your example and you can tweak it if you like more or less validation.

Diadistis 2010-07-03 12:07:52

Thank you for actually attempting to answer my question as opposed to telling me I'm doing it wrong. If I don't succeed in rewriting it to parse the whole DOM tree (which I suspect I might not due to the poor quality of the HTML that I have to parse) I will definitely come back to this.

Paragone 2010-07-03 12:11:25

Answer 3

A:

Kuchen 2010-07-03 12:32:51

The reason I was wanting to use regex originally is because I only needed to pull one matching HTML group from a page and it seemed to be a more efficient use of my time to just brew up a regex string than to write up a whole DOM parser, plus I wasn't even sure if that would be feasible because of the fact that the HTML I'm parsing is *very* poorly formed, but supposedly DOMDocument works for even malformed HTML documents, which I didn't know... But in any event, regex is being too much of a pain anyway, so I've officially given up on it.

Paragone 2010-07-03 12:55:40

ansaurus

tags:

views:

answers:

Regex fails to match for no obvious reason.

related questions