ansaurus

Question

Problem extracting text out of html file using python regex

Answer 1

+4 A:

Is there something I'm missing out with regards to regex and html?

Yes. You're missing the fact that some HTML cannot be parsed with a simple regex.

S.Lott 2010-07-31 13:22:54

Ouch. I was thinking that the above would simply match since the only thing I was searching for was the word "binary". While I understand that it isn't a good idea to use regex to process html, but in this scenario I don't understand why the regex does not match because I'm not dealing with the tags at all.

M Rubern C 2010-07-31 14:25:42

@M Rubern C: You can't ignore the tags. What if your "binary" is `<b>b</b>inary` to make the "b" bold?

S.Lott 2010-07-31 15:51:18

Answer 2

A:

HTML as understood by browsers is waaaay too flexible for reg expressions. Attributes can pop up in any tag, and in any order, and in upper or lower case, and with or without quotation marks about the value. Special emphasis tags can show up anywhere. Whitespace is significant in regex, but not so much in HTML, so your regex has to be littered with \s*'s everywhere. There is no requirement that opening tags be matched with closing tags. Some opening tags include a trailing '/', meaning that they are empty tags (no body, no closing tag). Lastly, HTML is often nested, which is pretty much off the chart as far as regex is concerned.

Paul McGuire 2010-07-31 14:26:49

Answer 3

A:

Is this actually what you're trying to do, or just a simple example for a more complicated regex later? If the latter, listen to everyone else. If the former:

for line in file:
      if "binary" in line:
            # do stuff

If that doesn't work, are you sure "binary" is in the file? Not, I don't know, "<i>b</i>inary"?

katrielalex 2010-07-31 14:29:43

I was planning to use regex to parse and tried to write simple example to test but I've been convinced otherwise. I'm sure it appears as<td>Target binary file name:</td>Just puzzled why it doesnt pick up.

M Rubern C 2010-07-31 14:40:36

ansaurus

tags:

views:

answers:

Problem extracting text out of html file using python regex

related questions