views:

67

answers:

1

Hey guys!

I've just began learning Python and I've ran into a small problem. I need to parse a text file, more specifically an HTML file (but it's syntax is so weird - divs after divs after divs, the result of a Google's 'View as HTML' for a certain PDF i can't seem to extract the text because it has a messy table done in m$ word).

Anyway, I chose a rather low-level approach because i just need the data asap and since I'm beginning to learn Python, I figured learning the basics would do me some good too.

I've got everything done except for a small part in which i need to retrieve a set of integers from a set of divs. Here's an example:

<div style="position:absolute;top:522;left:1020"><nobr>*88</nobr></div>

Now the numbers i want to retrieve all the ones inside <nobr></nobr> (in that case, '588') and, since it's quite a messy file, i have to make sure that what I am getting is correct. To do so, that number inside <nobr></nobr> must be preceded by "left:1020", "left:1024" or "left:1028". This is because of the automatic conversion and the best choice would be to get all the number preceded by left:102[0-] in my opinion.

To do so, I was trying to use:

for o in re.finditer('left:102[0-9]"><nobr>(.*?)</nobr></div>', words[index])
    out = o.group(1)

But so far, no such luck... How can I get those numbers?

Thanks in advance, J.

+1  A: 

Don't use regular expressions to parse HTML. BeautifulSoup will make light work of this.

As for your specific problem, it might be that you are missing a colon at the end of the first line:

for o in re.finditer('left:102[0-9]"><nobr>(.*?)</nobr></div>', words[index]):
    out = o.group(1)

If this isn't the problem, please post the error you are getting, at what you expect the output to be.

Mark Byers
Yeah, I've heard about it but I wasn't sure it would manage to get all those weird divs, hence the low-level approach
Hal
@Hal: BeautifulSoup can find tags based on attributes, and it can even accept regex as arguments for the search if you need that.
Mark Byers
Cool, didn't know it was so powerful. Anyway, I've practically finished the script, all that's missing is getting those integers. I guess I could simply make 10 searches, but that would be plain dumb and I'd like to learn how one could use regex on that string.
Hal
You did it. I wasn't getting any error at all, for some reason the damn thing would just output a blank space.Thanks for putting up with this noob crap, it's guys like you that make StackOverflow so awesome.
Hal