Hey guys!
I've just began learning Python and I've ran into a small problem. I need to parse a text file, more specifically an HTML file (but it's syntax is so weird - divs after divs after divs, the result of a Google's 'View as HTML' for a certain PDF i can't seem to extract the text because it has a messy table done in m$ word).
Anyway, I chose a rather low-level approach because i just need the data asap and since I'm beginning to learn Python, I figured learning the basics would do me some good too.
I've got everything done except for a small part in which i need to retrieve a set of integers from a set of divs. Here's an example:
<div style="position:absolute;top:522;left:1020"><nobr>*88</nobr></div>
Now the numbers i want to retrieve all the ones inside <nobr></nobr>
(in that case, '588') and, since it's quite a messy file, i have to make sure that what I am getting is correct. To do so, that number inside <nobr></nobr>
must be preceded by "left:1020"
, "left:1024"
or "left:1028"
. This is because of the automatic conversion and the best choice would be to get all the number preceded by left:102[0-]
in my opinion.
To do so, I was trying to use:
for o in re.finditer('left:102[0-9]"><nobr>(.*?)</nobr></div>', words[index])
out = o.group(1)
But so far, no such luck... How can I get those numbers?
Thanks in advance, J.