views:

103

answers:

4

Hi all, I'm trying to filter certain data from an HTML file. For example, the HTML file is as follows:

<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]">software_0.1-0.log</td><td align="right">17-Nov-2009 13:46  </td><td align="right">186K</td></tr>

I need to extract the software_0.1-0 part as well as the 17-Nov-2009 part. How can I do this?

Thanks a lot.

+2  A: 

You can extract the strings of interest (and some more text) using for example the popular beautifulsoup package. Then, you'll need some string manipulation (or maybe regular expressions) to separate the exact part of interest, but that depends on exactly what are the rules you want to apply -- i.e., is it always the .log suffix you want to drop from the filename, is it always a space that separates the date from the time, and so forth. If you specify the rules precisely it will not be hard to implement them (without a precise specification, however, it would all be a big mess of guesses;-).

Alex Martelli
A: 

Try Beautifull Soup, a parser for HTML. You'll get a structured document out of there and could select the first and second td contents.

It may be overkill in this instance, but especially if your HTML is from the outside and can change the maintenance guy will thank you for choosing a readable solution.

extraneon
+5  A: 

It's quite easy with BeautifulSoup:

html = '''<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]">software_0.1-0.log</td><td align="right">17-Nov-2009 13:46  </td><td align="right">186K</td></tr>'''

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
print soup.td.next.next
print soup.td.nextSibling.next

Output:

software_0.1-0.log
17-Nov-2009 13:46
Mark Byers
A: 

you requirement seems simple, so here's the non BeautifulSoup way, just pure string manipulation

s="""<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]">software_0.1-0.log</td><td align="right">17-Nov-2009 13:46  </td><td align="right">186K</td></tr>"""

string=s.split(">")
for i in string:
    try:
        e=i.index("<")
    except: pass
    else:
        print i[:e]

Now you can use i[:e] to find "software" and the date part

While this is technically true, it is still better to use Beautiful Soup because that will pay you dividends in the future when you have to do more complex HTML manipulations.
Michael Dillon
until that time when things are more complex, there's no need to use BeautifulSoup just for this case