ansaurus

Question

How do i extract my required data from HTML file?

Answer 1

A:

You want to find the strings preceded by > and followed by <, ignoring trailing or leading whitespace. You can do this quite easily with a loop looking at each character in the string, or regular expressions could help. Something like >[ \t]*[^<]+[ \t]*<.

You could also use re.split and a regex representing the tag contents, something like <[^>]*> as the splitter, you will get some empty entries in the array, but these are easily deleted.

jheriko 2009-02-18 13:03:57

Answer 2

+3 A:

The issue is that your HTML is not very well thought out -- you have a "mixed content model" where your labels and your data are interleaved. Your labels are wrapped in <font> Tags, but your data is in NavigableString nodes.

You need to iterate over the contents of p_tag. There will be two kinds of nodes: Tag nodes (which have your <font> tags) and NavigableString nodes which have the other bits of text.

from beautifulsoup import *
label_value_pairs = []
for n in p_tag.contents:
    if isinstance(n,Tag) and tag == "font"
        label= n.string
    elif isinstance(n, NavigableString):
        value= n.string
        label_value_pairs.append( label, value )
    else:
        # Generally tag == "br"
        pass
print dict( label_value_pairs )

Something approximately like that.

S.Lott 2009-02-18 13:30:15

if isinstance(n,Tag)What is Tag in this?

aatifh 2009-02-19 06:59:09

@neo, Tag and NavigableString are types from BeautifulSoup module.

Constantin 2009-02-19 08:13:06

Answer 3

+3 A:

Sorry for the unnecessarily complex code, I badly need a big dose of caffeine ;)

import re

str = """<p class="foo-body">
  <font class="test-proof">Full name</font> Foobar<br />
  <font class="test-proof">Born</font> July 7, 1923, foo, bar<br />
  <font class="test-proof">Current age</font> 27 years 226 days<br />
  <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />
  <font class="test-proof">Also</font> bar<br />
  <font class="test-proof">foo style</font> hand <br />
  <font class="test-proof">bar style</font> ball<br />
  <font class="test-proof">foo position</font> bak<br />
  <br class="bar" />
</p>"""

R_EXTRACT_DATA = re.compile("<font\s[^>]*>[\s]*(.*?)[\s]*</font>[\s]*(.*?)[\s]*<br />", re.IGNORECASE)
R_STRIP_TAGS = re.compile("<span\s[^>]*>|</span>", re.IGNORECASE)

def strip_tags(str):
    """Strip un-necessary <span> tags
    """
    return R_STRIP_TAGS.sub("", str)

def get_info(str):
    """Extract useful info from the given string
    """
    data = R_EXTRACT_DATA.findall(str)
    data_dict = {}

    for x in [(x[0], strip_tags(x[1])) for x in data]:
        data_dict[x[0]] = x[1]

    return data_dict

print get_info(str)

Baishampayan Ghose 2009-02-18 13:41:10

Answer 4

+3 A:

I started answering this before I realised you were using 'beautiful soup' but here's a parser that I think works with your example string written using the HTMLParser library

from HTMLParser import HTMLParser

results = {}
class myParse(HTMLParser):

   def __init__(self):
      self.state = ""
      HTMLParser.__init__(self)

   def handle_starttag(self, tag, attrs):
      attrs = dict(attrs)
      if tag == "font" and attrs.has_key("class") and attrs['class'] == "test-proof":
         self.state = "getKey"

   def handle_endtag(self, tag):
      if self.state == "getKey" and tag == "font":
         self.state = "getValue"

   def handle_data(self, data):
      data = data.strip()
      if not data:
         return
      if self.state == "getKey":
         self.resultsKey = data
      elif self.state == "getValue":
         if results.has_key(self.resultsKey):
            results[self.resultsKey] += " " + data 
         else: 
            results[self.resultsKey] = data


if __name__ == "__main__":
   p_tags = """<p class="foo-body">  <font class="test-proof">Full name</font> Foobar<br />  <font class="test-proof">Born</font> July 7, 1923, foo, bar<br />  <font class="test-proof">Current age</font> 27 years 226 days<br />  <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />  <font class="test-proof">Also</font> bar<br />  <font class="test-proof">foo style</font> hand <br />  <font class="test-proof">bar style</font> ball<br />  <font class="test-proof">foo position</font> bak<br />  <br class="bar" /></p>"""
   parser = myParse()
   parser.feed(p_tags)
   print results

Gives the result:

{'foo position': 'bak', 
'Major teams': 'Japan, Jakarta, bazz, foo, foobazz', 
'Also': 'bar', 
'Current age': '27 years 226 days', 
'Born': 'July 7, 1923, foo, bar' , 
'foo style': 'hand', 
'bar style': 'ball', 
'Full name': 'Foobar'}

sparklewhiskers 2009-02-18 13:50:26

ansaurus

tags:

views:

answers:

How do i extract my required data from HTML file?

related questions