ansaurus

Question

Answer 1

+3 A:

Your job will be easier if you use BeautifulSoup

Perhaps something like this

from BeautifulSoup import BeaufitulSoup
def count_images(htmltext)
    soup=BeautifulSoup(htmltext)
    return len(soup.findAll('div',{'class':'detail-indent'}))

Or using lxml

from lxml.html.soupparser import fromstring
def count_images(htmltext)
    return len([e.attrib for e in fromstring(htmltext).findall('div')
                             if e.attrib.get('class')=='detail-indent'])

gnibbler 2010-02-18 23:23:44

Development on BeautifulSoup has come to a halt. It makes more sense to recommend lxml (http://codespeak.net/lxml/), which is still current, supported, improving software.

Mike Graham 2010-02-18 23:33:28

I know a little BeautifulSoup as a beginner, but I don´t know to combine HTMLParser (and using "len()"). And it is necessary. Previous code (200 lines) is written with him.. (POST method, etc)

ditus 2010-02-19 00:11:00

Answer 2

+1 A:

Just for a lark, I tried a pyparsing approach. Pyparsing includes some methods to help construct matching patterns for HTML tags, which include matches for attributes, unexpected whitespace, single or double quotes, and other hard-to-predict HTML tag gotchas. Here is a pyparsing solution (assumes your HTML source has been read into a string variable 'html'):

from pyparsing import makeHTMLTags

# makeHTMLTags returns patterns for both opening and closing 
# tags, we just want the opening ones
aTag = makeHTMLTags("A")[0]
imgTag = makeHTMLTags("IMG")[0]

# find the matching tags
tagMatches = (aTag|imgTag).searchString(html)

# yes, use len() to see how many there are
print len(tagMatches)

# get the actual image names
for t in tagMatches:
    if t.startA:
        print t.href
    if t.startImg:
        print t.src

Prints:

2
/imgcache/cache231/3186-000393~8621457~640x480.jpg
/imgcache/cache231/3186-000393~8621457~120x120.jpg

Paul McGuire 2010-02-19 01:11:09

Nice solution. BeautifulSoup, lxml, pyparsing. And again.. What is html in my case? I don´t have any url of type "http:// www...... com/..... html". And how I add to srcData (next import in to the txt/csv file)? I like these methods, but don´t use in difficult cases.

ditus 2010-02-19 01:24:04

Edited to explain that 'html' is a string variable containing the HTML source you are searching for image refs. I don't understand your other questions.

Paul McGuire 2010-02-19 02:21:54

Answer 3

A:

import urllib
import urllib2
import HTMLParser
import codecs
import time
from BeautifulSoup import BeautifulSoup

# decode string
def decode(istr):
  ostr = u''
  idx = 0
  while idx < len(istr):
    add = True
    if istr[idx] == '&' and len(istr) > idx + 1 and istr[idx + 1] == '#':
      iend = istr.find(';', idx)
      if iend > idx:
        ostr += unichr(int(istr[idx + 2:iend]))
        idx = iend
        add = False
    if add:
      ostr += istr[idx]
    idx += 1
  return ostr

# parser 1
class FlatDetailParser (HTMLParser.HTMLParser):
  def __init__ (self):
    HTMLParser.HTMLParser.__init__(self)

  def loadDetails(self, link):
    self.record = (len(self.characts) + 1) * ['']
    self.status = 0
    self.index = -1
    self.reset()
    request = urllib2.Request(link)
    data = urllib2.urlopen(request)  # URL obtained from the next class
    self.srcData = []
    for line in data:
      line = line.decode('utf8')
      self.srcData.append(line)
    for line in self.srcData:
      self.feed(line)
    self.close()
    return self.record


  def handle_starttag(self, tag, attrs):
    if tag == 'div' and len(attrs) > 1 and attrs[1][0] == 'class' and attrs[1][1] == 'detail-headline' \
      and self.srcData[self.getpos()[0]].strip() == u'Realitn&#225; kancel&#225;ria':
      self.status = 2

    if self.status == 2 and tag == 'div' and len(attrs) > 0 and attrs[0][0] == 'class' \
      and attrs[0][1] == 'name':
      self.record[-1] = decode(self.srcData[self.getpos()[0]].strip())
      self.status = 0

...and next class of parser, and adding data in to the txt file.

When I use BeautifulSoup.. What is soup=BeautifulSoup(???). How can I add to srcData? This can be combined? How?

ditus 2010-02-19 01:16:17

See the documentation for BeautifulSoup. Someone else mentioned it's not being actively developed, but at least the existing code is pretty well-documented with lots of examples. http://www.crummy.com/software/BeautifulSoup/documentation.html

MatrixFrog 2010-02-19 01:21:16

ansaurus

tags:

views:

answers:

the number of images, using "len()"

related questions