tags:

views:

87

answers:

3

I need to count the number of images (in this case 1 image). Apparently using "len()"?

Here is HTML:

<div class="detail-headline">
    Fotogal&#233;ria
        </div>
<div class="detail-indent">
    <table id="ctl00_ctl00_ctl00_containerHolder_mainContentHolder_innnerContentHolder_ZakazkaControl_ZakazkaObrazky1_ObrazkyDataList" cellspacing="0" border="0" style="width:100%;border-collapse:collapse;">
    <tr>
        <td align="center" style="width:25%;">
            <div id="ctl00_ctl00_ctl00_containerHolder_mainContentHolder_innnerContentHolder_ZakazkaControl_ZakazkaObrazky1_ObrazkyDataList_ctl02_PictureContainer">
                <a title="1-izb. Kaspická" class="highslide detail-img-link" onclick="return hs.expand(this);" href="/imgcache/cache231/3186-000393~8621457~640x480.jpg"><img src="/imgcache/cache231/3186-000393~8621457~120x120.jpg" class="detail-img" width="89" height="120" alt="1-izb. Kaspická" /></a>
            </div>
        </td><td></td>
    </tr>
</table>
</div>

I used before HTMLParser and the number of images must be added to "self.srcData".. Previous code:

def handle_starttag(self, tag, attrs):  
    if tag == 'div' and len(attrs) > 1 and attrs[1][0] == 'class' and attrs[1][1] == 'detail-headline' \
      and self.srcData[self.getpos()[0]].strip() == u'Realitn&#225; kancel&#225;ria':
      self.status = 2

    if self.status == 2 and tag == 'div' and len(attrs) > 0 and attrs[0][0] == 'class' and attrs[0][1] == 'name':
      self.record[-1] = decode(self.srcData[self.getpos()[0]].strip())
      self.status = 0

Then (check start tag).. Like this?

if tag == 'div' and len(attrs) > 0 and attrs[0][0] == 'class' and attrs[0][1] == 'detail-headline' \
      and self.srcData[self.getpos()[0]].strip() == 'Fotogal&#233;ria':
      self.status = 3

Is it ok? And...? Thanks.

+3  A: 

Your job will be easier if you use BeautifulSoup

Perhaps something like this

from BeautifulSoup import BeaufitulSoup
def count_images(htmltext)
    soup=BeautifulSoup(htmltext)
    return len(soup.findAll('div',{'class':'detail-indent'}))

Or using lxml

from lxml.html.soupparser import fromstring
def count_images(htmltext)
    return len([e.attrib for e in fromstring(htmltext).findall('div')
                             if e.attrib.get('class')=='detail-indent'])
gnibbler
Development on BeautifulSoup has come to a halt. It makes more sense to recommend lxml (http://codespeak.net/lxml/), which is still current, supported, improving software.
Mike Graham
I know a little BeautifulSoup as a beginner, but I don´t know to combine HTMLParser (and using "len()"). And it is necessary. Previous code (200 lines) is written with him.. (POST method, etc)
ditus
+1  A: 

Just for a lark, I tried a pyparsing approach. Pyparsing includes some methods to help construct matching patterns for HTML tags, which include matches for attributes, unexpected whitespace, single or double quotes, and other hard-to-predict HTML tag gotchas. Here is a pyparsing solution (assumes your HTML source has been read into a string variable 'html'):

from pyparsing import makeHTMLTags

# makeHTMLTags returns patterns for both opening and closing 
# tags, we just want the opening ones
aTag = makeHTMLTags("A")[0]
imgTag = makeHTMLTags("IMG")[0]

# find the matching tags
tagMatches = (aTag|imgTag).searchString(html)

# yes, use len() to see how many there are
print len(tagMatches)

# get the actual image names
for t in tagMatches:
    if t.startA:
        print t.href
    if t.startImg:
        print t.src

Prints:

2
/imgcache/cache231/3186-000393~8621457~640x480.jpg
/imgcache/cache231/3186-000393~8621457~120x120.jpg
Paul McGuire
Nice solution. BeautifulSoup, lxml, pyparsing. And again.. What is html in my case? I don´t have any url of type "http:// www...... com/..... html". And how I add to srcData (next import in to the txt/csv file)? I like these methods, but don´t use in difficult cases.
ditus
Edited to explain that 'html' is a string variable containing the HTML source you are searching for image refs. I don't understand your other questions.
Paul McGuire
A: 
import urllib
import urllib2
import HTMLParser
import codecs
import time
from BeautifulSoup import BeautifulSoup

# decode string
def decode(istr):
  ostr = u''
  idx = 0
  while idx < len(istr):
    add = True
    if istr[idx] == '&' and len(istr) > idx + 1 and istr[idx + 1] == '#':
      iend = istr.find(';', idx)
      if iend > idx:
        ostr += unichr(int(istr[idx + 2:iend]))
        idx = iend
        add = False
    if add:
      ostr += istr[idx]
    idx += 1
  return ostr

# parser 1
class FlatDetailParser (HTMLParser.HTMLParser):
  def __init__ (self):
    HTMLParser.HTMLParser.__init__(self)

  def loadDetails(self, link):
    self.record = (len(self.characts) + 1) * ['']
    self.status = 0
    self.index = -1
    self.reset()
    request = urllib2.Request(link)
    data = urllib2.urlopen(request)  # URL obtained from the next class
    self.srcData = []
    for line in data:
      line = line.decode('utf8')
      self.srcData.append(line)
    for line in self.srcData:
      self.feed(line)
    self.close()
    return self.record


  def handle_starttag(self, tag, attrs):
    if tag == 'div' and len(attrs) > 1 and attrs[1][0] == 'class' and attrs[1][1] == 'detail-headline' \
      and self.srcData[self.getpos()[0]].strip() == u'Realitn&#225; kancel&#225;ria':
      self.status = 2

    if self.status == 2 and tag == 'div' and len(attrs) > 0 and attrs[0][0] == 'class' \
      and attrs[0][1] == 'name':
      self.record[-1] = decode(self.srcData[self.getpos()[0]].strip())
      self.status = 0

...and next class of parser, and adding data in to the txt file.

When I use BeautifulSoup.. What is soup=BeautifulSoup(???). How can I add to srcData? This can be combined? How?

ditus
See the documentation for BeautifulSoup. Someone else mentioned it's not being actively developed, but at least the existing code is pretty well-documented with lots of examples. http://www.crummy.com/software/BeautifulSoup/documentation.html
MatrixFrog