ansaurus

Question

Answer 1

+2 A:

In general, to solve such problems you must first download the page of interest as text (use urllib.urlopen or anything else, even external utilities such as curl or wget, but not a browser since you want to see how the page looks before any Javascript has had a chance to run) and study it to understand its structure. In this case, after some study, you'll find the relevant parts are (snipping some irrelevant parts in head and breaking lines up for readability)...:

<body onload=nx_init();>
 <dl>
 <dt>
<a href="http://news.naver.com/main/read.nhn?mode=LSD&amp;mid=sec&amp;sid1=&amp;oid=091&amp;aid=0002497340"
 [[snipping other attributes of this tag]]>
JAPAN TOKYO INTERNATIONAL FILM FESTIVAL</a>
</dt>
 <dd class="txt_inline">
EPA¿¬ÇÕ´º½º ¼¼°è <span class="bar">
|</span>
 2009.10.25 (ÀÏ) ¿ÀÈÄ 7:21</dd>
 <dd class="sh_news_passage">
 Japan, 25 October 2009. Gayet won the Best Actress Award for her role in the film 'Eight <b>
Times</b>
 Up' directed by French filmmaker Xabi Molia. EPA/DAI KUROKAWA</dd>

and so forth. So, you want as "subject" the content of an <a> tag within a <dt>, and as "content" the content of <dd> tags following it (in the same <dl>).

The headers you get contain:

Content-Type: text/html; charset=ks_c_5601-1987

so you must also find a way to interpret that encoding into Unicode -- I believe that encoding is also known as 'euc_kr' and my Python installation appears to come with a codec for it, but you should check yours, too.

Once you've determined all of these aspects, you try to lxml.etree.parse the URL -- and, just like so many other web pages, it doesn't parse -- it doesn't really present well formed HTML (try w3c's validators on it to find out about some of the ways it's broken).

Because badly-formed HTML is so common on the web, there exist "tolerant parsers" that try to compensate for common errors. The most popular in Python is BeautifulSoup, and indeed lxml comes with it -- with lxml 2.0.3 or later, you can use BeautifulSoup as the underlying parser, then proceed "just as if" the document had parsed correctly -- but I find it simpler to use BeautifulSoup directly.

For example, here's a script to emit the first few subject/content pairs at that URL (they've changed currently, originally they were being the same as you give;-). You need a terminal that supports Unicode output (for example, I run this without problem on a Mac's Terminal.App set to utf-8) -- of course, instead of the prints you can otherwise collect the Unicode fragments (e.g. append them to a list and ''.join them when you have all the required pieces), encode them however you wish, etc, etc.

from BeautifulSoup import BeautifulSoup
import urllib

def getit(pagetext, howmany=0):
  soup = BeautifulSoup(pagetext)
  results = []
  dls = soup.findAll('dl')
  for adl in dls:
    thedt = adl.dt
    while thedt:
      thea = thedt.a
      if thea:
        print 'SUBJECT:', thea.string
      thedd = thedt.findNextSibling('dd')
      if thedd:
        print 'CONTENT:',
        while thedd:
          for x in thedd.findAll(text=True):
            print x,
          thedd = thedd.findNextSibling('dd')
        print
      howmany -= 1
      if not howmany: return
      print
      thedt = thedt.findNextSibling('dt')

theurl = ('http://news.search.naver.com/search.naver?'
          'sm=tab%5Fhty&where=news&query=times&x=0&y=0')
thepage = urllib.urlopen(theurl).read()
getit(thepage, 3)

The logic in lxml, or "BeautifulSoup in lxml clothing", is not very different, just the spelling and capitalization of the various navigational operations changes a bit.

Alex Martelli 2009-10-25 19:01:12

Hello, i really appreciate,your hardwork!and this is almost 100% what i want.in addition ,is it possible to use with PAMIE module with my script source? im afraid, whether i have to open another new thread.thanks

paul 2009-10-25 22:49:23

hi,i was forget ,http://elca.pastebin.com/m52e7d8e0here is my current making scraper script source.thanks again

paul 2009-10-25 23:02:11

@Paul, I do believe closing this question (accepting the answer that helped most) and posing another one on another issue you have is proper SO etiquette: mixing issues in a question because they are near each other in your code's not helpful!

Alex Martelli 2009-10-26 04:11:10

Hi, thanks for your advice.... :) and also your good support.im close this question....

paul 2009-10-26 05:30:55

Alex Martelli 2009-10-26 15:24:29

ansaurus

tags:

views:

answers:

how to extract some text by use lxml?

related questions