views:

240

answers:

1

hello. i want to extract some text in certain website. here is web address what i want to extract some text to make scraper. http://news.search.naver.com/search.naver?sm=tab%5Fhty&where=news&query=times&x=0&y=0 in this page, i want to extract some text with subject and content field separately. for example,if you open that page, you can see some text in page,

JAPAN TOKYO INTERNATIONAL FILM FESTIVAL EPA연합뉴스 세계 | 2009.10.25 (일) 오후 7:21 Japan, 25 October 2009. Gayet won the Best Actress Award for her role in the film 'Eight Times Up' directed by French filmmaker Xabi Molia. EPA/DAI KUROKAWA

JAPAN TOKYO INTERNATIONAL FILM FESTIVAL EPA연합뉴스 세계 | 2009.10.25 (일) 오후 7:18 she learns that she won the Best Actress Award for her role in the film 'Eight Times Up' by French film director Xabi Molia during the award ceremony of the 22nd Tokyo ...

and so on ,,,,

and finally i want to extract text such like format

SUBJECT:JAPAN TOKYO INTERNATIONAL FILM FESTIVAL CONTENT:EPA연합뉴스 세계 | 2009.10.25 (일) 오후 7:21 Japan, 25 October 2009. Gayet won the Best Actress Award for her role in the film 'Eight Times Up' directed by French filmmaker Xabi Molia. EPA/DAI KUROKAWA

SUBJECT: ... CONTENT: ...

AND SO ON.. if anyone help,really appreciate. thanks in advance.

+2  A: 

In general, to solve such problems you must first download the page of interest as text (use urllib.urlopen or anything else, even external utilities such as curl or wget, but not a browser since you want to see how the page looks before any Javascript has had a chance to run) and study it to understand its structure. In this case, after some study, you'll find the relevant parts are (snipping some irrelevant parts in head and breaking lines up for readability)...:

<body onload=nx_init();>
 <dl>
 <dt>
<a href="http://news.naver.com/main/read.nhn?mode=LSD&amp;mid=sec&amp;sid1=&amp;oid=091&amp;aid=0002497340"
 [[snipping other attributes of this tag]]>
JAPAN TOKYO INTERNATIONAL FILM FESTIVAL</a>
</dt>
 <dd class="txt_inline">
EPA¿¬ÇÕ´º½º ¼¼°è <span class="bar">
|</span>
 2009.10.25 (ÀÏ) ¿ÀÈÄ 7:21</dd>
 <dd class="sh_news_passage">
 Japan, 25 October 2009. Gayet won the Best Actress Award for her role in the film 'Eight <b>
Times</b>
 Up' directed by French filmmaker Xabi Molia. EPA/DAI KUROKAWA</dd>

and so forth. So, you want as "subject" the content of an <a> tag within a <dt>, and as "content" the content of <dd> tags following it (in the same <dl>).

The headers you get contain:

Content-Type: text/html; charset=ks_c_5601-1987

so you must also find a way to interpret that encoding into Unicode -- I believe that encoding is also known as 'euc_kr' and my Python installation appears to come with a codec for it, but you should check yours, too.

Once you've determined all of these aspects, you try to lxml.etree.parse the URL -- and, just like so many other web pages, it doesn't parse -- it doesn't really present well formed HTML (try w3c's validators on it to find out about some of the ways it's broken).

Because badly-formed HTML is so common on the web, there exist "tolerant parsers" that try to compensate for common errors. The most popular in Python is BeautifulSoup, and indeed lxml comes with it -- with lxml 2.0.3 or later, you can use BeautifulSoup as the underlying parser, then proceed "just as if" the document had parsed correctly -- but I find it simpler to use BeautifulSoup directly.

For example, here's a script to emit the first few subject/content pairs at that URL (they've changed currently, originally they were being the same as you give;-). You need a terminal that supports Unicode output (for example, I run this without problem on a Mac's Terminal.App set to utf-8) -- of course, instead of the prints you can otherwise collect the Unicode fragments (e.g. append them to a list and ''.join them when you have all the required pieces), encode them however you wish, etc, etc.

from BeautifulSoup import BeautifulSoup
import urllib

def getit(pagetext, howmany=0):
  soup = BeautifulSoup(pagetext)
  results = []
  dls = soup.findAll('dl')
  for adl in dls:
    thedt = adl.dt
    while thedt:
      thea = thedt.a
      if thea:
        print 'SUBJECT:', thea.string
      thedd = thedt.findNextSibling('dd')
      if thedd:
        print 'CONTENT:',
        while thedd:
          for x in thedd.findAll(text=True):
            print x,
          thedd = thedd.findNextSibling('dd')
        print
      howmany -= 1
      if not howmany: return
      print
      thedt = thedt.findNextSibling('dt')

theurl = ('http://news.search.naver.com/search.naver?'
          'sm=tab%5Fhty&where=news&query=times&x=0&y=0')
thepage = urllib.urlopen(theurl).read()
getit(thepage, 3)

The logic in lxml, or "BeautifulSoup in lxml clothing", is not very different, just the spelling and capitalization of the various navigational operations changes a bit.

Alex Martelli
Hello, i really appreciate,your hardwork!and this is almost 100% what i want.in addition ,is it possible to use with PAMIE module with my script source? im afraid, whether i have to open another new thread.thanks
paul
hi,i was forget ,http://elca.pastebin.com/m52e7d8e0here is my current making scraper script source.thanks again
paul
@Paul, I do believe closing this question (accepting the answer that helped most) and posing another one on another issue you have is proper SO etiquette: mixing issues in a question because they are near each other in your code's not helpful!
Alex Martelli
Hi, thanks for your advice.... :) and also your good support.im close this question....
paul
Alex Martelli