In general, to solve such problems you must first download the page of interest as text (use urllib.urlopen
or anything else, even external utilities such as curl or wget, but not a browser since you want to see how the page looks before any Javascript has had a chance to run) and study it to understand its structure. In this case, after some study, you'll find the relevant parts are (snipping some irrelevant parts in head
and breaking lines up for readability)...:
<body onload=nx_init();>
<dl>
<dt>
<a href="http://news.naver.com/main/read.nhn?mode=LSD&mid=sec&sid1=&oid=091&aid=0002497340"
[[snipping other attributes of this tag]]>
JAPAN TOKYO INTERNATIONAL FILM FESTIVAL</a>
</dt>
<dd class="txt_inline">
EPA¿¬ÇÕ´º½º ¼¼°è <span class="bar">
|</span>
2009.10.25 (ÀÏ) ¿ÀÈÄ 7:21</dd>
<dd class="sh_news_passage">
Japan, 25 October 2009. Gayet won the Best Actress Award for her role in the film 'Eight <b>
Times</b>
Up' directed by French filmmaker Xabi Molia. EPA/DAI KUROKAWA</dd>
and so forth. So, you want as "subject" the content of an <a>
tag within a <dt>
, and as "content" the content of <dd>
tags following it (in the same <dl>
).
The headers you get contain:
Content-Type: text/html; charset=ks_c_5601-1987
so you must also find a way to interpret that encoding into Unicode -- I believe that encoding is also known as 'euc_kr'
and my Python installation appears to come with a codec for it, but you should check yours, too.
Once you've determined all of these aspects, you try to lxml.etree.parse
the URL -- and, just like so many other web pages, it doesn't parse -- it doesn't really present well formed HTML (try w3c's validators on it to find out about some of the ways it's broken).
Because badly-formed HTML is so common on the web, there exist "tolerant parsers" that try to compensate for common errors. The most popular in Python is BeautifulSoup, and indeed lxml comes with it -- with lxml 2.0.3 or later, you can use BeautifulSoup as the underlying parser, then proceed "just as if" the document had parsed correctly -- but I find it simpler to use BeautifulSoup directly.
For example, here's a script to emit the first few subject/content pairs at that URL (they've changed currently, originally they were being the same as you give;-). You need a terminal that supports Unicode output (for example, I run this without problem on a Mac's Terminal.App set to utf-8) -- of course, instead of the print
s you can otherwise collect the Unicode fragments (e.g. append them to a list and ''.join
them when you have all the required pieces), encode them however you wish, etc, etc.
from BeautifulSoup import BeautifulSoup
import urllib
def getit(pagetext, howmany=0):
soup = BeautifulSoup(pagetext)
results = []
dls = soup.findAll('dl')
for adl in dls:
thedt = adl.dt
while thedt:
thea = thedt.a
if thea:
print 'SUBJECT:', thea.string
thedd = thedt.findNextSibling('dd')
if thedd:
print 'CONTENT:',
while thedd:
for x in thedd.findAll(text=True):
print x,
thedd = thedd.findNextSibling('dd')
print
howmany -= 1
if not howmany: return
print
thedt = thedt.findNextSibling('dt')
theurl = ('http://news.search.naver.com/search.naver?'
'sm=tab%5Fhty&where=news&query=times&x=0&y=0')
thepage = urllib.urlopen(theurl).read()
getit(thepage, 3)
The logic in lxml, or "BeautifulSoup in lxml clothing", is not very different, just the spelling and capitalization of the various navigational operations changes a bit.