views:

79

answers:

3

hello All.

i have some problem to extract some data from html source.

following is sniffit of my html source code, and i want to extract string value in every

following

<td class="gamedate">10/12 00:59</b></td>

<td class="gametype">오버언더</b></td>

<td class="legue"><nobr style="width:100%;overflow:hidden;letter-spacing:-1;font-size:11px;"><nobr style='display:block; overflow:hidden;'><img src='../data/banner/25' border='0' width='20' height='13' alt='' align='absmiddle'></a> 그리스 D2</nobr>

<td class="bet" id="team1_27771" class="homeTeam1">Pas Giannina (↑오버)</td>

<td class="bet" id="bet1_27771" class="homeTeam2" align="right">1.65</td>

<td class="pointer muSelect" id="chk_27771_3" num='27771' bet='2.5' sp='오버언더'  bgcolor="f0f0f0"  class="handy handy1" ><span id="bet3_27771">2.5</span></td>

<td class="bet" id="bet2_27771" class="awayTeam2" align="left">1.95</td>

<td class="bet" id="team2_27771" class="awayTeam1">Pierikos (↓언더)</td>

so what i want extracted final value is

10/12 00:59

오버언더

그리스 D2

Pas Giannina (↑오버)

1.65

2.5

1.95

Pierikos (↓언더)

following is my html full source

help me please! thanks in advance!

because html source is some big so i was upload to pastebin.com

http://pastebin.com/Gdun0jhf

+1  A: 

Why not just do a replace on the string

html.replace("AAAAAA", "Put what you want for AAAAAA here")

and do this for all of the things you want to replace?

Ignore, I miss read the question completely my brain must not be on today

Zimm3r
Er, the OP isn't trying to replace things, they're trying to get the values located in certain places. They manually put in the letter strings in their HTML code as *examples* of what they want to pull out.
Amber
I guess that is what the OP wanted, cool
Zimm3r
A: 

Hi, You may use HTMLParser

A: 

Something like this works on a basic table:

soup = BeautifulSoup.BeautifulSoup(YOUR_HTML)
table = soup.find('TABLE_ID')
for td in table.findAll('td'):
    print td.string

but it looks like the html you are dealing with is a bit messier. SO maybe it would be best to go after each of the TDs by class name? e.g.

soup = BeautifulSoup.BeautifulSoup(YOUR_HTML)

#game date
game_dates = soup.findAll('td', {class: 'gamedate' })
for game_date in game_dates:
    print game_date

#bets
bets = soup.findAll('td', {class: 'bet' })
for bet in bets:
    print bet
ScraperWiki