tags:

views:

455

answers:

5

Hi, what is the best way to extract data from this HTML file and put it into MySQL database with company phone number, company name and email with a primary key as phone number?

</tr><tr class="tableRowOdd">
          <td>"JSC company inc. 00" &lt;[email protected]&gt;</td>
          <td>1231231234</td>
         </tr><tr class="tableRowEven">
          <td>"JSC company inc. 01" &lt;[email protected]&gt;</td>
          <td>234234234234234</td>
         </tr><tr class="tableRowOdd">
          <td>"JSC company inc. 02" &lt;[email protected]&gt;</td>
          <td>32423234234</td>
         </tr><tr class="tableRowEven">
          <td>"JSC company inc. 03" &lt;[email protected]&gt;</td>
          <td>23423424324</td>
         </tr><tr class="tableRowOdd">
          <td>"JSC company inc. 04" &lt;[email protected]&gt;</td>
          <td>234234232324244</td>
         </tr> <tr>
+5  A: 

For extracting and general HTML munging look at

http://www.crummy.com/software/BeautifulSoup/

For the MySQL I suggest googling on: MySQL tutorial python

Peter Rowell
+1  A: 

Here is how you get the td contents into a python list using BeautifulSoup:

#!/usr/bin/python
from BeautifulSoup import BeautifulSoup, SoupStrainer

def find_rows(data):
    table_rows = SoupStrainer('tr')
    rows = [tag for tag in BeautifulSoup(data, parseOnlyThese=table_rows)]
    return rows

def cell_data(row):
    cells = [tag.string for tag in row.contents]
    return cells

if __name__ == "__main__":
    f = open("testdata.html", "r")
    data = f.read()
    rows = find_rows(data)
    for row in rows:
     print cell_data(row)

Save your html file as testdata.html, and run this script from the same directory. With the data you posted here, the output is

[u'\n', u'"JSC company inc. 00" &lt;[email protected]&gt;', u'\n', u'1231231234', u'\n']
[u'\n', u'"JSC company inc. 01" &lt;[email protected]&gt;', u'\n', u'234234234234234', u'\n']
[u'\n', u'"JSC company inc. 02" &lt;[email protected]&gt;', u'\n', u'32423234234', u'\n']
[u'\n', u'"JSC company inc. 03" &lt;[email protected]&gt;', u'\n', u'23423424324', u'\n']
[u'\n', u'"JSC company inc. 04" &lt;[email protected]&gt;', u'\n', u'234234232324244', u'\n']
gnud
+1  A: 

For the parsing, I definitely also recommend Beautiful Soup.

To put the text in a database, I recommend a good Python ORM. My top suggestion is to use the ORM from Django, if you can. With Django, you not only get an ORM, you also get a web interface that lets you browse through your database with a web browser; you can even enter data into the database using the web browser.

If you can't use Django, I recommend SQLAlchemy.

Good luck.

steveha
+1  A: 

With lxml you can do it almost as easily as you could do it with jQuery.

from lxml import html

doc = html.parse('test.html').getroot()
for row in doc.cssselect('tr'):
    name, phone_number = row.cssselect('td')[:2]
    print name.text_content()
    print phone_number.text_content()
Imran
A: 

+1 for BeautifulSoup

Now that you've got the data, you need to put it into MySQL. If you want a pure python solution, you'll also need the MySQL-Python binding.

Otherwise, the SQL you'll need to generate is relatively painless. We'll hijack gnuds example. Add to the top of the file:

 import re

Then at the bottom:

exp = r'\"(.*)\" &lt;(.*)&gt;'
for row in rows:
 matcher = re.match(exp, row[1])
 name, email = matcher.groups()
 phone = row[3]

 sql = "INSERT INTO company (email, name, phone) VALUES ('%s','%s','%s')" % (email, name, phone)
 print sql

Which gives you output like:

INSERT INTO company (email, name, phone) VALUES ('[email protected]','JSC company inc. 00','1231231234');
INSERT INTO company (email, name, phone) VALUES ('[email protected]','JSC company inc. 01','234234234234234');
INSERT INTO company (email, name, phone) VALUES ('[email protected]','JSC company inc. 02','32423234234');
INSERT INTO company (email, name, phone) VALUES ('[email protected]','JSC company inc. 03','23423424324');
INSERT INTO company (email, name, phone) VALUES ('[email protected]','JSC company inc. 04','234234232324244');
fitzgeraldsteele
You need to escape the data. If any of the fields have a `'` in them the sql will have a syntax error. Worse - it could used for SQL injection
gnibbler
Ahh, yes, of course. Although it looks like we're scraping from a set HTML page, so that level of rigor may not be necessary. Anyway, the python cgi.escape() method will do some of this, although it won't get the quotes.
fitzgeraldsteele