views:

404

answers:

9

There is about 2000 lines of this, so manually would probably take more work than to figure out a way to do ths programatically. It only needs to work once so I'm not concerned with performance or anything.

<tr><td>Canada (CA)</td><td>Alberta (AB)</td></tr>
<tr><td>Canada (CA)</td><td>British Columbia (BC)</td></tr>
<tr><td>Canada (CA)</td><td>Manitoba (MB)</td></tr>

Basically its formatted like this, and I need to divide it into 4 parts, Country Name, Country Abbreviation, Division Name and Division Abbreviation.

In keeping with my complete lack of efficiency I was planning just to do a string.Replace on the HTML tags after I broke them up and then just finding the index of the opening brackets and grabbing the space delimited strings that are remaining. Then I realized I have no way of keeping track of which is the country and which is the division, as well as figuring out how to group them by country.

So is there a better way to do this? Or better yet, an easier way to populate a database with Country and Provinces/States? I looked around SO and the only readily available databases I can find dont provide the full name of the countries or the provinces/states or use IPs instead of geographic names.

+4  A: 
  1. Paste it into a spreadsheet. Some spreadsheets will parse the HTML table for you.

  2. Save it as a .CSV file and process it that way. Or. Add a column to the spreadsheet that says something like the following:

    ="INSERT INTO COUNTRY(CODE,NAME) VALUES=('" & A1 & "','" & B1 & "');"

Then you have a column of INSERT statements that you can cut, paste and execute.


Edit

Be sure to include the <table> tag when pasting into a spreadsheet.

<table><tr><th>country</th><th>name></th></tr>
<tr><td>Canada (CA)</td><td>Alberta (AB)</td></tr>
<tr><td>Canada (CA)</td><td>British Columbia (BC)</td></tr>
<tr><td>Canada (CA)</td><td>Manitoba (MB)</td></tr>
</table>

Processing a CSV file requires almost no parsing. It's got quotes and commas. Much easier to live with than XML/HTML.

S.Lott
I tried that, it just clumps everything into the A column. How would I process it as a .csv file? Won't I have the same issues I have with a text file?
Brandon
Ah! The wonders of sql-generation in Excel. :-)
Rune Grimstad
Spreadsheet as input syntax... http://homepage.mac.com/s_lott/iblog/architecture/C465799452/E20070125111155/index.html
S.Lott
Thanks everyone. It worked great.
Brandon
A: 

do you have to do this programatically? If not, may i suggest just copying and pasting the table (from the browser) onto MS Excel and then clearing all formats? This way tou get a nice table that can then be imported into your database without problem.

just a suggestion... hth

Andres
+1  A: 

Sounds like a problem easily solved by a Regex.

Chris Ballance
I'm not really familiar with writing RegExs, which is why I was using string.Replace :P Would you be able to provide an example that would let me strip the tags and replace them with pieces of SQL statements?
Brandon
A: 

An assembly exists for .Net called System.Xml; you can just reference the assembly and convert your HTML document to a System.Xml.XmlDocument, you can easily pinpoint the HTML node that contains your required data, and use the use the children nodes to add into your data. This requires little string parsing on your part.

+1  A: 

I recently learned that if you open a url from Excel it will try and parse out the table data.

jms
+1  A: 

If you are able to see this table in the browser (Internet explorer), you can select the entire table, right click & "Export to Microsoft Excel"

That should help you get data into separate columns, I guess.

shahkalpesh
Aha, genius. That's pretty much gotten me halfway there.
Brandon
A: 

Load the HTML data as XElements, use LINQ to grab the values you need, and then INSERT.

Chris
+2  A: 
/<tr><td>([^\s]+)\s\(([^\)])\)<\/td><td>([^\s]+)\s\(([^\)])\)<\/td><\/tr>/

Then you should have 4 captures with the 4 pieces of data from any PCRE engine :)

Alternatively, something like http://jacksleight.com/assets/blog/really-shiny/scripts/table-extractor.txt provides more completeness.

singpolyma
A: 

Blowing my own trumpet here but my FOSS tool CSVfix will do it with a combination of the read_xml and sql_insert commands.

anon