views:

228

answers:

1

I'm trying to write a program that takes company names from a text file and searches them on a search engine website (SEC's Edgar search). Each search usually comes up with 1-10 unique search result links and so I want to use curl to click on the link with the relevant company name. The link page has a brief summary with the term "state of incorporation:" and then the state name. Im hoping to parse the state name. I am having trouble understanding how to use HTML parsing and curl and their classes. I would appreciate any help possible such as a brief outline of steps or just any advice at all. Thanks.

+1  A: 

Assuming that the HTML is fairly basic, use something like the Mozilla Java HTML Parser. The getting started guide will give you more details on creating the DOM. Java has builtin APIs for downloading content from the web, and these will likely be sufficient for you (rather than using "curl").

Once you have a DOM, you can use the standard DOM APIs to navigate for the links and items that you want.

jsight