Hello everyone, Summer is here! And it is time for me to find a summer programming project as I am very fond of taking learning into my own hands.. My question to you is this, I am looking for methods to extract various data from various websites. I know there are programs out there you can buy but being that I am trying to learn I want to do it myself. Does anyone have any suggestions on a general structure and if so, what language would you write it in. My firsth though was java but I am more than willing and greatful to hear anyone elses opinion. Thanks in advance, Eric
Im tempted to say c# but thats becuase I really enjoy working with it! I think it would be more sensible for you to find a language that you like working with and go for that, if your not personally interested and motivated in your language it will be difficult to keep at it.
And as far as structure goes, you will essentially be recursively searching the web and parsing HTML, perhaps you could employ google, or another search provider to increase accuracy and results? Treat the internet like a big old folder structure full of juicy information and parse out what you want!
On this note though, you will want a language capable of doing web stuff, so perhaps do a little reasearch on that first?
What you use this for if up to you! You could gather information on just about anything :) Sounds like a fun project, have fun!
What kind of data are you trying to extract from websites? What websites? etc. A little more detail on your idea/project would be helpful
I recently had the need to look into and try a few html parsers to get some data I needed in a more consolidated format.
I tried JTidy (http://jtidy.sourceforge.net/) and looked into Web-Harvest (http://web-harvest.sourceforge.net/). JTidy wouldn't quite do what I wanted and Web-Harvest was overkill.
I ultimately settled on using Java + htmlparser (http://htmlparser.sourceforge.net/)
It took very little development time to get what I needed and htmlparser allows you to form 'filters' that search for specific things in the DOM.
look at hadoop (grids) and solr (crawlers and indexers ). They both support heavy processing and efficient indexing (for efficient searching) respectively.