screen-scraping

What is the best way to screen scrape poorly formed XHTML pages for a java app

I want to be able to grab content from web pages, especially the tags and the content within them. I have tried XQuery and XPath but they don't seem to work for malformed XHTML and REGEX is just a pain. Is there a better solution. Ideally I would like to be able to ask for all the links and get back an array of URLs, or ask for the text...

Screenscraping the ugliest HTML you've ever seen in your life

I'm using PHP and libtidy to attempt to screen scrape what might possibly be the most horrendous and malformed use of HTML tables in history. The site closes few table, tr, td, font, or bold tags and consistently nests many different layers of tables within tables. Example snippet: <center> <table border="1" bordercolor="#000000" cells...

How can i get IE credentials to use in my code?

I'm currently developing an IE plugin using SpicIE. This plugin does some web scraping similar to the example posted on MSDN: WebRequest request = WebRequest.Create ("http://www.contoso.com/default.html"); request.Credentials = CredentialCache.DefaultCredentials; HttpWebResponse response = (HttpWebResponse)request.GetResponse (); S...

How can i screen-scrape a webmail page?

I am doing a project, in which i need to login into a site and scrape the webpage contents. i tried the following code: protected void Page_Load(object sender, EventArgs e) { WebClient webClient = new WebClient(); string strUrl = "http://www.mail.yahoo.com?username=sakthivel123&amp;password=operator&amp;login=1"; byte[] reqH...

Whats the best screen scraping language?

Hi I want to create a desktop app (c# prob) that scrapes or manipulates a form on a 3rd party web page. Basically I enter my data in the form in the desktop app, it goes away to the 3rd party website and, using the script or whatever in the background, enters my data there (incl my login) and clicks the submit button for me.I just want t...

Curl function to select options from a select box and auto submit

Hi all i am a newbie and try different things everyday and always come here when i am stuck with something. I want to write a script using curl and php that goes to this link :http://tools.cisco.com/WWChannels/LOCATR/openBasicSearch.do and then goes through each page for each country capturing a list of every partner in every country a...

Is there any sort of API that'll give me real-time(ish) MLB stats?

I think it'd be fun to build a little mini-fantasy baseball game, but after a bit of Googling, I'm getting the impression that there's no easy and reasonably-priced (or free!) way to get that data. Have any of you done something like this? Should I be thinking about screen-scraping? ...

Python-getting data from an asp.net AJAX application

Using Python, I'm trying to read the values on http://utahcritseries.com/RawResults.aspx. I can read the page just fine, but am having difficulty changing the value of the year combo box, to view data from other years. How can I read the data for years other than the default of 2002? The page appears to be doing an HTTP Post once the ...

Is it possible to get upcoming event / show information from a Myspace page without scraping?

I want to get show information from myspace artists. One way I could do this is ask an artist to input their myspace URL and I could try to scrape the page. What I would really like to do is ask the artist for their myspace credentials and use the myspace api to get their show data. I cannot find how to do this on the myspace develop...

How best to screen scrape a password protected site on behalf of a 3rd party?

I want to write a program that analyzes your fantasy baseball team and notifies you of recommended actions, possibly multiple times per day. The problem is, you aren't playing fantasy baseball on my site, you're playing on yahoo, or cbs, or espn, etc. On the majority of these sites, fantasy teams and leagues are not public, so you must...

How can I log in to YouTube using Perl?

I am trying to write a Perl script to connect to me YouTube account but it doesnt seem to work. Basically I just want to connect to my account but apparently it is not working. I don't even have an idea on how I could debug this! Maybe it is something related to https protocol? Please enlighten me! Thanks in advance. use HTTP::Request:...

Can .NET WebRequest/WebResponse translate accent marks, diacritical marks, and entities correctly?

I am "screen scraping" my own pages as a temporary hack, using .NET's WebRequest. This works well, but accented characters and diacritical characters do not translate correctly. I am wondering if there is a way to make them translate correctly using .NET's many many built in properties and methods. Here is the code I am using to gra...

Find all IPs on an HTML Page

I want to get an HTML page with python and then print out all the IPs from it. I will define an IP as the following: x.x.x.x:y Where: x = a number between 0 and 256. y = a number with < 7 digits. Thanks. ...

How do I extract data from a web page with regexes?

I am writing a curl script for collecting information about some sex offenders, i have developed the script that is picking up links like given below: http://criminaljustice.state.ny.us/cgi/internet/nsor/... (snipped URL) Now when we go on this link I want to get information under all the fields on this page like Offender Id:, last nam...

python- is beautifulsoup misreporting my html?

I have two machines each, to the best of my knowledge, running python 2.5 and BeautifulSoup 3.1.0.1. I'm trying to scrape http://utahcritseries.com/RawResults.aspx, using: from BeautifulSoup import BeautifulSoup import urllib2 base_url = "http://www.utahcritseries.com/RawResults.aspx" data=urllib2.urlopen(base_url) soup=BeautifulSo...

curl not working for getting a web page content, why?

Hi all i am using a curl script to go to a link and get its content for further manipulation. following is the link and curl script: <?php $url = 'http://criminaljustice.state.ny.us/cgi/internet/nsor/fortecgi?serviceName=WebNSOR&amp;amp;templateName=detail.htm&amp;amp;requestingHandler=WebNSORDetailHandler&amp;amp;ID=368343543'; //cur...

Python lxml screen scraping?

I need to do some HTML parsing with python. After some research lxml seems to be my best choice but I am having a hard time finding examples that help me with what I am trying to do. this is why i am hear. I need to scrape a page for all of its viewable text.. strip out all tags and javascript.. I need it to leave me with what text is vi...

How legal is screen scraping?

I'm trying to build an iPhone application that gathers content from real estate websites to display it in a mashed-up and structured manner (mapping, price averages...etc) I've stumbled upon many sites whose "Terms and Conditions" only allow downloading/re-using the data for personal purposes but not commercial ones. My intent is to hav...

What is the best way to programmatically log into a web site in order to screen scrape? (Preferably in Python)

I want to be able to log into a website programmatically and periodically obtain some information from the site. What is the best tool(s) that would make this as simple as possible? I'd prefer a Python library of some type because I want to become more proficient in Python, but I'm open to any suggestions. ...

Options for web scraping - C++ version only

I'm looking for a good C++ library for web scraping. It has to be C/C++ and nothing else so please do not direct me to Options for HTML scraping or other SO questions/answers where C++ is not even mentioned. ...