screen-scraping

Getting HTTP request from IE

Is there a way to access the http request IE made when the page is already loaded. For instance, I have an application that is opening a browser window. I want to scrape the page, but would like to get the entire http request for that page (not just the URL). I have downloaded the developer tools, but don't see anything in there for the...

Following a Javascript link when scraping from a remote site using PHP

Given remote page: http://example.com/paged_list.aspx which uses a Javascript function call to display several pages of tabular data: javascript: show_page(1) javascript: show_page(2) and so on. Users click on the page links to display each page, which triggers a reload but with no query string, ie the URI remains the same. In scrap...

Finding content between two words withou RegEx, BeautifulSoup, lXml ... etc

How to find out the content between two words or two sets of random characters? The scraped page is not guaranteed to be Html only and the important data can be inside a javascript block. So, I can't remove the JavaScript. consider this: <html> <body> <div>StartYYYY "Extract HTML", ENDYYYY </body> Some Java Scripts code STARTXXXX "E...

How do I grab an instance of a dynamic php script output?

The following link outputs a different image every time you visit it: http://www.biglickmedia.com/art/random/index.php From a web browser, you can obviously right click it and save what you see. But if I were to visit this link from a command line (like through python+mechanize), how would I save the image that would output? So basical...

How do I data mine various news sources?

I'm working on a free web application that will analyze top news stories throughout the day and provide stats. Most news websites offer RSS feeds, which works fine for knowing which stories to retrieve. However, the problems arise when attempting to get the full news story from the news website itself. At the moment, I have separate News...

Price Comparison Script for Products

How would you build a price comparison script? I know Amazon offers a public API, but I saw these two sites goodreads, bookdope which compare book prices, retrieve prices from Walmart and others websites that do not offer APIs. How do you get prices from sites that do not have an API? I'm using C# and ASP.NET MVC. ...

How can I Programmatically perform a search without using an API?

I would like to create a program that will enter a string into the text box on a site like Google (without using their public API) and then submit the form and grab the results. Is this possible? Grabbing the results will require the use of HTML scraping I would assume, but how would I enter data into the text field and submit the form? ...

Scrape current request and zip it up.

I have an asp.net website which contains a few pages that I'd like to export their generated content and send to another service for archiving. The best way that I can fathom doing this is to grab the stream and dump it to a file which is easy enough to do. My main challenge would be follow the external resources and include them in th...

How to parse html in a client-side script?

What's the best way to create scripts for a browser? I need to parse some html pages on different domains I am on windows and use firefox most of all. ...

Scrape FULL image src with PHP

I am trying to scrape img src's with php, I can get the src fine, but if the src does not include the full path then I can't really reuse it. Is there a way to grab the full path of the image using php (browsers can get it if you use the right click menu). ie. How do I get a FULL path including the domain in one of the following tw...

How do you install simplehtmldom in ubuntu

Hi, I am trying to write a screen scraper in php. I am having a nightmare trying to figure out how to do regular expressions. However I have found a library that is suppose to remove the need to use regular expressions when screen scraping. It is called simplehtmldom. However I can't even figure out how to install it. I have downlo...

web scraping to fill out (and retrieve) search forms?

Hi, I was wondering if it is possible to "automate" the task of typing in entries to search forms and extracting matches from the results. For instance, I have a list of journal articles for which I would like to get DOI's (digital object identifier); manually for this I would go to the journal articles search page (e.g., http://pubs.acs...

ASP.NET Screen Scrape Post Simulate

I'm trying to download and parse the HTML of a web page. Recently, the source website moved from having all of their information on one page to hiding part of it behind javascript. There's a "Show All" check box that needs activated in order to view the whole page. Here's the website: Source Website Essentially I'm looking to automate ...

XPath to Parse "SRC" from IMG tag?

Right now I successfully grabbed the full element from an HTML page with this: //img[@class='photo-large'] for example it would return this: <img src="http://example.com/img.jpg" class='photo-large' /> But I only need the SRC url (http://example.com/img.jpg). Any help? ...

simplehtmldom php: How do you search for one thing or another

I want to scrape some html with simplehtmldom in php. I have a bunch of tags containg tags. The tags I want alternate between bgcolor=#ffffff and bgcolor=#cccccc. There are some tags that have other bgcolors. I want to get all the code in each tag that has either bgcolor=#ffffff or bgcolor=#cccccc. I can't just use $html->find(...

Wrong results using Simple_HTML_Dom

Hi, I am trying to scrape this web page: http://www.acttab.com.au/interbet/venues?day=today Here is my code: function FindRaceRows($html) { foreach ($rows = $html->find( 'tr[bgcolor="#ffffff"], tr[bgcolor="#cccccc"]') as $row); { echo $row->plaintext . "END ROW<br />\n"; foreach ($row->find...

How do you search by the contents of a tag in simplehtmldom?

Hi, I am trying to write a web scraper using simplehtmldom. I want to get a tag by searching the contents of the tag. This is the plaintext inside it, not the type of tag. Then once I have the tag by searching for the contents of its plain text I want to get the next tag after that. How do I find a tag based on its contents? And on...

Can't separate cells properly with simplehtmldom

I am trying to write a web scraper. I want to get all the cells in a row. The row before the one I want has THOROUGHBRED MEETINGS as its plain text value. I can successfully get this row. But I can't figure out how to get the next row's children which are the cells or <td> tags. if ($foundTag = FindTagByText("THOROUGHBRED MEETINGS",...

Screen scraping gotchas

When screen-scraping, what are the "gotcha"s to look out for? The inspiration for this is: my spouse's co-worker asked me to scrape all the pages from a Blogger-hosted blog that her friend with cancer kept in her final months and this lady wanted to keep all of the posts in case the blog were ever deleted. I eventually found a free tool...

How do I gurantee that utf-8 characters are scraped accurately using CURL in php?

Hello, I am scraping webpages (using php's curl) that have accented characters (like "é"). In the source of those webpages, those characters are written using utf-8 (they are not html encoded.) However, when the result is produced using the following code, I get question marks instead of the accented characters. $ch = curl_init(); $ti...