Is there a way to access the http request IE made when the page is already loaded. For instance, I have an application that is opening a browser window. I want to scrape the page, but would like to get the entire http request for that page (not just the URL).
I have downloaded the developer tools, but don't see anything in there for the...
Given remote page:
http://example.com/paged_list.aspx
which uses a Javascript function call to display several pages of tabular data:
javascript: show_page(1)
javascript: show_page(2)
and so on. Users click on the page links to display each page, which triggers a reload but with no query string, ie the URI remains the same.
In scrap...
How to find out the content between two words or two sets of random characters?
The scraped page is not guaranteed to be Html only and the important data can be inside a javascript block. So, I can't remove the JavaScript.
consider this:
<html>
<body>
<div>StartYYYY "Extract HTML", ENDYYYY
</body>
Some Java Scripts code STARTXXXX "E...
The following link outputs a different image every time you visit it:
http://www.biglickmedia.com/art/random/index.php
From a web browser, you can obviously right click it and save what you see. But if I were to visit this link from a command line (like through python+mechanize), how would I save the image that would output? So basical...
I'm working on a free web application that will analyze top news stories throughout the day and provide stats. Most news websites offer RSS feeds, which works fine for knowing which stories to retrieve. However, the problems arise when attempting to get the full news story from the news website itself. At the moment, I have separate News...
How would you build a price comparison script? I know Amazon offers a public API, but I saw these two sites goodreads, bookdope which compare book prices, retrieve prices from Walmart and others websites that do not offer APIs. How do you get prices from sites that do not have an API?
I'm using C# and ASP.NET MVC.
...
I would like to create a program that will enter a string into the text box on a site like Google (without using their public API) and then submit the form and grab the results. Is this possible? Grabbing the results will require the use of HTML scraping I would assume, but how would I enter data into the text field and submit the form? ...
I have an asp.net website which contains a few pages that I'd like to export their generated content and send to another service for archiving.
The best way that I can fathom doing this is to grab the stream and dump it to a file which is easy enough to do. My main challenge would be follow the external resources and include them in th...
What's the best way to create scripts for a browser?
I need to parse some html pages on different domains
I am on windows and use firefox most of all.
...
I am trying to scrape img src's with php, I can get the src fine, but if the src does not include the full path then I can't really reuse it. Is there a way to grab the full path of the image using php (browsers can get it if you use the right click menu).
ie. How do I get a FULL path including the domain in one of the following tw...
Hi,
I am trying to write a screen scraper in php. I am having a nightmare trying to figure out how to do regular expressions. However I have found a library that is suppose to remove the need to use regular expressions when screen scraping. It is called simplehtmldom.
However I can't even figure out how to install it. I have downlo...
Hi, I was wondering if it is possible to "automate" the task of typing in entries to search forms and extracting matches from the results. For instance, I have a list of journal articles for which I would like to get DOI's (digital object identifier); manually for this I would go to the journal articles search page (e.g., http://pubs.acs...
I'm trying to download and parse the HTML of a web page. Recently, the source website moved from having all of their information on one page to hiding part of it behind javascript. There's a "Show All" check box that needs activated in order to view the whole page.
Here's the website: Source Website
Essentially I'm looking to automate ...
Right now I successfully grabbed the full element from an HTML page with this:
//img[@class='photo-large']
for example it would return this:
<img src="http://example.com/img.jpg" class='photo-large' />
But I only need the SRC url (http://example.com/img.jpg). Any help?
...
I want to scrape some html with simplehtmldom in php. I have a bunch of tags containg tags. The tags I want alternate between bgcolor=#ffffff and bgcolor=#cccccc. There are some tags that have other bgcolors.
I want to get all the code in each tag that has either bgcolor=#ffffff or bgcolor=#cccccc. I can't just use $html->find(...
Hi,
I am trying to scrape this web page: http://www.acttab.com.au/interbet/venues?day=today
Here is my code:
function FindRaceRows($html) {
foreach ($rows = $html->find(
'tr[bgcolor="#ffffff"], tr[bgcolor="#cccccc"]') as
$row);
{
echo $row->plaintext . "END ROW<br />\n";
foreach ($row->find...
Hi,
I am trying to write a web scraper using simplehtmldom. I want to get a tag by searching the contents of the tag. This is the plaintext inside it, not the type of tag. Then once I have the tag by searching for the contents of its plain text I want to get the next tag after that.
How do I find a tag based on its contents? And on...
I am trying to write a web scraper. I want to get all the cells in a row. The row before the one I want has THOROUGHBRED MEETINGS as its plain text value. I can successfully get this row. But I can't figure out how to get the next row's children which are the cells or <td> tags.
if ($foundTag = FindTagByText("THOROUGHBRED MEETINGS",...
When screen-scraping, what are the "gotcha"s to look out for?
The inspiration for this is: my spouse's co-worker asked me to scrape all the pages from a Blogger-hosted blog that her friend with cancer kept in her final months and this lady wanted to keep all of the posts in case the blog were ever deleted. I eventually found a free tool...
Hello,
I am scraping webpages (using php's curl) that have accented characters (like "é").
In the source of those webpages, those characters are written using utf-8 (they are not html encoded.)
However, when the result is produced using the following code, I get question marks instead of the accented characters.
$ch = curl_init();
$ti...