screen-scraping

urllib2 returns a different page the bowser does?

I'm trying to scrape a page (my router's admin page) but the device seems to be serving a different page to urllib2 than to my browser. has anyone found this before? How can I get around it? this the code I'm using: >>> from BeautifulSoup import BeautifulSoup >>> import urllib2 >>> page = urllib2.urlopen("http://192.168.1.254/index.cgi...

scraping dynamic content

Hi, I am working on a web scraping project. do any body have idea of scraping dynamic content. Dynamic content on base of query string is similar to static content but dynamic content based on some event of a control within same page is the point where i am stuck. because in this case page url remain same. I am using C#. Thanks in ...

how can i convert xml document/file to dom in python?

I need to do screen scraping and for that i need to convert xml to dom in python please can anybody tell that how can i do this.... ...

improving regular expression to match all "http" only urls very neatly .

i have tried below given expressions. (http:\/\/.*?)['\"\< \>] (http:\/\/[-a-zA-Z0-9+&@#\/%?=~_|!:,.;\"]*[-a-zA-Z0-9+&@#\/%=~_|\"]) the first one is doing well but always gives the last extra character with the matched urls . Eg: http://domain.com/path.html" http://domain.com/path.html&lt; Notice " < I don't want them with ...

How to capture part of a screen

I am using the win32 PrintWindow function to capture a screen to a BitMap object. If I only want to capture a region of the window, how can I crop the image in memory? Here is the code I'm using to capture the entire window: [System.Runtime.InteropServices.DllImport(strUSER32DLL, CharSet = CharSet.Auto, SetLastError = true)] public ...

Android/Java: Simulate a click on this webpage.

Hello all Last year I made an Android application that scrapped the informations on my train company in Belgium ( application is BETrains: http://www.cyrket.com/p/android/tof.cv.mpp/) This application was really cool and allowed users to talk with other people in the train ( a messagery server is runned by me) and the conversations wre...

scrubyt - > Check for tag existence?

I'm trying to use scrubyt to scrape a page and have everything working except for a decent way of advancing to the next page of the results. The next_page approach isn't working due to the url being relative. I figured out a simple way to do it but it all hinges on being able to use something like: if node_exists("//div[@class='pagina...

How to click a javascripted' radio button and fetch next page using Python (urllib2)? [WebScrape, selenium]

I am struggling with this. I have a fully tested python script. I have to make a small change wherein I have to first click on a radio button which in turn automatically executes a javascript function forwarding the page to a search form. My working platform : Linux Language : Python Radio button code : <input type="radio" language...

Impossible site for HtmlUnit?

I cannot, for the life of me, rig HtmlUnit up to grab this site: http://www.bing.com/travel/flight/flightSearch?form=FORMTRVLGENERIC&amp;q=flights+from+SLC+to+BKK+leave+07%2F30%2F2010+return+08%2F11%2F2010+adults%3A1+class%3ACOACH&amp;stoc=0&amp;vo1=Salt+Lake+City%2C+UT+%28SLC%29+-+Salt+Lake+City+International+Airport&amp;o=SLC&amp;ve1=...

Help with Strange Python scraping error. HTTPError with one machine while it works on others.

I am using a proxy and following is the code. 20 req = urllib2.Request(url) 21 # run the request for each proxy 22 # now set the proxy 23 req.set_proxy(proxy, "http") 24 req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3') 25 req.add_hea...

Problem with MSHTML COM clicking on submit button

I'm having a problem screenscraping some data from this website using the MSHTML COM component. I have a WebBrowser control on my WPF form. The code where I retrieve the HMTL elements is in the WebBrowser LoadCompleted events. After I set the values of the data to the HTMLInputElement and call the click method on the HTMLInputButtonEleme...

Screen Scraping of Image Links in PHP

I have a website that contains many different pages of products and each page has a certain amount of images in the same format across all pages. I want to be able to screen scrap each page's url so I can retrieve the url of each image from each page. The idea is to make a gallery for each page made up of hotlinked images. I know this c...

echo image url with tags in php

I have previously asked a question on how to echo the url of images from an html page. I can do this successfully but how can I narrow this down further so only images urls beginning with a certain phrase are shown, furthermore how can I add an image tag around them so the images are displayed as images and not just text? e.g I only wan...

edit html page and redisplay PHP

So I have been using a method to retrieve images from a website but I thought it may be easier to simply show the page without some details I don't want displayed. The website in paticular know we are doing this so there shouldn't be any legal complications. So would it be possible to open the html page within PHP, search for a specific ...

Remove section of HTML with PHP

I have an html page that I want to edit. I want to remove a certain section like the following: <ul class="agentDetail"> ........ ....... ........ </ul> I want to be able to remove the tags and all the content between them. The idea is to edit a page and redisplay it minus some data that I don't want to be seen (hence the removal of s...

PHP DOMElement getElementsByTagName specific selector

$content = file_get_contents(http://www.domain.com/page.html); $dom = new DOMDocument(); if (!@$dom->loadHTML($content)) die ("Couldn't load file?"); $title = $dom->getElementById("cssid"); $data['heading'] = $title->nodeValue; // this works fine I would like to be able to select all p tags that are within a certain id. With Jquery ...

Detecting flash on a web page

Hi, I want to programmatically detect flash on a web page. From my search, I understand I need to parse the code and look for embed tags that have the attribute "application/x-shockwave-flash". Is that all? Or there are other ways to embed flash into a web page? Thank you. ...

Can I use Hpricot to find the main article text of any/most websites?

I need a way of extracting the main text from any webpage that displays an article. Similar to the way that Readability can find the main text on any website that it's run on. I'm using Ruby on Rails, so I think Hpricot is my best bet. Is what I'm looking for possible in Hpricot? Is there an example somewhere? Thanks for reading. ...

IronRuby download file using the WebClient "Not enough storage is available to process this command"

Entering the following two lines into an interactive window in IronRuby interactive console. wc = System::Net::WebClient.new doc = wc.DownloadString("http://yahoo.com") I get the following error. => mscorlib:0:in `WinIOError': Not enough storage is available to process this command.\r\n (IOError) from mscorlib:0:in `Write' fr...

What is the best paid proxy service for google search ranking tool?

Hi all I'm struggling to find a good and reliable paid proxy service to run a script that reports on organic search results for a set of keywords. Does anyone have any recommendations? We'll be analyzing 60 keywords against 30 urls per day and our setup is LAMP based using Curl for the script. Any advice would be welcome. Thanks Jon...