I'm trying to scrape a page (my router's admin page) but the device seems to be serving a different page to urllib2 than to my browser. has anyone found this before? How can I get around it?
this the code I'm using:
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen("http://192.168.1.254/index.cgi...
Hi,
I am working on a web scraping project. do any body have idea of scraping dynamic content.
Dynamic content on base of query string is similar to static content but dynamic content based on some event of a control within same page is the point where i am stuck. because in this case page url remain same.
I am using C#.
Thanks in ...
I need to do screen scraping and for that i need to convert xml to dom in python please can anybody tell that how can i do this....
...
i have tried below given expressions.
(http:\/\/.*?)['\"\< \>]
(http:\/\/[-a-zA-Z0-9+&@#\/%?=~_|!:,.;\"]*[-a-zA-Z0-9+&@#\/%=~_|\"])
the first one is doing well but always gives the last extra character with the matched urls .
Eg:
http://domain.com/path.html"
http://domain.com/path.html<
Notice
" <
I don't want them with ...
I am using the win32 PrintWindow function to capture a screen to a BitMap object.
If I only want to capture a region of the window, how can I crop the image in memory?
Here is the code I'm using to capture the entire window:
[System.Runtime.InteropServices.DllImport(strUSER32DLL, CharSet = CharSet.Auto, SetLastError = true)]
public ...
Hello all
Last year I made an Android application that scrapped the informations on my train company in Belgium ( application is BETrains: http://www.cyrket.com/p/android/tof.cv.mpp/)
This application was really cool and allowed users to talk with other people in the train ( a messagery server is runned by me) and the conversations wre...
I'm trying to use scrubyt to scrape a page and have everything working except for a decent way of advancing to the next page of the results. The next_page approach isn't working due to the url being relative.
I figured out a simple way to do it but it all hinges on being able to use something like:
if node_exists("//div[@class='pagina...
I am struggling with this. I have a fully tested python script. I have to make a small change wherein I have to first click on a radio button which in turn automatically executes a javascript function forwarding the page to a search form.
My working platform : Linux
Language : Python
Radio button code :
<input type="radio" language...
I cannot, for the life of me, rig HtmlUnit up to grab this site:
http://www.bing.com/travel/flight/flightSearch?form=FORMTRVLGENERIC&q=flights+from+SLC+to+BKK+leave+07%2F30%2F2010+return+08%2F11%2F2010+adults%3A1+class%3ACOACH&stoc=0&vo1=Salt+Lake+City%2C+UT+%28SLC%29+-+Salt+Lake+City+International+Airport&o=SLC&ve1=...
I am using a proxy and following is the code.
20 req = urllib2.Request(url)
21 # run the request for each proxy
22 # now set the proxy
23 req.set_proxy(proxy, "http")
24 req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')
25 req.add_hea...
I'm having a problem screenscraping some data from this website using the MSHTML COM component. I have a WebBrowser control on my WPF form.
The code where I retrieve the HMTL elements is in the WebBrowser LoadCompleted events. After I set the values of the data to the HTMLInputElement and call the click method on the HTMLInputButtonEleme...
I have a website that contains many different pages of products and each page has a certain amount of images in the same format across all pages. I want to be able to screen scrap each page's url so I can retrieve the url of each image from each page. The idea is to make a gallery for each page made up of hotlinked images.
I know this c...
I have previously asked a question on how to echo the url of images from an html page. I can do this successfully but how can I narrow this down further so only images urls beginning with a certain phrase are shown, furthermore how can I add an image tag around them so the images are displayed as images and not just text?
e.g I only wan...
So I have been using a method to retrieve images from a website but I thought it may be easier to simply show the page without some details I don't want displayed. The website in paticular know we are doing this so there shouldn't be any legal complications. So would it be possible to open the html page within PHP, search for a specific ...
I have an html page that I want to edit. I want to remove a certain section like the following:
<ul class="agentDetail">
........
.......
........
</ul>
I want to be able to remove the tags and all the content between them. The idea is to edit a page and redisplay it minus some data that I don't want to be seen (hence the removal of s...
$content = file_get_contents(http://www.domain.com/page.html);
$dom = new DOMDocument();
if (!@$dom->loadHTML($content)) die ("Couldn't load file?");
$title = $dom->getElementById("cssid");
$data['heading'] = $title->nodeValue; // this works fine
I would like to be able to select all p tags that are within a certain id. With Jquery ...
Hi,
I want to programmatically detect flash on a web page.
From my search, I understand I need to parse the code and look for embed tags that have the attribute "application/x-shockwave-flash".
Is that all? Or there are other ways to embed flash into a web page?
Thank you.
...
I need a way of extracting the main text from any webpage that displays an article. Similar to the way that Readability can find the main text on any website that it's run on.
I'm using Ruby on Rails, so I think Hpricot is my best bet. Is what I'm looking for possible in Hpricot? Is there an example somewhere? Thanks for reading.
...
Entering the following two lines into an interactive window in IronRuby interactive console.
wc = System::Net::WebClient.new
doc = wc.DownloadString("http://yahoo.com")
I get the following error.
=> mscorlib:0:in `WinIOError': Not enough storage is available to process this command.\r\n (IOError)
from mscorlib:0:in `Write'
fr...
Hi all
I'm struggling to find a good and reliable paid proxy service to run a script that reports on organic search results for a set of keywords.
Does anyone have any recommendations? We'll be analyzing 60 keywords against 30 urls per day and our setup is LAMP based using Curl for the script.
Any advice would be welcome.
Thanks
Jon...