screen-scraping

Mechanize not recognizing anchor tags via CSS selector methods

(Hope this isn't a breach of etiquette: I posted this on RailsForum, but I haven't been getting much response from there recently.) Has anyone else had problems with Mechanize not recognizing anchor tags via CSS selectors? The HTML looks like this (snippet with white space removed for clarity): <td class='calendarCell' align='left'> <...

How to make C# HttpWebRequest wait for query results

Hi guys I'm trying to scrape a quote engine but HttpWebResponse returns only the "please wait" screen, not the final result. Is there a way to make the request wait for the response? Status codes won't work because the "please wait" screen returns 200. - I guess it redirects after it retrieved the results from the database but that red...

Scraping and using data using PHP from a website that must be logged on to (Reddit)?

I would like to create a webpage that, given two reddit usernames and their passwords, subscribes user2 to all of the subreddits that user1 is subscribed to. So I need to: Get the subreddits that user1 is subscribed to. Subscribe user2 to those reddits I have experience using PHP, but I have no experience with scraping (especially wh...

How to parse just the text from a Word Doc using Python?

When you try opening a MS Word document or for that matter most Windows file formats, you will see gibberish as given below broken intermittently by the actual text. I need to extract the text that goes in and want to ignore the gibberish -- which is something like given below. How do I extract only the text that matters, and ignore rest...

Web scraping in VBA and Excel

I'm looking to download data tables from websites into Excel and pull specific pieces of data into a separate worksheet to create a database. I'm having trouble parsing numbers that are imported into one cell in Excel. For example, the numbers "-7 -110" written exactly like that on the website are inputted into one cell in Excel as "=-...

How to use HTML Parser to get complete information about all tags in the HTML page

I am using HTML Parser to develop an application. The code below is not able to get the entire set of tags in the page. There are some tags which are missed out and the attributes and text body of them are also missed out. Please help me to explain why is this happening.....or suggest me other way.... URL url = new URL("..."); PrintWr...

Maintaining session in an Eventlet page scraper?

Hello, I'm trying to do some scraping of a site that requires authentication (not http auth). The script I'm using is based on this eventlet example. Basically, urls = ["https://mysecuresite.com/data.aspx?itemid=blah1", "https://mysecuresite.com/data.aspx?itemid=blah2", "https://mysecuresite.com/data.aspx?itemid=blah3"] impo...

safariwatir: how to select anonymous button

I'm using watir for safari with ruby 1.8.7 on OSX Snow leopard. I want to click a button, the only one in the page, that has neither id nor name. It only has an onckick property and the text within the tag.. How to do that? Is there a way to list all buttons on the page, and get the first (and only) one? thanks ...

Scraping a phpbb forum

I want to know if it's possible to copy all the publicly available posts and data from one phpbb3 forum to a remote one without the database passwords and such, and if so, the simplest way to go about it. Details of the situation: We need to move the forums to a new and better place, but the guy who owns and operates the server where we...

A good web data extraction/screen scraper program?

I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program thats reliable and doesn't require...

Is there a Python equivalent for the Perl module Term::VT102?

In Perl there is a very handy module, Term::VT102, which allows you to create a screen in memory. This is very handy for scraping purposes since you can keep track of all the changes to portions of the screen and then export the screen as plain-text for processing. Is there an equivalent module in Python? Followup Question: There are mo...

Scraping for a "preview" of a webpage - Python

Hi folks, I'm indexing a list of links, these links update quite often so I'm automating thumbnails for the sites. For most sites it's easy, as I just grab the biggest image on the page hoping it describes the content. But other times there are videos as main content of the page. Does somebody have tips with dealing with this? That...

Display filter C#

Hello all. It's a little hard to explain what I need but i'll try: I need to write application (winform) which will be some kind of filter to image/other forms behind it. With one exception - all behind form should looks as is except of red (for example) color, which have to be replaced to any other specified color, white for example. ...

Is there a way to specify a fixed (or variable) number of elements for lxml in Python

There must be an easier way to do this. I need some text from a large number of html documents. In my tests the most reliable way to find it is to look for specific word in the text_content of the div elements. If I want to inspect a specific element above the one that has my text I have been enumerating my list of div elements and us...

Auto-detecting product data feeds for an arbitrary E-Commerce site?

Hey all! My web app needs to access an arbitrary E-Commerce store and determine whether or not it has a product data feed (i.e. a Google Base feed; an RSS/ATOM feed of all products in the store). Also, I need to extract the location of this feed. The best solution I can think of so far is to maintain a comprehensive list of known loca...

what is the best method or tool to scrape web sites ?

Hello all i need to scrape (with approval) web sites before I start to write my own what is the best tool/way to scrape web sites, which is both fast (multithreaded) and easy to learn? ...

Setting up a python screen scraper that could work on Google App engine

I am looking to setup a automated screen scraper that will run on Google app engine using python. I want it to scrape the site and put the specified results into a Entity in app engine. I am looking for some directions on what to use. I have seen beautifulsoup but wonder if people could recommend anything else that could run on Google Ap...

I want to scrape a site using GAE and post the results into a Google Entity

I want to scrape this URL : https://www.xstreetsl.com/modules.php?searchSubmitImage_x=0&amp;searchSubmitImage_y=0&amp;SearchLocale=0&amp;name=Marketplace&amp;SearchKeyword=business&amp;searchSubmitImage.x=0&amp;searchSubmitImage.y=0&amp;SearchLocale=0&amp;SearchPriceMin=&amp;SearchPriceMax=&amp;SearchRatingMin=&amp;SearchRatingMax=&amp;s...

Impose access limits from Apache to prevent scraping ?

Hello, The problem is of a content website that is being scraped so badly that it breaks the server. Is there an easy method of limiting access for IPs to a fixed number of requests at a time OR per day ? ( 10 pages / day or.... 10 pages every 2 minutes ) Ideally, I would keep a wildcard list for search engines and disallow everybod...

How many iMacros can run at the same time?

We're using iMacros to fill web forms. Does anyone know how many instances of iMacros can be run at the same time on a PC? If I need to automatically fill web forms for screen scraping, is there a better tool if I need "tons" of instances to run simultaneously? Thanks. ...