questions about webscraping | ansaurus

webscraping

find name of company at URL

Hello, given the URL of a well known company (eg http://mcdonalds.com/), how would you automatically and reliably find the company name (in this case "Mc Donalds")? Thanks Edit: someone voted to close this question, so maybe I need to explain the motivation. I have a large list of company URLs and I want to find data about each compan...

How to webscrape scholar.google.com in Java?

I want to write a Java func grabTopResults(String f) such that grabTopResults("automata theory") returns me a list of the top 100 cited papers on scholar.google.com for "automata theory". Does anyone have suggestions for what libraries will make my life easy? Thanks! ...

Error in using Python/mechanize select_form()?

Hello, I am trying to scrap some data from a website. The scripts I am trying to write, should get the content of the page: http://www.atpworldtour.com/Rankings/Singles.aspx Should simulate the user going trough every option for Additional Standings and the dates and simulate clicking on Go then after fetching the data should use the...

webpagescraping

Automate Post Of Login Details User & Password Into Safari For Scraping

Hi To All, I am wanting to automate the input of post variables on a login page for the purpose of webscraping. It would improve the process no end if I can get past the login page. Then I can schedule some functions to run on cycle automatically. (Had a go with some CURL commands but could not get the result) Thanks for any help,...

Logic for Implementing a Dynamic Web Scraper in C#

I am looking for developing a Web Scrapper (in C# windows forms).The whole idea which i am trying to accomplish is as follows. Get the URL from the User . Load the Web page , in the IE UI control(embeddeed browser) in WINForms. Allow the User to select a text (contiguous , small(not exceeding 50 chars)). from the loaded web page. When ...

Python Dynamic module loading based on input

I wrote a program that takes in a partial rss feed and outputs a full one, but it is one a case by case basis. The recipe for one site is not the same as the recipe for the other. So what I do is look at the domain basename(for instance nyt or wsj) and choose a module based on that. Though I need to load each and every module before h...

getting html tags using c#

ok ive got this code: public static string ScreenScrape(string url) { System.Net.WebRequest request = System.Net.WebRequest.Create(url); // set properties of the request using (System.Net.WebResponse response = request.GetResponse()) { using (System.IO.StreamReader reader = new System.IO.S...

how to get feed of all new products from amazon

Amazon exposes RSS feeds for new products with a certain tag, such as http://www.amazon.com/rss/tag/blu-ray/new They also expose new popular products with http://www.amazon.com/gp/new-releases/books Is there a way to get a feed of all new products, regardless of tag and popularity? ...

how to find what isbns are in use

I am trying to find a list of what ISBNs are in use. I guess I could scrape a website like Amazon but that would waste a lot of bandwidth. Is there a better (free) way? ...

Web scraping with Python

I'm currently trying to scrape a website that has fairly poorly-formatted HTML (often missing closing tags, no use of classes or ids so it's incredibly difficult to go straight to the element you want, etc.). I've been using BeautifulSoup with some success so far but every once and a while (though quite rarely), I run into a page where ...

How do you Screen Scrape?

When there is no webservice API available, your only option might be to Screen Scrape, but how do you do it in c#? how do you think of doing it? ...

How to display html formatted text in text area of java application?

I am scrapping data from web site using my java application and want to display the result after parsing code of html page in a Text Area made in Swing. Text like: hello <b>every</b>one should be displayed as: 'hello everyone' in text area. Thanks!! ...

How can I use R (Rcurl/XML packages ?!) to scrape this webpage ?

Hi all, I have a (somewhat complex) web scraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes: I would like to go through all the "species pages" present in this link: http://gtrnadb.ucsc.edu/ So for each of them I will go to: The species page link (for ex...

How to use CrawlSpider from scrapy to click a link with javascript onclick?

I want scrapy to crawl pages where going to the next one link looks like this: Next Will scrapy be able to interpret javascript code of that? With livehttpheaders extension I found out that clicking Next generates a POST with a really huge piece of "garbage" starting like this: encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n ...

HTML Agility Pack Screen Scraping XPATH isn't returning data

I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing. ...

screen-scraping

htmlagilitypack

Is there a jQuery webscraper out there?

I'm trying to pullout some info from an external site using jQuery and Adobe AIR. Right now I'm using a hidden div and jQuery's load function to load fragments of the external site, once the info is loaded I parse some info with selectors. This is fine but it's kinda dirty and I need to perform this several times (don't want to need many...

How to know if the website being scraped has changed?

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead. It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML struct...

screen-scraping

Webscraping Google tasks via Google Calendar

As gmail and the task api is not available everywhere (eg: some companies block gmail but not calendar), is there a way to scrap google task through the calendar web interface ? I did a userscript like the one below, but I find it too brittle : // List of div to hide idlist = [ 'gbar', 'logo-container', ... ]; // Hiding b...

google-calendar

google-calendar

Yahoo Web Scrapes: What are the limits?

We are using a web scraper and have it set up to have a sleep function which has a random function set up (so that it isn't the same time between each scrape) but we are still getting blocked from Yahoo after 20-30 requests. Does any one know if there is a limit (i.e: 20 requests per minutes, 200 an hour) Right now our average between ...

visual-studio-2008

screen-scraping

Source for Names to use in web scraping

Can anyone suggest a good source of names that I can use to help analyze some tables on web pages. The first column of the tables I am scraping have names alone, names and titles or just titles. The names can be as varied as John Smith to Vikram Saksena. I have been poking around for a compiled list of words that can be found in proper...

1
2
3
4
5