screen-scraping

XPath: "Exclude" tag in "InnerHtml" (<a href="">InnerHtml<span>excludeme</span></a>

Hi, I am using XPath to query HTML sites, which works pretty good so far, but now I hit a (brick)wall and can't find a solution :-) The html looks like this: <ul> <li><a href="">Text1<span>AnotherText1</span></a></li> <li><a href="">Text2<span>AnotherText2</span></a></li> <li><a href="">Text3<span>AnotherText3</span></a></li> </ul> ...

Google's robots.txt: Is scraping your positions = ignoring it?

I have found a post http://stackoverflow.com/questions/999056/ethics-of-robots-txt/999088#999088 discussing a matter of robots.txt on web sites. Generally, I agree with the principals. However, there are commercial tools checking Google positions by - very likely - scraping Google for results, due to lack of API (in case someone doesn't ...

How to prevent someone from hacking API feed?

I have started developing a webpage and recently hired someone to write code to display a customized feed (powered by API) in the middle panel on http://farmball.com/. Note that this is not the RSS feed tied to the site blog. The feed ties to my account on another site. There is no RSS link for an average user to subscribe to the feed. I...

How to grab data on website?

So, often, I check my accounts for different numbers. For example, my affiliate accounts- i check for cash increase. I want to program a script where it can login to all these websiets and then grab the money value for me and display it on one page. How can I program this? ...

C# WebClient - View source question

Hello, I'm using a C# WebClient to post login details to a page and read the all the results. The page I am trying to load includes flash (which, in the browser, translates into HTML). I'm guessing it's flash to avoid being picked up by search engines??? The flash I am interested in is just text (not an image/video) etc and when I "V...

Prototype js get element with certain value

Hello, I am scraping some data and I want to get the the value of an element after a specific tag with value. It's a bold tag with value 'Types:'. <b>Types:</b> Once I get to that element I can use Prototype's Element.next() to get the data I want. How exactly do I do this? I have been fiddling with $$ but can't seem to get it righ...

Mozilla Parser for screen scraping

I'm writing an app that takes in HTML code of a page and extracts certain elements (such as tables) of the page and returns the html code for those elements. I'm attempting to do this in java using the Mozilla parser to simplify the navigation through the page, but I'm having trouble extracting the html code needed. Maybe my whole appr...

How to download any(!) webpage with correct charset in python?

Problem When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong than your output will be messed up. People usually use some rudimentary technique to detect the encoding. They either use the charset from the header or the charset defined in the meta tag or t...

Screen scraping C application without using OCR or DOM?

We have a legacy system that is essentially a glorified telnet interface. We cannot use an alternative telnet client program to connect to the system since there are special features built into the client software they have provided us. I want to be able to screen scrape from this program, however that's proving very difficult. I have ...

web scraping a problem site

I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far. Firefox and Chrome have no trouble displaying the pages, altho...

Trouble with scraping

I'm trying to scrape some pages, from a list on a text file, from a domain and save them onto my server. I have the following code (with the domain obscured), culling from a text file list of the file directories, and then copying the file names, but with .html appended. For some reason, its creating the files without actually success...

HTML Parsing/Scraping Algorithm Help..Java

I am writing a program that an HTML scraper that when it grabs the HTML from the page, it returns the HTML, and I want to Grab words that are All Capital letters, and then stores these words into a database. My problem right now is I cannot right the algorithm to parse each line of the HTML I got back in order to store the words. This is...

Scrubyt fetch metadata

How to fetch the contents of meta name="description" content="....." with Scrubyt ? require 'rubygems' require 'scrubyt' data = Scrubyt::Extractor.define do fetch 'http://www.allegro.pl/' head '//head' do description '//meta[@name="description"]' end end puts data.to_xml What is the the correct way ? ...

Better way to handle screen scrape object

In my applications I always end up implementing a Model-View-Presenter pattern and usually end up scrapping my View object from the screen with a get property. For example Person IBasicRegistration.Person { get { if (ViewState["View.Person"] == null) ViewState["View.Person"] = new Person(); var Person = (Person) ViewState["Vi...

PHP Screen Scraping and Sessions

Ok still new to the screen scraping thing. I've managed to log into the site I need but now how do I redirect to another page? After I login I'm trying to do another GET request on the page that I need but it has a redirect on it that takes me back to the login page. So I'm thinking the SESSION variables are not being passed, how can ...

How to capture screenshot of specified website?

I want to know technique to capture screenshot if I have a url list of those sites like google fastflip. What technology or techniques require for this kind of task. If this technique available in rails it would be great. Thanks ...

How to scrape the contents of an axd resource?

Essentially I have an img tag with a src attribute of /ChartImg.axd?i=chart_0_0.png&g=06469eea67ea452b977f8e73cad70691. Do I need to create another WebRequest to get the content of this resource or is there a simpler way? I am scraping the output of the current request. Below is what I've got so far... Essentially my additionaAssets ...

How to use the WebClient.DownloadDataAsync() method in this context?

My plan is to have a user write down a movie title in my program and my program will pull the appropiate information asynchronously so the UI doesn't freeze up. Here's the code: public class IMDB { WebClient WebClientX = new WebClient(); byte[] Buffer = null; public string[] SearchForMovie(string SearchPar...

Looking for OO gurus, need some help in the design of my programming logic. Nothing fancy, just new to it.

I'll post my entire class and maybe someone with MUCH more experience can help me design something better. I'm really new to doing things Asynchronously, so I'm really lost here. Hopefully my design isn't TOO bad. :P IMDB Class: public class IMDB { WebClient WebClientX = new WebClient(); byte[] Buffer = null; public string...

.NET, scrape dynamic (Java App?) webpage for information?

I am attempting to get some information from a website, the info that I need is located on the missouri.edu site (so it's publicly available). Here is the process that I need to accomplish: - Navigate to https://webapps.missouri.edu/ODDSearchEngine/oddsearch - search for a department name like "business" - Click any of the department nam...