screen-scraping

Regex flash url

Hi Im trying to develop a C# program to scrape the urls of flash movies on a website, this is the code im trying to parse flashvars="file=http://cache01-videos02.myspacecdn.com/24/vid_878ccd5444874681845df39eb3f00628.flv"/> the closest I got using regex was this expression file=http://[^/]+/(.*)flv However it outputs with the fil...

How to programmatically log in to a website to screenscape?

I need some information from a website that's not mine, in order to get this information I need to login to the website to gather the information, this happens through a HTML form. How can I do this authenticated screenscaping in C#? Extra information: Cookie based authentication. POST action needed. ...

Automating WebTrends analysis

Every week I access server logs processed by WebTrends (for about 7 profiles) and copy ad clickthrough and visitor information into Excel spreadsheets. A lot of it is just accessing certain sections and finding the right title and then copying the unique visitor information. I tried using WebTrends' built-in query tool but that is real...

Finding the right caching and compression strategy for asp.net

I'm trying to figure out the best way to do caching for a website I'm building. It relies heavily on screen scraping the wikipedia website. Here is the process that I'm currently doing: User requests a topic from wikipedia via my site (i.e. http://www.wikipedia.org/wiki/Kevin_Bacon would be http://www.wikipediamaze.com/wiki?topic?=Kevi...

screen scrapping business contact details, legal?

I was wondering if it was legal, in the UK, to do this. Basically there are hundreds of websites that just display contact details of businesses, like online directories. If I were to scrape these kinds of pages for the details to put on a different directory site would I be commiting a crime? I was thinking of using HtmlAgilityPack ...

Scraping flash websites

I am willing to create a script that takes information from a website which is done in flash. I was about to start coding an application doing something like: moving mouse to position x,y. do a mouse click. wait x msec. get data. My question is: Is there a better way to do this? Any lib? Thanks for reading! ...

What is the techniques to implement Visual Web Scraper?

I'm going to build a visual web scraper. The most important feature the software required is "visual" like http://mozenda.com/. The software create a tool like web-browser not only allow user to browse a webpage, perform some tasks as authenticate, click links, make searching, ... but also can track all these tasks. Does anyone know the ...

How can I create directories during a copy in Zsh?

I get the following messages often, for instance when coping dev files to a master branch cp: /Users/Masi/gitHub/shells/zsh/dvorak: No such file or directory cp: /Users/Masi/gitHub/shells/zsh/dvorak2: No such file or directory I would like to be asked about the creation of the given folders such that my initial command will be run if ...

Advice needed: Screen scraping a web page using .NET

Hello everybody, I need an advice for a project I am about to begin. In few words, my application has to go to a certain soccer website, download the HTML and extract the necessary data. This is what I have done so far: :: 1) Go to a certain soccer website (ex. http://www.livescore.com/default.dll?page=england) and download the HTML ...

Programmatically detecting "most important content" on a page...

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body. How would you determine what information on a new...

How does Indeed.com gather results from multiple job sites.

Do they use partnership and APIs, scrape the data or use public apis from all the job sites. Specially interested in how they obtain data from other job sites like monster.com and hotjobs. Implementing a program to do similar stuff, all ideas welcome/ ...

Trying to Scrape YouTube Pages for ShockWave/Flash URLs

Over at SpokenWord.org we’re trying to figure out how to scrape YouTube pages (or pages with embedded YouTube players), then hack a video or ShockWave URL that we can include in the <enclosure> element of RSS feeds. We’ve been able to do this for programs in YouTube EDU such as this page, which we convert to this media-file URL. The latt...

How to get list of URLs for a domain

hello, I would like to generate a list of URLs for a domain but I would rather save bandwidth by not crawling the domain myself. So is there a way to use existing crawled data? One solution I thought of would be to do a Yahoo site search, which lets me download the first 1000 results in TSV format. However to get all the records I woul...

Python Scrapy , how to define a pipeline for an item?

I am using scrapy to craw diferent sites, for each site I have an Item (different information is extracted) Well, for example I have a generic pipeline (most of information is the same) but now I am crawling some google search response and the pipeline must be different. for example GenericItem uses GenericPipeline but the GoogleItem...

Detecting the URL POST parameters to pass

The problem is to screen-scrape the latitude/longitudes for entities(restaurant-names, etc.) from wikimapia.org AND restrict the results based on the latitude/longitude Here is how I tried: Install Live HTTP Headers addon in Firefox. Filled up the form on the main-page of wikimapia.org to "pizza corner" Saw that the the main site woul...

HTML Agility Pack or HTML Screen Scraping libraries for Java, Ruby, Python?

I found the HTML Agility Pack useful and easy to use for screen scraping web sites. What's the equivalent library for HTML screen scraping in Java, Ruby, Python? ...

Spidering through ams for associate emails

So I have a client that wants to spider through sites that he is a member of and collect participating members emails. Is there commercial software that does that, or am I better off writing a screen scraping script? This is all assuming that this is permitted at the sites in question of course. ...

Developing a crawler and scraper for a vertical search engine

I need to develop a vertical search engine as part of website. The data for the search engine comes from websites of specific category. I guess for this I need to have a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be i...

Parsing HTML rows into CSV

First off the html row looks like this: <tr class="evenColor"> blahblah TheTextIneed blahblah and ends with </tr> I would show the real html but I am sorry to say don't know how to block it. feels shame Using BeautifulSoup (Python) or any other recommended Screen Scraping/Parsing method I would like to output about 1200 .htm files i...

Should I use Yahoo-Pipes to scrape the contents of a div?

Given: Url - http://www.contoso.com/search.php?q={param} returns: -html- --body- {...} ---div id='foo'- ----div id='page1'/- ----div id='page2'/- ----div id='page3'/- ----div id='pageN'/- ---/div- {...} --/body- -/html- Wanted: The innerHtml of div id='foo' must be fetched by the client (i.e. Javascript). It will be split into di...