Hi Im trying to develop a C# program to scrape the urls of flash movies on a website, this is the code im trying to parse
flashvars="file=http://cache01-videos02.myspacecdn.com/24/vid_878ccd5444874681845df39eb3f00628.flv"/>
the closest I got using regex was this expression
file=http://[^/]+/(.*)flv
However it outputs with the fil...
I need some information from a website that's not mine, in order to get this information I need to login to the website to gather the information, this happens through a HTML form. How can I do this authenticated screenscaping in C#?
Extra information:
Cookie based authentication.
POST action needed.
...
Every week I access server logs processed by WebTrends (for about 7 profiles) and copy ad clickthrough and visitor information into Excel spreadsheets. A lot of it is just accessing certain sections and finding the right title and then copying the unique visitor information.
I tried using WebTrends' built-in query tool but that is real...
I'm trying to figure out the best way to do caching for a website I'm building. It relies heavily on screen scraping the wikipedia website. Here is the process that I'm currently doing:
User requests a topic from wikipedia via my site (i.e. http://www.wikipedia.org/wiki/Kevin_Bacon would be http://www.wikipediamaze.com/wiki?topic?=Kevi...
I was wondering if it was legal, in the UK, to do this.
Basically there are hundreds of websites that just display contact details of businesses, like online directories.
If I were to scrape these kinds of pages for the details to put on a different directory site would I be commiting a crime?
I was thinking of using HtmlAgilityPack ...
I am willing to create a script that takes information from a website which is done in flash.
I was about to start coding an application doing something like:
moving mouse to position x,y.
do a mouse click.
wait x msec.
get data.
My question is: Is there a better way to do this? Any lib?
Thanks for reading!
...
I'm going to build a visual web scraper. The most important feature the software required is "visual" like http://mozenda.com/.
The software create a tool like web-browser not only allow user to browse a webpage, perform some tasks as authenticate, click links, make searching, ... but also can track all these tasks.
Does anyone know the ...
I get the following messages often, for instance when coping dev files to a master branch
cp: /Users/Masi/gitHub/shells/zsh/dvorak: No such file or directory
cp: /Users/Masi/gitHub/shells/zsh/dvorak2: No such file or directory
I would like to be asked about the creation of the given folders such that my initial command will be run if ...
Hello everybody,
I need an advice for a project I am about to begin.
In few words, my application has to go to a certain soccer website, download the HTML and extract the necessary data.
This is what I have done so far:
:: 1) Go to a certain soccer website (ex. http://www.livescore.com/default.dll?page=england) and download the HTML ...
What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a new...
Do they use partnership and APIs, scrape the data or use public apis from all the job sites. Specially interested in how they obtain data from other job sites like monster.com and hotjobs.
Implementing a program to do similar stuff, all ideas welcome/
...
Over at SpokenWord.org we’re trying to figure out how to scrape YouTube pages (or pages with embedded YouTube players), then hack a video or ShockWave URL that we can include in the <enclosure> element of RSS feeds. We’ve been able to do this for programs in YouTube EDU such as this page, which we convert to this media-file URL. The latt...
hello,
I would like to generate a list of URLs for a domain but I would rather save bandwidth by not crawling the domain myself. So is there a way to use existing crawled data?
One solution I thought of would be to do a Yahoo site search, which lets me download the first 1000 results in TSV format. However to get all the records I woul...
I am using scrapy to craw diferent sites, for each site I have an Item (different information is extracted)
Well, for example I have a generic pipeline (most of information is the same) but now I am crawling some google search response and the pipeline must be different.
for example
GenericItem uses GenericPipeline
but the GoogleItem...
The problem is to screen-scrape the latitude/longitudes for entities(restaurant-names, etc.) from wikimapia.org AND restrict the results based on the latitude/longitude
Here is how I tried:
Install Live HTTP Headers addon in Firefox.
Filled up the form on the main-page of wikimapia.org to "pizza corner"
Saw that the the main site woul...
I found the HTML Agility Pack useful and easy to use for screen scraping web sites. What's the equivalent library for HTML screen scraping in Java, Ruby, Python?
...
So I have a client that wants to spider through sites that he is a member of and collect participating members emails. Is there commercial software that does that, or am I better off writing a screen scraping script? This is all assuming that this is permitted at the sites in question of course.
...
I need to develop a vertical search engine as part of website. The data for the search engine comes from websites of specific category. I guess for this I need to have a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be i...
First off the html row looks like this:
<tr class="evenColor"> blahblah TheTextIneed blahblah and ends with </tr>
I would show the real html but I am sorry to say don't know how to block it. feels shame
Using BeautifulSoup (Python) or any other recommended Screen Scraping/Parsing method I would like to output about 1200 .htm files i...
Given:
Url - http://www.contoso.com/search.php?q={param} returns:
-html-
--body-
{...}
---div id='foo'-
----div id='page1'/-
----div id='page2'/-
----div id='page3'/-
----div id='pageN'/-
---/div-
{...}
--/body-
-/html-
Wanted:
The innerHtml of div id='foo' must be fetched by the client (i.e. Javascript).
It will be split into di...