What is the best solution to programmatically take a snapshot of a webpage?
The situation is this: I would like to crawl a bunch of webpages and take thumbnail snapshots of them periodically, say once every few months, without having to manually go to each one. I would also like to be able to take jpg/png snapshots of websites that mi...
I'm trying to crawl about a thousand of web sites, from which I'm interested in the html content only.
Then I transform the HTML into XML to be parsed with Xpath to extract the specific content I'm interested in.
I've been using Heritrix 2.0 crawler for a few months, but I ran into huge performance, memory and stability problems (Herit...
I'm trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google's Search API or Yahoo's Search API, but that's not really a great solution as it requires a search to ...
I want to extract all the links from a page. I am using HTML:LinkExtor. How do I extract all the links that point to HTML content pages only?
I also cannot extract these kinds of links:
javascript:openpopup('http://www.admissions.college.harvard.edu/financial_aid/index.html'),
EDIT: HTML Pages - text/html. I am not indexing pictures ...
I am writing a crawler. Once after the crawler logs into a website I want to make the crawler to "stay-always-logged-in". How can I do that? Is a client (like browser, crawler etc.,) make a server to obey this rule? This scenario could occur when the server allows limited logins in day.
...
Hello, I'm writing a spider in Python to crawl a site. Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed.
What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort thro...
What is the best practice and library I can use to key in search textbox on external website and collect the search result?
How do tackle website with different search box and checkbox and collect the result?
Can Selenium be used to automate this?
Should I use Heritrix or nutch? Which one is better? I heard nutch comes with plugins. Whi...
Hi all,
I'm a graduate student whose research is complex network. I am working on a project that involves analyzing connections between Facebook users. Is it possible to write a crawler for Facebook based on friendship information?
I looked around but couldn't find any things useful so far. It seems Facebook isn't fond of such activit...
Hello guys,
How is it possibe to generate a list of all the pages of a given website programatically using PHP?
What I'm basically trying to achieve is to generate something like an sitemap, in nested unordered list with links for all the pages contained in a website.
Thank you in advance for your answers,
Constantin TOVISI
...
Hi, as an exercise in RSS I would like to be able to search through pretty much all Unix discussions on this group.
comp.unix.shell
I know enough Python and understand basic RSS, but I am stuck on ... how do I grab all messages between particular dates, or at least all messages between Nth recent and Mth recent?
High level description...
I am writing a website crawler in php and I already have code that can extract all links from a site.
A problem: sites use a combination of absolute and relative urls.
Examples (http replaced with hxxp as I can't post hyperlinks):
hxxp://site.com/
site.com
site.com/index.php
hxxp://site.com/hello/index.php
/hello/index.php
hxxp://...
I'm currently using the HTML Agility Pack in C# for a web crawler. I've managed to avoid many issues so far (Invalid URIs, such as "/extra/url/to/base.html" and "#" links), but I also need to process PHP, Javascript, etc. Like for some sites, the links are in PHP, and when my web crawler tries to navigate to these, it fails. One examp...
I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.
Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.
...
the main page of my site is /home.php
This page has pagination with anchor tags that link to many other queries of the same page,
for example
/home.php?start=4
/home.php?start=8
and so on...
My question is, when i include the home.php page in a sitemap will crawlers crawl what ever page home.php links to(ex. /home.php?star=4)? or d...
Regex hrefs = new Regex("<a href.*?>");
Regex http = new Regex("http:.*?>");
StringBuilder sb = new StringBuilder();
WebClient client = new WebClient();
string source = client.DownloadString("http://google.com");
foreach (Match m in hrefs.Matches(source)){
sb.Append(http.Match(m.ToString()));
Console.WriteLine(http.Match(m.ToString()))...
I want to build a web crawler based on Scrapy to grab news pictures from several news portal website. I want to this crawler to be:
Run forever
Means it will periodical re-visit some portal pages to get updates.
Schedule priorities.
Give different priorities to different type of URLs.
Multi thread fetch
I've read the Scrapy docum...
I just want to let Google, Bing, Yahoo crawl my website to build indexes. But I do not want my opposite website use crawling service to steal my website content. What should I do?
...
I am trying to get the list of people from the http://en.wikipedia.org/wiki/Category:People_by_occupation . I have to go through all the sections and get people from each section.
How should i go about it ? Should I use a crawler and get the pages and search through those using BeautifulSoup ?
Or is there any other alternative to get t...
Is there a way to crawl all facebook fan pages and collect some information? like for example crawling facebook fan pages and save their names, or how many fans, etc?
Or at least, do you have a hint of how this could be possibly done?
...
I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy.
The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated resource. So I want to know is ther...