crawler

How to Programmatically take Snapshot of Crawled Webpages (in Ruby)?

What is the best solution to programmatically take a snapshot of a webpage? The situation is this: I would like to crawl a bunch of webpages and take thumbnail snapshots of them periodically, say once every few months, without having to manually go to each one. I would also like to be able to take jpg/png snapshots of websites that mi...

Which web crawler for extracting and parsing data from about a thousand of web sites

I'm trying to crawl about a thousand of web sites, from which I'm interested in the html content only. Then I transform the HTML into XML to be parsed with Xpath to extract the specific content I'm interested in. I've been using Heritrix 2.0 crawler for a few months, but I ran into huge performance, memory and stability problems (Herit...

What's the best method to capture URLs?

I'm trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google's Search API or Yahoo's Search API, but that's not really a great solution as it requires a search to ...

How do I extract links in JavaScript that point to HTML pages in Perl?

I want to extract all the links from a page. I am using HTML:LinkExtor. How do I extract all the links that point to HTML content pages only? I also cannot extract these kinds of links: javascript:openpopup('http://www.admissions.college.harvard.edu/financial_aid/index.html'), EDIT: HTML Pages - text/html. I am not indexing pictures ...

Writing crawler that stay logged in with any server

I am writing a crawler. Once after the crawler logs into a website I want to make the crawler to "stay-always-logged-in". How can I do that? Is a client (like browser, crawler etc.,) make a server to obey this rule? This scenario could occur when the server allows limited logins in day. ...

Writing a Faster Python Spider

Hello, I'm writing a spider in Python to crawl a site. Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed. What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort thro...

crawler get external website search result

What is the best practice and library I can use to key in search textbox on external website and collect the search result? How do tackle website with different search box and checkbox and collect the result? Can Selenium be used to automate this? Should I use Heritrix or nutch? Which one is better? I heard nutch comes with plugins. Whi...

Facebook crawler?

Hi all, I'm a graduate student whose research is complex network. I am working on a project that involves analyzing connections between Facebook users. Is it possible to write a crawler for Facebook based on friendship information? I looked around but couldn't find any things useful so far. It seems Facebook isn't fond of such activit...

Generate a list of all the pages contained in a website programatically, using PHP

Hello guys, How is it possibe to generate a list of all the pages of a given website programatically using PHP? What I'm basically trying to achieve is to generate something like an sitemap, in nested unordered list with links for all the pages contained in a website. Thank you in advance for your answers, Constantin TOVISI ...

How to approach Google groups discussions crawler

Hi, as an exercise in RSS I would like to be able to search through pretty much all Unix discussions on this group. comp.unix.shell I know enough Python and understand basic RSS, but I am stuck on ... how do I grab all messages between particular dates, or at least all messages between Nth recent and Mth recent? High level description...

php convert all links to absolute urls

I am writing a website crawler in php and I already have code that can extract all links from a site. A problem: sites use a combination of absolute and relative urls. Examples (http replaced with hxxp as I can't post hyperlinks): hxxp://site.com/ site.com site.com/index.php hxxp://site.com/hello/index.php /hello/index.php hxxp://...

Web crawler Parsing PHP/Javascript links?

I'm currently using the HTML Agility Pack in C# for a web crawler. I've managed to avoid many issues so far (Invalid URIs, such as "/extra/url/to/base.html" and "#" links), but I also need to process PHP, Javascript, etc. Like for some sites, the links are in PHP, and when my web crawler tries to navigate to these, it fails. One examp...

How do I make a simple crawler in PHP?

I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file. Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer. ...

Search Engines Crawling Question

the main page of my site is /home.php This page has pagination with anchor tags that link to many other queries of the same page, for example /home.php?start=4 /home.php?start=8 and so on... My question is, when i include the home.php page in a sitemap will crawlers crawl what ever page home.php links to(ex. /home.php?star=4)? or d...

How to fix my crawler in C#?

Regex hrefs = new Regex("<a href.*?>"); Regex http = new Regex("http:.*?>"); StringBuilder sb = new StringBuilder(); WebClient client = new WebClient(); string source = client.DownloadString("http://google.com"); foreach (Match m in hrefs.Matches(source)){ sb.Append(http.Match(m.ToString())); Console.WriteLine(http.Match(m.ToString()))...

How to build a web crawler based on Scrapy to run forever?

I want to build a web crawler based on Scrapy to grab news pictures from several news portal website. I want to this crawler to be: Run forever Means it will periodical re-visit some portal pages to get updates. Schedule priorities. Give different priorities to different type of URLs. Multi thread fetch I've read the Scrapy docum...

how to prevent all crawlers except good ones (google, bing, yahoo) access website content?

I just want to let Google, Bing, Yahoo crawl my website to build indexes. But I do not want my opposite website use crawling service to steal my website content. What should I do? ...

Mining Groups of people from Wikipedia

I am trying to get the list of people from the http://en.wikipedia.org/wiki/Category:People_by_occupation . I have to go through all the sections and get people from each section. How should i go about it ? Should I use a crawler and get the pages and search through those using BeautifulSoup ? Or is there any other alternative to get t...

Is there a way to crawl all facebook fan pages?

Is there a way to crawl all facebook fan pages and collect some information? like for example crawling facebook fan pages and save their names, or how many fans, etc? Or at least, do you have a hint of how this could be possibly done? ...

Web crawler update strategy

I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy. The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated resource. So I want to know is ther...