crawler

allow and disallow in robots.txt file

hi i want to disallow all files and folders on my site from SE bots,except a special folder and files in it. can i use these lines at robots.txt file? User-agent: * Disallow: / Allow: /thatfolder is it right? ...

looking for input/discussion on implementing a basic distributed client/server crawler architecture

Hi. I'm working on a project, creating a distributed client/server webcrawler. I've looked into heretrix/grub/boinc/etc... I'm looking to find a couple of people who are relatively skilled in this area that I can talk to, discuss some issues with. If you know of a good architecture that discusses both the client/server side issues, o...

How to optimize this ugly code?

I make the other day a question here, but finally I decided to do it myself for questions of time, now I have a little more time to fix it :D I liked jSoup, but I'm kind of from old school, and preffer doing it my self (thanks to @Bakkal anyway). I manage to made this code, it works fine for now, but if a webpage is not well contructed ...

Best web graph crawler for speed?

For the past month I've been using Scrapy for a web crawling project I've begun. This project involves pulling down the full document content of all web pages in a single domain name that are reachable from the home page. Writing this using Scrapy was quite easy, but it simply runs too slowly. In 2-3 days I can only pull down 100,000 pa...

Get the mirror of wikipedia without actually storing html.

Wikipedia stores all the information in the servers and the pages are presented by PHP. Is there a possible way to download and store the wikipedia content without actually crawling through the website. This way I save time and storage space, and later processing of the crawled data. P.S. I know that the question formulation is bad but ...

Facebook fanpage crawler

I would like to write a Facebook fanpage crawler which crawl following information 1) Fan page name 2) Fan count 3) Feeds. I know i can use open graph API to get this, but i want to write some script which will run once in a day and get all these data and dump in my sql db. Is there any better way to do this? Any help is appreciable ...

Crawlable AJAX - return Pdf as static and crwlable page

Hello I am implementing a non-flash, html and javascript only epaper, that is based on a Pdf: www.patrickmueller.li/epaper . I am telling the google crawler that this is an AJAX page that is crawlable with according to the google specs "Making AJAX Applications Crawlable" (code.google.com/intl/sv-SE/web/ajaxcrawling/docs/getting-starte...

do crawlers decode html entities?

i was wondering if the crawlers and robots can decode html entities for example in my html i have something like: salariés do they read it like that? or something like: salariés which option is better for SEO? ...

How do I make my AJAX content crawlable by Google?

Hi all. I've been working on a site that uses jQuery heavily and loads in content via AJAX like so: $('#newPageWrapper').load(newPath + ' .pageWrapper', function() { //on load logic } It has now come to my attention that Google won't index any dynamically loaded content via Javascript and so I've been looking for a solution to th...

rails 404 error in subfolders in public

Hi, it looks for me that crawlers try to resolve the index of all public folder subfolders like "/images/foo", which makes a 404 error. Should I do something or is this normal? ...

criticism this python code (crawler with threadpool)

Hi, how good this python code ? need criticism) there is a error in this code, some times script do print "ALL WAIT - CAN FINISH!" and freeze (no more actions are happend..) but i can't find reason why this happend? site crawler with threadpool: import sys from urllib import urlopen from BeautifulSoup import BeautifulSoup, SoupStrainer...

python asyncore or threadpool for web crawler?

It seem what i can do fast crawler with python in two ways: thread pool with block sockets non block sockets select,asyncore,etc.. i thnk where is no real need in thread here, and solution #2 better. which is better and why? ...

Which information is stored by Google crawler?

.. and how the web crawler infers the semantics of information on the website? List out the ranking signal in separate answer. ...

Crawling a Page with dynamically generated content

Hello, I have been using the java.net crawler for a custom built crawler. The problem is with dynamically generated content, like comments on a blog for example. Consider the following page http://www.avc.com/a_vc/2010/09/contrarian-investing.html . If you crawl the page and get the source, you can't view the entire content of the pa...

Extracting data from an ASPX page

I've been entrusted with an idiotic and retarded task by my boss. The task is: given a web application that returns a table with pagination, do a software that "reads and parses it" since there is nothing like a webservice that provides the raw data. It's like a "spider" or a "crawler" application to steal data that is not meant to be a...

Identifying a Search Engine Crawler

I am working on a website which loads its data via AJAX. I also want that the whole website can be crawled by search engines like google and yahoo. I want to make 2 versions of the site... [1] When a user comes the hyperlinks should work just like GMAIL (#'ed hyperlinks) [2] When a crawler comes the hyperlinks should work normally (AJAX...

How to write this crawler in php ?

I need to create a php script. The idea is very simple: When I send a link of a blogpost to this php script, then the webpage is crawled and the first image with the title page are saved on my server. What PHP function I have to use for this crawler ? ...

Problem with web code generator designer

I want to write a web-based code generator for a Python crawler. Its aim is to automatically generate code so a developer doesn't need to write it, but I've run into this problem: in one of my project's webpages, there are some checkboxes, buttons, etc. Each of them generates some Python code and writes it to a common textarea. However, ...

Scrapy - how to identify already scraped urls

Hi, Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLS. Also is there any clear documentation or examples on SgmlLinkExtractor. -Avinash ...

HTML Snapshot for crawler - Understanding how it works

Hi. i'm reading this article today. To be honest, im really interessed to "2. Much of your content is created by a server-side technology such as PHP or ASP.NET" point. I want understand if i have understood :) I create that php script (gethtmlsnapshot.php) where i include the server-side ajax page (getdata.php) and i escape (for secur...