hi
i want to disallow all files and folders on my site from SE bots,except a special folder and files in it.
can i use these lines at robots.txt file?
User-agent: *
Disallow: /
Allow: /thatfolder
is it right?
...
Hi.
I'm working on a project, creating a distributed client/server webcrawler. I've looked into heretrix/grub/boinc/etc... I'm looking to find a couple of people who are relatively skilled in this area that I can talk to, discuss some issues with.
If you know of a good architecture that discusses both the client/server side issues, o...
I make the other day a question here, but finally I decided to do it myself for questions of time, now I have a little more time to fix it :D I liked jSoup, but I'm kind of from old school, and preffer doing it my self (thanks to @Bakkal anyway).
I manage to made this code, it works fine for now, but if a webpage is not well contructed ...
For the past month I've been using Scrapy for a web crawling project I've begun.
This project involves pulling down the full document content of all web pages in a single domain name that are reachable from the home page. Writing this using Scrapy was quite easy, but it simply runs too slowly. In 2-3 days I can only pull down 100,000 pa...
Wikipedia stores all the information in the servers and the pages are presented by PHP. Is there a possible way to download and store the wikipedia content without actually crawling through the website. This way I save time and storage space, and later processing of the crawled data.
P.S. I know that the question formulation is bad but ...
I would like to write a Facebook fanpage crawler which crawl following information 1) Fan page name 2) Fan count 3) Feeds.
I know i can use open graph API to get this, but i want to write some script which will run once in a day and get all these data and dump in my sql db.
Is there any better way to do this?
Any help is appreciable
...
Hello
I am implementing a non-flash, html and javascript only epaper, that is based on a Pdf: www.patrickmueller.li/epaper . I am telling the google crawler that this is an AJAX page that is crawlable with according to the google specs "Making AJAX Applications Crawlable" (code.google.com/intl/sv-SE/web/ajaxcrawling/docs/getting-starte...
i was wondering if the crawlers and robots can decode html entities for example in my html i have something like:
salariés
do they read it like that? or something like:
salariés
which option is better for SEO?
...
Hi all.
I've been working on a site that uses jQuery heavily and loads in content via AJAX like so:
$('#newPageWrapper').load(newPath + ' .pageWrapper', function() {
//on load logic
}
It has now come to my attention that Google won't index any dynamically loaded content via Javascript and so I've been looking for a solution to th...
Hi,
it looks for me that crawlers try to resolve the index of all public folder subfolders like "/images/foo", which makes a 404 error. Should I do something or is this normal?
...
Hi, how good this python code ? need criticism)
there is a error in this code, some times script do print "ALL WAIT - CAN FINISH!"
and freeze (no more actions are happend..) but i can't find reason why this happend?
site crawler with threadpool:
import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup, SoupStrainer...
It seem what i can do fast crawler with python in two ways:
thread pool with block sockets
non block sockets select,asyncore,etc..
i thnk where is no real need in thread here, and solution #2 better.
which is better and why?
...
.. and how the web crawler infers the semantics of information on the website?
List out the ranking signal in separate answer.
...
Hello,
I have been using the java.net crawler for a custom built crawler. The problem is with dynamically generated content, like comments on a blog for example. Consider the following page http://www.avc.com/a_vc/2010/09/contrarian-investing.html . If you crawl the page and get the source, you can't view the entire content of the pa...
I've been entrusted with an idiotic and retarded task by my boss.
The task is: given a web application that returns a table with pagination, do a software that "reads and parses it" since there is nothing like a webservice that provides the raw data. It's like a "spider" or a "crawler" application to steal data that is not meant to be a...
I am working on a website which loads its data via AJAX. I also want that the whole website can be crawled by search engines like google and yahoo.
I want to make 2 versions of the site...
[1] When a user comes the hyperlinks should work just like GMAIL (#'ed hyperlinks)
[2] When a crawler comes the hyperlinks should work normally (AJAX...
I need to create a php script.
The idea is very simple:
When I send a link of a blogpost to this php script, then the webpage is crawled and the first image with the title page are saved on my server.
What PHP function I have to use for this crawler ?
...
I want to write a web-based code generator for a Python crawler. Its aim is to automatically generate code so a developer doesn't need to write it, but I've run into this problem: in one of my project's webpages, there are some checkboxes, buttons, etc. Each of them generates some Python code and writes it to a common textarea. However, ...
Hi,
Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLS. Also is there any clear documentation or examples on SgmlLinkExtractor.
-Avinash
...
Hi. i'm reading this article today. To be honest, im really interessed to "2. Much of your content is created by a server-side technology such as PHP or ASP.NET" point.
I want understand if i have understood :)
I create that php script (gethtmlsnapshot.php) where i include the server-side ajax page (getdata.php) and i escape (for secur...