crawling

How to protect/monitor your site from crawling by malicious user

Situation: Site with content protected by username/password (not all controlled since they can be trial/test users) a normal search engine can't get at it because of username/password restrictions a malicious user can still login and pass the session cookie to a "wget -r" or something else. The question would be what is the best solu...

How does Pricegrabber find and link to all the products?

I'm building a site where vendors will link their products to our product pages (for affiliate sales) and was wondering what algorithms people use to automate/facilitate this process? Right now they would have to manually enter the link for each product they own which is quite tedious. I noticed that Pricegrabber finds any product you ...

how to crowd source my web crawling

Hello, I want to crawl a website anonymously without having to rely on an anonymous proxy server. So I was thinking of letting users of my website help me by inserting an invisible IFrame in my template - the IFrame src would be set to a webpage URL I needed, and then uploaded to my server with AJAX. (I can't use AJAX for the downloadi...

Crawling not working windows2008

Hi, We installed a new MOSS 2007 farm on windows 2008 SP2 enviroment. We used SQL2008 too. Configuration is 1 index, 1 FE and 1 server with 2008, all on ESX 4.0. All the Service that need it uses a dedicated user, so search has a dedicated user. Installation went well and we found no problem. We installed an SP1 MOSS from a ISO and aft...

Crawling a Facebook fanpage ...

I want to cral a facebook fanpage to get the details of all the members who are fans of that page. I there any function in the face book API which will help me. Or is ther any other way I can do this??? ...

Legality, terms of service for performing a web crawl

I was going to crawl a site for some research I was collecting. But, apparently the terms of service is quite clear on the topic. Is it illegal to now "follow" the terms of service. And what can the site normally do? Here is an example clause in the TOS. Also, what about sites that don't provide this particular clause. Restriction...

Does Google's crawlers have Javascript? What if I load a page through AJAX?

When a user enters my page, I have to make another AJAX call...to load a data inside a div. That's just how my application works. The problem is...when I view the source of this code, it does not contain the source of that AJAX. Of course, when I do wget URL ...it also does not show the AJAX HTML. Makes sense. But what about Google? W...

What's the best way to map the link connection between blogs ?

I wish to perform a social network analysis on a bunch of blogs, plotting who is linking to who (not just by their blogroll but also inside their posts). What software can perform such crawling/data-collecting/mapping ? Thanks! ...

If I want to get a Facebook user's "info" and "posts"...do I need Facebook Connect or Facebook Application?

Preferably, I want the least work possible! ...

crawl php open source forum?

is there an easy way to crawl open source php forums and put them in categories in my own forum, eg. "windows", "mac" and so on? ...

scrapy - python question

Hi.. Maybe not the correct place to post. But, I'm going to try anyway! I've got a couple of test python parsing scripts that I created. They work enough for me to test what I'm working on. However, I recently came across the python framework, Scrapy, which is used for web scraping. My app runs in a distributed process, across a test...

how to write a script that logs into an application and checks a page

Is it possible to write a script that will login to an application using uname/pwd? the username/password are not passed in through POST (they dont come in the URL) Basic steps I am looking for are: Visit url enter uname/pwd click a button click a link get the raw html to make sure it does not have 500 error Is that possible to do...

How to enable indexing of pages with dynamic data?

I have a site that has certain urls that point to pages with permanent data and others that point to dynamic web pages. Google indexes both these regularly. By the time a user finds one of the dynamic content urls, the data on the page has already changed and the user does not find what he was looking for. Further, the dynamic url pages ...

need help in site classification

hi guys, I have to crawl the contents of several blogs. The problem is that I need to classify whether the blogs the authors are from a specific school and is talking about the school's stuff. May i know what's the best approach in doing the crawling or how should i go about the classification? ...

Nutch crawling with seeds urls are in range

Some site have url pattern as www.___.com/id=1 to www.___.com/id=1000. How can I crawl the site using nutch. Is there any wway to provide seed for fetching in range?? ...

Configure HTTP Post data input to Nutch before crawling a site

I have to crawl a site which list item based on user input through http post submission. How to configure post http submission details in Nutch. I got help on how to do HttpPostAuthentication, but I got no help on "how to do post data submit other than username and password". ...

Retrieve method/code from many files

I have many .cs files and I want to retrieve the method behind the [AcceptVerbs(HttpVerbs.Post)] attribute from these files automatically. so input is: [AcceptVerbs(HttpVerbs.Post)] public ActionResult sample(string msg) {......} and output is: public ActionResult sample(string msg) {......} My idea is use the RegularExpressions a...

Getting nutch to prioritize frequently updated pages?

Is there a way to get Nutch to increase the crawling of pages that gets updated frequently? E.g. index pages and feeds. It would also be of value to refresh fresh pages that contains comments more frequently the first date after the page was created. Any tips are appreciated. ...

Are wildcards allowed in sitemap.xml file ?

Hi, I have a website that has a directory that contains 100+ html files. I want crawlers to crawl all the html files that directory. I have already added following sentence to my robots.txt: Allow /DirName/*.html$ Is there any way to include the files in the directory in sitemap.xml file so that all html files in the directory will ...

Best web graph crawler for speed?

For the past month I've been using Scrapy for a web crawling project I've begun. This project involves pulling down the full document content of all web pages in a single domain name that are reachable from the home page. Writing this using Scrapy was quite easy, but it simply runs too slowly. In 2-3 days I can only pull down 100,000 pa...