Situation:
Site with content protected by username/password (not all controlled since they can be trial/test users)
a normal search engine can't get at it because of username/password restrictions
a malicious user can still login and pass the session cookie to a "wget -r" or something else.
The question would be what is the best solu...
I'm building a site where vendors will link their products to our product pages (for affiliate sales) and was wondering what algorithms people use to automate/facilitate this process? Right now they would have to manually enter the link for each product they own which is quite tedious. I noticed that Pricegrabber finds any product you ...
Hello,
I want to crawl a website anonymously without having to rely on an anonymous proxy server.
So I was thinking of letting users of my website help me by inserting an invisible IFrame in my template - the IFrame src would be set to a webpage URL I needed, and then uploaded to my server with AJAX.
(I can't use AJAX for the downloadi...
Hi,
We installed a new MOSS 2007 farm on windows 2008 SP2 enviroment. We used SQL2008 too. Configuration is 1 index, 1 FE and 1 server with 2008, all on ESX 4.0. All the Service that need it uses a dedicated user, so search has a dedicated user.
Installation went well and we found no problem. We installed an SP1 MOSS from a ISO and aft...
I want to cral a facebook fanpage to get the details of all the members who are fans of that page. I there any function in the face book API which will help me. Or is ther any other way I can do this???
...
I was going to crawl a site for some research I was collecting. But, apparently the terms of service is quite clear on the topic. Is it illegal to now "follow" the terms of service. And what can the site normally do?
Here is an example clause in the TOS. Also, what about sites that don't provide this particular clause.
Restriction...
When a user enters my page, I have to make another AJAX call...to load a data inside a div.
That's just how my application works.
The problem is...when I view the source of this code, it does not contain the source of that AJAX. Of course, when I do wget URL ...it also does not show the AJAX HTML. Makes sense.
But what about Google? W...
I wish to perform a social network analysis on a bunch of blogs, plotting who is linking to who (not just by their blogroll but also inside their posts). What software can perform such crawling/data-collecting/mapping ?
Thanks!
...
Preferably, I want the least work possible!
...
is there an easy way to crawl open source php forums and put them in categories in my own forum, eg. "windows", "mac" and so on?
...
Hi..
Maybe not the correct place to post. But, I'm going to try anyway!
I've got a couple of test python parsing scripts that I created. They work enough for me to test what I'm working on.
However, I recently came across the python framework, Scrapy, which is used for web scraping. My app runs in a distributed process, across a test...
Is it possible to write a script that will login to an application using uname/pwd?
the username/password are not passed in through POST (they dont come in the URL)
Basic steps I am looking for are:
Visit url
enter uname/pwd
click a button
click a link
get the raw html to make sure it does not have 500 error
Is that possible to do...
I have a site that has certain urls that point to pages with permanent data and others that point to dynamic web pages. Google indexes both these regularly. By the time a user finds one of the dynamic content urls, the data on the page has already changed and the user does not find what he was looking for. Further, the dynamic url pages ...
hi guys,
I have to crawl the contents of several blogs. The problem is that I need to classify whether the blogs the authors are from a specific school and is talking about the school's stuff. May i know what's the best approach in doing the crawling or how should i go about the classification?
...
Some site have url pattern as www.___.com/id=1 to www.___.com/id=1000. How can I crawl the site using nutch. Is there any wway to provide seed for fetching in range??
...
I have to crawl a site which list item based on user input through http post submission. How to configure post http submission details in Nutch.
I got help on how to do HttpPostAuthentication, but I got no help on "how to do post data submit other than username and password".
...
I have many .cs files and I want to retrieve the method behind the [AcceptVerbs(HttpVerbs.Post)] attribute from these files automatically.
so input is:
[AcceptVerbs(HttpVerbs.Post)]
public ActionResult sample(string msg)
{......}
and output is:
public ActionResult sample(string msg)
{......}
My idea is use the RegularExpressions a...
Is there a way to get Nutch to increase the crawling of pages that gets updated frequently?
E.g. index pages and feeds.
It would also be of value to refresh fresh pages that contains comments more frequently the first date after the page was created. Any tips are appreciated.
...
Hi,
I have a website that has a directory that contains 100+ html files.
I want crawlers to crawl all the html files that directory.
I have already added following sentence to my robots.txt:
Allow /DirName/*.html$
Is there any way to include the files in the directory in sitemap.xml file so that all html files in the directory will ...
For the past month I've been using Scrapy for a web crawling project I've begun.
This project involves pulling down the full document content of all web pages in a single domain name that are reachable from the home page. Writing this using Scrapy was quite easy, but it simply runs too slowly. In 2-3 days I can only pull down 100,000 pa...