crawl

Is there a good open source search engine including indexing bot which can be used to make up special catalogue by feeding the bot with certain properties?

Hello, Our application (C#/.NET) needs a lot of queries to search. Google's 50,000 policy per day is not enough. We need something that would crawl Internet websites by specific rules we set (for ex. country domains) and gather URLs, Texts, keywords, name of websites and create our own internal catalogue so we wouldn't be limited to any...

Document Library Crawl

I set up a new scope and passed in the URL for a specific document libary that I created that hold 2 word documents. For some reason when I start a full crawl, it does not see the 2 word documents. The word documents have meta data and I've created Managed Properties that map the crawled properties. I am trying to utilize the Advanced...

Using buttons on web pages. Will Google index their links?

I want to use the look of standard buttons on my page, but I want web crawlers to follow them as if they were links. Will Google and other web-crawlers index a web page that has links that look like this? <form method="get" action="/mylink.html"><input style="font-size:10pt" id="my-link" type="submit" value="Learn More..." /></form> ...

how to make nutch crawl file system?

not based on http, like http://localhost:81 and so on, but directly crawl a certain directory on local file system, is there any way out? ...

ruby + save web page

To save the HTML of a web page using ruby, it's very easy. One way to do is by using rio : require 'rubygems' require 'rio' rio('http://www.google.com') > rio('google.html') It is possible to do the same for by parsing the html, requesting again the different image,s js, css and then save each of them. I think it is not very efficient...

How to get last crawl time of document in Sharepoint 2007?

How to get last crawl time of document in Sharepoint 2007? I want to know in which table I will get this information ? ...

How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol

Hi Everyone, I want to know How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol I am able to do it on local file systems using file:// protocol but not http protocol ...

solr + Heritrix

how is it possible to integrate solr with heritrix ? I want to archive a site using heritrix and then index and search locally this file using solr. thanks ...

configuring nutch regex-normalize.xml

I am using the Java-based Nutch web-search software. In order to prevent duplicate (url) results from being returned in my search query results, I am trying to remove (a.k.a. normalize) the expressions of 'jsessionid' from the urls being indexed when running the Nutch crawler to index my intranet. However my modifications to $NUTCH_HOME/...

crawling folder content in sharepoint

Hi everyone, I have a sharepoint site that gives me trouble with the search, it somehow crawls everything in the site but when I make a search for a document and I select one of those documents after the search is complete, instead of bringing up the documents it brings me back to the search page that I created. When I check the crawl lo...

Crawling and parsing Javascript elements

Hello I try to get info from a website which uses Javascript to show onclick the phonenumber of the items/companies. Crawling that with PHP curl or xpath did not let me find a solution how to trigger this events and than keep on crawling. Example: <a onclick="show(2423,'../entries.php?eid=2423',1); for info here the function too ...

Getting SharePoint crawl history

I have an application that uses the Microsoft.Office.Server.Search.Administration.CrawlHistory class to read crawl history information once a day and save it to a database where we can generate reports and statistics. For some reason, though, this class will not return data for crawls that started on the current date; it will only retur...

wget mirror all subdomains

I am mirroring a website starting my crawl from a particular subdomain (eg a.foo.com). How can I make wget also download content from other linked subdomains (eg b.foo.com) but not external domains (eg google.com)? I assumed this would work: wget --mirror --domains="foo.com" a.foo.com However links to b.foo.com were not followed. ...

scrape google codeSEARCH

Q: Advice on programming tools/scripts to automate the extraction of all project files from a Google code search result? NOTE: The question is specifically for code search: http://www.google.com/codesearch and NOT "google code" which already has repositary access. Motivation: An open source project official site has long gone without...

Is there a way to crawl all facebook fan pages?

Is there a way to crawl all facebook fan pages and collect some information? like for example crawling facebook fan pages and save their names, or how many fans, etc? Or at least, do you have a hint of how this could be possibly done? ...

Crawl Oracle Portal from SharePoint

We have a customer who has both a SharePoint 2007 SP 2 and an Oracle Portal 10.1.4.2.0. They would like to search the content of the Oracle portal from SharePoint. Is this a supported configuration? We have tried the various authentication methods. The only one which we have been able to get to work is the cookie authentication sending ...

crawl websites out of java web application without using bin/nutch

hi :) i am trying to using nutch (1.1) without bin/nutch from my (java) mojarra 2.0.2 webapp... i am searching at google for examples, but there are no examples how i can realize this :/ ... i get an exception and the job fails :/ (i think of cause something with hadoop)... here is my code: public void run() throws Exception { ...

Crawling MOSS 2007 from Sharepoint 2010

Hi Sharepoint Experts, I got the following warning after I crawled a site. Info out of Crawl Log: "The content for this address was excluded by the crawler because this item was marked with a no-index meta-tag. To index this item, remove the meta-tag and recrawl." Sound easy but I don't no where I could do it. In crawl Rules i have ...

Google crawling indexing algorithms

I am looking for some documents on how Google crawl and index content. I read many "light" papers and articles on what you need to do to improve your ranking and make sure your content is properly indexed but I am looking for some more advanced technical documents on how Google crawl and index content. The things I would like to know mo...

How do i exclude everything but text/hmtl from a heritrix crawl?

On: Heritrix Usecases there is an Use Case for "Only Store Successful HTML Pages" My Problem: i dont know how to implement it in my cxml File. Especially: Adding the ContentTypeRegExpFilter to the ARCWriterProcessor => set its regexp setting to text/html.*. ... There is no ContentTypeRegExpFilter in the sample cxml Files. ...