views:

6591

answers:

17
+2  Q: 

MOSS 2007 Crawl

I'm trying to get crawl to work on two separate farms I have but can't get it to work on either one. They both have two WFE's with an additional WFE configured as an Index server. There is one more server dedicated for Query and two clustered SQL 2005 back end servers for the database. I have unsuccessfully tried at least 50 different websites that I found with solutions from a search engine. I have configured (extended) my Web App to use http://servername:12345 as the default zone and http://abc.companyname.com as the custom and intranet zones. When I enter each of those into the content source and then try to run a crawl, I get a couple of errors in the crawl log:

http://servername:12345 returns:
"Could not connect to the server. Please make sure the site is accessible."

http://abc.companyname.com returns:
"Deleted by the gatherer. (The start address or content source that contained this item was deleted and hence this item was deleted.)"

However, I can click both URL's and the page is accessible.

Any ideas?

A: 

Need more details, as those error messages can be thrown due to many different causes.
For instance, have a look to see if all the servers are all at the same build level.

Aidenn
All four servers were built at the same time, first with MOSS 2007 Enterprise, then WSS SP1, MOSS SP1, then WSS KB941422 installed in that order on both farms. We have no other hotfixes or patches installed at this time.
RJ Russell
A: 

All four servers were built at the same time, first with MOSS 2007 Enterprise, then WSS SP1, MOSS SP1, then WSS KB941422 installed in that order on both farms. We have no other hotfixes or patches installed at this time.

RJ Russell
A: 

In the Services on Server section check the properties for the search crawl account to make sure it is set up, and that it has permissions to access those sites.

Daniel O
A: 

That sounded like a great idea so I added our search service account, which is also our Default content access account, to the main portal and the subsite:

http://portal.com

and

http://portal.com/sites/subsite

I then separately tried to crawl each of the above and received the same error on both:

"The crawler could not communicate with the server. Check that the server is available and that the firewall is configured correctly."

Note that we do not go through a proxy or firewall to access our portals.

Another possible help is that I can successfully crawl a file share when using my own account as the default content access account.

Any ideas?

Thanks!

RJ Russell
A: 

Rebuilt the entire farm. Exact same error. :(

"The crawler could not communicate with the server. Check that the server is available and that the firewall access is configured correctly."

RJ Russell
+1  A: 

I'm a little confused about your farm topology. A machine installed as a just a WFE cannot be an indexer. A machine installed as "complete" can be an indexer, query and/or a wfe...

Also, instead of changing the default content access account, you may want to add a crawl rule instead (once everything is up and running)

Can you see if anything helpful is in the %commonprogramfiles%/microsoft shared/web server extensions/12/logs on your indexer?

The log file may be a bit verbose, you can search for "started" or "full" and that will usually get you to the line in the log where your crawl started.

Also, on your sql machine, you may be able to get more information from the MSScrawlurlhistory table.

RedDeckWins
A: 

More good info, thanks!

All five of the MOSS servers were installed as complete. Then there is a pair of clustered servers for the SQL back end. The first two servers are WFE's configured with NLB. The first three servers were all to host the Query role. The fourth server has Central Admin and the fifth has the Indexer role. All were installed as complete and I believe all have the Web App service started.

Crawl rules would work, but they shouldn't be necessary to get a successful crawl. Like you said, maybe after a successful crawl has been run then we can look into that.

The log file is verbose and seems to make little sense. I see propagation errors. Other than that, a search for "started" only shows me Portal Analytics starting and a search for "full" only shows me "Successfully synchronized instance of Excel Services in SSP."

I couldn't find the MSScrawlurlhistory table, but the MSSCrawlHistory and the MSSCrawlURL tables are both empty in the Search SSP DB.

RJ Russell
A: 

More info:

I wiped the slate clean, so to speak, and ran another crawl to provide an updated sample.

My content sources are as such:

http://servername:33333
http://sharepoint.portal.fake.com
sps3://servername:33333

My current crawl log errors are:

sps3://servername:33333
Error in PortalCrawl Web Service.

http://servername:33333/mysites
Content for this URL is excluded by the server because a no-index attribute.

http://servername:33333/mysites
Crawled

sts3://servername:33333/contentdbid={62a647a...
Crawled

sts3://servername:33333
Crawled

http://servername:33333
Crawled

http://sharepoint.portal.fake.com
The Crawler could not communicate with the server. Check that the server is available and that the firewall access is configured correctly.

I double checked for typos above and I don't see any so this should be an accurate reflection.

RJ Russell
A: 

I "Reset Crawled Content" and changed the content source from http://sharepoint.portal.fake.com to http://servername5:11111 and ran the crawl again. When I created the extended web app I used a URL of http://servername4:11111 but when I went to find it in IIS it seems to only be installed on servername5, which is the Indexer. So when I used it as a content source I used it based on its current location of http://servername5:11111 and ran the crawl again. (By the way, I am unable to browse this site in IIS on servername5.)

The results are below:

http://servername5:11111
Access is denied. Check that the Default Content Access Account has access to this content, or add a crawl rule to crawl this content. (The item was deleted because it was either not found or the crawler was denied access to it.)

http://servername:33333/mysites
Content for this URL is excluded by the server because a no-index attribute.

http://servername:33333/mysites
Crawled

sts3://servername:33333/contentdbid={62a647a...
Crawled

sts3://servername:33333
Crawled

http://servername:33333
Crawled

sps3://servername:33333
Error in PortalCrawl Web Service.

RJ Russell
A: 

Another note on Topology. On servername2, which is part of the NLB cluster and has the Query server role on it, the Windows SharePoint Services Search service will not start in Central Admin. It is stuck on starting. I'm not even sure if it is necessary for that to be started for it to be a Query server since the Office SharePoint Server Search service is started and functioning, but I thought I'd mention it.

On servername1, which is also part of the NLB cluster and designed to have the Query role, I cannot get either the Windows SharePoint Services Search service or the Office SharePoint Server Search service started. It returns an error:

"An unhandled exception occurred in the user interface.Exception Information: Unable to connect to the remote server."

This occurred before and after the rebuild.

I don't know if these are related or not, but they all are part of the larger buggy SharePoint problem that has me pulling my hair out.

RJ Russell
A: 

I also found this, but it doesn't seem to apply:

You cannot crawl case-sensitive Web content in SharePoint Server 2007

http://support.microsoft.com/kb/932619

RJ Russell
+1  A: 

Can you create a content source for http://www.cnn.com and start a full crawl? Do you get the same error(s)?

Also, we may want to take this offline, let me know if you want to do that.

I'm not sure if there is a way to send private messages via stackoverflow though.

RedDeckWins
A: 

I tried a couple of sites and received the error:
"Could not connect to the server. Please make sure the site is accessible."

The sites are accessible without using any user credentials.

I'm willing to do whatever it takes to get this working. How do I contact you offline?

RJ Russell
dustinaconrad at gmail dot com
RedDeckWins
A: 

Just wanted to point out that I can successfully crawl a file share on \\server3\C$\sharename

This worked before I rebuilt the farm and it still works now after the rebuild.

RJ Russell
+2  A: 

One thing to remember is that crawling SharePoint sites is different from crawling file shares or non-SharePoint websites.

A few other quick pointers:

  • the sps3: protocol is for crawling user profiles for People Search. You can disregard anything the crawler says about it until you're ready for user profiles.
  • your crawl account is supposed to have access to your entire farm. If you see permissions errors, find the KB article that tells you the how to reset your crawl account (it's a specific stsadm.exe command). If you're trying to crawl another farm's content, then you'll have to work something else out to grant your crawl account access. I think this is your biggest issue presently.
  • The crawler (running from the index server) will attempt to visit the public URL. I've had inter-server communication issues before; make sure all three servers can ping each other, and make sure the index server can reach the public URL (open IE on the index server and check it out). If you have problems, it's time to dirty up your index server's hosts file. This is something SharePoint does for you anyway, so don't feel too bad doing it. If you've set up anything aside from Integrated Windows Authentication, you'll have to work harder to get your crawler working.

Anyway, there's been a lot of back and forth in the responses, so I'm just shotgunning a bunch of suggestions out there, maybe one of them is on target.

A: 

Thanks for the new input!

So I came back from my weekend and I wanted to go through your pointers and try every one and then report back about how they didn't work and then post the results that I got. Funny thing happened, though.

I went to my Indexer (servername5) and I tried to connect to Central Admin and the main portal from Internet Explorer. Neither worked. So I went into IIS on ther Indexer to try to browse to the main portal from within IIS. That didn't work either and I received an error telling me that something else was using that port. So I saw my old website from the previous build and I deleted it from IIS along with the corresponding Application Pool. Then I started the App Pool for the web site from the new build and browsed to the website. Success. Then I browsed to the website from the browser on my own PC. Success again. Then I ran a crawl by the full URL, not the servername, like so:

http://sharepoint.portal.fake.com

Success again. It crawled the entire portal including the subsites just like I wanted. The "Items in index" populated quickly and I could tell I was rolling.

I still cannot access the Central Admin site hosted on servername4 from servername5. I'm not sure why not but I don't know that it matters much at this point.

Where does this leave me? What was the fix?

I'm still not sure. Maybe it was the rebuild. Maybe as soon as I rebuilt the server farm I had everything I needed to get it to work but it just wouldn't work because of the previous website still in IIS. (It's funny how sloppy a SharePoint un-install can be. Manual deletion of content databases, web sites, and application pools seem necessary and that probably shouldn't be the case.)

In any event, it's working now on my "test" farm so the key is to get it working on the production farm. I'm hopeful that it won't be so difficult after this experience.

Thanks for the help from everyone!

RJ Russell
+1  A: 

Most of your issues are related to Kerberos, it sounds like. If you don't have the infrastructure update applied, then Sharepoint will not be able to use kerberos auth to web sites w/ non default (80/443) ports. That's also why (I would bet) that you cannot access CA from server 5 when it's on server 4. If you don't have the SPNs set up correctly, then CA will only be accessible from the machine it is installed on. If you had installed Sharepoint using port 80 as the default url you'd be able to do the local sharepoint crawl without any hitches. But by design the local sharepoint sites crawl uses the default url to access the sharepoint sites. Check out http://codefrob.spaces.live.com/blog/cns!7C69E7B2271B08F6!363.entry for a little more detail on how to get Kerberos & Sharepoint to work well together.

chris.w.mclean