views:

126

answers:

2

I have 20000-50000 entries in an excel file. One column contains the name of that company. Ideally, I would like search the name of that company, and whatever is the first result, I would select the URL associated with it. I am aware that Google (which my ideal choice) provides a AJAX Search API. However, it also has a 1000 search limit per registrant. Is there a way to get over 20000 searches without making 20 accounts with Google, or is there an alternative engine I could use?

Any alternative ways of approaching this problem are also welcome (i.e. WhoIs look-ups).

+2  A: 

Google AJAX Search has no such limit of 1000. Yahoo Search does. Google AJAX Search limits you to getting 64 results per search but otherwise has no limit.

From Google AJAX Search API - Class Reference:

Note: The maximum number of results pages is based on the type of searcher. Local search supports 4 pages (or a maximum of 32 total results) and the other searchers (Blog, Book, Image, News, Patent, Video, and Web) support 8 pages (for a maximum total of 64 results).

cletus
Ah, I need to be more thorough! I was looking at the SOAP Search API FAQ, not the AJAX one. Sorry about that.
Brian
+1  A: 

Approaches that avoid using an external search service ...

Approach 1 - put the information content of the XML into a database and search using SQL/JDBC. Variations of the same using Hibernate, etc.

Approach 2 - read the XML file as an in-memory data structure as a Java collection, and do the searching programmatically. This will use a bit of memory depending on how much information is in the XML file, but you only need to figure out how to parse / load the XML, and access the collection.

However, it would help if you explained the context in which you are trying to do this. Is it a browser plugin? The client side of a web app? The server side? A desktop application?

Stephen C
Well, I would preferably like to do it as a one-time run as a desktop application in Java. I could run it as a PHP script on a server, but I don't want to block up that site or anything while it's running (which will take quite a while).
Brian