ansaurus

Question

Data structure for search engine in JAVA?

Answer 1

A:

I would suggest you to use the Hashtable class or to organize your content into a tree to optimize searching.

Roberto Aloi 2009-10-13 08:57:57

-1 Hashtable is obsolete since Java 1.2, and has absolutely nothing to do with trees.

Michael Borgwardt 2009-10-13 09:18:11

I didn't work with Java in the past year. I just checked the doc:http://java.sun.com/j2se/1.4.2/docs/api/java/util/Hashtable.htmlNo mention about deprecation or obsoleteness there. Doc was from 1.4.2. I saw an advice to use HashMap in Java 1.6. Trees were not directly related to hash tables, but represented an alternative. For example, look at: http://stackoverflow.com/questions/823744/ternary-tree-vs-hash-table

Roberto Aloi 2009-10-13 09:45:59

Answer 2

+1 A:

If you have a small number of images and short descriptions (< 1000 characters), load them into an array and search for words using String.indexOf() (i.e. one entry in the array == one complete image description). This is efficient enough for, say, less than 10'000 images.

Use toLowerCase() to fold the case of the characters (so users will find "Computer" when they type "computer"). String.indexOf() will also work for short words (using "comp" to find "Computer" or "compare").

If you have lots of images and long descriptions and/or you want to give your users some comforts for the search (like Google does), then use Lucene.

Aaron Digulla 2009-10-13 09:01:45

Lucene is an elephant, believe me. But indeed worth to try out.

Adeel Ansari 2009-10-13 09:07:00

Thanks ... but the description is can be quite large i.e. more than 1000 lines, so can store it in simple array?

Prajakta B. 2009-10-13 09:26:51

What's the problem with using a string?

Thomas Jung 2009-10-13 09:35:28

The array will need more memory and it will be slower to search, so why would you want to use one?

Aaron Digulla 2009-10-13 12:11:24

Answer 3

+1 A:

There is no simple, easy-to-use data structure that supports efficient fulltext search.

But do you actually need efficiency? Is this a desktop app or a web app? In the former case, don't worry about efficiency, a modern CPU can search through megabytes of text in fractions of a second - simply look through all your descriptions using String.contains() (or a regexp to allow more flexible searches).

If you really need efficiency (such as for a webapp where many people could do searches at the same time), look into Apache Lucene.

As for your ArrayLists, it seems strange to use one for the description of a single image. Why a list, what does the index represent? Lines? If so, and unless you actually need to access lines directly, replace the lists with a simple String - it can contain newline characters just fine.

Michael Borgwardt 2009-10-13 09:05:15

This is a strange approach. Try brute force. If this fails use this gigantic lib (Lucene). There are one or two solutions in the middle.

Thomas Jung 2009-10-13 09:09:47

Cite some that aren't either a lot of work, or of limited usefulness (mapping words to text offsets fails on composite words).

Michael Borgwardt 2009-10-13 09:16:40

One simplistic solution can be: tokenize the description, for(token: tokenize(descr)) map.put(token, item). This will burn some memory but can be a valid solution. Depending on the constraints.

Thomas Jung 2009-10-13 09:20:43

As I wrote: fails on composite words. Most users won't be happy with that restriction.

Michael Borgwardt 2009-10-13 09:27:00

One can tokenize as needed. If you start to build your own stemming it's time to use Lucene. But the problem was about finding direct hits (simple tokenizing, no stemming).

Thomas Jung 2009-10-13 09:33:32

Answer 4

+2 A:

A simple implementation is to tokenize the description and use a Map<String, Collection<Item>> to store all items for a token.

Building:

for(String token: tokenize(description)) map.get(token).add(item)

(A collection is needed as multiple entries could be found for a token. The initialization of the collection is missing in the code. But the idea should be clear.)

Use:

List<Item> result = map.get("Computer")

The the general purpose HashMap implementation is not the most efficient in this case. When you start getting memory problems you can look into a tree implementation that is more efficient (like radix trees - implementation).

The next step could be to use some (in-memory) database. These could be relational (HSQL) or not (Berkeley DB).

Thomas Jung 2009-10-13 09:29:53

You are saying nothing new. See my post.

Adeel Ansari 2009-10-13 09:39:40

I did not understand what you're trying to say: tag != token. If it's called a tag it should be a tag (http://en.wikipedia.org/wiki/Tag_%28metadata%29). Tagging sounds like a user making a decission which items and tags are linked.

Thomas Jung 2009-10-13 09:47:31

In your case a list of items/images will be mapped to a token/key. In my case, its the same, but I say it a tag instead of token. As we, over here at SO, tag our posts and a single tag can hold a list of posts. So, the term I used is clearly valid, I believe.

Adeel Ansari 2009-10-13 09:54:23

May be I didn't express my idea well. Or say I chose the wrong word. Whatever, but the fact is I said the same. So, +1 for thinking like me :).

Adeel Ansari 2009-10-13 09:58:00

As you can see with SO the set of tags is not the tokenized description but metadata assigned by the user. These are two totally different concepts.

Thomas Jung 2009-10-13 09:58:17

Yes, the description you might not received from the user, but you tokenize your description, and then to retrieve the image, you completely follow the tag idea.

Adeel Ansari 2009-10-13 10:00:50

And thats the reason I mentioned, something like Tags, not Tags in the exact meaning.

Adeel Ansari 2009-10-13 10:04:10

I don't thinking tag must be from the user, you can tag things yourself too, right?

Adeel Ansari 2009-10-13 10:06:30

Tags are a way to create something similar to ontologies. The automatic creation of ontologies is more of a research issue (http://www.aifb.uni-karlsruhe.de/WBS/ywa/publications/wang06KASO_AAAIFS.pdf). This is not something lightly done.

Thomas Jung 2009-10-13 10:21:45

Yeah I know, its just tagging lightly done. Okay, I used the wrong word, but the words are not important anymore, as long as we grabbed the meaning. And the wrong word might be the reason no one understood what I was trying to say. Thanks for enlightening me. Cheers.

Adeel Ansari 2009-10-13 10:38:04

ansaurus

tags:

views:

answers:

Data structure for search engine in JAVA?

related questions