views:

360

answers:

4

Hi,

I m MCS 2nd year student.I m doing a project in Java in which I have different images. For storing description of say IMAGE-1, I have ArrayList named IMAGE-1, similarly for IMAGE-2 ArrayList IMAGE-2 n so on.....

Now I need to develop a search engine, in which i need to find a all image's whose description matches with a word entered in search engine..........

FOR EX If i enter "computer" then I should be able to find all images whose description contain "computer".

So my question is...

How should i do this efficiently?
How should i maintain all those ArrayList since i can have 100 of such...? or should i use another data structure instead of ArrayList?

A: 

I would suggest you to use the Hashtable class or to organize your content into a tree to optimize searching.

Roberto Aloi
-1 Hashtable is obsolete since Java 1.2, and has absolutely nothing to do with trees.
Michael Borgwardt
I didn't work with Java in the past year. I just checked the doc:http://java.sun.com/j2se/1.4.2/docs/api/java/util/Hashtable.htmlNo mention about deprecation or obsoleteness there. Doc was from 1.4.2. I saw an advice to use HashMap in Java 1.6. Trees were not directly related to hash tables, but represented an alternative. For example, look at: http://stackoverflow.com/questions/823744/ternary-tree-vs-hash-table
Roberto Aloi
+1  A: 

If you have a small number of images and short descriptions (< 1000 characters), load them into an array and search for words using String.indexOf() (i.e. one entry in the array == one complete image description). This is efficient enough for, say, less than 10'000 images.

Use toLowerCase() to fold the case of the characters (so users will find "Computer" when they type "computer"). String.indexOf() will also work for short words (using "comp" to find "Computer" or "compare").

If you have lots of images and long descriptions and/or you want to give your users some comforts for the search (like Google does), then use Lucene.

Aaron Digulla
Lucene is an elephant, believe me. But indeed worth to try out.
Adeel Ansari
Thanks ... but the description is can be quite large i.e. more than 1000 lines, so can store it in simple array?
Prajakta B.
What's the problem with using a string?
Thomas Jung
The array will need more memory and it will be slower to search, so why would you want to use one?
Aaron Digulla
+1  A: 

There is no simple, easy-to-use data structure that supports efficient fulltext search.

But do you actually need efficiency? Is this a desktop app or a web app? In the former case, don't worry about efficiency, a modern CPU can search through megabytes of text in fractions of a second - simply look through all your descriptions using String.contains() (or a regexp to allow more flexible searches).

If you really need efficiency (such as for a webapp where many people could do searches at the same time), look into Apache Lucene.

As for your ArrayLists, it seems strange to use one for the description of a single image. Why a list, what does the index represent? Lines? If so, and unless you actually need to access lines directly, replace the lists with a simple String - it can contain newline characters just fine.

Michael Borgwardt
This is a strange approach. Try brute force. If this fails use this gigantic lib (Lucene). There are one or two solutions in the middle.
Thomas Jung
Cite some that aren't either a lot of work, or of limited usefulness (mapping words to text offsets fails on composite words).
Michael Borgwardt
One simplistic solution can be: tokenize the description, for(token: tokenize(descr)) map.put(token, item). This will burn some memory but can be a valid solution. Depending on the constraints.
Thomas Jung
As I wrote: fails on composite words. Most users won't be happy with that restriction.
Michael Borgwardt
One can tokenize as needed. If you start to build your own stemming it's time to use Lucene. But the problem was about finding direct hits (simple tokenizing, no stemming).
Thomas Jung
+2  A: 

A simple implementation is to tokenize the description and use a Map<String, Collection<Item>> to store all items for a token.

Building:

for(String token: tokenize(description)) map.get(token).add(item)

(A collection is needed as multiple entries could be found for a token. The initialization of the collection is missing in the code. But the idea should be clear.)

Use:

List<Item> result = map.get("Computer")

The the general purpose HashMap implementation is not the most efficient in this case. When you start getting memory problems you can look into a tree implementation that is more efficient (like radix trees - implementation).

The next step could be to use some (in-memory) database. These could be relational (HSQL) or not (Berkeley DB).

Thomas Jung
You are saying nothing new. See my post.
Adeel Ansari
I did not understand what you're trying to say: tag != token. If it's called a tag it should be a tag (http://en.wikipedia.org/wiki/Tag_%28metadata%29). Tagging sounds like a user making a decission which items and tags are linked.
Thomas Jung
In your case a list of items/images will be mapped to a token/key. In my case, its the same, but I say it a tag instead of token. As we, over here at SO, tag our posts and a single tag can hold a list of posts. So, the term I used is clearly valid, I believe.
Adeel Ansari
May be I didn't express my idea well. Or say I chose the wrong word. Whatever, but the fact is I said the same. So, +1 for thinking like me :).
Adeel Ansari
As you can see with SO the set of tags is not the tokenized description but metadata assigned by the user. These are two totally different concepts.
Thomas Jung
Yes, the description you might not received from the user, but you tokenize your description, and then to retrieve the image, you completely follow the tag idea.
Adeel Ansari
And thats the reason I mentioned, something like Tags, not Tags in the exact meaning.
Adeel Ansari
I don't thinking tag must be from the user, you can tag things yourself too, right?
Adeel Ansari
Tags are a way to create something similar to ontologies. The automatic creation of ontologies is more of a research issue (http://www.aifb.uni-karlsruhe.de/WBS/ywa/publications/wang06KASO_AAAIFS.pdf). This is not something lightly done.
Thomas Jung
Yeah I know, its just tagging lightly done. Okay, I used the wrong word, but the words are not important anymore, as long as we grabbed the meaning. And the wrong word might be the reason no one understood what I was trying to say. Thanks for enlightening me. Cheers.
Adeel Ansari