views:

526

answers:

9

Hello,

Let me explain the problem:

  1. Let's say I have a library, the library contains many books, each book contains chapters, each chapter contains string (and the string begins and ends with dot ".").
  2. The sequence again, library -> book -> chapter -> string.
  3. I extracted the strings from books, let's call them "books strings".
  4. I have a system, where a user can enter a string in the search form, and the system should return the exact match of the entered string from "books strings". If the entered string doesn't match any string from books strings nothing will be returned.

I think about it and found a solution, I'll MD5 all books strings and save the hashed books strings. When a user enters a string to search, I'll hash it too and search for a match in the hashed books strings. It's cheaper (32 or 64 char for each string), faster than plain searching and it only returns the exact match(es).

Do have any comments, ideas, better solutions?

P.S. What is the name of such an algorithm? searching or matching?

A: 

This is called hashing. Your method may work but it is not very flexible. Again, you will only retrieve exact matches. It is also possible that two preimages share the same image (two different strings hash to the same value) but it is extremely unlikely so that isn't a real issue. I would recommend against it because of the lack of flexibility but if that doesn't bother you then I guess this will work for you. This essentially the same technique people use for storing and validate passwords (except, you, obviously, are not using any "salt" values).

BobbyShaftoe
"but it is extremely unlikely so that isn't a real issue". I hope you are joking.
Akusete
It's entirely possible that 2 strings hash to 1 value. A 32-bit hash only has 4 billion distinct values. ALL strings must hash to one of those 4 billion values. There are more than 4 billion possible strings.
S.Lott
Yes, I understand how it works. And I wasn't joking. I guess I made the assumption that he doesn't have that many strings and that they are all relatively short. This seemed reasonable since a user isn't going to search for a whole paragraph at once.
BobbyShaftoe
Obviously, if that assumption is wrong then don't do it. In fact, probably you shouldn't do it anyway.
BobbyShaftoe
+5  A: 

That's not bad, but you should investigate Lucene. it's a public shareware Text Indexing and search tool implemented in numerous languages, one of which is .Net.. (what platfporm/language are you working in?) I used it for free-text searching of web site content on a public internet whose primary model was providing content in a paritcula market segment (numerous magazine articles, book excerpts etc.) Lucene worked very well for us.

Lucene

Charles Bretana
I've read about Lucene, but I want only exact matching, so should I use it or it doesn't worth it?I'm using PHP and there is Lucene port in Zend framework
Khaled Al Hourani
@Khaled, I don't know about PHP port, but I'm fairly sure you can do exact macthing in .Net Lucene. (It's been three years since I used it - so my memory is blurry)
Charles Bretana
+1  A: 

You might want to use a Trie or other tree-based data structure for storing your string data.

A trie can also be used to replace a hash table, over which it has the following advantages:

  • Looking up data in a trie is faster in the worst case, O(m) time, compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.

  • There are no collisions of different keys in a trie.

  • Buckets in a trie which are analogous to hash table buckets that store key collisions are only necessary if a single key is associated with more than one value.
  • There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
  • A trie can provide an alphabetical ordering of the entries by key.

Tries do have some drawbacks as well:

  • Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random access time is high compared to main memory.
  • It is not easy to represent all keys as strings, such as floating point numbers, which can have multiple string representations for the same floating point number, e.g. 1, 1.0, 1.00, +1.0, etc.
  • Tries are frequently less space-efficient than hash tables.

(see http://en.wikipedia.org/wiki/Trie)

0xA3
The data is so huge, it's a library so can Trie handle this or hashing is better here?
Khaled Al Hourani
How long is a typical string? With many short strings a trie might be more space-efficient than a hash because you don't need to store the hash separately.
0xA3
between 60 and 500 chars. as usual strings you find in books. I think it's not efficient to use Trie here?
Khaled Al Hourani
+1  A: 

You should instead convert every book chapter to a suffix tree. A suffix tree is a type of Trie (mentioned by divo).

Suffix tree's are specifically intended for use in fast text searches. One advantage of a Suffix tree is that searching for a string of length n is O(n) time. This is just as good (asymptotically) as your algorithm idea (since hashing a string takes O(n) time), but far more flexible, since it will work even for partial sentences. It reduces to sentence search if you start/end your searches with a period.

Clarification: More precisely, you will have one suffix tree for everything.

Brian
+1  A: 

There are many algorithms for searching in strings, ranging from simple methods like the Boyer-Moore algorithm to complex data structures like suffix trees. A full presentation of these can be found in:

  • Gusfield, Dan (1999), Algorithms on Strings, Sequences and Trees. Cambridge: University Press.

For your case, however, it probably makes more sense to split the book text into individual tokens (the words), and store these in an index (e.g. simply in a Map, or using a full framework for indexing and searching like Lucene).

Fabian Steeg
Thanks for mentioning this book, I'll take a look at it's TOC
Khaled Al Hourani
A: 

First, it really sounds like what you should be using is a database -- this is pretty much exactly what databases are for. (If you want this embedded within your own application, check out SQLite, a lightweight DBMS designed to be used as an embedded library.)

Second, it's not quite true that your hash solution will only return exact matches... Since an MD5 digest is 128 bits, that means that any given pair of strings has a 1-in-2^128 chance of producing the same hash value. Yeah, it's a small number, but if you've got a lot of books, you'll have a lot of pairs of strings. So, once you've compared hash values, you'll need to do a full-text comparison to eliminate false positives.

Jeff Shannon
you are right about MD5, but I can use 64 or 128 hash function to make it better. I'll do the hash and store them in DB too, so I'm ak=lready going to use DB
Khaled Al Hourani
A: 

A Trie is the best approach. This is what is also referred to as a suffix map. The advantage you have of using a Trie over your idea of hashing is that with a trie you can display an auto-complete type syntax very easily. Time for finding a word is O(n), where n is the length of the word. At each node in your Trie you will need to store a list of books that contain the particular word.

+1  A: 

It's called hashing, and can be thought of as either Searching or Matching.

You should verify that your MD5 hash is correct, by also comparing the string that was used to generate the hash, so you don't have any false positives.

Another thing to consider is that it may be beneficial to do support some sort of begins with search. Consider

Mary Queen of Scots
Mary Livingston
Mary Had a Little Lamb, and other silly stories

A begins with search for Mary, ought to return those three records and probably more. Although an MD5 sort of hash is fast, the techniques presented in the other answers should be considered as well to find the best benefit/cost balance for your circumstance.

EvilTeach
A: 

I agree with a Trie - with one addition, use a soundx algorithm to convert the string for the trie id/node - so misspellings are taken into consideration

meade