ansaurus

Question

String matching

Answer 1

A:

This is called hashing. Your method may work but it is not very flexible. Again, you will only retrieve exact matches. It is also possible that two preimages share the same image (two different strings hash to the same value) but it is extremely unlikely so that isn't a real issue. I would recommend against it because of the lack of flexibility but if that doesn't bother you then I guess this will work for you. This essentially the same technique people use for storing and validate passwords (except, you, obviously, are not using any "salt" values).

BobbyShaftoe 2009-01-14 00:46:58

"but it is extremely unlikely so that isn't a real issue". I hope you are joking.

Akusete 2009-01-14 00:49:17

It's entirely possible that 2 strings hash to 1 value. A 32-bit hash only has 4 billion distinct values. ALL strings must hash to one of those 4 billion values. There are more than 4 billion possible strings.

S.Lott 2009-01-14 01:00:30

Yes, I understand how it works. And I wasn't joking. I guess I made the assumption that he doesn't have that many strings and that they are all relatively short. This seemed reasonable since a user isn't going to search for a whole paragraph at once.

BobbyShaftoe 2009-01-14 01:02:04

Obviously, if that assumption is wrong then don't do it. In fact, probably you shouldn't do it anyway.

BobbyShaftoe 2009-01-14 01:02:58

Answer 2

+5 A:

That's not bad, but you should investigate Lucene. it's a public shareware Text Indexing and search tool implemented in numerous languages, one of which is .Net.. (what platfporm/language are you working in?) I used it for free-text searching of web site content on a public internet whose primary model was providing content in a paritcula market segment (numerous magazine articles, book excerpts etc.) Lucene worked very well for us.

Lucene

Charles Bretana 2009-01-14 00:47:10

I've read about Lucene, but I want only exact matching, so should I use it or it doesn't worth it?I'm using PHP and there is Lucene port in Zend framework

Khaled Al Hourani 2009-01-14 01:24:21

@Khaled, I don't know about PHP port, but I'm fairly sure you can do exact macthing in .Net Lucene. (It's been three years since I used it - so my memory is blurry)

Charles Bretana 2009-01-14 01:41:00

Answer 3

+1 A:

You might want to use a Trie or other tree-based data structure for storing your string data.

A trie can also be used to replace a hash table, over which it has the following advantages:

Looking up data in a trie is faster in the worst case, O(m) time, compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.

There are no collisions of different keys in a trie.

Buckets in a trie which are analogous to hash table buckets that store key collisions are only necessary if a single key is associated with more than one value.

There is no need to provide a hash function or to change hash functions as more keys are added to a trie.

A trie can provide an alphabetical ordering of the entries by key.

Tries do have some drawbacks as well:

Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random access time is high compared to main memory.

It is not easy to represent all keys as strings, such as floating point numbers, which can have multiple string representations for the same floating point number, e.g. 1, 1.0, 1.00, +1.0, etc.

Tries are frequently less space-efficient than hash tables.

(see http://en.wikipedia.org/wiki/Trie)

0xA3 2009-01-14 00:51:28

The data is so huge, it's a library so can Trie handle this or hashing is better here?

Khaled Al Hourani 2009-01-14 01:33:41

How long is a typical string? With many short strings a trie might be more space-efficient than a hash because you don't need to store the hash separately.

0xA3 2009-01-14 09:37:16

between 60 and 500 chars. as usual strings you find in books. I think it's not efficient to use Trie here?

Khaled Al Hourani 2009-01-14 13:28:30

Answer 4

+1 A:

You should instead convert every book chapter to a suffix tree. A suffix tree is a type of Trie (mentioned by divo).

Suffix tree's are specifically intended for use in fast text searches. One advantage of a Suffix tree is that searching for a string of length n is O(n) time. This is just as good (asymptotically) as your algorithm idea (since hashing a string takes O(n) time), but far more flexible, since it will work even for partial sentences. It reduces to sentence search if you start/end your searches with a period.

Clarification: More precisely, you will have one suffix tree for everything.

Brian 2009-01-14 01:04:32

Answer 5

+1 A:

There are many algorithms for searching in strings, ranging from simple methods like the Boyer-Moore algorithm to complex data structures like suffix trees. A full presentation of these can be found in:

Gusfield, Dan (1999), Algorithms on Strings, Sequences and Trees. Cambridge: University Press.

For your case, however, it probably makes more sense to split the book text into individual tokens (the words), and store these in an index (e.g. simply in a Map, or using a full framework for indexing and searching like Lucene).

Fabian Steeg 2009-01-14 01:21:03

Thanks for mentioning this book, I'll take a look at it's TOC

Khaled Al Hourani 2009-01-14 01:35:57

Answer 6

A:

First, it really sounds like what you should be using is a database -- this is pretty much exactly what databases are for. (If you want this embedded within your own application, check out SQLite, a lightweight DBMS designed to be used as an embedded library.)

Second, it's not quite true that your hash solution will only return exact matches... Since an MD5 digest is 128 bits, that means that any given pair of strings has a 1-in-2^128 chance of producing the same hash value. Yeah, it's a small number, but if you've got a lot of books, you'll have a lot of pairs of strings. So, once you've compared hash values, you'll need to do a full-text comparison to eliminate false positives.

Jeff Shannon 2009-01-14 02:14:27

you are right about MD5, but I can use 64 or 128 hash function to make it better. I'll do the hash and store them in DB too, so I'm ak=lready going to use DB

Khaled Al Hourani 2009-01-14 13:32:46

Answer 7

A:

A Trie is the best approach. This is what is also referred to as a suffix map. The advantage you have of using a Trie over your idea of hashing is that with a trie you can display an auto-complete type syntax very easily. Time for finding a word is O(n), where n is the length of the word. At each node in your Trie you will need to store a list of books that contain the particular word.

2009-01-14 02:44:08

Answer 8

+1 A:

It's called hashing, and can be thought of as either Searching or Matching.

You should verify that your MD5 hash is correct, by also comparing the string that was used to generate the hash, so you don't have any false positives.

Another thing to consider is that it may be beneficial to do support some sort of begins with search. Consider

Mary Queen of Scots
Mary Livingston
Mary Had a Little Lamb, and other silly stories

A begins with search for Mary, ought to return those three records and probably more. Although an MD5 sort of hash is fast, the techniques presented in the other answers should be considered as well to find the best benefit/cost balance for your circumstance.

EvilTeach 2009-01-14 04:49:52

Answer 9

A:

I agree with a Trie - with one addition, use a soundx algorithm to convert the string for the trie id/node - so misspellings are taken into consideration

meade 2009-01-14 19:58:27

ansaurus

tags:

views:

answers:

String matching

related questions