ansaurus

Question

how to search for a word in a book programmatically?

Answer 1

+2 A:

Years ago thee was a Bible already stored in an Access database that I used to make an application exactly like what you're talking about. The Access DB was a free download. A few years back, I ran across one in XML. I can't do it from work but I would recommend doing a search for Access Bible or XML Bible and see if you can find it. (I think the original Access one may have been called ASP Bible). At any rate, if you can find it, it should give you a good idea of how you can structure your database.

David Stratton 2009-07-29 13:09:37

Answer 2

+1 A:

Is the program supposed to search any book or just a particular book? Books other than the Bible do not have content split up into chapter and verse like the Bible does. The answer will depend on what kind of format the book is in currently.

Jeremy Cron 2009-07-29 13:09:59

Many MANY books are split up into subdivisions. Chapter and verse was a biblical example.

2009-07-29 13:19:19

(-1) This is not an answer, it is a comment.

2009-07-29 13:20:12

Answer 3

+2 A:

I would suggest using an off-the-shelf full text engine like Lucene.NET. You'll get all kinds of features you would not get if you did it yourself.

Craig Stuntz 2009-07-29 13:10:49

Answer 4

A:

def findWord(keyword):
    f = open("book.txt")
    for line in f:  # horribly bad performance for a large block of text
        if line.find(keyword) > -1:
            print line

Substitute each line for a block of text for your specific bible example. How you store the text is really irrelevant. All you're doing is searching some given text (most likely in a loop), for a keyword.

If you want to search line numbers, and other arbitrary fields, you're best off storing the information in a database with the relevant fields and running the search on any field that is relevannt.

FYI - the code above is Python.

Josh Smeaton 2009-07-29 13:10:59

(-1) If you are talking about a book as big as the bible, this algorithm would be practically unusable.

2009-07-29 13:21:03

Answer 5

+3 A:

It kind off depends on the environment you want to run it on, and how many queries you expect per second.

The fastest is to store every word in a hashtable into memory, and the values contain reference to the chapters/verses, or whatever you call it, you want to retrieve.

But this may not scale well if the book is very large, or the client is very thin.

You could store every verse in a database record, and search with full-text-search. But if you need to host the app on a Website, you need to ensure that the hosting costs of the database of your choice does not exceed your budget.

If your application load can handle it, you can also store every verse in a text file (plain text, XML, or any other format), and scan each file, preferably with XPATH or regular expression. A very cheap and easy solution, that you can make as advanced as you like, but probably slower. Then again if you need to service only 1 request per hour, why not?

I would use the database with full-text-search, since that scales the best.

GvS 2009-07-29 13:12:40

Answer 6

A:

Do you expect multiple queries for the same book? i.e. do you want to do per-book preprocessing that may take a lot of time, but has to be done only once per book? Otherwise, the boyer-moore is probably the best way to go. Do you only want to search for complete words, or also for beginnings of words? For complete words, a simple hashtable is probably fastest. If you want to look for parts of word, I'd suggest a suffix tree.

When you know what algorithm you're using, deciding the best data structure (database, flat file, etc.) should be an easier choice.

nikie 2009-07-29 13:23:18

Answer 7

A:

You could look into the Boyer-Moore (also, this contains a link to their original paper) algorithm

Unfortunately, the Boyer-Moore algorithm is much faster on longer strings than it is on short 'keyword' searches. So, for keyword searching you might want to implement some sort of crawler that could index likely search terms.

Another troubling consideration is that in most books chapters are contained on only certain pages, whereas with a bible, the chapters and verses could be split across multiple pages, and the pages could contain multiple verses and chapters.

This means that if you split up your text by verse, then any search phrases that cross verse boundaries will come up with no results (or incorrect ones).

A further consideration is the proximity search, such as whether or not you require exact search phrases, or just groups of keywords.

I think the first and most important task is to hammer down and harden your requirements. Then you should figure out what format you will be receiving the books in. Once you know your constraints, you can begin to make your architectural design decisions.

2009-07-29 13:49:24

ansaurus

tags:

views:

answers:

how to search for a word in a book programmatically?

related questions