views:

273

answers:

7

I need to develop an application that can search through a book and list out all the pages and lines that contain a given keyword.

For books that are split up in some other way, such as a bible which is split up by chapter and verse; they would be able to search for all verses that contain a certain keyword. Or alternatively, search within certain chapters and verses for a keyword.

What format should I store the book into? Should it be stored into a SQL database?

What format would be easiest for searching as opposed to easiest for storage?

+2  A: 

Years ago thee was a Bible already stored in an Access database that I used to make an application exactly like what you're talking about. The Access DB was a free download. A few years back, I ran across one in XML. I can't do it from work but I would recommend doing a search for Access Bible or XML Bible and see if you can find it. (I think the original Access one may have been called ASP Bible). At any rate, if you can find it, it should give you a good idea of how you can structure your database.

David Stratton
+1  A: 

Is the program supposed to search any book or just a particular book? Books other than the Bible do not have content split up into chapter and verse like the Bible does. The answer will depend on what kind of format the book is in currently.

Jeremy Cron
Many MANY books are split up into subdivisions. Chapter and verse was a biblical example.
(-1) This is not an answer, it is a comment.
+2  A: 

I would suggest using an off-the-shelf full text engine like Lucene.NET. You'll get all kinds of features you would not get if you did it yourself.

Craig Stuntz
A: 
def findWord(keyword):
    f = open("book.txt")
    for line in f:  # horribly bad performance for a large block of text
        if line.find(keyword) > -1:
            print line

Substitute each line for a block of text for your specific bible example. How you store the text is really irrelevant. All you're doing is searching some given text (most likely in a loop), for a keyword.

If you want to search line numbers, and other arbitrary fields, you're best off storing the information in a database with the relevant fields and running the search on any field that is relevannt.

FYI - the code above is Python.

Josh Smeaton
(-1) If you are talking about a book as big as the bible, this algorithm would be practically unusable.
+3  A: 

It kind off depends on the environment you want to run it on, and how many queries you expect per second.

The fastest is to store every word in a hashtable into memory, and the values contain reference to the chapters/verses, or whatever you call it, you want to retrieve.

But this may not scale well if the book is very large, or the client is very thin.

You could store every verse in a database record, and search with full-text-search. But if you need to host the app on a Website, you need to ensure that the hosting costs of the database of your choice does not exceed your budget.

If your application load can handle it, you can also store every verse in a text file (plain text, XML, or any other format), and scan each file, preferably with XPATH or regular expression. A very cheap and easy solution, that you can make as advanced as you like, but probably slower. Then again if you need to service only 1 request per hour, why not?

I would use the database with full-text-search, since that scales the best.

GvS
A: 

Do you expect multiple queries for the same book? i.e. do you want to do per-book preprocessing that may take a lot of time, but has to be done only once per book? Otherwise, the boyer-moore is probably the best way to go. Do you only want to search for complete words, or also for beginnings of words? For complete words, a simple hashtable is probably fastest. If you want to look for parts of word, I'd suggest a suffix tree.

When you know what algorithm you're using, deciding the best data structure (database, flat file, etc.) should be an easier choice.

nikie
A: 

You could look into the Boyer-Moore (also, this contains a link to their original paper) algorithm

Unfortunately, the Boyer-Moore algorithm is much faster on longer strings than it is on short 'keyword' searches. So, for keyword searching you might want to implement some sort of crawler that could index likely search terms.

Another troubling consideration is that in most books chapters are contained on only certain pages, whereas with a bible, the chapters and verses could be split across multiple pages, and the pages could contain multiple verses and chapters.

This means that if you split up your text by verse, then any search phrases that cross verse boundaries will come up with no results (or incorrect ones).

A further consideration is the proximity search, such as whether or not you require exact search phrases, or just groups of keywords.

I think the first and most important task is to hammer down and harden your requirements. Then you should figure out what format you will be receiving the books in. Once you know your constraints, you can begin to make your architectural design decisions.