views:

113

answers:

4

Hi,

Which data structure or algorithm is used in browsers to search for a word? Will the browsers construct a trie or suffix tree?

Thank you
Bala

+3  A: 

Web pages are usually not large enough to need sophisticated search algorithms, at least on the first scan. I mean that you can probably find any word with a simple linear search in just a few ms. An optimization could be to build a trie during the first scan and then use it for subsequent searches.

Overall, I don't think this is one of the big issues in browser algorithms.

Eli Bendersky
I don't agree with you that linear scan will be used, because most browsers will highlight all the occurrences of the word as you type, and I don't think linear scan makes sense here. It could be that, based on the size of the web page, linear scan or trie will be used.
Algorist
@Algorist: how would highlighting the words make linear scan obsolete? To build a trie you still have to scan linearly at least once, so you just as well can use it for finding the first results
Eli Bendersky
But there is a difference between doing linear scan once..and doing it for every search word.
Algorist
@Algorist: and this is why I wrote the last sentence in the first paragraph of my answer
Eli Bendersky
+2  A: 

Searching with a trie/suffix tree is fast -- but building the trie to start with is considerably slower. That means they only make sense when you expect to carry out a large number of searches over the same data, so you can amortize the time to build the trie over many searches.

The average number of searches inside a web page is probably fractional (i.e. you expect the user to load several pages before doing a search even once). Even when you do search a page, doing a lot of searches in the same page is probably pretty rare.

That means a linear search will almost always be substantially more efficient overall than a trie or suffix tree. My guess is that if they bother optimizing it beyond a simple call to strstr() that they only go as far as something in the Boyer-Moore family of string searching. Given the number of searches you expect in a web page, this will normally finish them all before you could just do the initial build of the trie, so you could start searching with it.

For interactive use, your primary concern is producing results fast enough to appear instantaneous. That generally means results within 100ms or so. With a good implementation of Boyer-Moore-Horspool, that's enough time to search an amount of text that would be insane to include in a single web page (on the order of hundreds of megabytes or gigabytes).

If you want to put it to the test, I'd recommend Ray Gardner's implementation of Boyer-Moore-Horspool (Bmhsrch.C, from Bob Stout's Snippets site). I'd truly hate to see a web page large enough to keep it occupied for even 20 ms, not to mention 100 (though I'm the first to admit this particular implementation is exceptionally fast).

Jerry Coffin
Funny, WebKit even contains a comment `// FIXME: Can we do Boyer-Moore or equivalent instead for speed?` http://trac.webkit.org/browser/trunk/WebCore/editing/TextIterator.cpp?rev=34822#L1378
jleedev
+2  A: 

To understand why a linear scan is fast enough, consider how much more complex page rendering is (which obviously requires at least a linear scan of the HTML) and how fast it is done. I think the browser will spend a lot more time highlighting occurences than searching for them, anyways.

Also, the search may be done incrementally. Say, I'm searching for "algorithm". When I type "a", the browser might find (or asynchronously start searching for) the occurences of the letter "a" and subsequent symbols only refine the current findings.

jkff
A: 

Simple use of regular expressions is just more than enough. Take a look at various online tools.

pokrate