Hi,
Which data structure or algorithm is used in browsers to search for a word? Will the browsers construct a trie or suffix tree?
Thank you
Bala
Hi,
Which data structure or algorithm is used in browsers to search for a word? Will the browsers construct a trie or suffix tree?
Thank you
Bala
Web pages are usually not large enough to need sophisticated search algorithms, at least on the first scan. I mean that you can probably find any word with a simple linear search in just a few ms. An optimization could be to build a trie during the first scan and then use it for subsequent searches.
Overall, I don't think this is one of the big issues in browser algorithms.
Searching with a trie/suffix tree is fast -- but building the trie to start with is considerably slower. That means they only make sense when you expect to carry out a large number of searches over the same data, so you can amortize the time to build the trie over many searches.
The average number of searches inside a web page is probably fractional (i.e. you expect the user to load several pages before doing a search even once). Even when you do search a page, doing a lot of searches in the same page is probably pretty rare.
That means a linear search will almost always be substantially more efficient overall than a trie or suffix tree. My guess is that if they bother optimizing it beyond a simple call to strstr()
that they only go as far as something in the Boyer-Moore family of string searching. Given the number of searches you expect in a web page, this will normally finish them all before you could just do the initial build of the trie, so you could start searching with it.
For interactive use, your primary concern is producing results fast enough to appear instantaneous. That generally means results within 100ms or so. With a good implementation of Boyer-Moore-Horspool, that's enough time to search an amount of text that would be insane to include in a single web page (on the order of hundreds of megabytes or gigabytes).
If you want to put it to the test, I'd recommend Ray Gardner's implementation of Boyer-Moore-Horspool (Bmhsrch.C, from Bob Stout's Snippets site). I'd truly hate to see a web page large enough to keep it occupied for even 20 ms, not to mention 100 (though I'm the first to admit this particular implementation is exceptionally fast).
To understand why a linear scan is fast enough, consider how much more complex page rendering is (which obviously requires at least a linear scan of the HTML) and how fast it is done. I think the browser will spend a lot more time highlighting occurences than searching for them, anyways.
Also, the search may be done incrementally. Say, I'm searching for "algorithm". When I type "a", the browser might find (or asynchronously start searching for) the occurences of the letter "a" and subsequent symbols only refine the current findings.
Simple use of regular expressions is just more than enough. Take a look at various online tools.