How to search phrase queries in inverted index structure? | ansaurus

tags:

views:

193

answers:

1

Q:

How to search phrase queries in inverted index structure?

If we want to search a query like this "t1 t2 t3" (t1,t2 ,t3 must be queued) in an inverted index structure , which ways should we do ?

1-First we search the "t1" term and find all documents that contains "t1" , then do this work for "t2" and then "t3" . Then find documents that positions of "t1" , "t2" and "t3" are next to each other .

2-First we search the "t1" term and find all documents that contains "t1" , then in all documents that we found , we search the "t2" and next , in the result of this , we find documents that contains "t3" .

I have a full inverted index . I want to know which ways above is optimized , (1) or (2) ?

thanks a lot.

+2 A:

As the wikipedia entry well explains,

There are two main variants of inverted indexes: A record level inverted index (or inverted file index or just inverted file) contains a list of references to documents for each word. A word level inverted index (or full inverted index or inverted list) additionally contains the positions of each word within a document. The latter form offers more functionality (like phrase searches), but needs more time and space to be created.

Since you don't tell us which variant you have, we can't really answer your question precisely, but thinking about each possibility will help.

To open and search documents is typically a costly operation, unless your documents are unusually small, so you want to minimize that -- and option (2) doesn't really minimize it. If you have an inverted list, with option (1) you won't even need to open any document; if you only have an inverted file, you'll inevitably need to open documents and scan them (since you otherwise lack information to confirm word adjacency) -- but at least with option (1) you minimize the number of documents you have to open and scan (only those in the intersection of the lists of documents containing each word).

So, in either case, option (1) is more promising (unless your documents are peculiarly small).

Alex Martelli 2010-04-17 17:16:06

related questions

Best way to search data stored as XML in Sql Server?

What are the alternative's to using the iThenticate service for content comparison?

Search by hash?

Free text search integrated with code coverage

How-to: Ranking Search Results

Find item in WPF ComboBox

Find in Files: Search all code in Team Foundation Server

Searching for phone numbers in mysql

How do I implement Search Functionality in a website?

Can you perform an AND search of keywords using FREETEXT() on SQL Server 2005?

How do I search content, within audio files/streams?

Search Plugin for Safari

Search strategies in ORMs

Using Lucene to search for email addresses

SQL Server Full Text Searching

How do you do a case insensitive search using a pattern modifier using less ?

WildcardQuery error in Solr

PowerShell FINDSTR eqivalent?

Parsing search queries in Java

Need Pattern for dynamic search of multiple sql tables

grep a file, but show several surrounding lines?

Eclipse : Class file name must end with .class exception in Java Search

MOSS SSP problem - Failed database logons from deleted SSP

Incomplete results with Turkish characters in Indexing Service

Lucene Score results