views:

256

answers:

6

Hi all,

I am doing some linguistic research that depends on being able to query a corpus of 100 million sentences. The information I need from that corpus is along the lines: how many sentences had "john" as first word, "went" as second word and "hospital" as the fifth word...etc So I just need the count and don't need to actually retrieve the sentences.

The idea I had was to split these sentences into words and store them into a database, where the columns would be the positions (word-1, word-2, word-3..etc) and the sentences would be the rows. So it looks like:

Word1 Word2 Word3 Word4 Word5 ....

Congress approved a new bill

John went to school

.....

And my purpose will then be fulfilled by calling something like COUNT(SELECT * where Word1=John and Word4=school). But I am wondering: Can this be better achieved using Lucene (or some other tool)?

The program I am writing (in Java) will be doing tens of thosands of such queries on that 100 million sentece corpus. So speed of look-up is important.

Thanks for any advice,

Anas

A: 

Look at Apache Hadoop and Map Reduce. It's developed for things like this.

splix
MapReduce seems designed for cluster computing, I will be doing this thing on my personal notebook (the corpus is only a couple of GBs in size).
+1  A: 

Assuming that the queries are as simple as you have indicated, a simple SQL db (Postgres, MySQL, possibly H2) would be perfect for this.

jsight
That was the original idea, but the worry I had (part of the reason why I posted this question) is whether the counting would be a bit slow with 100 million rows. I mean, if it will take 10 seconds to count the rows that satisfy the select statement, that would be too slow.
A: 

Or you can done it by the hand, using only only java by

List triple = new ArrayList(3);    
for (String word: inputFileWords) {
  if (triple.size == 3) {
      resultFile.println(StringUtils.join(" ", triple));
      triple.remove(0);
  }
  triple.add(line);
}

then sort this file and sum all duplicate lines (manually or from some command line utility), it will be fast as possible.

splix
I am afraid this won't work for my purpose: I don't just want the duplicate lines, I want the count of lines that satisfy certain properties (e.g. how many lines have "car" as 2nd word and "crashed" as 3rd word). So simply collapsing the lines won't do. Plus I need to be able to access the account in a reasonably fast manner since my code will be doing tens of thousands of such queries.
oh, sorry, i'd just misunderstood your case.In this case using an DB will be an optimal choice. And for example Postgres have some features for easy indexing and quering arrays, it can help you. But for 100 million rows it may be no so fast as you wish.
splix
A: 
  • I suggest you read Search Engine versus DBMS. From what I gather, you do need a database rather than a full text search library.
  • In any case, I suggest you preprocess your text and replace every word/token with a number using a dictionary. This replaces every sentence with an array of word codes. I would then store every word place in a separate database column, simplifying counts and making them quicker. For example:

A boy and a girl drank milk

translates into:

120 530 14 120 619 447 253

(I chose arbitrary word codes) leading to store a row

120 530 14 120 619 447 253 0 0 0 0 0 0 0 ....

(until the number of words you allocate per a sentence is exhausted).

This is a somewhat sparse matrix, so maybe this question will help.

Yuval F
Thank you very much for the link, I found it quite helpful. And, actually, I took away from it that search indexer would be faster than DB so I've decided to go with that, God willing. Thanks again.
A: 

I suppose you already have infrastructure to create tokens from a given sentence. You can create a lucene document with one field for each word in the sentence. You can name the fields as field1, field2, and so on. Since, lucene doesn't have a schema like DB, you can define as many fields, on the fly, as you wish. You can add an additional identifier field if you want to identify which sentences matched a query.

While searching, your typical lucene query will be

+field1:John +field4:school

Since you are not bothered about the speed of retrieval, you can write a custom Collector which will ignore scores. (That will return results significantly faster as well.)

Since you don't plan to retrive back the matching sentences or words, you should only index these fields and not store. That should push performance up by a notch.

Shashikant Kore
Ok, thanks for the tip on the Collector.
+1  A: 

Lucene span queries can implement positional search. Use SpanFirst to find a word in the first N positions of a document, and combine it with SpanNot to rule out the first N-1.

Your example query would like this:

<BooleanQuery: +(+spanFirst(john, 1) +spanFirst(went, 2)) +spanNot(spanFirst(hospital, 5), spanFirst(hospital, 4))>

Lucene also of course allows getting the total hit count of a search result without iterating all the docs.

Coady