ansaurus

Question

Counting sentences: Database (like h2) vs. Lucene vs. ?

Answer 1

A:

Look at Apache Hadoop and Map Reduce. It's developed for things like this.

splix 2010-02-15 20:49:20

MapReduce seems designed for cluster computing, I will be doing this thing on my personal notebook (the corpus is only a couple of GBs in size).

2010-02-15 21:01:38

Answer 2

+1 A:

Assuming that the queries are as simple as you have indicated, a simple SQL db (Postgres, MySQL, possibly H2) would be perfect for this.

jsight 2010-02-15 21:09:30

That was the original idea, but the worry I had (part of the reason why I posted this question) is whether the counting would be a bit slow with 100 million rows. I mean, if it will take 10 seconds to count the rows that satisfy the select statement, that would be too slow.

2010-02-15 21:21:37

Answer 3

A:

Or you can done it by the hand, using only only java by

List triple = new ArrayList(3);    
for (String word: inputFileWords) {
  if (triple.size == 3) {
      resultFile.println(StringUtils.join(" ", triple));
      triple.remove(0);
  }
  triple.add(line);
}

then sort this file and sum all duplicate lines (manually or from some command line utility), it will be fast as possible.

splix 2010-02-15 21:20:00

I am afraid this won't work for my purpose: I don't just want the duplicate lines, I want the count of lines that satisfy certain properties (e.g. how many lines have "car" as 2nd word and "crashed" as 3rd word). So simply collapsing the lines won't do. Plus I need to be able to access the account in a reasonably fast manner since my code will be doing tens of thousands of such queries.

2010-02-15 22:18:42

oh, sorry, i'd just misunderstood your case.In this case using an DB will be an optimal choice. And for example Postgres have some features for easy indexing and quering arrays, it can help you. But for 100 million rows it may be no so fast as you wish.

splix 2010-02-15 22:35:04

Answer 4

A:

I suggest you read Search Engine versus DBMS. From what I gather, you do need a database rather than a full text search library.
In any case, I suggest you preprocess your text and replace every word/token with a number using a dictionary. This replaces every sentence with an array of word codes. I would then store every word place in a separate database column, simplifying counts and making them quicker. For example:

A boy and a girl drank milk

translates into:

120 530 14 120 619 447 253

(I chose arbitrary word codes) leading to store a row

120 530 14 120 619 447 253 0 0 0 0 0 0 0 ....

(until the number of words you allocate per a sentence is exhausted).

This is a somewhat sparse matrix, so maybe this question will help.

Yuval F 2010-02-16 07:25:50

Thank you very much for the link, I found it quite helpful. And, actually, I took away from it that search indexer would be faster than DB so I've decided to go with that, God willing. Thanks again.

2010-02-18 14:17:38

Answer 5

A:

I suppose you already have infrastructure to create tokens from a given sentence. You can create a lucene document with one field for each word in the sentence. You can name the fields as field1, field2, and so on. Since, lucene doesn't have a schema like DB, you can define as many fields, on the fly, as you wish. You can add an additional identifier field if you want to identify which sentences matched a query.

While searching, your typical lucene query will be

+field1:John +field4:school

Since you are not bothered about the speed of retrieval, you can write a custom Collector which will ignore scores. (That will return results significantly faster as well.)

Since you don't plan to retrive back the matching sentences or words, you should only index these fields and not store. That should push performance up by a notch.

Shashikant Kore 2010-02-16 08:51:18

Ok, thanks for the tip on the Collector.

2010-02-18 14:18:52

Answer 6

+1 A:

Lucene span queries can implement positional search. Use SpanFirst to find a word in the first N positions of a document, and combine it with SpanNot to rule out the first N-1.

Your example query would like this:

<BooleanQuery: +(+spanFirst(john, 1) +spanFirst(went, 2)) +spanNot(spanFirst(hospital, 5), spanFirst(hospital, 4))>

Lucene also of course allows getting the total hit count of a search result without iterating all the docs.

Coady 2010-02-17 16:00:09

ansaurus

tags:

views:

answers:

Counting sentences: Database (like h2) vs. Lucene vs. ?

related questions