views:

45

answers:

2

I have a database with a listing of documents and the words within them. Each row represents a term. What I'm looking to do is to count how many documents a word occurs in.

So, given the following:

+  doc  +  word  +
+-------+--------+
+   a   +  foo   +
+-------+--------+
+   a   +  foo   +
+-------+--------+
+   a   +  bar   +
+-------+--------+
+   b   +  bar   +
+-------+--------+

I'd get a result of

+  word  +  count  +
+--------+---------+
+  foo   +    1    +
+--------+---------+
+  bar   +    2    +
+--------+---------+

Because foo occurs in only one document (even if it occurs twice within that doc) and bar occurs in two documents.

Essentially, what (think) I should be doing is a COUNT of the words that the following query spits out,

SELECT DISTINCT word, doc FROM table

..but I can't quite figure it out. Any hints?

+4  A: 

You can actually use distinct inside count, like:

select  word
,       count(distinct doc)
from    YourTable
group by
        word
Andomar
A: 

This may be an aside, but i'm guessing this is not the best way to do this. Why are you tracking every word in every document? Take a look at Oracle Intermedia. It was built for this sort of thing (specifically text search).

erbsock
I'm practicing text mining and am in fact using another Oracle product - the Data Miner. What I was doing here is trimming the uninteresting words (those that occur in more than 98% of the documents and in less than 1%) to make the data set smaller.
Peter O