I want to use naive bayes to classify documents into a relatively large number of classes. I'm looking to confirm whether an mention of an entity name in an article really is that entity, on the basis of whether that article is similar to articles where that entity has been correctly verified.
Say, we find the text "General Motors" in an article. We have a set of data that contains articles and the correct entities mentioned within in. So, if we have found "General Motors" mentioned in a new article, should it fall into that class of articles in the prior data that contained a known genuine mention "General Motors" vs. the class of articles which did not mention that entity?
(I'm not creating a class for every entity and trying to classify every new article into every possible class. I already have a heuristic method for finding plausible mentions of entity names, and I just want to verify the plausibility of the limited number of entity name mentions per article that the method already detects.)
Given that the number of potential classes and articles was quite large and naive bayes is relatively simple, I wanted to do the whole thing in sql, but I'm having trouble with the scoring query...
Here's what I have so far:
CREATE TABLE `each_entity_word` (
`word` varchar(20) NOT NULL,
`entity_id` int(10) unsigned NOT NULL,
`word_count` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`word`, `entity_id`)
);
CREATE TABLE `each_entity_sum` (
`entity_id` int(10) unsigned NOT NULL DEFAULT '0',
`word_count_sum` int(10) unsigned DEFAULT NULL,
`doc_count` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`entity_id`)
);
CREATE TABLE `total_entity_word` (
`word` varchar(20) NOT NULL,
`word_count` int(10) unsigned NOT NULL,
PRIMARY KEY (`word`)
);
CREATE TABLE `total_entity_sum` (
`word_count_sum` bigint(20) unsigned NOT NULL,
`doc_count` int(10) unsigned NOT NULL,
`pkey` enum('singleton') NOT NULL DEFAULT 'singleton',
PRIMARY KEY (`pkey`)
);
Each article in the marked data is split into distinct words, and for each article for each entity every word is added to each_entity_word
and/or its word_count
is incremented and doc_count
is incremented in entity_word_sum
, both with respect to an entity_id
. This is repeated for each entity known to be mentioned in that article.
For each article regardless of the entities contained within for each word total_entity_word
total_entity_word_sum
are similarly incremented.
- P(word|any document) should equal the
word_count
intotal_entity_word
for that word overdoc_count
intotal_entity_sum
- P(word|document mentions entity x)
should equal
word_count
ineach_entity_word
for that word forentity_id
x overdoc_count
ineach_entity_sum
forentity_id
x - P(word|document does not mention entity x) should equal (the
word_count
intotal_entity_word
minus itsword_count
ineach_entity_word
for that word for that entity) over (thedoc_count
intotal_entity_sum
minusdoc_count
for that entity ineach_entity_sum
) - P(document mentions entity x) should equal
doc_count
ineach_entity_sum
for that entity id overdoc_count
intotal_entity_word
- P(document does not mention entity x) should equal 1 minus (
doc_count
ineach_entity_sum
for x's entity id overdoc_count
intotal_entity_word
).
For a new article that comes in, split it into words and just select where word in ('I', 'want', 'to', 'use'...) against either each_entity_word
or total_entity_word
. In the db platform I'm working with (mysql) IN clauses are relatively well optimized.
Also there is no product() aggregate function in sql, so of course you can just do sum(log(x)) or exp(sum(log(x))) to get the equivalent of product(x).
So, if I get a new article in, split it up into distinct words and put those words into a big IN() clause and a potential entity id to test, how can I get the naive bayesian probability that the article falls into that entity id's class in sql?
EDIT:
Try #1:
set @entity_id = 1;
select @entity_doc_count = doc_count from each_entity_sum where entity_id=@entity_id;
select @total_doc_count = doc_count from total_entity_sum;
select
exp(
log(@entity_doc_count / @total_doc_count) +
(
sum(log((ifnull(ew.word_count,0) + 1) / @entity_doc_count)) /
sum(log(((aew.word_count + 1) - ifnull(ew.word_count, 0)) / (@total_doc_count - @entity_doc_count)))
)
) as likelihood,
from total_entity_word aew
left outer join each_entity_word ew on ew.word=aew.word and ew.entity_id=@entity_id
where aew.word in ('I', 'want', 'to', 'use'...);