Classifying Text Based on Groups of Keywords?

views:

159

answers:

+2 Q:

Classifying Text Based on Groups of Keywords?

I have a list of requirements for a software project, assembled from the remains of its predecessor. Each requirement should map to one or more categories. Each of the categories consists of a group of keywords. What I'm trying to do is find an algorithm that would give me a score ranking which of the categories each requirement is likely to fall into. The results would be use as a starting point to further categorize the requirements.

As an example, suppose I have the requirement:

The system shall apply deposits to a customer's specified account.

And categories/keywords:

Customer Transactions: deposits, deposit, customer, account, accounts
Balance Accounts: account, accounts, debits, credits
Other Category: foo, bar

I would want the algorithm to score the requirement highest in category 1, lower in category 2, and not at all in category 3. The scoring mechanism is mostly irrelevant to me, but needs to convey how much more likely category 1 applies than category 2.

I'm new to NLP, so I'm kind of at a loss. I've been reading Natural Language Processing in Python and was hoping to apply some of the concepts, but haven't seen anything that quite fits. I don't think a simple frequency distribution would work, since the text I'm processing is so small (a single sentence.)

+2 A:

You might want to look the category of "similarity measures" or "distance measures" (which is different, in data mining lingo, than "classification".)

Basically, a similarity measure is a way in math you can:

Take two sets of data (in your case, words)
Do some computation/equation/algorithm
The result being that you have some number which tells you how "similar" that data is.

With similarity measures, this number is a number between 0 and 1, where "0" means "nothing matches at all" and "1" means "identical"

So you can actually think of your sentence as a vector - and each word in your sentence represents an element of that vector. Likewise for each category's list of keywords.

And then you can do something very simple: take the "cosine similarity" or "Jaccard index" (depending on how you structure your data.)

What both of these metrics do is they take both vectors (your input sentence, and your "keyword" list) and give you a number. If you do this across all of your categories, you can rank those numbers in order to see which match has the greatest similarity coefficient.

As an example:

From your question:

Customer Transactions: deposits, deposit, customer, account, accounts

So you could construct a vector with 5 elements: (1, 1, 1, 1, 1). This means that, for the "customer transactions" keyword, you have 5 words, and (this will sound obvious but) each of those words is present in your search string. keep with me.

So now you take your sentence:

The system shall apply deposits to a customer's specified account.

This has 2 words from the "Customer Transactions" set: {deposits, account, customer}

(actually, this illustrates another nuance: you actually have "customer's". Is this equivalent to "customer"?)

The vector for your sentence might be (1, 0, 1, 1, 0)

The 1's in this vector are in the same position as the 1's in the first vector - because those words are the same.

So we could say: how many times do these vectors differ? Lets compare:

(1,1,1,1,1) (1,0,1,1,0)

Hm. They have the same "bit" 3 times - in the 1st, 3rd, and 4th position. They only differ by 2 bits. So lets say that when we compare these two vectors, we have a "distance" of 2. Congrats, we just computed the Hamming distance! The lower your Hamming distance, the more "similar" the data.

(The difference between a "similarity" measure and a "distance" measure is that the former is normalized - it gives you a value between 0 and 1. A distance is just any number, so it only gives you a relative value.)

Anyway, this might not be the best way to do natural language processing, but for your purposes it is the simplest and might actually work pretty well for your application, or at least as a starting point.

(PS: "classification" - as you have in your title - would be answering the question "If you take my sentence, which category is it most likely to fall into?" Which is a bit different than saying "how much more similar is my sentence to category 1 than category 2?" which seems to be what you're after.)

good luck!

rascher 2009-09-29 03:59:41

A word of caution: the techniques described here are better applied in clustering-type tasks. Here, the pre-defined lists of words associated with each category are not at all prototypic items and the traditional distance functions between these and the actual items, is not representative of the belonging of the items to the corresponding categories. For example a particular category may have dozens of keywords (even though we only expect to find a few in a given instance of an item), such category will likely be under represented because of the poor scoring on hamming distance.

mjv 2009-09-29 04:38:37

Hm, you're right about Hamming being a poor measure - as you say in your answer, it'd be good for the results to be normalized, to get a ratio of "hits" to "misses" to see how closely the sets are related. Maybe using that method as an example was an suboptimal choice!

rascher 2009-09-29 06:00:41

You're both right, and what I'd ideally like to do is normalize tense and plurality in both keywords and sentences. That way, I only list "customer" and not "customers", "deposit" and not "deposits" or "deposited". I think Hamming still runs the risk of under-representation, but I think it's a good first stab at what I'm trying to do.

technomalogical 2009-09-29 14:14:08

+1 A:

The main characteristics of the problem are:

Externally defined categorization criteria (keyword list)
Items to be classified (lines from the requirement document) are made of a relatively small number of attributes values, for effectively a single dimension: "keyword".
As defined, no feedback/calibrarion (although it may be appropriate to suggest some of that)

These characteristics bring both good and bad news: the implementation should be relatively straight forward, but a consistent level of accuracy of the categorization process may be hard to achieve. Also the small amounts of various quantities (number of possible categories, max/average number of words in a item etc.) should give us room to select solutions that may be CPU and/or Space intentsive, if need be.

Yet, even with this license got "go fancy", I suggest to start with (and stay close to) to a simple algorithm and to expend on this basis with a few additions and considerations, while remaining vigilant of the ever present danger called overfitting.

Basic algorithm (Conceptual, i.e. no focus on performance trick at this time)

   Parameters = 
     CatKWs = an array/hash of lists of strings.  The list contains the possible
              keywords, for a given category.
         usage: CatKWs[CustTx] = ('deposits', 'deposit', 'customer' ...)
     NbCats = integer number of pre-defined categories
   Variables:
      CatAccu = an array/hash of numeric values with one entry per each of the
                possible categories.  usage:  CatAccu[3] = 4 (if array) or 
                 CatAccu['CustTx'] += 1  (hash)
      TotalKwOccurences = counts the total number of keywords matches (counts
       multiple when a word is found in several pre-defined categories)

    Pseudo code:  (for categorizing one input item)
       1. for x in 1 to NbCats
            CatAccu[x] = 0    // reset the accumulators
       2. for each word W in Item
             for each x in 1 to NbCats
                 if W found in CatKWs[x]
                      TotalKwOccurences++
                      CatAccu[x]++
       3. for each x in 1 to NbCats
             CatAccu[x] = CatAccu[x] / TotalKwOccurences  // calculate rating
       4. Sort CatAccu by value
       5. Return the ordered list of (CategoryID, rating)
              for all corresponding CatAccu[x] values about a given threshold.

Simple but plausible: we favor the categories that have the most matches, but we divide by the overall number of matches, as a way of lessening the confidence rating when many words were found. note that this division does not affect the relative ranking of a category selection for a given item, but it may be significant when comparing rating of different items.

Now, several simple improvements come to mind: (I'd seriously consider the first two, and give thoughts to the other ones; deciding on each of these is very much tied to the scope of the project, the statistical profile of the data to be categorized and other factors...)

We should normalize the keywords read from the input items and/or match them in a fashion that is tolerant of misspellings. Since we have so few words to work with, we need to ensure we do not loose a significant one because of a silly typo.
We should give more importance to words found less frequently in CatKWs. For example the word 'Account' should could less than the word 'foo' or 'credit'
We could (but maybe that won't be useful or even helpful) give more weight to the ratings of items that have fewer [non-noise] words.
We could also include consideration based on digrams (two consecutive words), for with natural languages (and requirements documents are not quite natural :-) ) word proximity is often a stronger indicator that the words themselves.
we could add a tiny bit of importance to the category assigned to the preceding (or even following, in a look-ahead logic) item. Item will likely come in related series and we can benefit from this regularity.

Also, aside from the calculation of the rating per-se, we should also consider:

some metrics that would be used to rate the algorithm outcome itself (tbd)
some logic to collect the list of words associated with an assigned category and to eventually run statistic on these. This may allow the identification of words representative of a category and not initially listed in CatKWs.

The question of metrics, should be considered early, but this would also require a reference set of input item: a "training set" of sort, even though we are working off a pre-defined dictionary category-keywords (typically training sets are used to determine this very list of category-keywords, along with a weight factor). Of course such reference/training set should be both statistically significant and statistically representative [of the whole set].

To summarize: stick to simple approaches, anyway the context doesn't leave room to be very fancy. Consider introducing a way of measuring the efficiency of particular algorithms (or of particular parameters within a given algorithm), but beware that such metrics may be flawed and prompt you to specialize the solution for a given set at the detriment of the other items (overfitting).

mjv 2009-09-29 04:21:20

ansaurus

tags:

views:

answers:

Classifying Text Based on Groups of Keywords?

related questions