views:

135

answers:

5

As part of a larger project, I need to read in text and represent each word as a number. For example, if the program reads in "Every good boy deserves fruit", then I would get a table that converts 'every' to '1742', 'good' to '977513', etc.

Now, obviously I can just use a hashing algorithm to get these numbers. However, it would be more useful if words with similar meanings had numerical values close to each other, so that 'good' becomes '6827' and 'great' becomes '6835', etc.

As another option, instead of a simple integer representing each number, it would be even better to have a vector made up of multiple numbers, eg (lexical_category, tense, classification, specific_word) where lexical_category is noun/verb/adjective/etc, tense is future/past/present, classification defines a wide set of general topics and specific_word is much the same as described in the previous paragraph.

Does any such an algorithm exist? If not, can you give me any tips on how to get started on developing one myself? I code in C++.

A: 

Natural Language Processing is a broad and complex field. There are some tools out there (see Software Tools section of linked article), with the predominant one probably being NLTK.

I don't know of an easy answer, but that's a place to start.

Eric J.
+1  A: 

To map a word to a number, you should probably just use an index. Using hashcodes is just asking for trouble, since completely unrelated words could end up using the same value.

There are a number of ways to get a numerical measure of how semantically related words are, such as latent semantic analysis (LSA) or using some measure of relatedness within a lexical resource like WordNet (e.g. Lin, Resnik, or Jiang-Conrath).

To get what you're calling lexical categories, you'll need to use a part-of-speech (POS) tagger. The POS tags will also give you tense information (e.g., VBP means the word is a past tense verb).

To assign words to topics, you could make use of hypernym information from WordNet. This will give you stuff like 'red' is a 'color'. Or, you could make use of Latent Dirichlet allocation (LDA), if you would like to have a softer assignment of words to topics such that each word can be assigned to numerous topics to varying degrees.

dmcer
+1  A: 

Your idea is interesting if only a bit naive (but no worries, naive questions are useful in the area of NLP).

Leaving other practical questions aside (e.g. Parsing, POS-tagging, stemming, and of course the very issue of identifying/mapping a given word... I discuss them, very briefly, thereafter), there are several difficulties with the very principle of your suggestion [of a numeric scale where semantically close words are coded in proximity]:

  • Polysemy (fancy word indicating the fact that some words can have multiple, unrelated meaning)
  • Semantics are multi-dimensional. For example the noun "gumption" conveys both an idea of "energy" and an idea of "enthusiasm"
  • Some concepts are completely unrelated to others, for example say 'tea' and 'carpet' belong to two different sets of words, but trying to place these on a linear scale would implicitly void the idea that distance on this scale (other than maybe very small distances) have any connection to semantics.
  • Expressions: within a sentence, a particular concept is sometimes carried by an expression rather than the individual words. For example "Renaissance man" or "Table of Content".
  • Semantics sometimes (often) come from context. For example "boss" is often referring to somebody's supervisor, is also Bruce Springsteen's nickname.

In a nutshell
  a) meaning (or "definition", as called in the question, or "semantics" as called by linguists) is a tricky thing which doesn't lend itself to being mapped onto a line, or even a tree. Other graphs such as networks can be used, but even then things can get a bit tricky when applied beyond relatively restricted domains.
and
  b) associating words with meanings is also tricky because of polysemy, expressions etc.

Never the less, if you'd like to attempt the kind of mapping suggested in the question, maybe in the context of a specific domain (say that of sport commentary or mechanics repairs) and/or understanding that some words will just have to be arbitrarily mapped, before "diving in", you may want to get familiar with the following NLP (Natural Language Processing) disciplines and resources:

With regards to your interest in using tools written in C++, you'll probably find several of these, for various purposes (and of various quality !). You may also find that although they sometimes bind to primitives written in C/C++ for performances reasons, many of the modern frameworks and tools of NLP tend to use Java or even script languages like Python. I do not have direct experience with C++ based NLP software. If you do not find what you need (in C++), I discourage you, vehemently, to try and implement something yourself, at least before you have previously reviewed extensively previous art and have a good understanding for the underlying difficulties.

mjv
Thanks for the detailed answer. My application is too broad a concept to explain here, but suffice to say that the input data would always be in an individual topic and with a small range of differentiation, thus semantics and context won't be such an issue. I've already considered polysemy and it could definately cause a problem, but I'll deal with that at the time. Otherwise, thanks for the info.
thornate
A: 

This is part of a more general problem called "Meaning Representation". I am interested in this problem, but the fact is that words are often too ambiguous to be represented as numbers. I think sentences might be a better candidate, because at least some context is present. Even then, representing text as numbers is more a research issue than a coding issue.

For words, as dmcer pointed out, LSA/PLSA/LDA will be your best bet if you really want to map words to numbers. In this case though, you will get real numbers, not integers. There is a large body of work on topic models and how semantically related words can be grouped together under a single topic (topic models are nothing but probabilistic clustering of words). Notably, LSA representation has been used in the past to model semantic memory (please google-scholar "Lemaire and Denhiere" for reference). However, as mjv indicated, the domain has to be restricted/specialized so that you can make sure the problem size does not get out of hand.

Finally, I personally think that there might be underlying structure of words that you can use for representing them as numbers. Explicit representations of sentences e.g. predicates have their own problems related to ordering of POS, clauses, etc. But words do not necessarily have to deal with these issues, so there might still be some hope. You might be interested in the following pointers:

1> Representation Theory 2> Universal Networking Language (language as a hypergraph of words where sentences are hyperedges) 3> Kolmogorov Complexity and Representational Distortion 4> Group Theory and Graph Theory (there are many interesting representations that might be used) 5> A review of Number Theory (to see if particular categories of numbers can be associated to particular categories of words)

Risi Kondor's thesis is also interesting.

Shibamouli Lahiri
A: 

Yet another pointer is Machine Reading (especially KnowItAll group at U Washington, and Hoifung Poon's homepage)

Shibamouli Lahiri