ansaurus

Question

How to make concept representation with the help of bag of words

Answer 1

A:

I think you might be thinking of Generative Grammars, but I'm not too sure.

msw 2010-03-04 03:26:54

"generative grammar" doesn't refer to "generating sentences", it refers to the refers to the idea that language is hierarchical and can be divided into constituents based on a set of rules - rules about syntactic categories, not individual words

rascher 2010-03-18 19:58:51

Answer 2

+2 A:

Most successful linguistic parsers today are statistically based, and this is (for example) how Google Translate works. What you do is get a large semantically marked-up corpus and start walking the word chart. The set of linguistically valid English sentences is larger than that of generative grammar (an older approach), but a large corpus will get you a huge number of viable sentence templates. You can make sentences from your bag by any data traversal technique, from random walk to genetic algorithms. Let us know what you do!

Here's a great set of resources to start: Stanford statistical natural language processing and corpus-based computational linguistics resources

In response to OP comment below: To generate a sentence you must have an abstract representations of valid sentences. A simple example is SUBJECT VERB OBJECT in generative grammar. You might also get SUBJECT VERB ADJECTIVE OBJECT as well. The problem is that you can fill it out with grammatically correct nonsense, such as "I ate hungry apple." What statistical analysis will tell you is that "hungry apple" is a combination you almost never see--it's very unlikely to appear in real English (your corpus), and so without even having to know the meaning I can eliminate that as a possible sentence. If you were writing a grammar checker you might underline that word pair as being questionable.

Since you are writing a sentence generator, you would just need to reverse that process--one simple possibility is to simply generate a large set of random combinations of the words and then check them against your database to see if the word chains all meet a certain threshold of likelihood, such as 80%. Another option is to treat individual word chains as genes in a genetic algorithm, and after a few generations chains like "hungry apple" will die out in favor of more successful genes like "red apple." With a small "word bag" like the one you mentioned you don't need to get that fancy, you can probably test every possible sentence with numwords < n with no problem. You only need to get fancy in your sentence search algorithm when your word bag is too huge to exhaustively compute.

The link above does have several marked-up corpora you can download and use, as well as plenty of sample programs for marking up corpora of your own. But you do want to keep it simple if this is just a project of idle curiosity. Let me make another suggestion--one of the largest corpora available is Google's index of the web. Any sentence or phrase you put in quotes in a google search will return a number of hits. "red apple" returns over a million hits, for example, whereas "hungry apple" returns a mere 11,000. You can use this to build a small statistical markup for the validity of your sentences with a small word bag. If the statistical process turns out to be too complicated for you to implement, instead think of marking up your word bag with parts of speech (research part-of-speech markup) and provide your program with a variety of abstract sentence templates--you will still get sentences like "A person will eat a hungry apple" but depending on your needs that may be enough. :)

P.S. Without the word "an" in your word bag you look limited to Tarzan grammar and the world of man-eating apples :)

Plynx 2010-03-05 02:55:06

Thanks for your answer. I have some knowledge about lingustic stuff ( lsi, vsm etc. ) also genetic stuff too. But I couldn't understand your answer very well. Is there any sample marked-up corpus that I can see?? For google translate it just see words translate them with the help of dictionary. If you put my string in google it will not generate a sentense. Can you please explain it more with an example??

2010-03-06 20:09:49

ansaurus

tags:

views:

answers:

How to make concept representation with the help of bag of words

related questions