ansaurus

Question

Answer 1

A:

Hash tables.

jeffamaphone 2009-06-23 16:23:51

Your absolutely right, but you should elaborate, so others who are unaware of the technique can learn something from you.

Matthew Vines 2009-06-23 16:34:20

Answer 2

+1 A:

Most parsing like program text parsing is done with formal grammar parsers. English and most spoken languages are not formal grammars and you're going to have a very hard time parsing them. This problem has tied up PHDs for decades without a lot of success.

2009-06-23 16:35:19

I don't think that's the point of this question. He's already come up with a simplified version of English grammar which is formal enough to parse. His question is more about how to create a parsing program in general.

Sean Nyman 2009-06-23 16:48:30

Yes the question was. How do I parse this grammer.

Martin York 2009-06-23 17:15:24

Answer 3

A:

Before you get too far writing a parser, might I suggest learning the pair of lex and yacc or flex and bison? These tools were specifically made for creating parsers and lexers.

They will automatically generated your c/c++ (maybe other) code, so you will not have to worry about all sorts of boundary cases for user arguments. You could whip up the grammar you have above in under 30 minutes.

As for your questions:

For the word functions (noun, verb, etc.) how should I go about checking if they are true? (as in checking if the user's input has birds, fish, fly, swim, etc.)

A generous use of strcasecmp() is called for here, with all sorts of error checking as well.

How should I handle the conjunction call and the output?

I don't really understand too much what you mean here. I'd just return some sort of sentinel value if its valid or not.

Should I handle the output from the main function or the call functions?

Mostly from the call functions, since these have the individual functionality you're concerned with. main() is just glue to hold it together.

None of the above questions matter if my psuedo code is completely wrong. Is there anything wrong with the basics?

It looks doable the way you have, but you will save yourself a huge headache by switching to lex/yacc or flex/bison

samoz 2009-06-23 16:38:55

Why was this double downvoted?

samoz 2010-05-20 13:00:00

Answer 4

A:

Use Flex and Bison:

Grammer Rule in Bison:

%%

English            :   SentenceList

SentenceList       :   Sentence
                   |   Article  Sentence
                   |   Sentence Conjunction Sentence

Sentence           :   Noun Verb


Conjunction        :   TOKEN_WordAnd
                   |   TOKEN_WordOr
                   |   TOKEN_WordBut


Noun               :   TOKEN_WORD_BIRDS
                   |   TOKEN_WORD_FISH
                   |   TOKEN_WORD_CPP

Verb:              :   TOKEN_WORD_RULES
                   |   TOKEN_WORD_FLY
                   |   TOKEN_WRD_SWIM

Article            :   TOKEN_WORD_THE
%%

Martin York 2009-06-23 16:40:33

Why the down vote. I can think of a better way to parse a well defined grammer! Use the corect tool for the job.

Martin York 2009-06-23 17:17:40

Maybe because the OP wasn't looking for a parser - he was looking for help in writing one...

DJ 2009-06-23 17:27:38

Maybe. But this is writting one.

Martin York 2009-06-23 20:05:45

Having a tool take your grammar and generating a parser isn't quite the same at all as writing a parser.

McPherrinM 2009-06-24 00:02:48

I would argue it is: The only difference is that you do not need to set with paper and pencil for 10 hours generating a state table. The hard part is defining a valid grammer and that is done by you not the tools.

Martin York 2009-06-24 01:00:04

Answer 5

+2 A:

Okay so... I won't have answers to your specific questions, but I want to point you to some general ideas to think about when working on this. First of all, parsing is hard. You have a simple grammar, but it can STILL be hard. Compiler front-ends are responsible for parsing... just to give some context.

There are two basic types of parsing... top down parsing and bottom up parsing. They are named by how they traverse a syntax tree (Think about what sort of syntax tree is going to be created for possible constructs). Top down parsing is easy, and will probably work for what you want to do. The most common top-down parsing method is Recursive Descent parsing: http://en.wikipedia.org/wiki/Recursive_descent_parser

However, to use recursive descent parsing, your grammar must be in a certain format... for some grammars, it is impossible to recursive descent parse them. You should be able to reform your grammar to fit this though.

Top-down parsers are easy to write... as you will only need a few functions for a small language.

The second way to parse is bottom-up parsing. This is commonly used in compilers as it does not have the restrictions top-down has. It is also easier to do error-reporting if the given string does not fit the language.

Bottom-up parsers are HARD to write... most people use a parser generator to do the work. I've worked with YACC quite a bit. You basically input the grammar (and actions to take when that rule is matched) and it parses the grammer.

Bottom-up parsers use something called shift-reduce parsing. This is the method of processing the input and matching the rules.

Looking at your code again, I'd say it is possible for you to use a top-down parser. Sorry I can't give specific code, but a google of top-down parser examples (or recursive descent parser examples) would probably turn up the code you need.

Polaris878 2009-06-23 16:52:02

The example from the chapter (a calculator example) used bottom-up parsing for an order of operations grammar. However, this seems different since it isn't checking for nouns and verbs in a strict ascending or descending order, but is rather checking to see "if x is true, make sure y is also true." It pretty much works top-down, but has a few twists. I'm going to post some unfinished code in a second to show what I mean.

Alex 2009-06-23 17:27:08

Answer 6

A:

You could take a loot to Ubiquity, it's a plugin for Firefox that aims to use natural language to do common web tasks (it's written in JavaScript but maybe you can get a general algorithm that would be usefull)

Eliseo Ocampos 2009-06-23 17:11:53

Answer 7

+4 A:

Ok. If you really want to do it by hand :-(

There are two parts to this problem:

Lexical analysis
Syntactic Analysis.
We can ignore the Symantic analysis as this is why up ther.

First you have tokenize the input stream into resonable tokens. Words would be an obvios choice but that would leave a lot of work for the syntactic phase. So I would group up your words into the following types (Conjunction,Noun,Verb,Article) and then write a lexer to return the correct Lexems.

Lexer.cpp
enum Lexeme { END,Conjunction,Noun,Verb,Article };
Lexem getNextLexme(std::istream in)
{
    std::string word;
    in >> word;

    if (!in) {return END;}

         if (word == "and")   return Conjunction;
    else if (word == "birds") return Noun;
    else if (word == "fly")   return Verb;
    else if (word == "the")   return Article;

    ... etc
}

So now you can wrtie your syntactic parser in terms of the simplified token stream.

bool ParseSentence(std::istream in)
{
    Lexeme token = getNextLexme(in);
    switch(token)
    {
        case Noun:    if (!parseVerb(in))
                      {    return false;
                      }
                      return parseConjunctionOrEnd(in);
        case Article: return ParseSentence();
        case END:     return true;
    }
}
bool parseVerb(std::istream in)
{
    Lexeme token = getNextLexeme(in);
    if (token != Verb) { /*ERROR*/ return false;}
    return true;
 }
 // etc

Your other option with syntactic analyasis is to build a state table. But that involves hand analysing the grammer and determing the states. This should only be attepted with the most trivial of grammers, anything bigger than you have here should be left to tools that can generate the state table auto-magically.

So Assuming the grammer I defined in my original post below:
And hoping I get it correct as I am not an inflable tool :-)

State 1:   Start <Nothing Happened>
               Article -> State 2
               Noun -> State 3
               Otherwise Error
State 2:   Seen Article.
               Noun -> State 3
               Otherwise Error
State 3:   Seen Noun  in Sentence.
               Verb -> State 4
               Otherwise Error
State 4:   Seen Noun Verb
               End -> State 5
               Conjunction -> State 1
State 5:   Finished:

State 0:   Error State.


int stateTable[][]    // CurrentState,CurrentObject
           = {/*State 0: Error State:*/{},
                                       // END,Conjunction,Noun,Verb,Article 
              /*State 1: Start*/       {  0,  0,          3,   0,   2},
              /*State 2: Article*/     {  0,  0,          3,   0,   0},
              /*State 3: Noun*/        {  0,  0,          0,   4,   0},
              /*State 4: Noun Verb*/   {  5,  1,          0,   0,   0},
              /*State 5: End*/         {}
             };

bool parseSentence(std::iostream& in)
{
    int currentState = 1;
    while((currentState != 0) && (currentState != 5))
    {
        int token = getNextLexme(in);
        currentState = stateTable[currentState][token];
    }
    return currentState == 5;
}

Martin York 2009-06-23 23:30:42

The exercise said not to bother with tokens and just to read into a string. However, this has definitely put me on the right track. I'll comment and edit the post if I get it.

Alex 2009-06-23 23:46:58

@Alex... This is about as good of an answer you will get without writing the whole thing for you. Combine this answer with Barry Brown's answer and you have a fairly complete solution. Martin's answer is giving you the code, and pointing you in the right direction to top-down recurse the grammar. It is definitely better to tokenize the input stream as Martin has done. HOwever, to parse like this, you need to re-form your grammar as Barry has suggested. Make sure that it fits the requirements for top-down parsing.

Polaris878 2009-06-24 00:22:50

The exercise specifically says not to use tokens, so I'm going to try to pull out a few more tricks before I resort to that. Also I'm definitely going to reform the grammar, however creating classes seems unnecessary (unless I'm using tokens.)

Alex 2009-06-24 00:31:16

@Alex... you can pull this exercise off without tokens, but it does not complicate it much and it is definitely a cleaner/easier route to go.

Polaris878 2009-06-24 00:36:10

I know I've got the code down except for one major flaw. The sentence function() needs to be able to read every word input rather than just one at a time. Is tokenization the only way to accomplish this? If so, damn you Stroustrup!

Alex 2009-06-24 00:41:11

I'm using a getline function to save all of the words input by the user. It's now only outputting 1 result, but it always outputs "Not Ok." Not sure why....

Alex 2009-06-24 01:31:02

Answer 8

A:

Parts of Speech

To get parts of speech you will need a dictionary list with parts of speech. Besides a hashtable mapping words to lists of parts of speech, another possible way to check for part(s) of speech is to load each set of words for each part of speech into its own Bloom filter (sort of a compressed hashed map from strings to booleans).

Jared Updike 2009-06-24 00:10:03

Answer 9

+2 A:

I'm intrigued by this question. I'm going to help the OP, Alex, cook something up, but since I don't know C++ very well, it'll be in pseudo-C++. It won't use lex/yacc, either, because Alex wants to learn how they're made. Tools such as lex and yacc become "black boxes" if you use them. I don't have time to put it all together right now, but I can work on it piece by piece over a few hours. I just wanted to get this started now.

The first thing we need to do is clean up the grammar. You have a sentence defined this way:

sentence : noun verb
         | article sentence
         | sentence conjunction sentence

This grammar is flawed. A sentence such as "a the fish swim" is valid because sentence is defined in terms of itself. Recursion is a normal part of grammars, but it needs to be handled correctly. I'm going to hazard a guess that you don't want two or more articles to appear in a row.

A better grammar for sentence could be:

sentence : clause conjunction clause
         | clause

clause : nounphrase verbphrase

nounphrase : noun
           | article noun

This removes the unconstrained recursion which could cause infinite loops.

Now we're ready to tackle the parser itself. Since this is C++, we might as well make it object-oriented. I gotta scoot for now, but I'll give you a hint: there's going to be a class for each one of the production rules.

Barry Brown 2009-06-24 00:13:21

"Since this is C++, we might as well make it object-oriented. [...] there's going to be a class for each one of the production rules." Somehow, this says a lot about what's wrong with the state of the world today. ;-)

ShreevatsaR 2009-06-24 00:21:32

Thanks for the reformed grammar. I'm going to have to use classes if I tokenize it, but there's got to be a solution not involving tokens. I'm going to try a few more tricks (maybe make a vector of strings?) and if I can't figure it out I'll resort to tokenization.

Alex 2009-06-24 00:34:50

Yup, I'm going to tokenize. But it's late now so I'll work on it tomorrow morning. However, don't I just need a token class to get the inputs? I don't see any particular reason to make a class for every rule.

Alex 2009-06-24 05:22:10

Answer 10

A:

One aspect of grammars for natural languages is that they are often ambigious. For example, the english sentance:

I once shot an elephant in my pajamas. How he got in my pajams I'll never know
-- Groucho Marx

The phrase 'in my pajamas' ambigiously describes The subject 'I' or the object 'elephant'. Without semantic context it would be impossible to build an AST correctly.

If you wish to avoid this, you will probably need some way of treating ambiguity in a useful way. One strategy is to produce all possible derivations of ambiguous phrases. One tool that makes this possible is an Earley Parser. Unlike other types of parser, such as recursive descent parsers, Earley parsers generate all derivations in the form of parser state transitions, rather than a simple tree. In practice this isn't any harder to work with.

TokenMacGuy 2009-06-24 00:14:59

It's hard to wreck a nice beach.

Barry Brown 2009-06-24 04:31:58

Another ambiguity in natural languages is words that can be one of two or more types. For instance, these nouns/verbs: fly, call, stack, wreck. And this noun/verb/adjective: light. I think you have your work cut out for you. :-)

RobH 2009-06-25 19:04:02

ansaurus

tags:

views:

answers:

Very basic English grammar parser

related questions