views:

134

answers:

3

EDITED according to WayneH's grammar

Here's what i have in my grammar file.

grammar pfinder;

options {
  language = Java;
}
sentence
    : ((words | pronoun) SPACE)* ((words | pronoun) ('.' | '?'))
    ;

words 
    :   WORDS {System.out.println($text);};

pronoun returns [String value] 
    : sfirst {$value = $sfirst.value; System.out.println($sfirst.text + '(' + $sfirst.value + ')');}
    | ssecond {$value = $ssecond.value; System.out.println($ssecond.text + '(' + $ssecond.value + ')');}
    | sthird {$value = $sthird.value; System.out.println($sthird.text + '(' + $sthird.value + ')');}
    | pfirst {$value = $pfirst.value; System.out.println($pfirst.text + '(' + $pfirst.value + ')');}
    | psecond {$value = $psecond.value; System.out.println($psecond.text + '(' + $psecond.value + ')');}
    | pthird{$value = $pthird.value; System.out.println($pthird.text + '(' + $pthird.value + ')');};

sfirst returns [String value] :  ('i'   | 'me'  | 'my'   | 'mine') {$value = "s1";};
ssecond returns [String value] : ('you' | 'your'| 'yours'| 'yourself') {$value = "s2";};
sthird returns [String value] :  ('he'  | 'she' | 'it'   | 'his' | 'hers' | 'its' | 'him' | 'her' | 'himself' | 'herself') {$value = "s3";};
pfirst returns [String value] :  ('we'  | 'us'  | 'our'  | 'ours') {$value = "p1";};
psecond returns [String value] : ('yourselves') {$value = "p2";};
pthird returns [String value] :  ('they'| 'them'| 'their'| 'theirs' | 'themselves') {$value = "p3";};

WORDS : LETTER*;// {$channel=HIDDEN;}; 
SPACE : (' ')?;
fragment LETTER :  ('a'..'z' | 'A'..'Z');

and here,s what i have on a java test class

import java.util.Scanner;
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import java.util.List;

public class test2 {
    public static void main(String[] args) throws RecognitionException {
        String s;
        Scanner input = new Scanner(System.in);
        System.out.println("Eter a Sentence: ");
        s=input.nextLine().toLowerCase();
        ANTLRStringStream in = new ANTLRStringStream(s);
        pfinderLexer lexer = new pfinderLexer(in);
        TokenStream tokenStream = new CommonTokenStream(lexer);
        pfinderParser parser = new pfinderParser(tokenStream); 
        parser.pronoun(); 
    }
}

what do I need to put in the test file so that the it will display all the pronouns in a sentence and their respective values(s1,s2,...)?

+1  A: 

fragments don't create tokens, and placing them in parser rules will not give desirable results.

On my test box, this produced (I think!) the desired result:

program :
        PRONOUN+
    ;

PRONOUN :
        'i'   | 'me'  | 'my'   | 'mine'
    |   'you' | 'your'| 'yours'| 'yourself'
    |   'he'  | 'she' | 'it'   | 'his' | 'hers' | 'its' | 'him' | 'her' | 'himself' | 'herself'
    |   'we'  | 'us'  | 'our'  | 'ours'
    |   'yourselves'
    |   'they'| 'them'| 'their'| 'theirs' | 'themselves'
    ;

WS  :   ' ' { $channel = HIDDEN; };

WORD    :   ('A'..'Z'|'a'..'z')+ { $channel = HIDDEN; };

In Antlrworks, a sample "i kicked you" returned the tree structure: program -> [i, you].

I feel compelled to point out that Antlr is overkill for stripping the pronouns out of a sentence. Consider using a regular expression. This grammar is not case insensitive. Expanding WORD to consume everything except your dictionary of PRONOUNs (such as puncuation, etc) may be a bit tedious. Will require sanitization of input.

--- Edit: In response to the second OP:

  • I have altered the original grammar to make ease of parsing. The new grammar is:

    grammar pfinder;
    
    
    options {
        backtrack=true;
        output = AST;
    }
    
    
    tokens {
        PROGRAM;
    }
    
    
    program :
            (WORD* p+=PRONOUN+ WORD*)*
            -> ^(PROGRAM $p*)
        ;
    
    
    PRONOUN :
            'i'   | 'me'  | 'my'   | 'mine'
        |   'you' | 'your'| 'yours'| 'yourself'
        |   'he'  | 'she' | 'it'   | 'his' | 'hers' | 'its' | 'him' | 'her' | 'himself' | 'herself'
        |   'we'  | 'us'  | 'our'  | 'ours' | 'yourselves'
        |   'they'| 'them'| 'their'| 'theirs' | 'themselves'
    ;
    
    
    WS  :   ' ' { $channel = HIDDEN; };
    
    
    WORD    :   ('A'..'Z'|'a'..'z')+;
    

I'll explain the changes:

  • Backtracking is now required to solve the parser rule program. Perhaps there's a better way to write it which doesn't require backtracking but this is the first thing that popped in to my mind.
  • An imaginary token PROGRAM has been defined to group our pronouns.
  • Each matched program is added to Antlr var $p and is rewritten in AST under the imaginary rule.
  • The interpreter code may now use a CommonTree to collect matched pronouns
  • The following is written in C# (I don't know Java) but I wrote it with the intent that you'll be able to read and understand it.

    static object[] ReadTokens( string text )
    {
        ArrayList results = new ArrayList();
        pfinderLexer Lexer = new pfinderLexer(new Antlr.Runtime.ANTLRStringStream(text));
        pfinderParser Parser = new pfinderParser(new CommonTokenStream(Lexer));
        // syntaxTree is imaginary token {PROGRAM},
        // its children are the pronouns collected by $p in grammar.
        CommonTree syntaxTree = Parser.program().Tree as CommonTree;
        if ( syntaxTree == null ) return null;
        foreach ( object pronoun in syntaxTree.Children )
        {
            results.Add(pronoun.ToString());
        }
        return results.ToArray();
    }
    
  • Calling ReadTokens("i kicked you and them") returns array ["i", "you", "them"]

Kivin
+1  A: 

In case you are trying to do some sort of high-level analysis of spoken/written language, you might consider using some sort of natural language processing tool. For example, TagHelper Tools will tell you which elements are pronouns (and verbs, and nouns, and adverbs, and other esoteric grammatical constructs). (THT is the only tool of that sort that I'm familiar with, so don't take that as a particular endorsement of awesomeness).

Gabe Johnson
+1  A: 

I think you need to learn more about lexer rules within ANTLR, lexer rules start with uppercase letter and generate tokens for the stream the parser will look at. Lexer fragment rules will not generate a token for the stream but will help other lexer rules generate tokens, look at lexer rules WORDS and LETTER (LETTER is not a token but does help WORDS create a token).

Now, when a text literal is put into a parser rule (rule name will start with a lowercase letter) that text literal is also a valid token that the lexer will identify and pass (at least when you use ANTLR - I have not used any other tools similar to ANTLR to answer for them).

The next thing I was noticing is that your 's' and 'pronoun' rules appear to be the same thing. I commented out the 's' rule and put everything into the 'pronoun' rule

And then the last thing is to learn how to put actions into the grammer, you have some in the 's' rule setting the return value. I made the pronoun rule return a string value so that if you wanted the actions in your 'sentence' rule you would easily be able to accomplish your "-i pronoun" comment/answer.

Now since I do not know what your exact results are, I played with your grammer and made some slight modifications and reorganized (moving what I thought were parser rules to the top with keep all lexer rules at the bottom) and put in some actions that I think will show you what you need. Also, there could be several different ways to accomplish this and I don't think my solution is perfect for any of your possible wanted results, but here is a grammer I was able to get working in ANTLRWorks:

grammar pfinder;

options {
  language = Java;
}
sentence
    : ((words | pronoun) SPACE)* ((words | pronoun) ('.' | '?'))
    ;

words 
    :   WORDS {System.out.println($text);};

pronoun returns [String value] 
    : sfirst {$value = $sfirst.value; System.out.println($sfirst.text + '(' + $sfirst.value + ')');}
    | ssecond {$value = $ssecond.value; System.out.println($ssecond.text + '(' + $ssecond.value + ')');}
    | sthird {$value = $sthird.value; System.out.println($sthird.text + '(' + $sthird.value + ')');}
    | pfirst {$value = $pfirst.value; System.out.println($pfirst.text + '(' + $pfirst.value + ')');}
    | psecond {$value = $psecond.value; System.out.println($psecond.text + '(' + $psecond.value + ')');}
    | pthird{$value = $pthird.value; System.out.println($pthird.text + '(' + $pthird.value + ')');};

//s returns [String value]
//    :  exp=sfirst  {$value = "s1";}
//    |  exp=ssecond {$value = "s2";}
//    |  exp=sthird  {$value = "s3";}
//    |  exp=pfirst  {$value = "p1";}
//    |  exp=psecond {$value = "p2";}
//    |  exp=pthird  {$value = "p3";}
//    ;

sfirst returns [String value] :  ('i'   | 'me'  | 'my'   | 'mine') {$value = "s1";};
ssecond returns [String value] : ('you' | 'your'| 'yours'| 'yourself') {$value = "s2";};
sthird returns [String value] :  ('he'  | 'she' | 'it'   | 'his' | 'hers' | 'its' | 'him' | 'her' | 'himself' | 'herself') {$value = "s3";};
pfirst returns [String value] :  ('we'  | 'us'  | 'our'  | 'ours') {$value = "p1";};
psecond returns [String value] : ('yourselves') {$value = "p2";};
pthird returns [String value] :  ('they'| 'them'| 'their'| 'theirs' | 'themselves') {$value = "p3";};

WORDS : LETTER*;// {$channel=HIDDEN;}; 
SPACE : (' ')?;
fragment LETTER :  ('a'..'z' | 'A'..'Z');

I think the end result is this grammer will show you how to accomplish what you are trying to do and will require modification no matter what that end result is.

Good luck.

I think you only have to change one line in your test class, parser.pronoun(); to: parser.sentence();

You might want to change a few other things in the grammer as well: SPACE : ' '; sentence: (words | pronoun) (SPACE (words | pronoun))* ('.' | '?'); // then you might want to put a rule between sentence and words/pronoun.

WayneH
thank you. I didn't know that java codes can be used in antlr.another question. if I want to display everything, will i be able to do that? because in the above code, when you input "i kicked you" it will display "i(s1)" so the other parts of the sentence are ignored. Is there a loop or something that it can display all the sentences,like it will output "i(s1) kicked you(s2)" or just "i(s1) you(s2)"?
XIII
If you are only getting the first word, there would be something in your control program. When I put "i kicked you", in ANTLRWorks it will display:i(s1)kickedyou(s2)In debug mode I did have to step all the way through. Can you maybe show the line in the pronoun for ssecond
WayneH
I've edited my test program. I don't know why it only displays the first pronoun it finds
XIII
In your java test class, change the line with parser.pronoun(); to parser.sentence(); And you should get all words printed and with the "s#" text printed after your pronouns. I did not do any formatting, so feel free to add code for that (print each word on a separate line sort of thing). Good luck.
WayneH