ansaurus

Question

How to build a sentence parser using only the c++ standared library?

Answer 1

+3 A:

The strtok function is from the C standard library, and it has a few problems. For example, it modifies the string in place and could cause security problems due to buffer overflows. You should instead look into using the IOStream classes within the C++ Standard Library as well as the Standard Template Library (STL) containers and algorithms.

Example:

#include <algorithm>
#include <cctype>
#include <iostream>
#include <sstream>

using namespace std;

int
main()
{
    string line;

    // grab a line from standard input
    while (getline(cin, line)) {

        // break the input in to tokens using a space as the delimeter
        istringstream stream(line);
        string token;
        while (getline(stream, token, ' ')) {

            // convert string to all caps
            transform(token.begin(), token.end(), token.begin(), (int(*)(int)) toupper);

            // print each token on a separate line
            cout << token << endl;
        }
    }
}

Judge Maygarden 2010-04-12 21:05:36

In fact stringstream is not, and never has been, part of the STL.

anon 2010-04-12 21:17:29

Oh yeah, it's part of the iostream library, right?

Judge Maygarden 2010-04-12 21:25:38

Historically, yes. As of now it is part of the C++ Standard Library - differentiating between the historical origins of the the Standard Library's components is not particularly useful.

anon 2010-04-12 21:34:13

Thank you very much for all your help, I will look into the STL more deeply before I post an inaccurate question next time. Clearly you all know your c++!

CiM 2010-04-12 22:22:00

@Neil Indeed, I have update the verbiage accordingly.

Judge Maygarden 2010-04-16 19:33:57

Answer 2

+2 A:

Depending on how complicated this language is to parse, you may be able to use C++ Technical Report 1's regular expression libraries.

If that's not powerful enough, then stringstreams may get you somewhere, but after a point you'll likely decide that a parser generator like Flex/Bison is the most concise way to express your grammar.

You'll need to pick your tool based on the complexity of the sentences you're parsing.

Ken Bloom 2010-04-13 02:46:35

Answer 3

A:

If you do want to code the parsing yourself, I would strongly recommend you to use "something like Lex/Yacc". In fact, I strongly recommend you to use Antlr. See my previously accepted answer to a similar question at http://stackoverflow.com/questions/2519715/what-language-should-i-use-to-write-a-text-parser-and-display-the-results-in-a-us/2520476#2520476

However, the best approach is probably to forget C++ all together - unless you have a burning desire to learn C++, but, even then, there are probably better projects on which to cut your teeth.

If what you want is to program a text adventure, then I recommend that you use one of the programming languages specifically designed for that purpose. There are many, see

http://www.brasslantern.org/writers/howto/chooselang.html
http://www.brasslantern.org/editorials/easyif.html
http://tim.oreilly.com/pub/a/onlamp/2004/11/24/interactive_fiction.html
or google for "i-f prohtamming language" (Interactive Fiction")

You will probably decide on TADS, Inform or Hugo (my personal vote goes to TADS).

You might get good advice if you post to rec.arts.int-fiction explaining what you hope to achieve and giving your level or programming ability.

Have fun!

Mawg 2010-04-13 03:07:18

huh? any reason for -1?

Mawg 2010-04-23 01:30:43

Answer 4

A:

For a naive implementation using std::string, the std::set container and this tokenization function (Alavoor Vasudevan) you can do this :

#include <iostream>
#include <set>
#include <string>

int main()
{
 /*You match the substring find in the while loop (tokenization) to 
 the ones contained in the dic(tionnary) set. If there's a match, 
 the substring is printed to the console.
 */

    std::set<std::string> dic;
    dic.insert("sword");
    dic.insert("kill");
    dic.insert("monster");

    std::string str = "take sword and kill monster";
    std::string delimiters = " ";    

    std::string::size_type lastPos = str.find_first_not_of(delimiters, 0);
    std::string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (std::string::npos != pos || std::string::npos != lastPos)
    {
        if(dic.find(str.substr(lastPos, pos - lastPos)) != dic.end())
            std::cout << str.substr(lastPos, pos - lastPos) 
                    << " is part of the dic.\n";
        lastPos = str.find_first_not_of(delimiters, pos);
        pos = str.find_first_of(delimiters, lastPos);
    }

    return 0;
}

This will output :

sword is part of the dic.
kill is part of the dic.
monster is part of the dic.

Remarks :

The tokenization delimiter (white space) is very (too) simple for natural languages.
You could use some utilities in boost (split,tokenizer).
If your dictionnary (word list) was really big using the hash version of set could be useful (unordered_set).

With boost tokenizer, it could look like this (this may not be very efficient):

boost::tokenizer<> tok(str);
BOOST_FOREACH(const std::string& word,tok)
{
    if(dic.find(word) != dic.end())
        std::cout << word << " is part of the dic.\n";
}

anno 2010-04-13 03:59:02

if you're using boost, look into Boost.Spirit

KitsuneYMG 2010-04-13 04:46:12

Boost.spirit can be a major pain in the butt, because of all of the templates involved.

Ken Bloom 2010-04-13 14:01:17

Answer 5

A:

Unless your language is extremely simply you want to follow the steps of writing a parser.

Write a formal grammar. By formal I don't mean to scare you: write it on a piece of napkin if it sounds less worrisome. I only mean get your grammar right, and don't advance to the next step before you do. For example:

action := ('caress' | 'kill') creature

creature := 'monster' | 'pony' | 'girlfriend'
Write a lexer. The lexer will, given a stream, take one character at a time until it can figure out which token is next, and return that token. It will discard the characters that constitute that token and leave all other characters in the stream intact. For example, it can get the character d, then r, then o and p, figure the next token is a DROP token and return that.
Write a parser. I personally find recursive descent parsers fairly easy to write, because all you have to do is write exactly one function for each of your rules, which does exactly what the rule defines. The parser will take one token at a time (by calling the lexer). It knows exactly what token it is about to receive from the lexer (or else knows that the next token is one of a limited set of possible tokens), because it follows the grammar. If it receives an unexpected token then it reports a syntax error.

Read the Dragon Book for details. The book talks about writing entire compiler systems, but you can skip the optimization phase and the code generation phase. These don't apply to you here, because you just want to interpret code and run it once, not write an executable which can then be executed to repeatedly run these instructions.

wilhelmtell 2010-04-13 04:44:01

Answer 6

A:

lexertl (http://www.benhanson.net/lexertl.html) is tailor made for this kind of problem. It is a header only library, very lightweight, completely self contained and you don't have to generate source code to use it! :-) As well as simply scanning for keywords you could validate token ordering using multiple lexer states if you wished.

Email me if you need help (benhanson2 at icqmail dot com).

Regards,

Ben

Ben Hanson 2010-10-26 15:58:35

ansaurus

tags:

views:

answers:

How to build a sentence parser using only the c++ standared library?

related questions