tags:

views:

220

answers:

6

Hello everyone,

I am designing a text based game similar to Zork, and I would like it to able to parse a sentance and draw out keywords such TAKE, DROP ect. The thing is, I would like to do this all through the standard c++ library... I have heard of external libraries (such as flex/bison) that effectively accomplish this; however I don't want to mess with those just yet.

What I am thinking of implementing is a token based system that has a list of words that the parser can recognize even if they are in a sentence such as "take sword and kill monster" and know that according to the parsers grammar rules, TAKE, SWORD, KILL and MONSTER are all recognized as tokens and would produce the output "Monster killed" or something to that effect. I have heard there is a function in the c++ standard library called strtok that does this, however I have also heard it's "unsafe". So if anyone here could lend a helping hand, I would greatly appreciate it.

+3  A: 

The strtok function is from the C standard library, and it has a few problems. For example, it modifies the string in place and could cause security problems due to buffer overflows. You should instead look into using the IOStream classes within the C++ Standard Library as well as the Standard Template Library (STL) containers and algorithms.

Example:

#include <algorithm>
#include <cctype>
#include <iostream>
#include <sstream>

using namespace std;

int
main()
{
    string line;

    // grab a line from standard input
    while (getline(cin, line)) {

        // break the input in to tokens using a space as the delimeter
        istringstream stream(line);
        string token;
        while (getline(stream, token, ' ')) {

            // convert string to all caps
            transform(token.begin(), token.end(), token.begin(), (int(*)(int)) toupper);

            // print each token on a separate line
            cout << token << endl;
        }
    }
}
Judge Maygarden
In fact stringstream is not, and never has been, part of the STL.
anon
Oh yeah, it's part of the iostream library, right?
Judge Maygarden
Historically, yes. As of now it is part of the C++ Standard Library - differentiating between the historical origins of the the Standard Library's components is not particularly useful.
anon
Thank you very much for all your help, I will look into the STL more deeply before I post an inaccurate question next time. Clearly you all know your c++!
CiM
@Neil Indeed, I have update the verbiage accordingly.
Judge Maygarden
+2  A: 

Depending on how complicated this language is to parse, you may be able to use C++ Technical Report 1's regular expression libraries.

If that's not powerful enough, then stringstreams may get you somewhere, but after a point you'll likely decide that a parser generator like Flex/Bison is the most concise way to express your grammar.

You'll need to pick your tool based on the complexity of the sentences you're parsing.

Ken Bloom
A: 

If you do want to code the parsing yourself, I would strongly recommend you to use "something like Lex/Yacc". In fact, I strongly recommend you to use Antlr. See my previously accepted answer to a similar question at http://stackoverflow.com/questions/2519715/what-language-should-i-use-to-write-a-text-parser-and-display-the-results-in-a-us/2520476#2520476


However, the best approach is probably to forget C++ all together - unless you have a burning desire to learn C++, but, even then, there are probably better projects on which to cut your teeth.

If what you want is to program a text adventure, then I recommend that you use one of the programming languages specifically designed for that purpose. There are many, see

You will probably decide on TADS, Inform or Hugo (my personal vote goes to TADS).

You might get good advice if you post to rec.arts.int-fiction explaining what you hope to achieve and giving your level or programming ability.

Have fun!

Mawg
huh? any reason for -1?
Mawg
A: 

For a naive implementation using std::string, the std::set container and this tokenization function (Alavoor Vasudevan) you can do this :

#include <iostream>
#include <set>
#include <string>

int main()
{
 /*You match the substring find in the while loop (tokenization) to 
 the ones contained in the dic(tionnary) set. If there's a match, 
 the substring is printed to the console.
 */

    std::set<std::string> dic;
    dic.insert("sword");
    dic.insert("kill");
    dic.insert("monster");

    std::string str = "take sword and kill monster";
    std::string delimiters = " ";    

    std::string::size_type lastPos = str.find_first_not_of(delimiters, 0);
    std::string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (std::string::npos != pos || std::string::npos != lastPos)
    {
        if(dic.find(str.substr(lastPos, pos - lastPos)) != dic.end())
            std::cout << str.substr(lastPos, pos - lastPos) 
                    << " is part of the dic.\n";
        lastPos = str.find_first_not_of(delimiters, pos);
        pos = str.find_first_of(delimiters, lastPos);
    }

    return 0;
}

This will output :

sword is part of the dic.
kill is part of the dic.
monster is part of the dic.

Remarks :

  • The tokenization delimiter (white space) is very (too) simple for natural languages.
  • You could use some utilities in boost (split,tokenizer).
  • If your dictionnary (word list) was really big using the hash version of set could be useful (unordered_set).

With boost tokenizer, it could look like this (this may not be very efficient):

boost::tokenizer<> tok(str);
BOOST_FOREACH(const std::string& word,tok)
{
    if(dic.find(word) != dic.end())
        std::cout << word << " is part of the dic.\n";
}   
anno
if you're using boost, look into Boost.Spirit
KitsuneYMG
Boost.spirit can be a major pain in the butt, because of all of the templates involved.
Ken Bloom
A: 

Unless your language is extremely simply you want to follow the steps of writing a parser.

  1. Write a formal grammar. By formal I don't mean to scare you: write it on a piece of napkin if it sounds less worrisome. I only mean get your grammar right, and don't advance to the next step before you do. For example:

    action := ('caress' | 'kill') creature

    creature := 'monster' | 'pony' | 'girlfriend'

  2. Write a lexer. The lexer will, given a stream, take one character at a time until it can figure out which token is next, and return that token. It will discard the characters that constitute that token and leave all other characters in the stream intact. For example, it can get the character d, then r, then o and p, figure the next token is a DROP token and return that.

  3. Write a parser. I personally find recursive descent parsers fairly easy to write, because all you have to do is write exactly one function for each of your rules, which does exactly what the rule defines. The parser will take one token at a time (by calling the lexer). It knows exactly what token it is about to receive from the lexer (or else knows that the next token is one of a limited set of possible tokens), because it follows the grammar. If it receives an unexpected token then it reports a syntax error.

Read the Dragon Book for details. The book talks about writing entire compiler systems, but you can skip the optimization phase and the code generation phase. These don't apply to you here, because you just want to interpret code and run it once, not write an executable which can then be executed to repeatedly run these instructions.

wilhelmtell
A: 

lexertl (http://www.benhanson.net/lexertl.html) is tailor made for this kind of problem. It is a header only library, very lightweight, completely self contained and you don't have to generate source code to use it! :-) As well as simply scanning for keywords you could validate token ordering using multiple lexer states if you wished.

Email me if you need help (benhanson2 at icqmail dot com).

Regards,

Ben

Ben Hanson