views:

682

answers:

4

I am working on porting code from JAVA to C#, and part of the JAVA code uses tokenizer - but it is my understanding that the resulting array from the stringtokenizer in java will also have the separators (in this case +, -, /, *, (, )) as tokens. I have attempted to use the C# Split() function, but it seems to eliminate the separators themselves. In the end, this will parse a string and run it as a calculation. I have done a lot of research, and have not found any references on the topic.

Does anyone know how to get the actual separators, in the order they were encountered, to be in the split array?

Code for token-izing:

public CalcLexer(String s)
{
    char[] seps = {'\t','\n','\r','+','-','*','/','(',')'};
    tokens = s.Split(seps);
    advance();
}

Testing:

static void Main(string[] args)
    {
        CalcLexer myCalc = new CalcLexer("24+3");
        Console.ReadLine();
    }

The "24+3" would result in the following output: "24", "3" I am looking for an output of "24", "+", "3"

In the nature of full disclosure, this project is part of a class assignment, and uses the following complete source code:

http://www.webber-labs.com/mpl/source%20code/Chapter%20Seventeen/CalcParser.java.txt http://www.webber-labs.com/mpl/source%20code/Chapter%20Seventeen/CalcLexer.java.txt

A: 

Not easily, no. You may have to parse the string manually or look for a third party tokenizer library.

Edit: I found an interesting article on tokenizing using Regex.Split. Perhaps that will help?

Odd that the link isn't working. It appears to be the underscores. Here's the full URL:

http://en.csharp-online.net/CSharp_Regular_Expression_Recipes—A_Better_Tokenizer

Edit2: Got the link working; it was the long-dash in the title. Had to manually encode it. W00t

Randolpho
A: 

If you want a very flexible, powerful, reliable, and expandable solution, you can use the C# port of ANTLR. There is some initial overhead (link is setup information for VS2008) that would likely result in overkill for such a tiny project. Here's a calculator example with support for variables.

Probably overkill for your class, but if you're interested in learning about "real" solutions to this type of real-world problem, have a look-see. I even have a Visual Studio package for working with the grammars, or you can use ANTLRWorks separately.

280Z28
+1  A: 

You can use Regex.Split with zero-width assertions. For example, the following will split on +-*/:

Regex.Split(str, @"(?=[-+*/])|(?<=[-+*/])");

Effectively this says, "split at this point if it is followed by, or preceded by, any of -+*/. The matched string itself will be zero-length, so you won't lose any part of the input string.

Pavel Minaev
+1  A: 

This produces your output:

string s = "24+3";
string seps = @"(\t)|(\n)|(\+)|(-)|(\*)|(/)|(\()|(\))";
string[] tokens = System.Text.RegularExpressions.Regex.Split(s, seps);

foreach (string token in tokens)
    Console.WriteLine(token);
Shane Cusson