You should look up "abstract syntax trees" and "expression trees" as well as "lexical analysis", "syntax", "parse", and "compiler theory". Reading text input and getting meaning from it is quite difficult for most things (though we often try to make sure we have simple input).
The first step in generating a parser is to write down the grammar for your input language. In this case your input language is some Mathematical expressions, so you would do something like:
expr => <function_identifier> ( stmt )
( stmt )
<variable_identifier>
<numerical_constant>
stmt => expr <operator> stmt
(I haven't written a grammar like this {look up BNF
and EBNF
} in a few years so I've probably made some glaring errors that someone else will kindly point out)
This can get a lot more complicated depending on how you handle operator precedence (multiply and device before add and subtract type stuff), but the point of the grammar in this case is to help you to write a parser.
There are tools that will help you do this (yacc
, bison
, antlr
, and others) but you can do it by hand as well. There are many many ways to go about doing this, but they all have one thing in common -- a stack. Processing a language such as this requires something called a push down automaton, which is just a fancy way of saying something that can make decisions based on new input, a current state, and the top item of the stack. The decisions that it can make include pushing, popping, changing state, and combining (turning 2+3
into 5
is a form of combining). Combining is usually referred to as a production because it produces a result.
Of the various common types of parsers you will almost certainly start out with a recursive decent parser. They are usually written directly in a general purpose programming language, such as C. This type of parser is made up of several (often many) functions that call each other, and they end up using the system stack as the push down automaton stack.
Another thing you will need to do is to write down the different types of words and operators that make up your language. These words and operators are called lexemes and represent the tokens of your language. I represented these tokens in the grammar <like_this>
, except for the parenthesis which represented themselves.
You will most likely want to describe your lexemes with a set of regular expressions. You should be familiar with these if you use grep
, sed
, awk
, or perl
. They are a way of describing what is known as a regular language which can be processed by something known as a Finite State Automaton. That is just a fancy way of saying that it is a program that can make a decision about changing state by considering only its current state and the next input (the next character of input). For example part of your lexical description might be:
[A-Z] variable-identifier
sqrt function-identifier
log function-identifier
[0-9]+ unsigned-literal
+ operator
- operator
There are also tools which can generate code for this. lex
which is one of these is highly integrated with the parser generating program yacc
, but since you are trying to learn you can also write your own tokenizer/lexical analysis code in C.
After you have done all of this (it will probably take you quite a while) you will need to have your parser build a tree to represent the expressions and grammar of the input. In the simple case of expression evaluation (like writing a simple command line calculator program) you could have your parser evaluate the formula as it processed the input, but for your case, as I understand it, you will need to make a tree (or Reverse Polish representation, but trees are easier in my opinion).
Then after you have read the values for the variables you can traverse the tree and calculate an actual number.