Are there any good free parsing programs out there in python or java? I have been using a lot of textfiles recently and they are all different. I have been spending a lot of time writing code to parse these textfiles. I was wondering if there is some program that could get all the names of a person out of a textfile or parse the file based on a keyword.
Take a look at JavaCC.
From the JavaCC FAQ:
JavaCC stands for "the Java Compiler Compiler"; it is a parser generator and lexical analyzer generator. JavaCC will read a description of a language and generate code, written in Java, that will read and analyze that language. JavaCC is particularly useful when you have to write code to deal with an input language has a complex structure
ANTLR is pretty popular and even has an IDE to help you develop / test your grammars.
i think you are looking for something like apache lucene.
check this : http://lucene.apache.org/java/docs/index.html
Pyparsing is a good Python add-on module for plain text. Easy to get something going quickly, but has enough supporting components to do some pretty elaborate parsing work. See http://pyparsing.wikispaces.com, and check out the Examples page. (Plus it is very liberally licensed, so there are no restrictions or runtime encumberances.)
If the text has a known format, a grammar parser might be your best bet.
Gold Parser is open source and has both java and python support, among others. http://www.devincook.com/goldparser/
Lepl - http://www.acooke.org/lepl - is a general-purpose, recursive descent parser for Python that I maintain.
It's similar to pyparsing, in that both are parsers that you write directly in Python. Here's an example that parses and evaluates an arithmetic expression:
>>> from operator import add, sub, mul, truediv
>>> # ast nodes
... class Op(List):
... def __float__(self):
... return self._op(float(self[0]), float(self[1]))
...
>>> class Add(Op): _op = add
...
>>> class Sub(Op): _op = sub
...
>>> class Mul(Op): _op = mul
...
>>> class Div(Op): _op = truediv
...
>>> # tokens
>>> value = Token(UnsignedFloat())
>>> symbol = Token('[^0-9a-zA-Z \t\r\n]')
>>> number = Optional(symbol('-')) + value >> float
>>> group2, group3 = Delayed(), Delayed()
>>> # first layer, most tightly grouped, is parens and numbers
... parens = ~symbol('(') & group3 & ~symbol(')')
>>> group1 = parens | number
>>> # second layer, next most tightly grouped, is multiplication
... mul_ = group1 & ~symbol('*') & group2 > Mul
>>> div_ = group1 & ~symbol('/') & group2 > Div
>>> group2 += mul_ | div_ | group1
>>> # third layer, least tightly grouped, is addition
... add_ = group2 & ~symbol('+') & group3 > Add
>>> sub_ = group2 & ~symbol('-') & group3 > Sub
>>> group3 += add_ | sub_ | group2
... ast = group3.parse('1+2*(3-4)+5/6+7')[0]
>>> print(ast)
Add
+- 1.0
`- Add
+- Mul
| +- 2.0
| `- Sub
| +- 3.0
| `- 4.0
`- Add
+- Div
| +- 5.0
| `- 6.0
`- 7.0
>>> float(ast)
6.833333333333333
>>> 1+2*(3-4)+5/6+7
6.833333333333333
The main advantages of Lepl over pyparsing are that it's slightly more powerful (it can compile itself to regular expressions in places for speed, handle left recursive grammars, uses trampolining to avoid running out of stack space). The main disadvantages are that it's younger than pyparsing, so doesn't have the same number of users or as large and supportive a community.