views:

267

answers:

6

Looking for parsers (in C#) for a bunch of formats. (PHP, ASP, some XML based formats, HTML,...pretty much anything I can get my hands on.)

So far we have:

HTML:

* Majestic-12
* Html Agility Pack

I am having a hard time believing that these are the only free parsers for c# in existence, so I am adding a bounty to the question.

For my own needs (see below for details), it looks like I will have to roll out my own, but I still would like to get a list of free parsers, if there are any.

Note that by parser, I mean parser. Not parser generator. Something ready to use, where you can just call .loadFile(FileName) and .next(item) without having to study the format RFQ, define the grammar, the terminal and non-terminal tokens and whatnot.


Original question: The purpose is to separate the text from the code and do some edits without messing up the code.

I had a look at ANTLR, but while it seems like the "right tool", there is just too much prior knowledge assumed. I have an easier time writing a parser from scratch than understanding how to "easily" generate parsers from ANTLR. (I wrote a small parser for a specific type of RTF files within a couple days, so the task is probably within my reach, but as I have no formal knowledge of parsing/lexing, I am at loss with ANTLR)

Then it occurred to me that there must existing parsers for many formats, so before I start writing yet another a brand new and potentially buggy version of the wheel, I figured I would check what parsers already exist and can be reused in a commercial product.

I could use parsers for just about every format in existence, so this question would be a good place to make a list of all existing free parsers written in C#, if there are any.

Thanks in advance for your suggestions

=====

Edit: To clarify, I just need to identify strings that could potentially require translation and protect the rest. Not a full parser (although full parsers can be used in this context)/

It is impossible to identify strings to be translated automatically, but looking at the problem backward, it is possible to identify the parts of a file which should never be translated. The idea here is to do as much preparation as possible automatically, and allow the user to run regexes on the result. Ideally, bring it to the point that the user can fix it manually with little effort. I am not going for an absolute solution, but for a practical one.

For a better understanding of what I am doing and how, have a look at the video tutorials on www.preptags.com.

+2  A: 

HTML:

Andrew Lewis
+1  A: 

The Gardens Point Parser Generator generates a C# parser given a YACC-like language syntax.

Dour High Arch
+1  A: 

If what you want to do is to harness a large set of pre-existing language definitions to identify the text strings in those langauges (and to have a foundation for building efficient text-string extractors for arbitrary other documents), you might want to look at SD Source Code Search Engine (SCSE).

The SCSE uses compiler techonology (essentially big mean versions of FLEX) to break source code files apart into its constituent tokens (keywords, operators, numbers, comments ... and text strings). The individual tokens are then indexed by file name, line, column. The resulting index is used to enable lightning-fast searches over very large sets of source files, accomodating multiple languages. The SCSE has extractors for PHP, VB6, C#, COBOL and some 20 other computer languages including HTML and XML. Being built on top of DMS Software Reengineering Toolkit, it is possible to add other document types easily using DMS's lexer generators.

[The DMS machinery is like ANTLR in basic capability, but goes far beyond ANTLR if your goal is to actually analyze and transform source code, but that's not relevant to your specifically proposed solution. The language definitions used for SCSE are used for many other purposes with DMS, and so they are tested by fire and extremely robust. And DMS is exactly one of those parser frameworks supporting a family of parsers I said would be difficult to find.]

The relevance to your task is that the extracted tokens fed to the SCSE indexer identify the token type (especially "string literal") and the precise location (start line/column, end line/column). This information appears to be precisely what you want.

The SCSE's output isn't documented and wasn't intended for this purpose, but that's a curable problem. Nor is it BSD-licensed, but you said you were interested in a commercial solution. It does run under Windows, and a C# based tool could easily read the results.

You can arguably do this with ANTLR's technology, too, but the existing ANTLR parsers don't produce the tokens directly ready for consumption for your purpose in the way the SCSE does. I'm unsure if ANTLR handles Unicode; the SCSE absolutely does. Similarly, you can do this with FLEX or any very strong regular expression compiler, but you won't get the large stable of robust language processors as a starting point.

I'm the architect behind DMS and the SCSE. If you have further interest contact me directly; see my SO bio.

Ira Baxter
@Ira: Thanks for your suggestions and interest. It would probably work but I can't afford it for PrepTag. I sell my software licenses for €39 a piece, and the market is fairly small. I asked about BSD license parsers because I can't use GNU or GPLed code (PrepTags is not open-source) and can't afford commercial products, as you can probably understand. The purpose of this question is to get a list of ready-to-use parsers for C# under a BSD or MIT license. If I can't find any, I will just write some.
Sylverdrag
A: 

Two previous entries in stack overflow that might be interesting:

http://stackoverflow.com/questions/1257268/good-parser-generator-think-lex-yacc-or-antlr-for-net-build-time-only

and here is a very simple one, check the second answer: http://stackoverflow.com/questions/673113/poor-mans-lexer-for-c

jgauffin
@jgauffin: this doesn't answer the question, but I give +1 because playing with Irony (from the first entry) made me realize that trying to use a formally built parser, following closely all the rules, will not work out well for me. Using the built-in SQL grammar from Irony on a "simple" SQL dump turned out a whole bunch of syntax errors. Probably the parser was made for a different brand of SQL, but it made me realize that my purposes are somewhat the opposite of that of a compiler. A compiler validates everything and fails loudly if there is the slightest problem. The user must fix his code.
Sylverdrag
On the other hand, for my purposes, the file is always right, even if it violates every official rule of the format. The file can not, must not be changed and the parser must ignore any irregularity and do its best to prepare the file anyway, and shut up if it can't. While the general idea is the same, the purposes and requirements are virtually opposed one to another. Looks like I will have to write my own parsers.
Sylverdrag
+1  A: 
jgauffin
Thanks for your suggestions. The question has not changed, actually. It has been reworded, but the bottom line was to get ready-to-use parsers from the beginning.
Sylverdrag
ok. I kept the old answer as a reference. Others might be interested in those links.
jgauffin
+1  A: 

There's also:

Coco/R:
http://en.wikipedia.org/wiki/Coco/R

GOLD:
http://en.wikipedia.org/wiki/GOLD_(parser)

code4life
@code4life: If I am not mistaken, these are parser generators, right? I am looking for parsers. Not parser generators.
Sylverdrag