I'm trying to come up with a way to go through about a million documents which are formal documents (for arguments sake, they are Thesis documents). They are not all standardized but close enough. They are Titles, sections, paragraphs etc. There are subtle differences that might crop up such as in english, we call a title "Title" but in French it is "Titre".
Thus in my mind the best way to do this would be to create an EBNF with all possible combinations of Title := Title | Titre for instance.
I'm not too concerned with coming up with the EBNF. My main concern is how to achieve the parsing. I've looked at ANTLR, OSLO, Irony and a slew of others but don't have the expertise in them to judge whether they would be perfect for my task.
So, my question to the learned among you is
- Which DSL tool would you recommend for parsing documents on this scale?
- What DSL tool is the most accurate in parsing yet forgiving on the matching (ie. do we have to define rules for uppercase and lowercase, what about numbers vs roman numerals and foreign language (french).
- Is there a process/algorithm that I have not considered that you would recommend as an alternative to DSL? (Rewriting from scratch is an option but I would like to get something working quickly).
- Has anyone attempted to add learning and intelligence to the algorithms for parsing through DSLs (think genetic algorithms and neural networks)?
- Would you use these DSL tools in a production environment?
My development platform of choice is C#. I mention this because ideally I would like to integrate the DSL tool into code so that we can work with it from existing apps.