views:

101

answers:

3

Hi,

In the lack of any good free XPath 2.0 implementations for .Net build upon Linq to XML I have thought about implementing my own (also for the experience). But just to be clear (and not building something that exists) these are the XPath 2.0 implementations I have found:

  • Saxon .Net
  • Query Machine - I had problems with this - exceptions with the examples
  • XQSharp - may be good, but is commercial (single developer ~300 $)

Now, I want some thoughts on how difficult it is to implementing some language such as XPath 2.0 expressions. I have found this link which have a EBNF for XPath 2.0 expression: http://www.w3.org/TR/2007/REC-xpath20-20070123/#id-grammar and I'm thinking of making it in F# with the fslex/fsyacc combination.

My background (subjective): I have played with these tools before, but only for some simple expressions and a very simple programming language. Furthermore, I have read most of the Dragon book and Appel´s Modern compiler implementation in ML - but unfortunately, I have not put the theory in practice while reading. I've studied computer science in a year now where I have completed courses with theory about ex finite automaton, CFL and algorithms but I have been a developer for years before university (a few years with professional jobs - back-end of websites mainly).

Now, the steps of parsing and what I tend to cover:

  1. Lex - Parsing - Reductions: FsLex/FsYacc. I will properly not cover ALL of Xpath 2.0 at first but at least all of what XPath 1.0 can do + a little more.
  2. Sematic analysis - I'm not sure about how much there is to this
  3. Optimization - I do not tend to cover this (at least not at first)
  4. Actual traversing etc.
  5. ...?

Now, the concrete questions in addition to the above:

  1. How difficult is it to make a parser of this size? based on my background, would I could to it?
  2. Is there any crucial steps I have missed in regards to XPath 2.0 in particular?
  3. Is there any technology I have missed; Do I have to cover more than just XPath 2.0 and XDocument etc. to be able to make the parser?

To be clear: I want to make a XPath 2.0 expression parser and traverse XDocument etc. with this parsed expression. Which I guess combined is a query engine.

Update: I found this: http://www.w3.org/2007/01/applets/xpathApplet.html which contains code to parsing and traversing. I think it would be a nice start or reference :-)

Your answers will be appreciated.

+2  A: 

To address your third concrete question, the Dragon Book makes no mention of Parsing Expression Grammars (PEGs)/Packrat parsers/parser combinator libraries, which are quite the rage now, especially when it comes to functional languages. See FParsec, for example.

ig2r
+1 I have never encountered PEGs (in class CFL and reg.) so I really appreciate your answer and will look into the tool :)
lasseespeholt
+1, FParsec is great
Joel Mueller
+2  A: 

I implemented an XPath 2.0 parser entirely in XSLT 2.0 three years ago.

I used my LR Parsing Framework in FXSL and this was not so difficult. The grammar is quite big -- 209 rules, if I remember well. I used a modificationn of YACC (done by me) which I call Yaccx to generate the parsing tables as XML. These are the input for the general LR Parser, written in XSLT.

For such kind of project you need to allocate at least 6 months full time, maybe 1 year. The difficulty is in implementing the enormous function library (F & O).

Also, XPath is not a standalone language -- it must be hosted by another language. Due to this reason I didn't use this parser for anything meaningful, as I didn't have the access, influence and possibility to alter an existing hosting language.

So, be prepared for all these difficulties.

Dimitre Novatchev
+1 The work you have done sounds very interesting. May I ask why you used your own yacc and parseing framework and not just other implementations? I don't have 6 months full time :/ I guess I have a few hours each day but I'm studying currently. Also, the last point seems very rational, my initial usage of this was to make a online xpath tester but if it can't be used to anything besides that and others ain't requesting it it might be waste of time.
lasseespeholt
@lasseespeholt: This is not my own YACC. This is Berkley YACC, only slightly modified to output the parsing tables in XML format. Normally it outputs the parsing tables as C arrays. As for an XPath 2.0 Visualizer, I developed such four years ago and am considering publishing it.
Dimitre Novatchev
+1  A: 

I am one of the developers of XQSharp, so I have experience in this area. XQSharp actually began its life as an XPath implementation before we expanded it to support XQuery.

Our initial implementation took us about 6 months, although this was not the only thing we were working on at the time.

After this time we had an implementation that was feature complete. There were many areas in which this was not fully conformant, where the standard .NET methods did not behave quite the same as the specification required. Some examples of this are with converting values to and from strings, regular expressions, a lot of unicode stuff, problems with the .NET representations of XML (eg handling of xml:base) and so on.

There were several areas that needed to be done to implement this:

Parsing: The parser itself was straightforward, and mostly generated from the EBNF in the spec. I would estimate that this initially represented a few weeks work.

Data Model: How the data is represented. In order to have a full XPath implementation there are a lot of new data types (like xs:gDay) that need to be implemented. In our case we have all our items derive from a base type and all our expressions would return enumerators of these. You also need to be able to identify whether the type of an item matches a particular XPath type. We supported static typing and schema-awareness from the outset, without these features this section probably becomes trivial, but you are still looking at several weeks worth of work.

Expressions/Abstract Syntax Tree This is the model of the expression itself. We used the XQuery Formal Semantics document to produce a mapping from the various XPath constructs (for example axes and predicates) to a simpler core grammer (which ends up with huge amounts of let, for if and typeswitch expressions!). In our initial implementation all these expressions had evaluate methods and so were the final representation of the expression. In our case the expressions all had type check methods too, but this can be skipped initially (The main purpose of these is for optimization). Creating all these expressions again took several weeks.

Functions As a previous commenter pointed out the function library for XPath is rather large. The entire XPath library took us several months to implement.

Static Analysis A small amount of static analysis is required. Variable references and function calls must be bound to the correct variables and functions. Most XPath implementations are stack based, and so a stack allocation phase is required to assign pointers (or indexes) to all the variables. This static analysis took a week or two. The Dragon Book should set you up very nicely to solve most of these problems.

You are probably looking at another month's worth of work for all the extra bits of work that do not fall directly into these categories.

After all this work we were left with a mostly functional implementation of XPath; but it was far to slow for real world use (maybe 100x slower than XPath 1 in .NET). So after this comes the fun work - Optimization.

Bringing the engine up to 100% conformance and adding optimizations probably took another 12-18 man months (although we probably went a little overboard with optimization!), but by that point we had already made the transition to being an XQuery implementation.

My advice would be to start by tackling a subset of XPath (maybe forward axes only and a very limited function library) and you might be able to knock together an implementation in a month or two, but a serious XPath2 implementation will be a big investment in time.

Make sure that you use XPathNavigator for your node representation, as it has methods like SelectChildren which can take advantages of indexes in the underlying representations (for example XPathDocument).

Oliver Hallam
+1 I really appreciate you took the time to write about it :) It seems like a long journey. I did think it was a smaller project than that but I often do. Right now, I'll return to the studies and use XQuery for non-commercial stuff (at least for now). Thanks...
lasseespeholt
If I may add a little suggestion to XQuery, then I really think you guys should make your LINQ equivalent methods like `XPathEvaluate`, `XPathSelect` etc. behave just like the .Net XPath 1.0 version.
lasseespeholt
@lasseespeholt: I don't think we realised how big a journey this was when we started either! What differences are you referring to with the behaviour of the extension methods? If you could post this to our forum (http://www.xqsharp.com/forum) this would be greatly appreciated.
Oliver Hallam