tags:

views:

131

answers:

2

I need to parse DTDs using PHP and am hoping there's a simple library to help out. Each DTD has numerous <!ENTITY... and <!-- Comment... elements, which I need to act upon.

Note that I do not need to validate anything against these DTDs, simply parse them as data files themselves.

A few options I've looked at:

James Clarke's SD, which is an option of last resort, but I'd like to avoid the complexity of building/installing/configuring code external to PHP. I'm not sure it's even possible in my situation.

PEAR has an XML_DTD_Parser, which requires installing/configuring PEAR and a number of pear modules, which I'm also not sure is possible, and would rather avoid. Has anyone used it with success? EDIT: I've since learned that XML_DTD_Parser discards comments, so is not a valid option for my needs.

PHP XML Classes has the class_path_parser, which another site suggested, but it fails to read ENTITY elements. It appears to be using PHP's built in XML parsing capabilities, which use EXPAT.

PHP's DOMDocument will validate against a DTD, so must be able to read them, though I don't see how to get at the DTD parser directly at first glance.

+1  A: 

I don't know useful this will be...

If I understand what you're looking for, you're looking for a means to extract the and "nodes" from a DTD in order to act on them. Very interesting. Here's where my brain went:

  • Use DOMDocument class directly. Looks as if there's no distinct way of getting at the DTD data if you treat the DTD as the source.
  • Use the SimpleXML in the same way. Ditto.
  • Use the XML parser in, again, the same way but use some of the entity declaration handler functions to get information out. I think this proves more foresight and is probably not what you need. (Although I could be wrong.)
  • Use preg_match_all, or the like, to grab your values based on the patterns. Not to dissimilar to other thoughts in the world.
  • Use XSLT to nix everything but what you need. The .xsl to remove all non-comments would be pretty easy to manage. It's quite possible you could just output them in a format that's easier to parse (say, in a better XML structure). Entities may require handling via PHP's XSL processor. I'm a little rusty on entities.

Regardless, I hope some of this helps.

Inkspeak
A: 

None of the standard XML parsers for PHP give access to general entities*, and few give access to comments. PHP's built in XML Parser uses Expat, but does not expose the full expat API; in particular, a handler for entities cannot be set. There is a PHP bug filed to add this.

AFAICT, the only way to handle comments and general entities in a DTD parser is to write your own parser; either by hand, or using one of the lexers and parser generators available for php (e.g. PHP_LexerGenerator and PHP_ParserGenerator among others).

* PHP's expat wrapper (XML Parser) does give access to notation declarations, which are similar to, but not the same as general entities.

Chadwick