ansaurus

Question

Answer 1

+2 A:

Here's a random snippet of code I wrote years and years ago to do some work with DTDs from Python, which might give you an idea of what it's like to work with this library:

from xml.parsers.xmlproc import dtdparser

attr_separator = '_'
child_separator = '_'

dtd = dtdparser.load_dtd('schedule.dtd')

for name, element in dtd.elems.items():
    for attr in element.attrlist:
        output = '%s%s%s = ' % (name, attr_separator, attr)
        print output
    for child in element.get_valid_elements(element.get_start_state()):
        output = '%s%s%s = ' % (name, child_separator, child)
        print output

(FYI, this was the first result when searching for "python dtd parser")

Will McCutchen 2010-01-27 15:53:47

It seems pretty good, but the DTD information that has the version number is a part of the XML file.

prosseek 2010-01-27 16:06:03

What information do you actually need out of this file? Is it just the version information from the embedded DTD? If so, why don't you just pull it out with a regular expression?

Will McCutchen 2010-01-27 16:42:59

And I guess I should point out that the `xmlproc` parsers provide a `get_dtd` method that will give you access to the DTD of a parsed XML file. Which may or may not be what you're looking for. This is all explained in the docs that I linked to.

Will McCutchen 2010-01-27 17:15:14

> why don't you just pull it out with a regular expression?That's actually what I did for getting the job done, but I wanted to know if there are some functions for doing it.Thanks, and it was a great help.

prosseek 2010-01-27 18:16:17

Answer 2

A:

Because both of the the standard library XML libraries (xml.dom.minidom and xml.etree) use the same parser (xml.parsers.expat) you are limited in the "quality" of XML data you are able to successfully parse.

You're better off using the tried-and-true 3rd party modules out there like lxml or BeautifulSoup that are not only more resilient to errors, but will also give you exactly what you are looking for with little trouble.

jathanism 2010-01-28 14:10:46

ansaurus

tags:

views:

answers:

Reading XML DOCTYPE info with Python

related questions