ansaurus

Question

How to implement a generic DOM data structure in C++?

Answer 1

A:

If you look at my past answers, you will see that I am a proponent of templates, but if you have no other requirements, they will only get in the way. Parsers don't like lots of different types to deal with. (Although you say you don't need a parser — huh?)

The point of XML and DOM is to make it easy to translate to and from any internal structure. Not only shouldn't you need to define an XML node template, you shouldn't need any kind of custom data structure. Any structure is already in DOM-like style. DAGs are troublesome, because they're kinda-trees and kinda-graphs, but you don't hint that you're hitting that kind of roadblock.

You say (in comment to deleted answer) that you don't want to use an existing library. Why? What is it you're really trying to do?

Potatoswatter 2010-10-17 01:59:09

"Although you say you don't need a parser — huh?" Yes! That's right, I don't need a parser! My parsing mathod is a very implicit procedure. I can show you some examples when have it done.What I am doing is a template engine for html -- so I want to have an abstract html data representation to use in DOM tree manipulation.

Rizo 2010-10-18 11:33:06

@Rizo: In that case, you are writing a parser and you're making your own HTML (not XML) library. And like I said, unless there's a reason to be using templates, they will just get in the way.

Potatoswatter 2010-10-18 13:25:07

@Potatoswatter: I just thought xml was syntactically and logically equivalent to html (isn't it?).

Rizo 2010-10-18 13:50:55

@Rizo: No, they are two different standards with different structures and encodings. Look them up. I sort of don't believe you're really writing an HTML engine without having read the HTML standard in some depth… edit: I see your comment to Matthieu; either you've read this or you really need to: http://en.wikipedia.org/wiki/XHTML

Potatoswatter 2010-10-18 14:38:06

I did not say they were equal. Just equivalent. and as far as I know xHTML is a direct application of XML and can be parsed using XML parsers. I should have said "xhtml" instead of "html".

Rizo 2010-10-18 14:58:00

Answer 2

+1 A:

Instead of trying to strongly type each node based on how many parents it has, organize your code as a tree structure:

class Element
{
public:
  std::string Name;
  std::map<std::string, std::string, std::less<std::string> > Attributes;
  std::list<Element> Children;
};

Your public interface will probably look much different from this. I'm just trying to show the general type layout.

You don't really need the Node or Attribute features, unless you need to iterate over them in a collection along with Elements. It is a useful feature for XML DOM libraries, but if you're just trying to create a data structure, you don't have to follow the DOM design to the letter.

In fact, if you're just going for a generic data structure, you might just want a property bag:

#include<map>
#include<string>
#include<iostream>

class PropertyBag;
typedef std::map<std::string, PropertyBag> PropertyMap;

class PropertyBag : public PropertyMap
{
public:
  PropertyBag(const std::string& value)
    : value(value)
  {
  }

  PropertyBag& operator=(const std::string& value)
  {
    this->value = value;
    return *this;
  }

  operator std::string& () { return value; }

private:
  std::string value;

  friend PropertyMap::mapped_type& PropertyMap::operator[](const PropertyMap::key_type&);
  PropertyBag() { }
};

void SomeFunction(const std::string& value)
{
  std::cout << value << "\n";
}

int main(int argc, char* argv[])
{
  PropertyBag config("configuration root");
  config["child1"] = "value1";
  config["child1"]["subchild1"] = "value2";

  SomeFunction(config["child1"]);
  SomeFunction(config["child1"]["subchild1"]);
  return 0;
}

Just talking about the syntax, you can also try to get tricky with overloading operator(), and/or chaining methods:

PropertyBag& SomeMethod(const std::string& someParam)
{
  // do something here...
  return *this;
}

PropertyBag& operator()(const std::string& p1, const std::string& p2)
{
  // ...
  return *this;
}

// ...

Configuration config1("root")
  .SomeMethod("p1")
  .SomeMethod("p2");
Configuration config2("root")
  ("Something", "blah")
  ("sizzle", "V2");

I imagine the less text/code duplication, the better. The closer you can get your code to have a syntax like JSON or YAML, the better.

Once c++0x comes out, you may have much simpler options available to you. You can also look into the boost::assign library for an easy initialization syntax to use on your data structure.

You can also look into the boost::any library for a datatype that you can use as the value, instead of strings (supports a type-safe method of inserting any value, as long as you extract it as the same type).

Merlyn Morgan-Graham 2010-10-17 02:04:13

@Merlyn: list have awful runtime performance, are you sure you don't want a vector here ?

Matthieu M. 2010-10-17 10:06:33

@Matthieu: A list better represents the operations you'd do on a structure like this. You don't know the size to start with, and you would never do index based access on children. In this scenario, a list can perform better for construction, and just as fast for traversal. But in real life, this sort of perf is rarely going to be a bottleneck, and if it is, you'd come up with a custom storage solution for that portion of the app, rather than a generic data structure.

Merlyn Morgan-Graham 2010-10-17 21:00:01

@Merlyn: going for a custom storage solution is a last resort solution, reserved to experts. As for the list choice, I strongly disagree. The first choice of a container should be either `vector` or `deque`, because they are the simpler. Only if you have specific requirements should you change: `unordered_set` for unicity, `set` for ordering, `stack` / `queue` for specific access, etc... the only advantage of using a `list`, in general, is the iterator invalidation guarantee.

Matthieu M. 2010-10-18 07:15:15

@Merlyn: That's exactly what I want -- organize data (exempli gratia, through syntactical changes provided by C++ operators overloading feature) to get its representation code closer to JSON/YAML/XML logic. Though I don't want to use it directly as you do in your example source. I think it's more convenient to define a background data architecture (such as DOM for this case) to store data. And, independently have an interface to convert a user-friendly code into that abstract form. Since I plan to use it mainly for xHTML, DOM would be most adequate. Am I wrong?

Rizo 2010-10-18 14:18:55

@Matthieu: "the only advantage of using a list, in general, is the iterator invalidation guarantee" - not true. If you push_back one node for a vector, how long is it going to take to do the insert? What exactly will happen to the objects already in the vector? You don't know, because it is dependent on what is already in the vector, and the implementation of the copying implementation. With a list, you modify the list, not its contents. You insert an object, and it does one or zero copies, and (potentially) fewer allocations. That sounds like desirable behavior to me.

Merlyn Morgan-Graham 2010-10-18 17:34:10

@Matthieu: But sure, a vector is a great default container. But so is a list. And if a programmer can even hope to make decisions like this, they are going to be intimately familiar with how both work, and their characteristics.

Merlyn Morgan-Graham 2010-10-18 17:35:14

@Matthieu: Also, I didn't necessarily mean to invent the structure yourself - just to implement it. For example, a custom binary tree, binary heap, hash table w/ custom storage, etc. Those don't necessarily require experts, though they will be a lot more bug prone to start with.

Merlyn Morgan-Graham 2010-10-18 17:42:48

@Rizo: I guess so, though the DOM libraries I have seen have more data structure overhead than I'd personally want to use in my code. The data structure I presented here can directly be converted to a DOM if it needs to go to XHTML, etc.

Merlyn Morgan-Graham 2010-10-18 17:48:40

@Merlyn: are we talking about amortized cost ? If so, insertion into a vector is `O(1)` and generally faster than in a `list` because there is no systematic call to `new`. Of course, a `deque` gets the best of both worlds here: bulk allocation combined with long lived iterators (as long as you only operate on the head / tail). I know that you can choose between various containers, but truly, the `list` is the slowest of all.

Matthieu M. 2010-10-18 19:01:33

@Matthieu: While any performance discussions we have here are completely moot without a solid benchmark vs a solid scenario (vs the code I wrote here :), I can imagine scenarios where the copy operator could dwarf the allocation. Can you show me data or math that shows this to be a rare case? If so, I'll concede the point. Otherwise I call "blanket statement" on your assertion.

Merlyn Morgan-Graham 2010-10-18 19:51:16

ansaurus

tags:

views:

answers:

How to implement a generic DOM data structure in C++?

related questions