views:

198

answers:

2

After exploring several different ways to pass web data to a database for update purposes, I'm wondering if XML might be a good strategy. The database is currently SQL 2000. In a few months it will move to SQL 2005 and I will be able to change things if needed, but I need a SQL 2000 solution now.

First of all, the database in question uses the EAV model. I know that this kind of database is generally highly frowned on, so for the purposes of this question, please just accept that this is not going to change.

The current update method has the web server inserting values (that have all been converted first to their correct underlying types, then to sql_variant) to a temp table. A stored procedure is then run which expects the temp table to exist and it takes care of updating, inserting, or deleting things as needed.

So far, only a single element has needed to be updated at a time. But now, there is a requirement to be able to edit multiple elements at once, and also to support hierarchical elements, each of which can have its own list of attributes. Here's some example XML I hand-typed to demonstrate what I'm thinking of.

Note that in this database the Entity is Element and an ID of 0 signifies "create" aka an insert of a new item.

<Elements>
  <Element ID="1234">
    <Attr ID="221">Value</Attr>
    <Attr ID="225">287</Attr>
    <Attr ID="234">
      <Element ID="99825">
        <Attr ID="7">Value1</Attr>
        <Attr ID="8">Value2</Attr>
        <Attr ID="9" Action="delete" />
      </Element>
      <Element ID="99826" Action="delete" />
      <Element ID="0" Type="24">
        <Attr ID="7">Value4</Attr>
        <Attr ID="8">Value5</Attr>
        <Attr ID="9">Value6</Attr>
      </Element>
      <Element ID="0" Type="24">
        <Attr ID="7">Value7</Attr>
        <Attr ID="8">Value8</Attr>
        <Attr ID="9">Value9</Attr>
      </Element>
    </Attr>
    <Rel ID="3827" Action="delete" />
    <Rel ID="2284" Role="parent">
      <Element ID="3827" />
      <Element ID="3829" />
      <Attr ID="665">1</Attr>
    </Rel>
    <Rel ID="0" Type="23" Role="child">
      <Element ID="3830" />
      <Attr ID="67"
    </Rel>
  </Element>
  <Element ID="0" Type="87">
    <Attr ID="221">Value</Attr>
    <Attr ID="225">569</Attr>
    <Attr ID="234">
      <Element ID="0" Type="24">
        <Attr ID="7">Value10</Attr>
        <Attr ID="8">Value11</Attr>
        <Attr ID="9">Value12</Attr>
      </Element>
    </Attr>
  </Element>
  <Element ID="1235" Action="delete" />
</Elements>

Some Attributes are straight value types, such as AttrID 221. But AttrID 234 is a special "multi-value" type that can have a list of elements underneath it, and each one can have one or more values. Types only need to be presented when a new item is created, since the ElementID fully implies the type if it already exists. I'll probably support only passing in changed items (as detected by javascript). And there may be an Action="Delete" on Attr elements as well, since NULLs are treated as "unselected"--sometimes it's very important to know if a Yes/No question has intentionally been answered No or if no one's bothered to say Yes yet.

There is also a different kind of data, a Relationship. At this time, those are updated through individual AJAX calls as things are edited in the UI, but I'd like to include those so that changes to relationships can be canceled (right now, once you change it, it's done). So those are really elements, too, but they are called Rel instead of Element. Relationships are implemented as ElementID1 and ElementID2, so the RelID 2284 in the XML above is in the database as:

ElementID 2284 ElementID1 1234 ElementID2 3827

Having multiple children in one relationship isn't currently supported, but it would be nice later.

Does this strategy and the example XML make sense? Is there a more sensible way? I'm just looking for some broad critique to help save me from going down a bad path. Any aspect that you'd like to comment on would be helpful.

The web language happens to be Classic ASP, but that could change to ASP.Net at some point. A persistence engine like Linq or nHibernate is probably not acceptable right now--I just want to get this already working application enhanced without a huge amount of development time.

I'll choose the answer that shows experience and has a balance of good warnings about what not to do, confirmations of what I'm planning to do, and recommendations about something else to do. I'll make it as objective as possible.

P.S. I'd like to handle unicode characters as well as very long strings (10k +).

UPDATE

I have had this working for some time and I used the ADO Recordset Save-To-Stream trick to make creating the XML really easy. The result seems to be fairly fast, though if speed ever becomes a problem I may revisit this.

In the meantime, my code works to handle any number of elements and attributes on the page at once, including updating, deleting, and creating new items all in one go.

I settled on a scheme like so for all my elements:

  • Existing data elements

    Example: input name e12345_a678 (element 12345, attribute 678), the input value is the value of the attribute.

  • New elements

    Javascript copies a hidden template of the set of HTML elements needed for the type into the correct location on the page, increments a counter to get a new ID for this item, and prepends the number to the names of the form items.

    var newid = 0;
    
    
    function metadataAdd(reference, nameid, value) {
       var t = document.createElement('input');
       t.setAttribute('name', nameid);
       t.setAttribute('id', nameid);
       t.setAttribute('type', 'hidden');
       t.setAttribute('value', value);
       reference.appendChild(t);
    }
    
    
    function multiAdd(target, parentelementid, attrid, elementtypeid) {
       var proto = document.getElementById('a' + attrid + '_proto');
       var instance = document.createElement('p');
       target.parentNode.parentNode.insertBefore(instance, target.parentNode);
       var thisid = ++newid;
       instance.innerHTML = proto.innerHTML.replace(/{prefix}/g, 'n' + thisid + '_');
       instance.id = 'n' + thisid;  
       instance.className += ' new';
       metadataAdd(instance, 'n' + thisid + '_p', parentelementid);
       metadataAdd(instance, 'n' + thisid + '_c', attrid);
       metadataAdd(instance, 'n' + thisid + '_t', elementtypeid);
       return false;
    }
    

    Example: Template input name _a678 becomes n1_a678 (a new element, the first one on the page, attribute 678). all attributes of this new element are tagged with the same prefix of n1. The next new item will be n2, and so on. Some hidden form inputs are created:

    n1_t, value is the elementtype of the element to be created n1_p, value is the parent id of the element (if it is a relationship) n1_c, value is the child id of the element (if it is a relationship)

  • Deleting elements

    A hidden input is created in the form e12345_t with value set to 0. The existing controls displaying that attribute's values are disabled so they are not included in the form post. So "set type to 0" is treated as delete.

With this scheme, every item on the page has a unique name and can be distinguished properly, and every action can be represented properly.

When the form is posted, here's a sample of building one of the two recordsets used (classic ASP code):

Set Data = Server.CreateObject("ADODB.Recordset")
Data.Fields.Append "ElementID", adInteger, 4, adFldKeyColumn
Data.Fields.Append "AttrID", adInteger, 4, adFldKeyColumn
Data.Fields.Append "Value", adLongVarWChar, 2147483647, adFldIsNullable Or adFldMayBeNull
Data.CursorLocation = adUseClient
Data.CursorType = adOpenDynamic
Data.Open

This is the recordset for values, the other is for the elements themselves.

I step through the posted form and for the element recordset use a Scripting.Dictionary populated with instances of a custom Class that has the properties I need, so that I can add the values piecemeal, since they don't always come in order. New elements are added as negative to distinguish them from regular elements (rather than requiring a separate column to indicate if it is new or addresses an existing element). I use regular expression to tear apart the form keys: "^(e|n)([0-9]{1,10})_(a|p|t|c)([0-9]{0,10})$"

Then, adding an attribute looks like this.

Data.AddNew
ElementID.Value = DataID
AttrID.Value = Integerize(Matches(0).SubMatches(3))
AttrValue.Value = Request.Form(Key)
Data.Update

ElementID, AttrID, and AttrValue are references to the fields of the recordset. This method is hugely faster than using Data.Fields("ElementID").Value each time.

I loop through the Dictionary of element updates and ignore any that don't have all the proper information, adding the good ones to the recordset.

Then I call my data-updating stored procedure like so:

Set Cmd = Server.CreateObject("ADODB.Command")
With Cmd
   Set .ActiveConnection = MyDBConn
   .CommandType = adCmdStoredProc
   .CommandText = "DataPost"
   .Prepared = False
   .Parameters.Append .CreateParameter("@ElementMetadata", adLongVarWChar, adParamInput, 2147483647, XMLFromRecordset(Element))
   .Parameters.Append .CreateParameter("@ElementData", adLongVarWChar, adParamInput, 2147483647, XMLFromRecordset(Data))
End With
Result.Open Cmd ' previously created recordset object with options set

Here's the function that does the xml conversion:

Private Function XMLFromRecordset(Recordset)
   Dim Stream
   Set Stream = Server.CreateObject("ADODB.Stream")
   Stream.Open
   Recordset.Save Stream, adPersistXML
   Stream.Position = 0
   XMLFromRecordset = Stream.ReadText
End Function

Just in case the web page needs to know, the SP returns a recordset of any new elements, showing their page value and their created value (so I can see that n1 is now e12346 for example).

Here are some key snippets from the stored procedure. Note this is SQL 2000 for now, though I'll be able to switch to 2005 soon:

CREATE PROCEDURE [dbo].[DataPost]
   @ElementMetaData ntext,
   @ElementData ntext
AS
DECLARE @hdoc int

--- snip ---

EXEC sp_xml_preparedocument @hdoc OUTPUT, @ElementMetaData, '<xml xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:z="#RowsetSchema" />'
INSERT #ElementMetadata (ElementID, ElementTypeID, ElementID1, ElementID2)
SELECT *
FROM
   OPENXML(@hdoc, '/xml/rs:data/rs:insert/z:row', 0)
   WITH (
      ElementID int,
      ElementTypeID int,
      ElementID1 int,
      ElementID2 int
   )
ORDER BY ElementID -- orders negative items (new elements) first so they begin counting at 1 for later ID calculation

EXEC sp_xml_removedocument @hdoc

--- snip ---

UPDATE E
SET E.ElementTypeID = M.ElementTypeID
FROM
   Element E
   INNER JOIN #ElementMetadata M ON E.ElementID = M.ElementID
WHERE
   E.ElementID >= 1
   AND M.ElementTypeID >= 1

The following query does the correlation of the negative new element ids to the newly inserted ones:

UPDATE #ElementMetadata -- Correlate the new ElementIDs with the input rows
SET NewElementID = Scope_Identity() - @@RowCount + DataID
WHERE ElementID < 0

Other set-based queries do all the other work of validating that the attributes are allowed, are the correct data type, and inserting, updating, and deleting elements and attributes.

I hope this brief run-down is useful to others some day! Converting ADO Recordsets to an XML stream was a huge winner for me as it saved all sorts of time and had a namespace and schema already defined that made the results come out correctly.

Using a flatter XML format with 2 inputs was also much easier than sticking to some ideal about having everything in a single XML stream.

A: 

I don't see any reason not to use XML columns in SQL Server 2005, and to do all your work via stored procedures.

You probably don't have time to abstract your data access to hide the ugliness of the data model, so why not just access it as-is, using XML? You can use XQuery in SQL Server to do updates, queries, etc.

Now that I think of it, you might still put one layer of abstraction between the ASP pages and the database. That would allow you in the future to use XSLT to transform the structure of your XML into a format that will perform better in the database.

John Saunders
What do you mean by "XML columns?" Some brief examples of what you're talking about would be good. I don't mean working code, I'm just having trouble connecting what you're saying to the real world. For example, what format would perform better in the database?Note that the question here isn't about reading the database--that works fine. It's about writing back to the database. Though I suppose if multiple requests are working for reading, then multiple requests for writing could be okay. It's just not so simple to submit entire recordsets as parameters to SPs.
Emtucifor
@Emtucifor: SQL Server 2005 has a new column data type, XML. See http://technet.microsoft.com/en-us/library/ms190936%28SQL.90%29.aspx. Sorry, I thought you mentioned SQL Server 2005 because you knew about XML columns.
John Saunders
I do know about XML columns, but I don't think actually storing the data as XML is the right route, here. Were you suggesting changing the database structure to use XML columns? Or were you instead suggesting the web application insert to an XML column and then do its updates from there? Or am I missing that an input parameter can be XML data type?
Emtucifor
@John Saunders: Any thoughts on my last comment?
Emtucifor
@Emtucifor: I had been suggesting XML columns. But since you're stuck on SQL Server 2000 for a while, given that you're using Classic ASP, and given what @Remus said about severity 16, I'm not so sure anymore.
John Saunders
+3  A: 

If I understand correctly, you are interested in the pros and cons of using XML as a format of data between the database and the application (in this case, a web app).

If you happen to have the entire data to be inserted/updated/deleted as a handy bag of data in your client, then actually sending it as XML makes sense. On simple reason is that this would allow for a single database round-trip to the server, and reducing round-trips is always a good think. But the most important advantage is that you can employ the holy-grail of database performance: set oriented processing. Using XML methods, specially nodes and value, combined with some moderate XPath-fu skills, and you can shred the entire XML parameter received from the application into relational sets, and use set oriented operations to do the database writes.

Take for instance the XML in your post, lets say that it was passed as an parameter named @x of type XML. You can shred that into an attributes to be merged into existing elements:

select x.value(N'@ID', N'int') as ID,
  x.value(N'.', N'varchar(max)') as [Value]
from  @x.nodes('//Element[not(@Action="delete") and not (@ID=0)]/Attr') t(x)

You can shred the attributes that go into new elements:

select x.value(N'@ID', N'int') as ID,
  x.value(N'.', N'varchar(max)') as [Value]
from  @x.nodes('//Element[@ID=0]/Attr') t(x);

And you can shred the elements to be deleted:

select x.value(N'@ID', N'int') as ID
from  @x.nodes('//Element[@Action="delete"]') t(x);

These sets can be manipulated via normal SQL DML: inserted, deleted, updated or merged into the EAV tables, in one single pass. Note that the XML shredding I show here are trivial ones and probably incorrect for you, but are just to show the way to do it.

Now whether this is the best path to go, I don't know. There are way too many variables and moving pieces and they lay mostly in your dev team skill set and existing code base. For sure, XML is a good format to call into the database for updating sets, but XML has its shortcomings too: is verbose and fat, is slower to parse than binary formats, and is actually quite difficult to fully grok by programmers: once you get past the sugar coat of '<' and '>', there a deep (and sometimes messy) layer of XPath, XQuery, namespaces, encodings, cdata and the rest.

I'd say go ahead, prototype, let us know how it goes...

Remus Rusanu
Remus, all the reasons you mention are the exact ones I was thinking about! I do have all the data at once and can stuff it into a property bag, as you say. And exactly so, I can query the XML repeatedly to extract different data sets and then do my joins (row-by-row is truly evil). In the past I've not been impressed with XML because of how verbose and slow it is. And I also have my reservations about all the extra gunk you mentioned like namespaces, encodings, and cdata (I want to be able to handle any unicode character properly).
Emtucifor
So... given that you jumped right in gave sample queries for me, it seems that you sort of approve of the XML route? Of course, for now I am stuck with SQL 2000. Do you have any other thoughts on the format of my XML (e.g., is Action="delete" and the mix of element-based and attribute-based data sensible) or on any other good ways to hand in an entire property bag? Last night I did realize that all the things I'm updating share a fairly standard representation, and if I add ElementID1 and ElementID2 to the existing update process, then I could do without XML.
Emtucifor
Also, I have concerns about creating the XML in the first place. I've found the MSXML libraries to be painfully slow, even to build XML, and I've done the (gasp! horror!) workaround of just concatenating the XML together manually in the past. But that only works when it's kind of simple, and this looks to be like it won't be so simple.
Emtucifor
I do not disprove of XML, but I can't recommend it neither since, as you are clearly aware, there are just too many moving pieces. but I will tell you a short story from my professional history. I deal a lot with Service Broker based systems, and in SSB the norm is to get the data as XML: dequeue a message from a queue, then start processing the XML. You can see on my blog at http://rusanu.com/2006/10/16/writing-service-broker-procedures/ how various styles of processing payload impact performance. XML set oriented is pretty much the fastest way.
Remus Rusanu
However, I am finding myself *removing* set oriented processing and replacing it with cursor based processing, row-by-row ('row' extracted from the XML). The reason for this is complexity. Maintaining the set oriented procedures is just hard to do, projects are taking longer than needed, have more bugs, simply because is hard to wrap one's head around the complex XPath queries involved. And a major problem is error handling: is hard to detect if *any* xml element was simply not-processed (ie. is skipped by all XPath queries)
Remus Rusanu
@Remus: are you using XML Schema with the XML data? SQL Server is capable of validating the XML against a schema.
John Saunders
@John: No, I don't use an XML schema. Schema validation wouldn't help me though. The problem is more one of defensive code style: the code works fine with all the known cases and passes functional test cases, but in production you want to be able to detect *new* cases (they always happen eventually) and trigger the alarm yet allow the processing to continue. I rather rely on code path (ie. a switch default case) than XML schemas for this. A second reason is the severity 16 that XML errors are raised with, see http://rusanu.com/2007/10/31/error-handling-in-service-broker-procedures/
Remus Rusanu
I may go with XML as I've figured out a nifty way to get my data into an XML object: build an ADO Recordset and turn it into a Stream! This will be much simpler for me than using MSXML and won't require installing MSXML either.
Emtucifor
@Remus: I didn't know about severity 16. Thanks for that.
John Saunders