views:

541

answers:

3

I've got a very large xml data set that is structured like the following:

<root>
    <person>
        <personid>HH3269732</personid>
        <firstname>John</firstname>
        <lastname>Smith</lastname>
        <entertime>01/02/2008 10:15</entertime>
        <leavetime>01/02/2008 11:45</leavetime>
        <entertime>03/01/2008 08:00</entertime>
        <leavetime>03/01/2008 10:00</leavetime>
        ... 
        // number of enter times and leave times vary from person to person
        // there may not be a final leave time (ie, they haven't left yet)
    </person>
    ...
</root>

The structure of the data is not under my control. This data is currently residing in a single xml column in a single row in MS SQL Server 2005. I am trying to construct a query which results in the following output:

HH3269732   John   Smith   01/02/2008 10:15   01/02/2008 11:45
HH3269732   John   Smith   03/01/2008 08:00   01/02/2008 10:00
HH3269735   Mark   Pines   02/01/2008 09:00   NULL
HH3263562   James  Frank   NULL               NULL
HH3264237   Harold White   04/18/2008 03:00   04/18/2008 05:00
...

My query currently looks like the following:

DECLARE @xml xml
SELECT @xml = XmlCol FROM Data

SELECT
    [PersonId] = Persons.PersonCollection.value('(personid)[1]', 'NVARCHAR(50)')
    ,[First Name] = Persons.PersonCollection.value('(firstname)[1]', 'NVARCHAR(50)')
    ,[Last Name] = Persons.PersonCollection.value('(lastname)[1]', 'NVARCHAR(50)')
    ??????
FROM @xml.nodes('root\person') Persons(PersonCollection)

That query may not be 100% right as I'm pulling it from memory, but the problem I'm having is that I don't know how to include the entertime leavetime sequence elements in such a way as to get the desired rowset that I indicated above.

Thanks.

UPDATE: I wanted to add that a given person record may have no entertime/leavetime sequence elements at all, but still needs to be returned in the rowset. I have updated the example of the desired output to reflect this.

+1  A: 
with cte_entertime as (
SELECT
    [PersonId] = t.c.value('(../personid)[1]', 'NVARCHAR(50)')
    ,[First Name] = t.c.value('(../firstname)[1]', 'NVARCHAR(50)')
    ,[Last Name] = t.c.value('(../lastname)[1]', 'NVARCHAR(50)')
    ,[Entertime] = t.c.value('.', 'NVARCHAR(50)')
    ,[entry_number] = ROW_NUMBER() OVER (ORDER BY t.c)
FROM @x.nodes('root/person/entertime') t(c))
, cte_leavetime as (
    SELECT
    [Leavetime] = t.c.value('.', 'NVARCHAR(50)')
    ,[entry_number] = ROW_NUMBER() OVER (ORDER BY t.c)
FROM @x.nodes('root/person/leavetime') t(c))
SELECT PersonID
    , [First Name]
    , [Last Name] 
    , [Entertime]
    , [Leavetime]
    FROM cte_entertime e 
    LEFT OUTER JOIN cte_leavetime l on e.entry_number = l.entry_number
Remus Rusanu
I have noticed that it seems by changing the root element of the nodes functions to the sequence and then stepping up the hierarchy to get the other values that the performance of the query is drastically affected. I was attempting to validate the number of records returned by updating my query to use this format, but I cancelled the query after it had been running for 30 minutes with no results. I have the equivalent of 215,000 person records in the data set I'm working with.
Dan Rigby
I should add though, that while this answer doesn't get me 100% where I'm trying to be, it was greatly helpful in showing me how to join over the xml data.
Dan Rigby
glad it help. The step up to the parent element can easily be eliminated, simply use three tables joined (one for name/id, one for enters, one for leaves).
Remus Rusanu
I've been doing some testing and that traverse to parent is causing something really terrible behind the scenes. The following query by itself seems to get slower and slower as more results are returned, and ultimately chokes around the 3,000th row (out of about 215,000):SELECT [MemberNumber] = t.c.value('(../PersonId)[1]','NVARCHAR(20)') ,[Entertime] = t.c.value('.', 'DATETIME')FROM @x.nodes('root/person/entertime') t(c)Its very interesting. I'm not sure whats going on.
Dan Rigby
Did you try using 3 tables and join, instead of 2? Extract the ID/FirstName/LastName in one pass then @x.nodes('root/person') join with @x.nodes(/root/person/entertime) and @x.nodes(/root/person/leavetime). This should be faster as it each 'table' only scans the XML forward.
Remus Rusanu
The problem is what do you join on. The ROW_NUMBER() solution wasn't yielding correct results because the rowcount returned from each of the 3 queries was different (id/name),(entertime),(leavetime). The only way I can see that yields correct results is to join on personid and in order to do that, the entertime and leavetime queries have to traverse upward to figure out what their personid was (see query in my answer below). Maybe theres another way?
Dan Rigby
A: 

I have accepted Remus's answer as it got me 95% to the solution. For informational purposes, here is the final query structure:

with cte_maindata as (
SELECT
    [PersonId] = t.c.value('(personid)[1]', 'NVARCHAR(50)')
    ,[First Name] = t.c.value('(firstname)[1]', 'NVARCHAR(50)')
    ,[Last Name] = t.c.value('(lastname)[1]', 'NVARCHAR(50)')
FROM @x.nodes('root/person') t(c))
, cte_entertime as (
    SELECT
    [PersonId] = t.c.value('(../personid)[1]', 'NVARCHAR(50)')
    ,[Entertime] = t.c.value('.', 'NVARCHAR(50)')
FROM @x.nodes('root/person/entertime') t(c))
, cte_leavetime as (
    SELECT
    [PersonId] = t.c.value('(../personid)[1]', 'NVARCHAR(50)')
    ,[Leavetime] = t.c.value('.', 'NVARCHAR(50)')
FROM @x.nodes('root/person/leavetime') t(c))
SELECT 
    m.PersonID
    ,[First Name]
    ,[Last Name] 
    ,[Entertime]
    ,[Leavetime]
FROM cte_maindata m
    LEFT OUTER JOIN cte_entertime e on m.PersonId = e.PersonId
    LEFT OUTER JOIN cte_leavetime l on m.PersonId = l.PersonId
Dan Rigby
A: 

Haven't realized you may have multiple persons in the document. My query would be incorrect in that case anyway. I thought maybe if you first shred out each person into its own XML fragment and ten extract the enter/leave times might perform better. I don't have 215k person XML to try, but here is an idea:

declare @x xml;
select @x = N'<root>
    <person>
        <personid>HH3269732</personid>
        <firstname>John</firstname>
        <lastname>Smith</lastname>
        <entertime>01/02/2008 10:15</entertime>
        <leavetime>01/02/2008 11:45</leavetime>
        <entertime>03/01/2008 08:00</entertime>
        <leavetime>03/01/2008 10:00</leavetime>
        <entertime>04/01/2008 08:00</entertime>
    </person>
    <person>
        <personid>HH3269733</personid>
        <firstname>Jane</firstname>
        <lastname>Doe</lastname>
        <entertime>01/03/2008 10:15</entertime>
        <leavetime>01/03/2008 11:45</leavetime>
        <entertime>03/04/2008 08:00</entertime>
        <leavetime>03/04/2008 10:00</leavetime>
        <entertime>04/04/2008 08:00</entertime>
    </person>
</root>';


with cte_person as (
    select
     t.c.value('(personid)[1]', 'NVARCHAR(50)') as personid
     , t.c.value('(firstname)[1]', 'NVARCHAR(50)') as firstname
     , t.c.value('(lastname)[1]', 'NVARCHAR(50)') as lastname
     , t.c.query('entertime') as entertime
     , t.c.query('leavetime') as leavetime
    FROM @x.nodes('root/person') t(c))
, cte_cross_enter as (
    select
     p.personid
     , p.firstname
     , p.lastname
     , x.c.value('.', 'datetime') as entertime
     , row_number() over (partition by personid order by x.c) as row_enter
     from cte_person p
     cross apply p.entertime.nodes('/entertime') x(c))
, cte_cross_leave as (
    select
     p.personid 
     , x.c.value('.', 'datetime') as leavetime
     , row_number() over (partition by personid order by x.c) as row_leave
     from cte_person p
     cross apply p.leavetime.nodes('/leavetime') x(c))
select e.personid
    , e.firstname
    , e.lastname
    , e.entertime
    , l.leavetime
    from cte_cross_enter e
    left outer join cte_cross_leave l 
      on e.personid = l.personid and 
      e.row_enter = l.row_leave
Remus Rusanu
whoops, sry, didn't notice you already posted a solution
Remus Rusanu