tags:

views:

32

answers:

1

Hi there!

I'm using XSLT keys in many contexts. Usually, the keys used are more or less unique with very infrequent duplicate instances. Now I defined a key which has A LOT of instances for some key values. To be precise: I'm processing a 1.7 GigaByte file with 420.000 entries having a @STEREOTYPE attribute. Some of the stereotypes occur up to 90.000 times. Those are not the ones I'm interested in, though. The ones that I would like to select usually have have maybe 10 to 20 instances.

The key definition is

<xsl:key 
     name="entityByStereotype" 
     match="/REPOSITORY_DUMP/ENTITY_LIST/ENTITY"
     use="@STEREOTYPE"/>

The building of the index lasts eternally, that is I usually kill the process after 5 or 6 hours.

An alternate key definition is

<xsl:key 
     name="entityByStereotype" 
     match="/REPOSITORY_DUMP/ENTITY_LIST/ENTITY"
     use="concat(@STEREOTYPE, @OBJECT_ID)"/>

which forces the instance keys to be unique and its build returns after 14 seconds. My assumption is that the sort algorithm does not work very well for multiple instances of the same key resulting in an O(n**2) complexity for all subsets with identical keys. This is pretty bad for sub sets of 90.000 entries. :-(

However, I cannot use the alternate index definition, since I do not know the OBJECT_ID part of the instance beforehand.

Any ideas? Thanks a lot!

Saxon used: Version 9.1.0.5

A: 

Have you tried to use just <xsl:for-each-group>?

In case you provide a suitable source XML document I may be interested to help find a more optimal solution.

Update: A few other tricks I'd recommend:

1) In case you know in advance the values of @STEREOTYPE in which you are interested, then use:

<xsl:key  
     name="entityByStereotype"  
     match="/REPOSITORY_DUMP/ENTITY_LIST/ENTITY[@STEREOTYPE = ($val1, $val2,...,$val-n)]" 
     use="@STEREOTYPE"/>

If they occur, as you say, just 10-20 times, chances are the hash-table (yes, sorting isn't meaningful for implementing keys) will be more easily built.

2) Split the XML document into several smaller (say 10) documents and process separately.

Dimitre Novatchev
Selecting the relevant stereotypes by name did the job. Thank you very much! However, I think this behaviour is a bug. Maybe I should file it with SAXON.
Marcus Rickert