Hi there!
I'm using XSLT keys in many contexts. Usually, the keys used are more or less unique with very infrequent duplicate instances. Now I defined a key which has A LOT of instances for some key values. To be precise: I'm processing a 1.7 GigaByte file with 420.000 entries having a @STEREOTYPE attribute. Some of the stereotypes occur up to 90.000 times. Those are not the ones I'm interested in, though. The ones that I would like to select usually have have maybe 10 to 20 instances.
The key definition is
<xsl:key
name="entityByStereotype"
match="/REPOSITORY_DUMP/ENTITY_LIST/ENTITY"
use="@STEREOTYPE"/>
The building of the index lasts eternally, that is I usually kill the process after 5 or 6 hours.
An alternate key definition is
<xsl:key
name="entityByStereotype"
match="/REPOSITORY_DUMP/ENTITY_LIST/ENTITY"
use="concat(@STEREOTYPE, @OBJECT_ID)"/>
which forces the instance keys to be unique and its build returns after 14 seconds. My assumption is that the sort algorithm does not work very well for multiple instances of the same key resulting in an O(n**2) complexity for all subsets with identical keys. This is pretty bad for sub sets of 90.000 entries. :-(
However, I cannot use the alternate index definition, since I do not know the OBJECT_ID part of the instance beforehand.
Any ideas? Thanks a lot!
Saxon used: Version 9.1.0.5