For performance reasons I have a need to split a set of objects identified by a string into groups. Objects may be either identified by a number or by a string in prefixed (qualified) form with dots separating parts of the identifier:
12
323
12343
2345233
123123131
ns1:my.label.one
ns1:my.label.two
ns1:my.label.three
ns1:system.text.one
ns2:edit.box.grey
ns2:edit.box.black
ns2:edit.box.mixed
Numeric identifiers are from 1 to several millions. Text identifiers are most likely to have very many starting with the same name space prefix (ns1:) and with the same path prefix (edit.box.).
What is the best hash function for this purpose? It would be good if I can predict somehow the size of the bucket based on object identifier statistics. Are there some good articles for constructing good hash function based on some statistical information?
There are several millions of such identifiers, but the purpose is to split them into groups of 1-2 thousands based on the hash function.