This was tweeted to me by @leoniedu today and I don't have an answer for him so I thought I would post it here.
I have read the documentation for agrep() (fuzzy string matching) and it appears that I don't fully understand the max.distance parameter. Here's an example:
pattern <- "Staatssekretar im Bundeskanzleramt"
x <- "Bundeskanzleramt"
agrep(pattern,x,max.distance=18)
agrep(pattern,x,max.distance=19)
That behaves exactly like I would expect. There are 18 characters different between the strings so I would expect that to be the threshold of a match. Here's what's confusing me:
agrep(pattern,x,max.distance=30)
agrep(pattern,x,max.distance=31)
agrep(pattern,x,max.distance=32)
agrep(pattern,x,max.distance=33)
Why are 30 and 33 matches, but not 31 and 32? To save you some counting,
nchar("Staatssekretar im Bundeskanzleramt")
[1] 34
nchar("Bundeskanzleramt")
[1] 16