How's "most related" defined in your scenario?
If you just use Contains, the outcome would miss a lot of "related" results that don't match in the form of exact substring.
What kind of input are you expecting?
If the input is just a single word, then this algorithm might work for you:
Fitness should be calculated as the
length of the longest common substring
of the target string found in the
input string, divided by the length of
the target string. For example, HPPLE
has a score of .8 compared to APPLE,
since the longest substring of APPLE
found in HPPLE is 4 letters long,
which is .8 of APPLE's length.
Additionally, penalize the score by
0.1 for each extraneous letter the input string has over the length of
the target string. For example. HAPPLE
has a score of .9 compared to APPLE,
because it is 1 letter longer, and
otherwise has the complete substring
APPLE. Note that this makes it
possible for a candidate word to have
a negative score.
Of course there are a lot of other, better algorithms to calculate distance.
If input may contain multiple words, then you're probably better off with some other algorithm to calculate the "distance" between strings.
And umm...your original query didn't sort the result, so you're not taking the top 7 most related ones, you're just taking the first 7 results in a sequence.
To wrap things up, this might work for you:
using System;
using System.Collections.Generic;
using System.Linq;
public static class StringDistanceUtil {
/// <summary>
/// Returns the longest common substring of the given two arguments.
/// </summary>
/// <param name="first">
/// the first string
/// </param>
/// <param name="second">
/// the second string
/// </param>
/// <returns>
/// the longest common substring of the given two arguments
/// </returns>
public static string LongestCommonSubstringWith(this string first, string second) {
// could have used dynamic programming, or generalized suffix tree
// to solve the LCS problem, but here we'll just stick to simplicity
var start = 0; // The start in a of the longest found so far
var len = 0; // The length of the longest found so far
for (var i = 0; i < first.Length - len; ++i) {
for (var j = first.Length - i; j > len; --j) {
if (second.Contains(first.Substring(i, j))) {
start = i;
len = j;
break; // Exit the inner loop
}
}
}
return first.Substring(start, len);
}
/// <summary>
/// Returns the distance of two strings.
/// </summary>
/// <param name="str">
/// a string
/// </param>
/// <param name="target">
/// the target string
/// </param>
/// <returns>
/// the distance from a string to the target string
/// </returns>
public static double DistanceFrom(this string str, string target) {
var strLen = str.Length;
var targetLen = target.Length;
var ratio = str.LongestCommonSubstringWith(target).Length
/ (double) targetLen;
var penalty =
(strLen > targetLen) ?
(0.1 * (strLen - targetLen))
: 0;
return ratio - penalty;
}
static void Main(string[] args) {
var list = new List<string> {
"zero",
"range",
"shot",
"shoot",
"hop",
"rage",
"fang",
"age"
};
var target = "zero_range_shot";
var top5mostRelated = list
.OrderByDescending(str => str.ToUpper().DistanceFrom(target.ToUpper()))
.Take(5).ToList();
foreach (var str in top5mostRelated) Console.WriteLine(str);
}
}
and the output is:
range
zero
shot
shoot
fang