views:

341

answers:

11

I have a List of words I want to ignore like this one :

public List<String> ignoreList = new List<String>()
        {
            "North",
            "South",
            "East",
            "West"
        };

For a given string, say "14th Avenue North" I want to be able to remove the "North" part, so basically a function that would return "14th Avenue " when called.

I feel like there is something I should be able to do with a mix of LINQ, regex and replace, but I just can't figure it out.

The bigger picture is, I'm trying to write an address matching algorithm. I want to filter out words like "Street", "North", "Boulevard", etc. before I use the Levenshtein algorithm to evaluate the similarity.

+1  A: 

Something like this should work:

string FilterAllValuesFromIgnoreList(string someStringToFilter)
{
  return ignoreList.Aggregate(someStringToFilter, (str, filter)=>str.Replace(filter, ""));
}
George Mauer
I suspect this is correct, and yet I don't actually know.
Steven Sudit
I might have swapped around the parameters to the second lambda but this will definitely work, Aggregate is an incredibly powerful method, its lame people don't use it very often
George Mauer
It should be noted that I doubt that calling Replace multiple times is not the most preformant way of doing this. Probably something where you build the contents of the list into a static RegEx and use that to replace would be faster, but I suspect the difference won't matter in this case.
George Mauer
This is not correct because it uses `string.Replace` which can't match only on a word boundary. If you're going to use a RegEx, though, it should use a single compiled one.
Gabe
Good point @Gabe the example is more about the usage of Aggregate than of Replace.
George Mauer
A: 
public static string Trim(string text)
{
   var rv = text;
   foreach (var ignore in ignoreList)
      rv = rv.Replace(ignore, "");
   return rv;
}

Updated For Gabe


public static string Trim(string text)
{
   var rv = "";
   var words = text.Split(" ");
   foreach (var word in words)
   {
      var present = false;
      foreach (var ignore in ignoreList)
         if (word == ignore)
            present = true;
      if (!present)
         rv += word;
   }
   return rv;
}
Umair Ashraf
No LINQ, not RegExp, yet it's correct. Only thing I'd change is the use of an empty string literal.
Steven Sudit
No, not correct. This will turn "123 Northampton" into "123 ampton".
Gabe
Close...now you need to make sure that you put back the space between words.
Gabe
+2  A: 

What's wrong with a simple for loop?

string street = "14th Avenue North";
foreach (string word in ignoreList)
{
    street = street.Replace(word, string.Empty);
}
Albin Sunnanbo
Nothing wrong with the loop, I just thought there was another way of doing it.
Hugo Migneron
A: 

If you have a list, I think you're going to have to touch all the items. You could create a massive RegEx with all your ignore keywords and replace to String.Empty.

Here's a start:

(^|\s+)(North|South|East|West){1,2}(ern)?(\s+|$)

If you have a single RegEx for ignore words, you can do a single replace for each phrase you want to pass to the algorithm.

Brad
I guess we could. Do we really want to, though?
Steven Sudit
This is a good start. Now make it so that it only matches whole words.
Gabe
We used this approach to flag a huge list of customers as business or residential based on RegEx keywords generated from looking at the data.
Brad
+6  A: 
Regex r = new Regex(string.Join("|", ignoreList.Select(s => Regex.Escape(s)).ToArray()));
string s = "14th Avenue North";
s = r.Replace(s, string.Empty);
Bob
if there are special characters, you should escape the stuff in ignoreList: string.Join("|", ignoreList.select(s => Regex.Escape(s)).ToArray())
Frank Schwieterman
Since odds are the list will contain words like `"St."`, escaping is advised. And you have to look only for whole words.
Gabe
@Frank Correct . . . though it isn't really specified where the list comes from. It would probably be easiest to just write the correct regular expression in the first place rather than to convert it from a list, unless the list is really necessary.
Bob
Yeah, building a Regex dynamically is only really worthwhile if the list contents might change. Using a Regex in general is only useful if this function is used alot as its potentially faster then N string replacements.
Frank Schwieterman
A: 

Why not juts Keep It Simple ?

public static string Trim(string text)
{
   var rv = text.trim();
   foreach (var ignore in ignoreList) {
      if(tv.EndsWith(ignore) {
      rv = rv.Replace(ignore, string.Empty);
   }
  }
   return rv;
}
Vash
+1  A: 

If it's a short string as in your example, you can just loop though the strings and replace one at a time. If you want to get fancy you can use the LINQ Aggregate method to do it:

address = ignoreList.Aggregate(address, (a, s) => a.Replace(s, String.Empty));

If it's a large string, that would be slow. Instead you can replace all strings in a single run through the string, which is much faster. I made a method for that in this answer.

Guffa
Thanks a lot for that. My ignore list will obviously be much longer than what I posted here, but not sure if it will be long enough to use your method. I will profile it and see though.
Hugo Migneron
+6  A: 

How about this:

string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)));

or for .Net 3:

string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)).ToArray());

Note that this method splits the string up into individual words so it only removes whole words. That way it will work properly with addresses like Northampton Way #123 that string.Replace can't handle.

Gabe
*sip* - tastes like perl!
George Mauer
This is a great solution, both shorter and clearer than the regex versions.
AHM
You might as well split by the words - `text.Split(ignoreList.ToArray(), StringSplitOptions.None)`. That said, it is easier to adapt your approach to ignore case.
Kobi
What about punctuation before or after words?
Mark Byers
Kobi: `text.Split(ignoreList.ToArray())` doesn't work for the same reason all the `string.Replace` methods don't work.
Gabe
Mark: Presumably he would want to consider punctuation to be word-breakers. It's up to him, but I'd guess he'd want `text.Split(new[]{' ','.',',','-'})` but he can tweak it to support whatever algorithm he has.
Gabe
@Gabe: Then it won't match words containing punctuation, such as `St.`.
Mark Byers
Of course, not sure how I've missed that.
Kobi
Mark: I would expect that if he wants to ignore `St.` and he wants `.` to be a word-breaker, he would just put `St` in his ignore list.
Gabe
Thanks a lot, this is a great solution. Very clean and readable.
Hugo Migneron
+2  A: 

If you know that the list of word contains only characters that do not need escaping inside a regular expression then you can do this:

string s = "14th Avenue North";
Regex regex = new Regex(string.Format(@"\b({0})\b",
                        string.Join("|", ignoreList.ToArray())));
s = regex.Replace(s, "");

Result:

14th Avenue 

If there are special characters you will need to fix two things:

  • Use Regex.Escape on each element of ignore list.
  • The word-boundary \b will not match a whitespace followed by a symbol or vice versa. You may need to check for whitespace (or other separating characters such as punctuation) using lookaround assertions instead.

Here's how to fix these two problems:

Regex regex = new Regex(string.Format(@"(?<= |^)({0})(?= |$)",
    string.Join("|", ignoreList.Select(x => Regex.Escape(x)).ToArray())));
Mark Byers
It's a pretty good bet that his words *will* need escaping, because they'll be like `"St.", "Blvd.", "Rd."`
Gabe
That's a great way to handle the space problem raised in another comment.
Hugo Migneron
This is very clever and it seems like it would work on all the words. I will write some tests for it and try it out properly.
Hugo Migneron
A: 

You can do this using and expression if you like, but it's easier to turn it around than using a Aggregate. I would do something like this:

string s = "14th Avenue North"
ignoreList.ForEach(i => s = s.Replace(i, ""));
//result is "14th Avenue "
Øyvind Bråthen
+1  A: 

LINQ makes this easy and readable. This requires normalized data though, particularly in that it is case-sensitive.

List<string> ignoreList = new List<string>()
{
    "North",
    "South",
    "East",
    "West"
};    

string s = "123 West 5th St"
        .Split(' ')  // Separate the words to an array
        .ToList()    // Convert array to TList<>
        .Except(ignoreList) // Remove ignored keywords
        .Aggregate((s1, s2) => s1 + " " + s2); // Reconstruct the string
Phil Gilmore
The `.ToList()` is unnecessary.
Gabe