views:

1770

answers:

3

Apparently Java's Regex flavor counts Umlauts and other special characters as non-"word characters" when I use Regex.

        "TESTÜTEST".replaceAll( "\\W", "" )

returns "TESTTEST" for me. What I want is for only all truly non-"word characters" to be removed. Any way to do this without having something along the lines of

         "[^A-Za-z0-9äöüÄÖÜßéèáàúùóò]"

only to realize I forgot ô?

A: 

Well, here is one solution I ended up with, but I hope there's a more elegant one...

                StringBuilder result = new StringBuilder();
                for(int i=0; i<name.length(); i++) {
                    char tmpChar = name.charAt( i );
                    if (Character.isLetterOrDigit( tmpChar) || tmpChar == '_' ) {
                        result.append( tmpChar );
                    }
                }

result ends up with the desired result...

Epaga
The fact that your variable String is named `name` suggests it won't be a large String. But in cases that it does get large (a couple of thousands of characters), I'd go with the for-statement as you did now.
Bart Kiers
+5  A: 

Use [^\p{L}\p{N}] - this matches all (Unicode) characters that are neither letters nor numbers.

In Java:

String resultString = subjectString.replaceAll("[^\\p{L}\\p{N}]", "");
Tim Pietzcker
Why the `\\[` inside your character class?
Bart Kiers
Oops. Typo. Corrected.
Tim Pietzcker
works like a charm! thanks!
Epaga
A: 

At times you do not want to simply remove the characters, but just remove the accents. I came up with the following utility class which I use in my Java REST web projects whenever I need to include a String in an URL:

import java.text.Normalizer; 
import java.text.Normalizer.Form;

import org.apache.commons.lang.StringUtils;

/**
 * Utility class for String manipulation.
 * 
 * @author Stefan Haberl
 */
public abstract class TextUtils {

  private static String[] searchList = { "Ä", "ä", "Ö", "ö", "Ü", "ü", "ß" };
  private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue", "sz" };

  /**
   * Normalizes a String by removing all accents to original 127 US-ASCII
   * characters. This method handles German umlauts and "sharp-s" correctly
   * 
   * @param s
   *          The String to normalize
   * @return The normalized String
   */
  public static String normalize(String s) {

    if (s == null) return null;

    String n = null;

    n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList);
    n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\\p{ASCII}]", "");

    return n;

  }

  /**
   * Returns a clean representation of a String which might be used safely
   * within an URL. Slugs are a more human friendly form of URL encoding a
   * String.
   * <p>
   * The method first normalizes a String, then converts it to lowercase and
   * removes ASCII characters, which might be problematic in URLs:
   * <ul>
   * <li>all whitespaces
   * <li>dots ('.')
   * <li>slashes ('/')
   * </ul>
   * 
   * @param s
   *          The String to slugify
   * @return The slugified String
   * @see #normalize(String)
   */
  public static String slugify(String s) {

    if (s == null) return null;

    String n = normalize(s);
    n = StringUtils.lowerCase(n);
    n = n.replaceAll("[\\s./]", "");

    return n;

  }

}

Being a German speaker I've included proper handling of German umlauts as well - the list should be easy to extend for other languages.

HTH

Stefan Haberl