ansaurus

Question

Remove all non-"word characters" from a String in Java, leaving accented characters?

Answer 1

A:

Well, here is one solution I ended up with, but I hope there's a more elegant one...

                StringBuilder result = new StringBuilder();
                for(int i=0; i<name.length(); i++) {
                    char tmpChar = name.charAt( i );
                    if (Character.isLetterOrDigit( tmpChar) || tmpChar == '_' ) {
                        result.append( tmpChar );
                    }
                }

result ends up with the desired result...

Epaga 2009-10-23 08:05:43

The fact that your variable String is named `name` suggests it won't be a large String. But in cases that it does get large (a couple of thousands of characters), I'd go with the for-statement as you did now.

Bart Kiers 2009-10-23 09:34:35

Answer 2

+5 A:

Use [^\p{L}\p{N}] - this matches all (Unicode) characters that are neither letters nor numbers.

In Java:

String resultString = subjectString.replaceAll("[^\\p{L}\\p{N}]", "");

Tim Pietzcker 2009-10-23 08:11:54

Why the `\\[` inside your character class?

Bart Kiers 2009-10-23 08:14:29

Oops. Typo. Corrected.

Tim Pietzcker 2009-10-23 08:19:44

works like a charm! thanks!

Epaga 2009-10-23 08:33:35

Answer 3

A:

At times you do not want to simply remove the characters, but just remove the accents. I came up with the following utility class which I use in my Java REST web projects whenever I need to include a String in an URL:

import java.text.Normalizer; 
import java.text.Normalizer.Form;

import org.apache.commons.lang.StringUtils;

/**
 * Utility class for String manipulation.
 * 
 * @author Stefan Haberl
 */
public abstract class TextUtils {

  private static String[] searchList = { "Ä", "ä", "Ö", "ö", "Ü", "ü", "ß" };
  private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue", "sz" };

  /**
   * Normalizes a String by removing all accents to original 127 US-ASCII
   * characters. This method handles German umlauts and "sharp-s" correctly
   * 
   * @param s
   *          The String to normalize
   * @return The normalized String
   */
  public static String normalize(String s) {

    if (s == null) return null;

    String n = null;

    n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList);
    n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\\p{ASCII}]", "");

    return n;

  }

  /**
   * Returns a clean representation of a String which might be used safely
   * within an URL. Slugs are a more human friendly form of URL encoding a
   * String.
   * <p>
   * The method first normalizes a String, then converts it to lowercase and
   * removes ASCII characters, which might be problematic in URLs:
   * <ul>
   * <li>all whitespaces
   * <li>dots ('.')
   * <li>slashes ('/')
   * </ul>
   * 
   * @param s
   *          The String to slugify
   * @return The slugified String
   * @see #normalize(String)
   */
  public static String slugify(String s) {

    if (s == null) return null;

    String n = normalize(s);
    n = StringUtils.lowerCase(n);
    n = n.replaceAll("[\\s./]", "");

    return n;

  }

}

Being a German speaker I've included proper handling of German umlauts as well - the list should be easy to extend for other languages.

HTH

Stefan Haberl 2010-07-19 10:38:21

ansaurus

tags:

views:

answers:

Remove all non-"word characters" from a String in Java, leaving accented characters?

related questions