ansaurus

Question

Easiest way of checking if a string consists of unique characters?

Answer 1

+2 A:

I like the HashSet idea. It's conceptually simple, and only does one pass through the string. For a simple performance improvement, check the add return value. One thing you should be aware of is that this works by case-folding. in one direction. You could create a wrapper class around Character with different equals semantics to really be case-insensitive.

Interestingly, Apache Commons has a CaseInsensitiveMap (src) which works by upper-casing then lower-casing the key. As you probably know, Java's HashSet is backed by a HashMap.

public static boolean allUnique(String s)
{
  HashSet<Character> hs = new HashSet<Character>(s.length()); // This can be tuned, obviously
  for(int i = 0; i < s.length(); i++)
  {
    if(!hs.add(s.charAt(i).toUpperCase())
      return false;
  }
  return true;
}

Matthew Flaschen 2010-03-16 04:52:49

Answer 2

+1 A:

An improvement on option 2 is to check the boolean flag that the HashSet add method returns. It's true if the object wasn't already there. Though, for this method to be at all useful you'd first have to set the string to all caps or lowercase.

Mimisbrunnr 2010-03-16 04:53:57

Answer 3

+8 A:

I don't like 1. -- it's an O(N²) algorithm. Your 2. is roughly linear, but always traverses the entire string. Your 3. is O(N lg₂ N), with (probably) a relatively high constant -- probably almost always slower than 2.

My preference, however, would be when you try to insert a letter into the set, check whether it was already present, and if it was, you can stop immediately. Given random distribution of letters, this should require scanning only half the string on average.

Edit: both comments are correct that exactly what portion of the string you expect to scan will depend on the distribution and the length -- at some point the string is long enough that a repeat is inevitable, and (for example) one character short of that, the chance is still pretty darned high. In fact, given a flat random distribution (i.e., all characters in the set are equally likely), this should fit closely with the birthday paradox, meaning the chance of a collision is related to the square root of the number of possible characters in the character set. Just for example, if we assumed basic US-ASCII (128 characters) with equal probability, we'd reach a 50% chance of a collision at around 14 characters. Of course, in real strings we could probably expect it sooner than that, since the ASCII characters aren't used with anywhere close to equal frequency in most strings.

Jerry Coffin 2010-03-16 04:56:34

+1 - but your last sentence is wrong. The average number of characters scanned (assuming randomly ordered/chosen characters) is a function of the probability distribution of the letters and the average string length.

Stephen C 2010-03-16 05:20:51

off-topic, but how do you get superscripts and subscripts?

Kache4 2010-03-16 05:30:18

Alconja 2010-03-16 05:37:26

Answer 4

+1 A:

What about using an int to store the bits corresponding to the index of the letter of the alpabhet? or maybe a long to be able to reach 64 distinct symbols.

long mask;
// already lower case
string = string.toLowerCase();
for (int i = 0; i < string.length(); ++i)
{
  int index = 1 << string.charAt(i) - 'a';
  if (mask & index == index)
    return false;

  mask |= index;
}
return true;

This should be < O(n) on average case, O(n) on worst. But I'm not sure how much performant bitwise operations are in Java..

Jack 2010-03-16 05:07:12

Answer 5

+2 A:

By "unique letters" do you mean merely the standard English set of 26, or are you allowing interesting Unicode? What result do you expect if the string contains a non-letter?

If you're only considering 26 possible letters, and you want to either ignore any non-letter or consider it an automatic fail, the best algorithm is likely this pseudocode:

create present[26] as an array of booleans.
set all elements of present[] to false.
loop over characters of your string
  if character is a letter
    if corresponding element of present[] is true
      return false.
    else
      set corresponding element of present[] to true.
    end if
  else
    handle non-letters
  end if
end loop

The only remaining question is whether your array should actually be an array (requiring 26 operations to zero), or a bitfield (possibly requiring more work to check/set, but can be zeroed in a single operation). I think that the bitfield access will be pretty much comparable to the array lookup, if not faster, so I expect a bitfield is the right answer.

Brooks Moses 2010-03-16 05:09:44

Answer 6

+1 A:

public boolean hasUniqChars(String s){
  Hashset h = new Hashset();
  HashSet<Character> h = new HashSet<Character>();
  for (char c : s.toCharArray()) {
  if (!h.add(Character.toUpperCase(c))) // break if already present
    return false;
  }
  return true;
}

You should use hashset technique if you are performing char sets like utf-8 and for internationalization's sake.

Javadoc on Character.toUpperCase for cases of utf: This method (toUpperCase(char) ) cannot handle supplementary characters. To support all Unicode characters, including supplementary characters, use the toUpperCase(int) method.

Blessed Geek 2010-03-16 05:17:14

Answer 7

+1 A:

I would suggest a variant of (2) - use an array of "character already seen" flags instead of a hashset. As you loop through the string, exit immediately if the current character was already seen.

If you have a bitvector class available (I forget whether Java provides one), you could use that, though the memory saving will not necessarily result in any speed improvement and could easily slow things down.

It's O(n) worst case, though, and could have far better average performance depending on your strings - you may well find that most have a repeat near the start. In fact, strictly speaking, it's O(1) worst case anyway since a string longer than the size of the character set must have repeated characters so you have a constant bound to the number of characters you need to check in each string.

Steve314 2010-03-16 05:19:02

Answer 8

+3 A:

Option 2 is the best of the three - Hashing is faster than searching.

However, there's an even faster method, if you have enough memory for it.

Take advantage of the fact that a character set is limited and already enumerated, and keep track of what's appeared and what hasn't as you check each character.

For example, if you're using one-byte chars, there are only 256 possibilities. You would only need 256 bits to keep track as you read through the string. If the character 0x00 occurs, flip the first bit. If the character 0x05 occurs, flip the sixth bit, and so on. When an already-flipped bit is encountered, the string isn't unique.

It's worst case O(min(n, m)) where n is the length of the string, and m is the size of the character set.

And of course, as I saw in another person's comment, if n > m (i.e. length of string > size of character set), then by pigeon-hole principle, there is a repeated character, determinable in O(1) time.

Kache4 2010-03-16 05:37:19

+1 I think you'd need 2^8 bits for one-byte characters, but wouldn't you need 2^16 for two-byte? `BitSet` would work either way. http://java.sun.com/javase/6/docs/api/java/util/BitSet.html

trashgod 2010-03-16 09:18:34

oh whoops, I got myself confused there.

Kache4 2010-03-16 11:54:02

ansaurus

tags:

views:

answers:

Easiest way of checking if a string consists of unique characters?

related questions