ansaurus

Question

Testing for repeated characters in a string

Answer 1

+5 A:

If the string is short, then just looping and testing may well be the simplest and most efficient way. I mean you could create a hash set (in whatever platform you're using) and iterate through the characters, failing if the character is already in the set and adding it to the set otherwise - but that's only likely to provide any benefit when the strings are longer.

EDIT: Now that we know it's sorted, mquander's answer is the best one IMO. Here's an implementation:

public static bool IsSortedNoRepeats(string text)
{
    if (text.Length == 0)
    {
        return true;
    }
    char current = text[0];
    for (int i=1; i < text.Length; i++)
    {
        char next = text[i];
        if (next <= current)
        {
            return false;
        }
        current = next;
    }
    return true;
}

A shorter alternative if you don't mind repeating the indexer use:

public static bool IsSortedNoRepeats(string text)
{
    for (int i=1; i < text.Length; i++)
    {
        if (text[i] <= text[i-1])
        {
            return false;
        }
    }
    return true;
}

EDIT: Okay, with the "frequency" side, I'll turn the problem round a bit. I'm still going to assume that the string is sorted, so what we want to know is the length of the longest run. When there are no repeats, the longest run length will be 0 (for an empty string) or 1 (for a non-empty string). Otherwise, it'll be 2 or more.

First a string-specific version:

public static int LongestRun(string text)
{
    if (text.Length == 0)
    {
        return 0;
    }
    char current = text[0];
    int currentRun = 1;
    int bestRun = 0;

    for (int i=1; i < text.Length; i++)
    {
        if (current != text[i])
        {
            bestRun = Math.Max(currentRun, bestRun);
            currentRun = 0;
            current = text[i];
        }
        currentRun++;
    }
    // It's possible that the final run is the best one
    return Math.Max(currentRun, bestRun);
}

Now we can also do this as a general extension method on IEnumerable<T>:

public static int LongestRun(this IEnumerable<T> source)
{
    bool first = true;
    T current = default(T);
    int currentRun = 0;
    int bestRun = 0;

    foreach (T element in source)
    {
        if (first || !EqualityComparer<T>.Default(element, current))
        {
            first = false;
            bestRun = Math.Max(currentRun, bestRun);
            currentRun = 0;
            current = element;
        }
    }
    // It's possible that the final run is the best one
    return Math.Max(currentRun, bestRun);
}

Then you can call "AABCD".LongestRun() for example.

Jon Skeet 2009-05-06 13:27:28

This is exactly how I would do it. +1

bruno conde 2009-05-06 13:31:09

And I thought you were a LINQ evangelist :P

BobTheBuilder 2009-05-06 14:09:34

I'm a fan of LINQ where it's appropriate. In this case, I don't think it is.

Jon Skeet 2009-05-06 15:29:41

I haven't delved into the MSIL to check, but my assumption was that the compiler would optimize the LINQ to similar as the loop. Can you elaborate on why you think this is more appropriate? I think your opinion on that would be very useful.

BenAlabaster 2009-05-06 15:39:58

Hadn't spotted the frequency bit. Will edit when I have time.

Jon Skeet 2009-05-06 15:40:35

@balabaster: LINQ will create a hash set (or something similar) behind the scenes, create iterators and all kinds of things which are appropriate for large data sets, but have a relatively high "constant cost" associated with them. I had assumed that the OP wanted an efficient solution.

Jon Skeet 2009-05-06 16:50:29

@JS Thanks. What you're saying, if I understand you correctly is that my assumption that the compiler would optimize to a similar loop was incorrect, which is fair enough. I hadn't had time to delve into the MSIL to figure that out.

BenAlabaster 2009-05-06 17:55:30

@balabaster: The C# *compiler* doesn't really have much to do with LINQ unless you're using a query expression. It's the implementation of Distinct() which is important here.

Jon Skeet 2009-05-06 18:34:52

@JS Gotcha, I'll have a dig into the MSIL to figure out the underlying differences when I get a moment.

BenAlabaster 2009-05-06 18:52:58

@JS Thanks for the code, you da man.

DrG 2009-05-08 09:43:55

Answer 2

+2 A:

Update Now, you'd need an array of counters to maintain a count.

Keep a bit array, with one bit representing a unique character. Turn the bit on when you encounter a character, and run over the string once. A mapping of the bit array index and the character set is upto you to decide. Break if you see that a particular bit is on already.

dirkgently 2009-05-06 13:27:43

+1. HashSet is also valid, but since this problem is limited to 26 items a bit / bool array is going to be way faster.

Steve Haigh 2009-05-06 13:39:37

If it's not too much to ask, could someone please provide an implementation of this?

bruno conde 2009-05-06 13:51:11

The question has now been edited and this answer no longer works, as it is not possible to get the frequency of duplicates this way.

Steve Haigh 2009-05-06 14:04:47

Answer 3

+11 A:

If the string is sorted, you could just remember each character in turn and check to make sure the next character is never identical to the last character.

Other than that, for strings under ten characters, just testing each character against all the rest is probably as fast or faster than most other things. A bit vector, as suggested by another commenter, may be faster (helps if you have a small set of legal characters.)

Bonus: here's a slick LINQ solution to implement Jon's functionality:

int longestRun =
    s.Select((c, i) => s.Substring(i).TakeWhile(x => x == c).Count()).Max();

So, OK, it's not very fast! You got a problem with that?!

:-)

mquander 2009-05-06 13:27:58

Not very elegant though... a nice little LINQ statement would do it very succinctly.

BobTheBuilder 2009-05-06 14:09:21

That's true, but if he's even asking this question, I assume performance is important.

mquander 2009-05-06 14:09:42

Answer 4

+3 A:

I think the easiest way to achieve that is to use this simple regex

bool foundMatch = false;
foundMatch = Regex.IsMatch(yourString, @"(\w)\1");

If you need more information about the match (start, length etc)

     Match match = null;
 string testString = "ABCDE AABCD";
 match = Regex.Match(testString, @"(\w)\1+?");
 if (match.Success)
 {
  string matchText = match.Value; // AA
  int matchIndnex = match.Index;  // 6
  int matchLength = match.Length; // 2
 }

2009-05-06 13:37:15

Gah, beat me to it.

Steve Jessop 2009-05-06 13:39:59

Answer 5

+1 A:

/(.).*\1/

(or whatever the equivalent is in your regex library's syntax)

Not the most efficient, since it will probably backtrack to every character in the string and then scan forward again. And I don't usually advocate regular expressions. But if you want brevity...

Steve Jessop 2009-05-06 13:38:17

Answer 6

+5 A:

Since you're using 3.5, you could do this in one LINQ query:

var results = stringInput
  .ToCharArray() // not actually needed, I've left it here to show what's actually happening
  .GroupBy(c=>c)
  .Where(g=>g.Count()>1)
  .Select(g=>new {Letter=g.First(),Count=g.Count()})
;

For each character that appears more than once in the input, this will give you the character and the count of occurances.

Winston Smith 2009-05-06 13:44:26

You could condense this a lot more by just checking distincts...if there's a different number of distincts than actual, then you've got a duplicate.

BobTheBuilder 2009-05-06 14:10:39

The OP wanted to know which letters were repeated, as well as the number of occurrences, hence my solution above.

Winston Smith 2009-05-06 14:16:59

@Bob as noted in the OPs edit, this takes care of the frequency which a more condensed solution probably wouldn't.

BenAlabaster 2009-05-06 14:17:27

+1 nice solution, as I noted in mine, so long as you don't need frequencies, it becomes even simpler.

BenAlabaster 2009-05-06 14:18:44

although it works for this solution, you should be selecting g.Key, not g.First()

Lucas 2009-05-06 17:09:59

Good point, thanks.

Winston Smith 2009-05-06 17:11:34

Answer 7

+6 A:

This will tell you very quickly if a string contains duplicates:

bool containsDups = "ABCDEA".Length != s.Distinct().Count();

It just checks the number of distinct characters against the original length. If they're different, you've got duplicates...

Edit: I guess this doesn't take care of the frequency of dups you noted in your edit though... but some other suggestions here already take care of that, so I won't post the code as I note a number of them already give you a reasonably elegant solution. I particularly like Joe's implementation using LINQ extensions.

BenAlabaster 2009-05-06 13:51:36

You can remove the .ToCharArray(), it'll work just fine with only s.Distinct().Count()...

BobTheBuilder 2009-05-06 14:06:07

Thanks, I've updated my code accordingly

BenAlabaster 2009-05-06 14:07:00

Answer 8

+2 A:

How about something like:

string strString = "AA BRA KA DABRA";

var grp = from c in strString.ToCharArray() 
        group c by c into m
        select new { Key = m.Key, Count = m.Count() };

foreach (var item in grp)
{
    Console.WriteLine(
        string.Format("Character:{0} Appears {1} times", 
        item.Key.ToString(), item.Count));
}

CasperT 2009-05-06 13:52:56

same as Joe's, but +1 showing different syntax. btw String implements IEnumerable<char>, no need for ToCharArray()

Lucas 2009-05-06 17:13:56

Answer 9

A:

When there is no order to work on you could use a dictionary to keep the counts:

String input = "AABCD";
var result = new Dictionary<Char, int>(26);
var chars = input.ToCharArray();
foreach (var c in chars)
{
    if (!result.ContainsKey(c))
    {
        result[c] = 0; // initialize the counter in the result
    }
    result[c]++;
}

foreach (var charCombo in result)
{
    Console.WriteLine("{0}: {1}",charCombo.Key, charCombo.Value);   
}

Davy Landman 2009-05-06 14:08:41

Answer 10

A:

The hash solution Jon was describing is probably the best. You could use a HybridDictionary since that works well with small and large data sets. Where the letter is the key and the value is the frequency. (Update the frequency every time the add fails or the HybridDictionary returns true for .Contains(key))

Paul U 2009-05-06 17:03:58

ansaurus

tags:

views:

answers:

Testing for repeated characters in a string

related questions