views:

282

answers:

11

I have a string, and another text file which contains a list of strings.

We call 2 strings "brotherhood strings" when they're exactly the same after sorting alphabetically.

For example, "abc" and "cba" will be sorted into "abc" and "abc", so the original two are brotherhood. But "abc" and "aaa" are not.

So, is there an efficient way to pick out all brotherhood strings from the text file, according to the one string provided?

For example, we have "abc" and a text file which writes like this:

abc
cba
acb
lalala

then "abc", "cba", "acb" are the answers.

Of course, "sort & compare" is a nice try, but by "efficient", i mean if there is a way, we can determine a candidate string is or not brotherhood of the original one after one pass processing.

This is the most efficient way, i think. After all, you can not tell out the answer without even reading candidate strings. For sorting, most of the time, we need to do more than 1 pass to the candidate string. So, hash table might be a good solution, but i've no idea what hash function to choose.

+2  A: 

iterate, sort, compare. that shouldn't be too hard, right?

SilentGhost
Unfortunately, you'd have to sort every string that way, which seems like more work than you should have to do to solve this...
Mark Bessey
well you could push already selected values into a hash table and check there first
SilentGhost
or create all possible permutations first and then check each value for existence.
SilentGhost
sort and compare is too time-consuming, so is checking all permutations. hash table is a good point, so the key is, where's the hash function?
ibread
a 10 letter word would have over three million permutations. if there were only 100 words in the file, this might be overkill :)
Peter Recore
of course Peter, that's why it's a third in the list of suggestion ;)
SilentGhost
+1  A: 
  • Sort your query string
  • Iterate through the Collection, doing the following:
    • Sort current string
    • Compare against query string
    • If it matches, this is a "brotherhood" match, save it/index/whatever you want

That's pretty much it. If you're doing lots of searching, presorting all of your collection will make the routine a lot faster (at the cost of extra memory). If you are doing this even more, you could pre-sort and save a dictionary (or some hashed collection) based off the first character, etc, to find matches much faster.

Reed Copsey
+3  A: 

Most efficient algorithm I can think of:

  • Set up a hash table for the original string. Let each letter be the key, and the number of times the letter appears in the string be the value. Call this hash table inputStringTable
  • Parse the input string, and each time you see a character, increment the value of the hash entry by one
  • for each string in the file
  • create a new hash table. Call this one brotherStringTable.
  • for each character in the string, add one to a new hash table. If brotherStringTable[character] > inputStringTable[character], this string is not a brother (one character shows up too many times)
  • once string is parsed, compare each inputStringTable value with the corresponding brotherStringTable value. If one is different, then this string is not a brother string. If all match, then the string is a brother string.

This will be O(n*k), where n is the length of the input string (any strings longer than the input string can be discarded immediately) and k is the number of strings in the file. Any sort based algorithm will be O(n*k lg n), so in certain cases, this algorithm is faster than a sort based algorithm.

ristonj
It should be O(N) - you have N O(1) increments on the hash table, and then 2N O(1) accesses to compare the hashes. The constant factor however will be way higher that my answer though.
Pete Kirkham
Sorry, I should have said O(n*k), where n is the length of the original string, and k is the number of elements in the file. Edited.
ristonj
+3  A: 

Sorting each string, then comparing it, works out to something like O(N*(k+log S)), where N is the number of strings, k is the search key length, and S is the average string length.

It seems like counting the occurrences of each character might be a possible way to go here (assuming the strings are of a reasonable length). That gives you O(k+N*S). Whether that's actually faster than the sort & compare is obviously going to depend on the values of k, N, and S.

I think that in practice, the cache-thrashing effect of re-writing all the strings in the sorting case will kill performance, compared to any algorithm that doesn't modify the strings...

Mark Bessey
how does O(N*(k+log S)) come out?Why not O(N*(SlogS+k))?
ibread
Yeah, looks like I dropped an S there. thanks.
Mark Bessey
+2  A: 

Let's assume your alphabet is from 'a' to 'z' and you can index an array based on the characters. Then, for each element in a 26 element array, you store the number of times that letter appears in the input string.

Then you go through the set of strings you're searching, and iterate through the characters in each string. You can decrement the count associated with each letter in (a copy of) the array of counts from the key string.

If you finish your loop through the candidate string without having to stop, and you have seen the same number of characters as there were in the input string, it's a match.

This allows you to skip the sorts in favor of a constant-time array copy and a single iteration through each string.

EDIT: Upon further reflection, this is effectively sorting the characters of the first string using a bucket sort.

Pillsy
A: 

The most efficient answer will depend on the contents of the file. Any algorithm we come up with will have complexity proportional to N (number of words in file) and L (average length of the strings) and possibly V (variety in the length of strings)

If this were a real world situation, I would start with KISS and not try to overcomplicate it. Checking the length of the target string is simple but could help avoid lots of nlogn sort operations.

target = sort_characters("target string")
count = 0
foreach (word in inputfile){
   if target.len == word.len && target == sort_characters(word){
      count++
    }
 }
Peter Recore
+2  A: 

I think what will help you is the test if two strings are anagrams. Here is how you can do it. I am assuming the string can contain 256 ascii characters for now.

#define NUM_ALPHABETS 256
int alphabets[NUM_ALPHABETS];

bool isAnagram(char *src, char *dest) {
    len1 = strlen(src);
    len2 = strlen(dest);
    if (len1 != len2)
        return false;

    memset(alphabets, 0, sizeof(alphabets));
    for (i = 0; i < len1; i++)
        alphabets[src[i]]++;
    for (i = 0; i < len2; i++) {
        alphabets[dest[i]]--;
        if (alphabets[dest[i]] < 0)
            return false;
    }

   return true;
}

This will run in O(mn) if you have 'm' strings in the file of average length 'n'

Ashwin
+1  A: 

It's fairly obvious that each brotherhood string will have the same histogram of letters as the original. It is trivial to construct such a histogram, and fairly efficient to test whether the input string has the same histogram as the test string ( you have to increment or decrement counters for twice the length of the input string ).

The steps would be:

  • construct histogram of test string ( zero an array int histogram[128] and increment position for each character in test string )
  • for each input string

    • for each character in input string c, test whether histogram[c] is zero. If it is, it is a non-match and restore the histogram.

      • decrement histogram[c]
    • to restore the histogram, traverse the input string back to its start incrementing rather than decrementing

At most, it requires two increments/decrements of an array for each character in the input.

Pete Kirkham
A: 

I would recommend:
for each string in text file :

  1. compare size with "source string" (size of brotherhood strings should be equal)
  2. compare hashes (CRC or default framework hash should be good)
  3. in case of equity, do a finer compare with string sorted.

It's not the fastest algorithm but it will work for any alphabet/encoding.

Tox'N
A: 

Here's another method, which works if you have a relatively small set of possible "letters" in the strings, or good support for large integers. Basically consists of writing a position-independent hash function...

Assign a different prime number for each letter:

prime['a']=2;
prime['b']=3;
prime['c']=5;

Write a function that runs through a string, repeatedly multiplying the prime associated with each letter into a running product

long long key(char *string)
{
  long long product=1;
  while (*string++) {
    product *= prime[*string];
  }
  return product;
}

This function will return a guaranteed-unique integer for any set of letters, independent of the order that they appear in the string. Once you've got the value for the "key", you can go through the list of strings to match, and perform the same operation.

Time complexity of this is O(N), of course. You can even re-generate the (sorted) search string by factoring the key. The disadvantage, of course, is that the keys do get large pretty quickly if you have a large alphabet.

Mark Bessey
A: 

Here's an implementation. It creates a dict of the letters of the master, and a string version of the same as string comparisons will be done at C++ speed. When creating a dict of the letters in a trial string, it checks against the master dict in order to fail at the first possible moment - if it finds a letter not in the original, or more of that letter than the original, it will fail. You could replace the strings with integer-based hashes (as per one answer regarding base 26) if that proves quicker. Currently the hash for comparison looks like a3c2b1 for abacca.

This should work out O(N log( min(M,K) )) for N strings of length M and a reference string of length K, and requires the minimum number of lookups of the trial string.

master = "abc"
wordset = "def cba accb aepojpaohge abd bac ajghe aegage abc".split()

def dictmaster(str):
    charmap = {}
    for char in str:
        if char not in charmap:
            charmap[char]=1
        else:
            charmap[char] += 1
    return charmap

def dicttrial(str,mastermap):
    trialmap = {}
    for char in str:
        if char in mastermap:
            # check if this means there are more incidences
            # than in the master
            if char not in trialmap:
                trialmap[char]=1
            else:
                trialmap[char] += 1
        else:
            return False
    return trialmap

def dicttostring(hash):
    if hash==False:
        return False
    str = ""
    for char in hash:
        str += char + `hash[char]`
    return str

def testtrial(str,master,mastermap,masterhashstring):
    if len(master) != len(str):
        return False
    trialhashstring=dicttostring(dicttrial(str,mastermap))
    if (trialhashstring==False) or (trialhashstring != masterhashstring):
        return False
    else:
        return True

mastermap = dictmaster(master)
masterhashstring = dicttostring(mastermap)
for word in wordset:
    if testtrial(word,master,mastermap,masterhashstring):
        print word+"\n"
Phil H