views:

352

answers:

3

Hello,

I spent days working on a function to get common chars in an array of strings, in the right order, to create a wildcard.

Here is an example to explain my problem. I made about 3 functions, but I always have a bug when the absolute position of each letter is different.

Let's assume "+" is the "wildcard char":

Array(
0 => '48ca135e0$5',
1 => 'b8ca136a0$5',
2 => 'c48ca13730$5',
3 => '48ca137a0$5');

Should return :

$wildcard='+8ca13+0$5';

In this example, the tricky thing is that $array[2] as 1 char more than others.

Other example :

Array(
0 => "case1b25.occHH&FmM",
1 => "case11b25.occHH&FmM",
2 => "case12b25.occHH&FmM",
3 => "case20b25.occHH&FmM1");

Should return :

$wildcard='case+b25.occHH&FmM+';

In this example, the tricky parts are :
- Repeating chars, such as 1 -> 11 in the "to delete" part, and c -> cc in the common part
- The "2" char in $array[2] & [3] in the "to delete" part is not in the same position
- The "1" char at the end of the last string

I really need help because I can't find a solution to this function and it is a main part of my application.

Thanks in advance, don't hesitate to ask questions, I will answer as fast as possible.

Mykeul

+3  A: 

Seems you want to create something like regular expression out of set of example strings. This might be quite tricki in general. Found this link, not sure if it's relevant: http://scholar.google.com/scholar?hl=en&rlz=1B3GGGL_enEE351EE351&q=%22regular%20expression%20by%20example%22&oq=&um=1&ie=UTF-8&sa=N&tab=ws

On the other hand, if you need only one specific wildcard meaning "0 or more characters", then it should be much easier. Levenshtein distance algorithm computes similarity between 2 strings. Normally only result is needed, but in your case the places of differences are important. You also need to adapt this for N strings.

So I recommend to study this algorithm and hopefully you'll get some ideas how to solve your problem (at least you'll get some practice with text algorithms and dynamic programming).

Heres algorithm in PHP: _http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#PHP

You might want also to search for PHP implementations of "diff". http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/

Aivar
Hello,Thanks for your awnser, however :- I dont think I can use a regex because I cant know the type of chars in string, neither the type od chars I will have to delete- I read about the Levenshtein distance but it tell the "number of differences", not what is the difference ... and there is no order.- The last link shows a PHP file download link which doesnt work anymore and the comments only provide parts of script.I really need to keep the common chars and the order.
Mykeul
A: 

Main code:
Step 1: Sort strings by length, shortest to longest, into array[]
Step 2: Compare string in array[0] and array[1] to get $temp_wildcard
Step 3: Compare string in array[2] with $temp_wildcard to create new $temp_wildcard
Step 4: Continue comparing each string with $temp_wildcard - the last $wildcard is your $temp_wildcard

OK, so now we're down to the problem of how to compare two strings to return your wildcard string.

Subroutine code: Compare strings character-by-character, substituting wildcards into your return value when the comparison doesn't match.

To handle the problem of different lengths, run this comparison an extra time for each character that the second string is longer with an offset. (Compare string1[x] to string2[x+offset].) For each returned string, count the number of wildcard characters. The subroutine should return the answer with the fewest number of wildcard characters.

Good luck!

Summer
Hello and thanks for your awnser ! That is a good algorythm and I will try it this evening I think. There is still 1 case I can't match : Array("abcd", "bcde"); => same length but I should get "+bcd+".It is still a very good algorythm, better than mine.Mykeul
Mykeul
A: 

Hello,

Your algorythm is G.R.E.AT !
After a few days tuning here and there, it works nearly perfectly.

I am now able to save way more diskspace than before. The only case that doesnt work is Array("abcd", "bcde"); (backward search char with no offset), but the script doesnt encounter this type often, so it is sufficient.

I thank you very much for this help, I had a really hard time figuring this before you helped me.

Regards,

Mykeul

Mykeul
I forgot the name of the helper : Summer ! :)
Mykeul
It isn't to hard to add a "backwards" alignment to the standard string alignment algorithm (Levenshtein distance), it is sometimes used for nucleotide sequence alignment to find genes which have been inserted in the reverse direction.
Nixuz