views:

112

answers:

5

Alright guys, I really hurt my brain over this one and I'm curious if you guys can give me any pointers towards the right direction I should be taking.

The situation is this:

Lets say, I have a collection of strings (let it be clear that the pattern of this strings is unknown. For a fact, I can say that the string contain only signs from the ASCII table and therefore, I don't have to worry about weird Chinese signs).

For this example, I take the following collection of strings (note that the strings don't have to make any human sense so don't try figuring them out :)):

"[001].[FOO].[TEST] - 'foofoo.test'",  
"[002].[FOO].[TEST] - 'foofoo.test'",  
"[003].[FOO].[TEST] - 'foofoo.test'",  
"[001].[FOO].[TEST] - 'foofoo.test.sample'",  
"[002].[FOO].[TEST] - 'foofoo.test.sample'",    
"-001- BAR.[TEST] - 'bartest.xx1",  
"-002- BAR.[TEST] - 'bartest.xx1"  

Now, what I need to have is a way of finding logical groups (and subgroups) of these set of strings, so in the above example, just by rational thinking, you can combine the first 3, the 2 after that and the last 2. Also the resulting groups from the first 5 can be combined in one main group with 2 subgroups, this should give you something like this:

{
    {
        "[001].[FOO].[TEST] - 'foofoo.test'",  
        "[002].[FOO].[TEST] - 'foofoo.test'",  
        "[003].[FOO].[TEST] - 'foofoo.test'",  
    }
    {
        "[001].[FOO].[TEST] - 'foofoo.test.sample'",  
        "[002].[FOO].[TEST] - 'foofoo.test.sample'",    
    }
}
{
    {
        "-001- BAR.[TEST] - 'bartest.xx1",  
        "-002- BAR.[TEST] - 'bartest.xx1"  
    }
}

Sorry for the layout above but indenting with 4 spaces doesn't seem to work correctly (or I'm frakk'n it up).

Anyway, I'm not sure how to approach this problem (how to get the result desired as indicated above).

First of, I thought of creating a huge set of regexes which would parse most known patterns but the amount of different patterns is just to huge that this isn't realistic.

Another think I thought of was parsing each individual word within a string (so strip all non alphabetic or numeric characters and split by those), and if X% matches, I can assume the strings belong to the same group. (where X will probably be around 80/90). However, I find the area of speculation kinda big. For example, when matching strings with each 20 words, the change of hitting above 80% is kinda big (that means that 4 words can differ), however when matching only 8 words, 2 words at most can differ.

My question to you is, what would be a logical approach in the above situation?

As for a reallife example:

Thanks in advance!

+3  A: 

Basically I would consider each string as a bag of characters. I would define a kind of distance between two strings which would be sth like "number of characters belonging to both strings" divided by "total number of characters in string 1 + total number of characters in string 2". (well, it's not a distance mathematically speaking...) and then I would try to apply some algorithms to cluster your set of strings.

Well, this is just a basic idea but I think it would be a good start to try some experiments...

PierrOz
A: 

Your question is not easy to understand, but I think what you ask is impossible to do in a satisfying way given any group of strings. Take these strings for instance:

[1].[2].[3].[4].[5]
[a].[2].[3].[4].[5]
[a].[b].[3].[4].[5]
[a].[b].[c].[4].[5]
[a].[b].[c].[d].[5]
[a].[b].[c].[d].[e]

Each is close to those listed next to it, so they should all group with their neighbours, but the first and the last are completely different, so it would not make sense to group those together. Given a more "grouping" dataset you might get pretty good results with a method like the one PierrOz describes, but there is no guarantee for meaningful results.

May I enquire what the purpose is? It would allow us all to better understand what errors might be tolerated, or perhaps even come up with a different approach to solving the problem.

Edit: I wonder, would it be OK if one string ends up in multiple different groups? That could make the problem a lot simpler, and more reliably give you useful information, but you would end up with a bigger grouping tree with the same node copied to different branches.

eBusiness
[19720]-[FULL]-[#a.b.teevee@EFNet]-[ Cricket.Highlights.P DTV.XviD-C4TV ]-[23/28] - "cricket.highlights. pdtv.xvid-c4tv.vol00+01.par2" yEnc (1/3) [19720]-[FULL]-[#a.b.teevee@EFNet]-[ Cricket.Highlights.P DTV.XviD-C4TV ]-[18/28] - "cricket.highlights. pdtv.xvid-c4tv.r12" yEnc (1/53) [17537]-[FULL]-[#a.b.teevee@EFNet]-[ The.Worlds.C4TV ]-[01/52] - "sample-the.worlds.c4tv" yEnc (1/15)The first 2 strings belong to the same main group but both belong to their own subgroup.
Polity
Updated the result in the original post because something went wrong there, hope it helps!
Polity
A: 

Hi,

I would recommend using this: http://en.wikipedia.org/wiki/Hamming_distance as the distance.

Also, For files a good heuristic would be to remove checksum in the end from the filename before calculating the distance:

[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_[35218661].mkv
->
[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_.mkv

A check is simple - it's always 10 characters, the first being [, the last -- ], and the rest ALPHA-numeric :)

With the heuristic and the distance max of 4, your stuff will work in the vast majority of the cases.

Good luck!

glebm
Hamming distance assumes that the inputs are of equal length, i cant guaranty this.
Polity
Oh, well, different length simply adds abs(length_2 - length_1) :)
glebm
A: 

I'd be tempted to tackle this with cluster analysis techniques. Hit Wikipedia for an introduction. And the other answers probably fall within the domain of cluster analysis, but you might find some other useful approaches by reading a bit more widely.

High Performance Mark
+1  A: 

Building on @PierrOz' answer, you might want to experiment with multiple measures, and do a statistical cluster analysis on those measures.

For example, you could use four measures:

  1. How many letters (upper/lowercase)
  2. How many digits
  3. How many of ([,],.)
  4. How many other characters (probably) not included above

You then have, in this example, four measures for each string, and you could, if you wished, apply a different weight to each measure.

R has a number of functions for cluster analysis. This might be a good starting point.


Afterthought: the measures can be almost anything you invent. Some more examples:

  • Binary: does the string contain a given character (0 or 1)?
  • Binary: does the string contain a given substring?
  • Count: how many times does the given substring appear?
  • Binary: does the string include all these characters?

Enough for a least a weekend's tinkering...

Brent.Longborough
Cheers to you all, these answers are a good way to go. i'll start building on these concepts right away, Thanks!
Polity
Please come back later to let us know how you got on!
Brent.Longborough