ansaurus

Question

Matching unmatched strings based on a unknown pattern

Answer 1

+3 A:

Basically I would consider each string as a bag of characters. I would define a kind of distance between two strings which would be sth like "number of characters belonging to both strings" divided by "total number of characters in string 1 + total number of characters in string 2". (well, it's not a distance mathematically speaking...) and then I would try to apply some algorithms to cluster your set of strings.

Well, this is just a basic idea but I think it would be a good start to try some experiments...

PierrOz 2010-04-03 12:38:58

Answer 2

A:

Your question is not easy to understand, but I think what you ask is impossible to do in a satisfying way given any group of strings. Take these strings for instance:

[1].[2].[3].[4].[5]
[a].[2].[3].[4].[5]
[a].[b].[3].[4].[5]
[a].[b].[c].[4].[5]
[a].[b].[c].[d].[5]
[a].[b].[c].[d].[e]

Each is close to those listed next to it, so they should all group with their neighbours, but the first and the last are completely different, so it would not make sense to group those together. Given a more "grouping" dataset you might get pretty good results with a method like the one PierrOz describes, but there is no guarantee for meaningful results.

May I enquire what the purpose is? It would allow us all to better understand what errors might be tolerated, or perhaps even come up with a different approach to solving the problem.

Edit: I wonder, would it be OK if one string ends up in multiple different groups? That could make the problem a lot simpler, and more reliably give you useful information, but you would end up with a bigger grouping tree with the same node copied to different branches.

eBusiness 2010-04-03 13:06:37

[19720]-[FULL]-[#a.b.teevee@EFNet]-[ Cricket.Highlights.P DTV.XviD-C4TV ]-[23/28] - "cricket.highlights. pdtv.xvid-c4tv.vol00+01.par2" yEnc (1/3) [19720]-[FULL]-[#a.b.teevee@EFNet]-[ Cricket.Highlights.P DTV.XviD-C4TV ]-[18/28] - "cricket.highlights. pdtv.xvid-c4tv.r12" yEnc (1/53) [17537]-[FULL]-[#a.b.teevee@EFNet]-[ The.Worlds.C4TV ]-[01/52] - "sample-the.worlds.c4tv" yEnc (1/15)The first 2 strings belong to the same main group but both belong to their own subgroup.

Polity 2010-04-03 13:21:11

Updated the result in the original post because something went wrong there, hope it helps!

Polity 2010-04-03 13:24:18

Answer 3

A:

Hi,

I would recommend using this: http://en.wikipedia.org/wiki/Hamming_distance as the distance.

Also, For files a good heuristic would be to remove checksum in the end from the filename before calculating the distance:

[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_[35218661].mkv
->
[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_.mkv

A check is simple - it's always 10 characters, the first being [, the last -- ], and the rest ALPHA-numeric :)

With the heuristic and the distance max of 4, your stuff will work in the vast majority of the cases.

Good luck!

glebm 2010-04-03 13:39:57

Hamming distance assumes that the inputs are of equal length, i cant guaranty this.

Polity 2010-04-03 13:56:46

Oh, well, different length simply adds abs(length_2 - length_1) :)

glebm 2010-04-03 18:19:28

Answer 4

A:

I'd be tempted to tackle this with cluster analysis techniques. Hit Wikipedia for an introduction. And the other answers probably fall within the domain of cluster analysis, but you might find some other useful approaches by reading a bit more widely.

High Performance Mark 2010-04-03 13:53:09

Answer 5

+1 A:

Building on @PierrOz' answer, you might want to experiment with multiple measures, and do a statistical cluster analysis on those measures.

For example, you could use four measures:

How many letters (upper/lowercase)
How many digits
How many of ([,],.)
How many other characters (probably) not included above

You then have, in this example, four measures for each string, and you could, if you wished, apply a different weight to each measure.

R has a number of functions for cluster analysis. This might be a good starting point.

Afterthought: the measures can be almost anything you invent. Some more examples:

Binary: does the string contain a given character (0 or 1)?
Binary: does the string contain a given substring?
Count: how many times does the given substring appear?
Binary: does the string include all these characters?

Enough for a least a weekend's tinkering...

Brent.Longborough 2010-04-03 13:56:55

Cheers to you all, these answers are a good way to go. i'll start building on these concepts right away, Thanks!

Polity 2010-04-03 15:13:14

Please come back later to let us know how you got on!

Brent.Longborough 2010-04-18 20:56:56

ansaurus

tags:

views:

answers:

Matching unmatched strings based on a unknown pattern

related questions