views:

44

answers:

2

I have a collection of binary strings of given size encoding effective solutions to a given problem.

By looking at them, I can spot obvious similarities and intuitively see patterns of symmetry and periodicity.

Are there mathematical/algorithmic tools I can "feed" this set of strings to and get results that might give me an idea of what this set of strings have in common?

By doing so I would be able to impose a structure (or at least favor some features over others) on candidate solutions in order to greatly reduce the search space, maximizing chances to find optimal solutions for my problem (I am using genetic algorithms as the search tool - but this is not pivotal to the question).

Any pointers/approaches appreciated.

A: 

Pattern recognition is a broad subject. If your strings have a small alphabet then perhaps some of the DNA matching algorithms would be applicable.

Of Interest?:

http://bio.informatics.indiana.edu/sunkim/c/s06/I690/aomtp.pdf

http://etd.ohiolink.edu/send-pdf.cgi/Frey%20Jeffrey%20Daniel.pdf?kent1208961242

http://www.cs.ucr.edu/~stelo/cpm/cpm09/04_fl09.pdf

http://www.iaeng.org/IJCS/issues_v34/issue_2/IJCS_34_2_03.pdf

Mitch Wheat
I wouldn't describe this as a pattern recognition activity as it implies I know what I am looking for. It's more of a pattern-discovery exercise in my view. The links are much appreciated!
JohnIdol
+1  A: 

You need to take a look in unsupervised clustering algorithms such as:

  • Hierarchical Clustering
  • k-Nearest Neighbours
  • k-Means

Since they're binary strings, something simple such as Huffman Coding might work (only if you have a bit number of strings so you can estimate bit frequencies).

All of this methods are described in any Pattern Recognition / Machine Learning text worthy of that tag. The classic textbook is Duda, Stork & Hart: http://rii.ricoh.com/~stork/DHS.html

To use all of this methods you'll need some similarity measure between strings. If they are binary, you can start with the simplest thing: counting the numer of different bits in each string.

miquelramirez
Thanks for the pointers - what do you mean by 'different bits'? I was thinking as a first attempt to run a diff between the first and the second half of the string (they're showed one on top of the other to outline similarities in the link provided --> http://bit.ly/brYoHJ) and see if there was a correlation between that and the fitness, but I'd also like to try clustering methods that do not rely on human intuition, as the ones you suggest.
JohnIdol
miquelramirez