A: 

I would maybe define some delta between two strings. I don't know what you define as the difference (or "unequality") of two strings, but the most obvious thing I could think about would be string length and the number of occurences of particular letters (and their index in the string). It should not be tricky to implement it such that it returns the same color code in equal strings (if you do an equal first, and return before further comparison).

When it comes to the actual RGB value, I would try to convert the string data into 4 bytes (RGBA), or 3 bytes if you only use the RGB. I don't know if every string would fit into them (as that may be language specific?).

Yngve Sneen Lindal
+1  A: 

You need to elaborate more on what you mean by "similar strings" in order to come up with an appropriate conversion function. Are the strings

 "I am going to the store" and 
"We are going to the store"

considered similar? What about the strings

 "I am going to the store" and 
"J bn hpjoh up uif tupsf"

(all of the letters in the original +1), or

 "I am going to the store" and 
"I am going to the store today"

? Based on what you mean by "similar", you might consider different functions.

If the difference can be based solely on the values of the characters (in Unicode or whatever space they are from), then you can try summing the values up and using the result as a hue for HSV space. If having a longer string should cause the colours to be more different, you might consider weighing characters by their position in the string.

If the difference is more complex, such as by the occurrences of certain letters or words, then you need to identify this. Maybe you can decide red, green and blue values based on the number of Es, Ss and Rs in a string, if your domain has a lot of these. Or pick a hue based on the ratio of vowels to consonents, or words to syllables.

There are many, many different ways to approach this, but the best one really depends on what you mean by "similar" strings.

Welbog
+1  A: 

It sounds like you want a hash of some sort. It doesn't need to be secure (so nothing as complicated as MD5 or SHA) but something along the lines of:

char1 + char2 + char3 + ... + charN % MAX_COLOUR_VALUE

would work as a simple first step. You could also do fancier things along the lines of having each character act as an 'amplitude' for R,G and B (e could be +1R, +2G and -4B, etc.) and then simply add up all the values in a string... clamp them at the end and you have a method of turning arbitrary length strings into colours as a 'colour hash' sort of process.

workmad3
Nope - MD5 intentionally breaks locality. If S1 and s2 are similar, color(s1) and color(s2) should be similar to. MD5(s1) and MD5(s2) are not similar.
MSalters
Fair enough... but I didn't suggest MD5, this just makes it clearer as to why :)
workmad3
+1  A: 

First, you'll need to pick a way to measure string similarity. Minimal edit distance is traditional, but is not sufficient to well-order the strings, which is what you will need if you want to allocate the same colours to the same strings every time - perhaps you could weight the edit costs by alphabetic distance. Also minimal edit distance by itself may not be very useful if what you are after is similarity in speech rather than in written form (if so, consider a stemming/soundex pass first), or some other sense of "similarity".

Then you need to pick a way of traversing the visible colour space based on that metric. It may be helpful to consider using HSL or HSV colour representation - the algorithm could then become as simple as picking a starting hue and walking the sorted corpus, assigning the current hue to each string before offsetting it by the string's difference from the previous one.

moonshadow
A: 

Sorry, but you can't do what you're looking for with levenshtein distance or similar. RGB and HSV are 3-dimensional geometric spaces, but levenshtein distance describes a metric space - a much looser set of contstraints with no fixed number of dimensions. There's no way to map a metric space into a fixed number of dimensions while always preserving locality.

As far as approximations go, though, for single terms you could use a modification of an algorithm like soundex or metaphone to pick a color; for multiple terms, you could, for example, apply soundex or metaphone to each word individually, then sum them up (with overflow).

Nick Johnson
+1  A: 

How important is it that you never end up with two dissimilar strings having the same colour?

If it's not that important then maybe this could work?

You could pick a 1 dimensional color space that is "homotopic" to the circle: Say the color function c(x) is defined for x between 0 and 1. Then you'd want c(0) == c(1).

Now you take the sum of all character values modulo some scaling factor and wrap this back to the color space:

c( (SumOfCharValues(word) modulo ScalingFactor) / ScalingFactor )

This might work even better if you defined a "wrapping" color space of higher dimensions and for each dimension pick different SumOfCharValues function; someone suggested alternating sum and length.

Just a thought... HTH

wires
Oh I see other people suggested this as well, but in other words.. Srry!
wires
+1  A: 

Here is my suggestion (I think there is a general name for this algorithm, but I'm too tired to remember it):

You want to transform each string to a 3D point node(r, g, b) (you can scale the values so that they fit your range) such that the following error is minimized:

Error = \sum_i{\sum_j{(dist(node_i, node_j) - dist(str_i, str_j))^2}}

You can do this:

  1. First assign each string a random color (r, g, b)
  2. Repeat until you see fit (eg. error is adjusted less than \epsilon = 0.0001):
    1. Pick a random node
    2. Adjust it's position (r, g, b) such that the error is minimized
  3. Scale the coordinate system such that each nodes coordinates are in the range [0., 1.) or [0, 256]
Alexandru