views:

904

answers:

17

I have wondered for some time, what a nice, clean solution for joining an array of strings might look like. Example: I have ["Alpha, "Beta", "Gamma"] and want to join the strings into one, separated by commas – "Alpha, Beta, Gamma".

Now I know that most programming languages offer some kind of join method for this. I just wonder how these might be implemented. When I took introductory courses, I often tried to go it alone, but never found a satisfactory algorithm. Everything seemed rather messy, the problem being that you can not just loop through the array, concatenating the strings, as you would add one too many commas (either before or after the last string). I don’t want to check conditions in the loop. I don’t really want to add the first or the last string before/after the loop (I guess this is maybe the best way?).

Can someone show me an elegant solution? Or tell me exactly why there can’t be anything more elegant?

+10  A: 

The most elegant solution i found for problems like this is something like this (in pseudocode)

separator = ""
foreach(item in stringCollection)
{
    concatenatedString += separator + item
    separator = ","
}

You just run the loop and only after the second time around the separator is set. So the first time it won't get added. It's not as clean as I'd like it to be so I'd still add comments but it's better than an if statement or adding the first or last item outside the loop.

Mendelt
What's wrong with added an item outside of the loop?
Jon Ericson
In many languages this will have poor performance unless you use a Stringbuilder.
Corey
@Jon: If you ad an item outside the loop and inside the loop you've got duplicate code, in this case that's not a problem but in other cases you might have several lines of code repeated, this is a bit cleaner.
Mendelt
@Corey: You're right. I think this is more clear for pseudo code though. If you implement this in C# you want to use a StringBuilder for sure.
Mendelt
Not so elegant if stringColelction contains lots of strings; therefore the solution is not useful as a general use library function. For more, read my answer below.
blabla999
@blabla999 Just using + here to show I'm doing string concatenation. What mechanism you use (stringbuilder etc.) is a language specific implementation detail. Just wanted to show an elegant way of solving the fence-post problem of adding n-1 separators to n items.
Mendelt
+3  A: 

I usually go with something like...

list = ["Alpha", "Beta", "Gamma"];
output = "";
separator = "";
for (int i = 0; i < list.length ; i++) {
  output = output + separator;
  output = output + list[i];
  separator = ", ";
}

This works because on the first pass, separator is empty (so you don't get a comma at the start, but on every subsequent pass, you add a comma before adding the next element.

You could certainly unroll this a little to make it a bit faster (assigning to the separator over and over isn't ideal), though I suspect that's something the compiler could do for you automatically.

In the end though, I suspect pretty this is what most language level join functions come down to. Nothing more than syntax sugar, but it sure is sweet.

Matt Sheppard
I've missed something or the result is : "Alpha Beta, Gamma,"
Nicolas
You're missing something. It works just fine.
Patrick
Forget it : i've read it too quickly. But I still don't like the in-loop setting of the separator.
Nicolas
Nicolas: I doubt you will be able to get something "nicer" in the imperative setting as you will typically have to separately handle the first or last item in the list. I can only imagine doing it nicer if you have some loop operator with an "in between" part: while C do S and in between T od.
mweerden
It's wasteful to set _seperator_ each time through the loop. Maybe a good optimizer would factor it out, though.
Jon Ericson
See my comment about the ternary operator below, and simply checking that the string you're building is non-empty
Bobby Jack
Please also read my answer below w.r.t. temporary garbage. Better use a StringBuilder.
blabla999
+3  A: 

For pure elegance, a typical recursive functional-language solution is quite nice. This isn't in an actual language syntax but you get the idea (it's also hardcoded to use comma separator):

join([]) = ""

join([x]) = "x"

join([x, rest]) = "x," + join(rest)

In reality you would write this in a more generic way, to reuse the same algorithm but abstract away the data type (doesn't have to be strings) and the operation (doesn't have to be concatenation with a comma in the middle). Then it usually gets called 'reduce', and many functional languages have this built in, e.g. multiplying all numbers in a list, in Lisp:

(reduce #'* '(1 2 3 4 5)) => 120

Luke Halliwell
+6  A: 

All of these solutions are decent ones, but for an underlying library, both independence of separator and decent speed are important. Here is a function that fits the requirement assuming the language has some form of string builder.

public static string join(String[] strings, String sep) {
  if(strings.length == 0) return "";
  if(strings.length == 1) return strings[0];
  StringBuilder sb = new StringBuilder();
  sb.append(strings[0]);
  for(int i = 1; i < strings.length; i++) {
    sb.append(sep);
    sb.append(strings[i]);
  }
  return sb.toString();
}

EDIT: I suppose I should mention why this would be speedier. The main reason would be because any time you call c = a + b; the underlying construct is usually c = (new StringBuilder()).append(a).append(b).toString();. By reusing the same string builder object, we can reduce the amount of allocations and garbage we produce.

And before someone chimes in with optimization is evil, we're talking about implementing a common library function. Acceptable, scalable performance is one of the requirements them. A join that takes a long time is one that's going to be not oft used.

Patrick
Can we chime in with the "language agnostic" label ?
Nicolas
I believe the StringBuilder (or a similar construct) was implied in the other responses. So, where's your second function? You promised two.
Konrad Rudolph
Not really. No matter how you cut it, every language deals with strings in a set number of ways. All of the other solutions assume that you have an append string method, which is certainly not language agnostic, as c doesn't have it. You can't really deal with strings without making some assumptions
Patrick
Re other function: I considered it, then didn't. Really. I could demonstrate the c way of doing it, I suppose. Then we'd at least have all the major language families covered, but I'll leave that for someone else to do.
Patrick
If you're really worried about allocation speed, you can initialise the StringBuilder to the needed capacity (strings.Sum(s => s.Length) + (strings.length - 1) * separator.length) and do away with reallocating space at all.
David Schmitt
C most certainly does have an append string function: strcat.
Jon Ericson
Jon: not in the way used in the previous examples. strcat doesn't return a new string and has far different semantics than strA + strB.
Patrick
David: then you've iterated over the collection twice. I'm not certain that in a managed environment where reallocs are relatively cheap that counting ahead would put you ahead. Then again, there's only one way to find out. :)
Patrick
+1  A: 

' Pseudo code Assume zero based

ResultString = InputArray[0]
n = 1
while n (is less than)  Number_Of_Strings
    ResultString (concatenate) ", "
    ResultString (concatenate) InputArray[n]
    n = n + 1
loop
David L Morris
I would use += instead of = to show that there is a concat taking place.
Ralph Rickenbach
I saw that before the comment and added (concatenate) to be really clear (they might not know 'c').
David L Morris
Don't you want to change `=` for `(assign?)` to be more language-agnostic? The poor Pascal people don't know what you mean. ;-)Seriously, though.
Konrad Rudolph
Yes, so true. But I think I have done my 4.
David L Morris
A: 

The following is no longer language-agnostic (but that doesn't matter for the discussion because the implementation is easily portable to other languages). I tried to implement Luke's (theretically best) solution in an imperative programming language. Take your pick; mine's C#. Not very elegant at all. However, (without any testing whatsoever) I could imagine that its performance is quite decent because the recursion is in fact tail recursive.

My challenge: give a better recursive implementation (in an imperative language). You say what “better” means: less code, faster, I'm open for suggestions.

private static StringBuilder RecJoin(IEnumerator<string> xs, string sep, StringBuilder result) {
    result.Append(xs.Current);
    if (xs.MoveNext()) {
        result.Append(sep);
        return RecJoin(xs, sep, result);
    } else
        return result;
}

public static string Join(this IEnumerable<string> xs, string separator) {
    var i = xs.GetEnumerator();
    if (!i.MoveNext())
        return string.Empty;
    else
        return RecJoin(i, separator, new StringBuilder()).ToString();
}
Konrad Rudolph
Of course, if we're dropping language independence then you could just use string.Join(separator, xs) :)
Simon Steele
<Insert snide comment about picking a terser language here.> ;-)
Jon Ericson
+1  A: 

In Perl, I just use the join command:

$ echo "Alpha
Beta
Gamma" | perl -e 'print(join(", ", map {chomp; $_} <> ))'
Alpha, Beta, Gamma

(The map stuff is mostly there to create a list.)

In languages that don't have a built in, like C, I use simple iteration (untested):

for (i = 0; i < N-1; i++){
    strcat(s, a[i]);
    strcat(s, ", ");
}
strcat(s, a[N]);

Of course, you'd need to check the size of s before you add more bytes to it.

You either have to special case the first entry or the last.

Jon Ericson
+5  A: 

Most languages nowadays - e.g. perl (mention by Jon Ericson), php, javascript - have a join() function or method, and this is by far the most elegant solution. Less code is better code.

In response to Mendelt Siebenga, if you do require a hand-rolled solution, I'd go with the ternary operator for something like:

separator = ","
foreach (item in stringCollection)
{
    concatenatedString += concatenatedString ? separator + item : item
}
Bobby Jack
using "+" to concatenate many strings is always a possible performance eater.
blabla999
A: 

We're just throwing up implementations in different languages now? Okay. Here's a hopefully (it's 5 in the morning) decent C implementation that adds the first item beforehand. I don't see an efficient way to avoid doing so. =\

char* joinStrings(char** strings, int numStrings, char* seperator) {
 // Handle empty case which would cause problems.
 if (numStrings == 0)
  return "";

 // Determine the length of the seperator string.
 int seperatorLength = 0;
 while (seperator[seperatorLength])
  ++seperatorLength;

 // Start with 1 char for the null terminator...
 int finalSize = 1;

 // ...add the length of the first string...
 int characterIndex;
 for (characterIndex = 0; strings[0][characterIndex]; ++characterIndex)
  ++finalSize;

 int stringIndex;

 // ...add the lengths of subsequent strings and the seperator.
 for (stringIndex = 1; stringIndex < numStrings; ++stringIndex) {
  finalSize += seperatorLength;

  for (characterIndex = 0; strings[stringIndex][characterIndex]; ++characterIndex)
   ++finalSize;
 }

 // Allocate memory our new string...
 char* newString = (char*) malloc(sizeof(char) * finalSize);
 int finalIndex = 0;

 // ...copy the first string in....
 for (characterIndex = 0; strings[0][characterIndex]; ++characterIndex)
  newString[finalIndex++] = strings[0][characterIndex];

 // ...copy the subsequent strings and seperators.
 for (stringIndex = 1; stringIndex < numStrings; ++stringIndex) {
  for (characterIndex = 0; seperator[characterIndex]; ++characterIndex)
   newString[finalIndex++] = seperator[characterIndex];

  for (characterIndex = 0; strings[stringIndex][characterIndex]; ++characterIndex)
   newString[finalIndex++] = strings[stringIndex][characterIndex];
 }

 // Terminate our new string...
 newString[finalIndex] = (char) 0;

 // ...and return it.
 return newString;
}

Here's it with a simple test, whose output seems to be as it should.

Jeremy Banks
Any reason you don't use the string library?
Jon Ericson
I felt that just preforming the simple operations inline would yield better performance, but perhaps that is a choice best left to the compiler.
Jeremy Banks
+2  A: 

@Mendelt Siebenga

Strings are corner-stone objects in programming languages. Different languages implement strings differently. An implementation of join() strongly depends on underlying implementation of strings. Pseudocode doesn't reflect underlying implementation.

Consider join() in Python. It can be easily used:

print ", ".join(["Alpha", "Beta", "Gamma"])
# Alpha, Beta, Gamma

It could be easily implemented as follow:

def join(seq, sep=" "):
    if not seq:         return ""
    elif len(seq) == 1: return seq[0]
    return reduce(lambda x, y: x + sep + y, seq)

print join(["Alpha", "Beta", "Gamma"], ", ")
# Alpha, Beta, Gamma

And here how join() method is implemented in C (taken from trunk):

PyDoc_STRVAR(join__doc__,
"S.join(sequence) -> string\n\
\n\
Return a string which is the concatenation of the strings in the\n\
sequence.  The separator between elements is S.");

static PyObject *
string_join(PyStringObject *self, PyObject *orig)
{
    char *sep = PyString_AS_STRING(self);
    const Py_ssize_t seplen = PyString_GET_SIZE(self);
    PyObject *res = NULL;
    char *p;
    Py_ssize_t seqlen = 0;
    size_t sz = 0;
    Py_ssize_t i;
    PyObject *seq, *item;

    seq = PySequence_Fast(orig, "");
    if (seq == NULL) {
     return NULL;
    }

    seqlen = PySequence_Size(seq);
    if (seqlen == 0) {
     Py_DECREF(seq);
     return PyString_FromString("");
    }
    if (seqlen == 1) {
     item = PySequence_Fast_GET_ITEM(seq, 0);
     if (PyString_CheckExact(item) || PyUnicode_CheckExact(item)) {
      Py_INCREF(item);
      Py_DECREF(seq);
      return item;
     }
    }

    /* There are at least two things to join, or else we have a subclass
     * of the builtin types in the sequence.
     * Do a pre-pass to figure out the total amount of space we'll
     * need (sz), see whether any argument is absurd, and defer to
     * the Unicode join if appropriate.
     */
    for (i = 0; i < seqlen; i++) {
     const size_t old_sz = sz;
     item = PySequence_Fast_GET_ITEM(seq, i);
     if (!PyString_Check(item)){
#ifdef Py_USING_UNICODE
      if (PyUnicode_Check(item)) {
       /* Defer to Unicode join.
        * CAUTION:  There's no gurantee that the
        * original sequence can be iterated over
        * again, so we must pass seq here.
        */
       PyObject *result;
       result = PyUnicode_Join((PyObject *)self, seq);
       Py_DECREF(seq);
       return result;
      }
#endif
      PyErr_Format(PyExc_TypeError,
            "sequence item %zd: expected string,"
            " %.80s found",
            i, Py_TYPE(item)->tp_name);
      Py_DECREF(seq);
      return NULL;
     }
     sz += PyString_GET_SIZE(item);
     if (i != 0)
      sz += seplen;
     if (sz < old_sz || sz > PY_SSIZE_T_MAX) {
      PyErr_SetString(PyExc_OverflowError,
       "join() result is too long for a Python string");
      Py_DECREF(seq);
      return NULL;
     }
    }

    /* Allocate result space. */
    res = PyString_FromStringAndSize((char*)NULL, sz);
    if (res == NULL) {
     Py_DECREF(seq);
     return NULL;
    }

    /* Catenate everything. */
    p = PyString_AS_STRING(res);
    for (i = 0; i < seqlen; ++i) {
     size_t n;
     item = PySequence_Fast_GET_ITEM(seq, i);
     n = PyString_GET_SIZE(item);
     Py_MEMCPY(p, PyString_AS_STRING(item), n);
     p += n;
     if (i < seqlen - 1) {
      Py_MEMCPY(p, sep, seplen);
      p += seplen;
     }
    }

    Py_DECREF(seq);
    return res;
}


Note that the above Catenate everything. code is a small part of the whole function.

In pseudocode:

/* Catenate everything. */
for each item in sequence
    copy-assign item
    if not last item
        copy-assign separator
J.F. Sebastian
A: 

join() function in Ruby:

def join(seq, sep) 
  seq.inject { |total, item| total << sep << item } or "" 
end

join(["a", "b", "c"], ", ")
# => "a, b, c"
J.F. Sebastian
A: 

join() in Perl:

use List::Util qw(reduce);

sub mjoin($@) {$sep = shift; reduce {$a.$sep.$b} @_ or ''}

say mjoin(', ', qw(Alpha Beta Gamma));
# Alpha, Beta, Gamma

Or without reduce:

 sub mjoin($@) 
 {
   my ($sep, $sum) = (shift, shift); 
   $sum .= $sep.$_ for (@_); 
   $sum or ''
 }
J.F. Sebastian
A: 

Perl 6

sub join( $separator, @strings ){
  my $return = shift @strings;
  for @strings -> ( $string ){
    $return ~= $separator ~ $string;
  }
  return $return;
}

Yes I know it is pointless because Perl 6 already has a join function.

Brad Gilbert
A: 

In Java 5, with unit test:

import junit.framework.Assert;
import org.junit.Test;

public class StringUtil
{
    public static String join(String delim, String... strings)
    {
        StringBuilder builder = new StringBuilder();

        if (strings != null)
        {
            for (String str : strings)
            {
                if (builder.length() > 0)
                {
                    builder.append(delim);
                }

                builder.append(str);
            }
        }           

        return builder.toString();
    }

    @Test
    public void joinTest()
    {
        Assert.assertEquals("", StringUtil.join(", ", null));
        Assert.assertEquals("", StringUtil.join(", ", ""));
        Assert.assertEquals("", StringUtil.join(", ", new String[0]));
        Assert.assertEquals("test", StringUtil.join(", ", "test"));
        Assert.assertEquals("foo, bar", StringUtil.join(", ", "foo", "bar"));
        Assert.assertEquals("foo, bar, baz", StringUtil.join(", ", "foo", "bar", "baz"));
    }
}
mbaird
Having a delimiter parameter, and then assuming everyone also want a space appended, is a little silly.
Jon
Yeah I didn't think that through. Updated my code above to correct the hardcoded space.
mbaird
A: 

I wrote a recursive version of the solution in lisp. If the length of the list is greater that 2 it splits the list in half as best as it can and then tries merging the sublists

    (defun concatenate-string(list)
       (cond ((= (length list) 1) (car list))
             ((= (length list) 2) (concatenate 'string (first list) "," (second list)))
             (t (let ((mid-point (floor (/ (- (length list) 1) 2))))
                   (concatenate 'string 
                          (concatenate-string (subseq list 0 mid-point))
                          ","
                          (concatenate-string (subseq list mid-point (length list))))))))



    (concatenate-string '("a" "b"))

I tried applying the divide and conquer strategy to the problem, but I guess that does not give a better result than plain iteration. Please let me know if this could have been done better.

I have also performed an analysis of the recursion obtained by the algorithm, it is available here.

+1  A: 

collecting different language implementations ?
Here is, for your amusement, a Smalltalk version:

join:collectionOfStrings separatedBy:sep

  |buffer|

  buffer := WriteStream on:''.
  collectionOfStrings 
      do:[:each | buffer nextPutAll:each ]
      separatedBy:[ buffer nextPutAll:sep ].
  ^ buffer contents.

Of course, the above code is already in the standard library found as:

Collection >> asStringWith:

so, using that, you'd write:

#('A' 'B' 'C') asStringWith:','

But here's my main point:

I would like to put more emphasis on the fact that using a StringBuilder (or what is called "WriteStream" in Smalltalk) is highly recommended. Do not concatenate strings using "+" in a loop - the result will be many many intermediate throw-away strings. If you have a good Garbage Collector, thats fine. But some are not and a lot of memory needs to be reclaimed. StringBuilder (and WriteStream, which is its grand-grand-father) use a buffer-doubling or even adaptive growing algorithm, which needs MUCH less scratch memory.

However, if its only a few small strings you are concatenating, dont care, and "+" them; the extra work using a StringBuilder might be actually counter-productive, up to an implementation- and language-dependent number of strings.

blabla999
Depending on how a language implements the '+' operator on strings it can be good or bad. Not all languages create intermediate strings. I can't read smalltalk but I'm curious how you prevent the last ',' from being printed. You only need n-1 separators for n items.
Mendelt
A: 

Use the String.join method in C#

http://msdn.microsoft.com/en-us/library/57a79xd0.aspx