ansaurus

Question

Algorithm for joining e.g. an array of strings

Answer 1

+10 A:

The most elegant solution i found for problems like this is something like this (in pseudocode)

separator = ""
foreach(item in stringCollection)
{
    concatenatedString += separator + item
    separator = ","
}

You just run the loop and only after the second time around the separator is set. So the first time it won't get added. It's not as clean as I'd like it to be so I'd still add comments but it's better than an if statement or adding the first or last item outside the loop.

Mendelt 2008-09-12 07:15:44

What's wrong with added an item outside of the loop?

Jon Ericson 2008-09-12 08:47:31

In many languages this will have poor performance unless you use a Stringbuilder.

Corey 2008-09-15 16:19:31

@Jon: If you ad an item outside the loop and inside the loop you've got duplicate code, in this case that's not a problem but in other cases you might have several lines of code repeated, this is a bit cleaner.

Mendelt 2008-09-15 18:40:17

@Corey: You're right. I think this is more clear for pseudo code though. If you implement this in C# you want to use a StringBuilder for sure.

Mendelt 2008-09-15 18:41:45

Not so elegant if stringColelction contains lots of strings; therefore the solution is not useful as a general use library function. For more, read my answer below.

blabla999 2009-01-12 22:27:23

@blabla999 Just using + here to show I'm doing string concatenation. What mechanism you use (stringbuilder etc.) is a language specific implementation detail. Just wanted to show an elegant way of solving the fence-post problem of adding n-1 separators to n items.

Mendelt 2009-01-14 09:38:13

Answer 2

+3 A:

I usually go with something like...

list = ["Alpha", "Beta", "Gamma"];
output = "";
separator = "";
for (int i = 0; i < list.length ; i++) {
  output = output + separator;
  output = output + list[i];
  separator = ", ";
}

This works because on the first pass, separator is empty (so you don't get a comma at the start, but on every subsequent pass, you add a comma before adding the next element.

You could certainly unroll this a little to make it a bit faster (assigning to the separator over and over isn't ideal), though I suspect that's something the compiler could do for you automatically.

In the end though, I suspect pretty this is what most language level join functions come down to. Nothing more than syntax sugar, but it sure is sweet.

Matt Sheppard 2008-09-12 07:18:48

I've missed something or the result is : "Alpha Beta, Gamma,"

Nicolas 2008-09-12 07:31:34

You're missing something. It works just fine.

Patrick 2008-09-12 07:35:28

Forget it : i've read it too quickly. But I still don't like the in-loop setting of the separator.

Nicolas 2008-09-12 07:40:04

Nicolas: I doubt you will be able to get something "nicer" in the imperative setting as you will typically have to separately handle the first or last item in the list. I can only imagine doing it nicer if you have some loop operator with an "in between" part: while C do S and in between T od.

mweerden 2008-09-12 07:59:07

It's wasteful to set _seperator_ each time through the loop. Maybe a good optimizer would factor it out, though.

Jon Ericson 2008-09-12 08:33:20

See my comment about the ternary operator below, and simply checking that the string you're building is non-empty

Bobby Jack 2008-09-12 10:09:37

Please also read my answer below w.r.t. temporary garbage. Better use a StringBuilder.

blabla999 2009-01-12 22:28:42

Answer 3

+3 A:

For pure elegance, a typical recursive functional-language solution is quite nice. This isn't in an actual language syntax but you get the idea (it's also hardcoded to use comma separator):

join([]) = ""

join([x]) = "x"

join([x, rest]) = "x," + join(rest)

In reality you would write this in a more generic way, to reuse the same algorithm but abstract away the data type (doesn't have to be strings) and the operation (doesn't have to be concatenation with a comma in the middle). Then it usually gets called 'reduce', and many functional languages have this built in, e.g. multiplying all numbers in a list, in Lisp:

(reduce #'* '(1 2 3 4 5)) => 120

Luke Halliwell 2008-09-12 07:21:29

Answer 4

+6 A:

All of these solutions are decent ones, but for an underlying library, both independence of separator and decent speed are important. Here is a function that fits the requirement assuming the language has some form of string builder.

public static string join(String[] strings, String sep) {
  if(strings.length == 0) return "";
  if(strings.length == 1) return strings[0];
  StringBuilder sb = new StringBuilder();
  sb.append(strings[0]);
  for(int i = 1; i < strings.length; i++) {
    sb.append(sep);
    sb.append(strings[i]);
  }
  return sb.toString();
}

EDIT: I suppose I should mention why this would be speedier. The main reason would be because any time you call c = a + b; the underlying construct is usually c = (new StringBuilder()).append(a).append(b).toString();. By reusing the same string builder object, we can reduce the amount of allocations and garbage we produce.

And before someone chimes in with optimization is evil, we're talking about implementing a common library function. Acceptable, scalable performance is one of the requirements them. A join that takes a long time is one that's going to be not oft used.

Patrick 2008-09-12 07:33:35

Can we chime in with the "language agnostic" label ?

Nicolas 2008-09-12 07:42:04

I believe the StringBuilder (or a similar construct) was implied in the other responses. So, where's your second function? You promised two.

Konrad Rudolph 2008-09-12 07:42:43

Not really. No matter how you cut it, every language deals with strings in a set number of ways. All of the other solutions assume that you have an append string method, which is certainly not language agnostic, as c doesn't have it. You can't really deal with strings without making some assumptions

Patrick 2008-09-12 08:00:32

Re other function: I considered it, then didn't. Really. I could demonstrate the c way of doing it, I suppose. Then we'd at least have all the major language families covered, but I'll leave that for someone else to do.

Patrick 2008-09-12 08:02:06

If you're really worried about allocation speed, you can initialise the StringBuilder to the needed capacity (strings.Sum(s => s.Length) + (strings.length - 1) * separator.length) and do away with reallocating space at all.

David Schmitt 2008-09-12 08:35:51

C most certainly does have an append string function: strcat.

Jon Ericson 2008-09-12 08:40:10

Jon: not in the way used in the previous examples. strcat doesn't return a new string and has far different semantics than strA + strB.

Patrick 2008-09-12 08:51:39

David: then you've iterated over the collection twice. I'm not certain that in a managed environment where reallocs are relatively cheap that counting ahead would put you ahead. Then again, there's only one way to find out. :)

Patrick 2008-09-12 08:52:59

Answer 5

+1 A:

' Pseudo code Assume zero based

ResultString = InputArray[0]
n = 1
while n (is less than)  Number_Of_Strings
    ResultString (concatenate) ", "
    ResultString (concatenate) InputArray[n]
    n = n + 1
loop

David L Morris 2008-09-12 07:38:29

I would use += instead of = to show that there is a concat taking place.

Ralph Rickenbach 2008-09-12 07:44:56

I saw that before the comment and added (concatenate) to be really clear (they might not know 'c').

David L Morris 2008-09-12 07:47:26

Don't you want to change `=` for `(assign?)` to be more language-agnostic? The poor Pascal people don't know what you mean. ;-)Seriously, though.

Konrad Rudolph 2008-09-12 07:50:27

Yes, so true. But I think I have done my 4.

David L Morris 2008-09-12 08:21:50

Answer 6

A:

The following is no longer language-agnostic (but that doesn't matter for the discussion because the implementation is easily portable to other languages). I tried to implement Luke's (theretically best) solution in an imperative programming language. Take your pick; mine's C#. Not very elegant at all. However, (without any testing whatsoever) I could imagine that its performance is quite decent because the recursion is in fact tail recursive.

My challenge: give a better recursive implementation (in an imperative language). You say what “better” means: less code, faster, I'm open for suggestions.

private static StringBuilder RecJoin(IEnumerator<string> xs, string sep, StringBuilder result) {
    result.Append(xs.Current);
    if (xs.MoveNext()) {
        result.Append(sep);
        return RecJoin(xs, sep, result);
    } else
        return result;
}

public static string Join(this IEnumerable<string> xs, string separator) {
    var i = xs.GetEnumerator();
    if (!i.MoveNext())
        return string.Empty;
    else
        return RecJoin(i, separator, new StringBuilder()).ToString();
}

Konrad Rudolph 2008-09-12 07:40:44

Of course, if we're dropping language independence then you could just use string.Join(separator, xs) :)

Simon Steele 2008-09-12 08:35:05

<Insert snide comment about picking a terser language here.> ;-)

Jon Ericson 2008-09-12 08:43:30

Answer 7

+1 A:

In Perl, I just use the join command:

$ echo "Alpha
Beta
Gamma" | perl -e 'print(join(", ", map {chomp; $_} <> ))'
Alpha, Beta, Gamma

(The map stuff is mostly there to create a list.)

In languages that don't have a built in, like C, I use simple iteration (untested):

for (i = 0; i < N-1; i++){
    strcat(s, a[i]);
    strcat(s, ", ");
}
strcat(s, a[N]);

Of course, you'd need to check the size of s before you add more bytes to it.

You either have to special case the first entry or the last.

Jon Ericson 2008-09-12 08:21:42

Answer 8

+5 A:

Most languages nowadays - e.g. perl (mention by Jon Ericson), php, javascript - have a join() function or method, and this is by far the most elegant solution. Less code is better code.

In response to Mendelt Siebenga, if you do require a hand-rolled solution, I'd go with the ternary operator for something like:

separator = ","
foreach (item in stringCollection)
{
    concatenatedString += concatenatedString ? separator + item : item
}

Bobby Jack 2008-09-12 08:39:50

using "+" to concatenate many strings is always a possible performance eater.

blabla999 2009-01-12 22:29:39

Answer 9

A:

We're just throwing up implementations in different languages now? Okay. Here's a hopefully (it's 5 in the morning) decent C implementation that adds the first item beforehand. I don't see an efficient way to avoid doing so. =\

char* joinStrings(char** strings, int numStrings, char* seperator) {
 // Handle empty case which would cause problems.
 if (numStrings == 0)
  return "";

 // Determine the length of the seperator string.
 int seperatorLength = 0;
 while (seperator[seperatorLength])
  ++seperatorLength;

 // Start with 1 char for the null terminator...
 int finalSize = 1;

 // ...add the length of the first string...
 int characterIndex;
 for (characterIndex = 0; strings[0][characterIndex]; ++characterIndex)
  ++finalSize;

 int stringIndex;

 // ...add the lengths of subsequent strings and the seperator.
 for (stringIndex = 1; stringIndex < numStrings; ++stringIndex) {
  finalSize += seperatorLength;

  for (characterIndex = 0; strings[stringIndex][characterIndex]; ++characterIndex)
   ++finalSize;
 }

 // Allocate memory our new string...
 char* newString = (char*) malloc(sizeof(char) * finalSize);
 int finalIndex = 0;

 // ...copy the first string in....
 for (characterIndex = 0; strings[0][characterIndex]; ++characterIndex)
  newString[finalIndex++] = strings[0][characterIndex];

 // ...copy the subsequent strings and seperators.
 for (stringIndex = 1; stringIndex < numStrings; ++stringIndex) {
  for (characterIndex = 0; seperator[characterIndex]; ++characterIndex)
   newString[finalIndex++] = seperator[characterIndex];

  for (characterIndex = 0; strings[stringIndex][characterIndex]; ++characterIndex)
   newString[finalIndex++] = strings[stringIndex][characterIndex];
 }

 // Terminate our new string...
 newString[finalIndex] = (char) 0;

 // ...and return it.
 return newString;
}

Here's it with a simple test, whose output seems to be as it should.

Jeremy Banks 2008-09-12 09:45:22

Any reason you don't use the string library?

Jon Ericson 2008-09-12 09:48:23

I felt that just preforming the simple operations inline would yield better performance, but perhaps that is a choice best left to the compiler.

Jeremy Banks 2008-09-12 16:05:02

Answer 10

+2 A:

@Mendelt Siebenga

Strings are corner-stone objects in programming languages. Different languages implement strings differently. An implementation of join() strongly depends on underlying implementation of strings. Pseudocode doesn't reflect underlying implementation.

Consider join() in Python. It can be easily used:

print ", ".join(["Alpha", "Beta", "Gamma"])
# Alpha, Beta, Gamma

It could be easily implemented as follow:

def join(seq, sep=" "):
    if not seq:         return ""
    elif len(seq) == 1: return seq[0]
    return reduce(lambda x, y: x + sep + y, seq)

print join(["Alpha", "Beta", "Gamma"], ", ")
# Alpha, Beta, Gamma

And here how join() method is implemented in C (taken from trunk):

PyDoc_STRVAR(join__doc__,
"S.join(sequence) -> string\n\
\n\
Return a string which is the concatenation of the strings in the\n\
sequence.  The separator between elements is S.");

static PyObject *
string_join(PyStringObject *self, PyObject *orig)
{
    char *sep = PyString_AS_STRING(self);
    const Py_ssize_t seplen = PyString_GET_SIZE(self);
    PyObject *res = NULL;
    char *p;
    Py_ssize_t seqlen = 0;
    size_t sz = 0;
    Py_ssize_t i;
    PyObject *seq, *item;

    seq = PySequence_Fast(orig, "");
    if (seq == NULL) {
     return NULL;
    }

    seqlen = PySequence_Size(seq);
    if (seqlen == 0) {
     Py_DECREF(seq);
     return PyString_FromString("");
    }
    if (seqlen == 1) {
     item = PySequence_Fast_GET_ITEM(seq, 0);
     if (PyString_CheckExact(item) || PyUnicode_CheckExact(item)) {
      Py_INCREF(item);
      Py_DECREF(seq);
      return item;
     }
    }

    /* There are at least two things to join, or else we have a subclass
     * of the builtin types in the sequence.
     * Do a pre-pass to figure out the total amount of space we'll
     * need (sz), see whether any argument is absurd, and defer to
     * the Unicode join if appropriate.
     */
    for (i = 0; i < seqlen; i++) {
     const size_t old_sz = sz;
     item = PySequence_Fast_GET_ITEM(seq, i);
     if (!PyString_Check(item)){
#ifdef Py_USING_UNICODE
      if (PyUnicode_Check(item)) {
       /* Defer to Unicode join.
        * CAUTION:  There's no gurantee that the
        * original sequence can be iterated over
        * again, so we must pass seq here.
        */
       PyObject *result;
       result = PyUnicode_Join((PyObject *)self, seq);
       Py_DECREF(seq);
       return result;
      }
#endif
      PyErr_Format(PyExc_TypeError,
            "sequence item %zd: expected string,"
            " %.80s found",
            i, Py_TYPE(item)->tp_name);
      Py_DECREF(seq);
      return NULL;
     }
     sz += PyString_GET_SIZE(item);
     if (i != 0)
      sz += seplen;
     if (sz < old_sz || sz > PY_SSIZE_T_MAX) {
      PyErr_SetString(PyExc_OverflowError,
       "join() result is too long for a Python string");
      Py_DECREF(seq);
      return NULL;
     }
    }

    /* Allocate result space. */
    res = PyString_FromStringAndSize((char*)NULL, sz);
    if (res == NULL) {
     Py_DECREF(seq);
     return NULL;
    }

    /* Catenate everything. */
    p = PyString_AS_STRING(res);
    for (i = 0; i < seqlen; ++i) {
     size_t n;
     item = PySequence_Fast_GET_ITEM(seq, i);
     n = PyString_GET_SIZE(item);
     Py_MEMCPY(p, PyString_AS_STRING(item), n);
     p += n;
     if (i < seqlen - 1) {
      Py_MEMCPY(p, sep, seplen);
      p += seplen;
     }
    }

    Py_DECREF(seq);
    return res;
}

Note that the above Catenate everything. code is a small part of the whole function.

In pseudocode:

/* Catenate everything. */
for each item in sequence
    copy-assign item
    if not last item
        copy-assign separator

J.F. Sebastian 2008-09-14 13:14:51

Answer 11

A:

join() function in Ruby:

def join(seq, sep) 
  seq.inject { |total, item| total << sep << item } or "" 
end

join(["a", "b", "c"], ", ")
# => "a, b, c"

J.F. Sebastian 2008-09-14 19:21:54

Answer 12

A:

join() in Perl:

use List::Util qw(reduce);

sub mjoin($@) {$sep = shift; reduce {$a.$sep.$b} @_ or ''}

say mjoin(', ', qw(Alpha Beta Gamma));
# Alpha, Beta, Gamma

Or without reduce:

 sub mjoin($@) 
 {
   my ($sep, $sum) = (shift, shift); 
   $sum .= $sep.$_ for (@_); 
   $sum or ''
 }

J.F. Sebastian 2008-09-14 20:37:40

Answer 13

A:

Perl 6

sub join( $separator, @strings ){
  my $return = shift @strings;
  for @strings -> ( $string ){
    $return ~= $separator ~ $string;
  }
  return $return;
}

Yes I know it is pointless because Perl 6 already has a join function.

Brad Gilbert 2008-09-15 16:10:26

Answer 14

A:

In Java 5, with unit test:

import junit.framework.Assert;
import org.junit.Test;

public class StringUtil
{
    public static String join(String delim, String... strings)
    {
        StringBuilder builder = new StringBuilder();

        if (strings != null)
        {
            for (String str : strings)
            {
                if (builder.length() > 0)
                {
                    builder.append(delim);
                }

                builder.append(str);
            }
        }           

        return builder.toString();
    }

    @Test
    public void joinTest()
    {
        Assert.assertEquals("", StringUtil.join(", ", null));
        Assert.assertEquals("", StringUtil.join(", ", ""));
        Assert.assertEquals("", StringUtil.join(", ", new String[0]));
        Assert.assertEquals("test", StringUtil.join(", ", "test"));
        Assert.assertEquals("foo, bar", StringUtil.join(", ", "foo", "bar"));
        Assert.assertEquals("foo, bar, baz", StringUtil.join(", ", "foo", "bar", "baz"));
    }
}

mbaird 2008-09-16 20:36:36

Having a delimiter parameter, and then assuming everyone also want a space appended, is a little silly.

Jon 2009-09-24 10:16:49

Yeah I didn't think that through. Updated my code above to correct the hardcoded space.

mbaird 2009-09-25 22:07:30

Answer 15

A:

I wrote a recursive version of the solution in lisp. If the length of the list is greater that 2 it splits the list in half as best as it can and then tries merging the sublists

    (defun concatenate-string(list)
       (cond ((= (length list) 1) (car list))
             ((= (length list) 2) (concatenate 'string (first list) "," (second list)))
             (t (let ((mid-point (floor (/ (- (length list) 1) 2))))
                   (concatenate 'string 
                          (concatenate-string (subseq list 0 mid-point))
                          ","
                          (concatenate-string (subseq list mid-point (length list))))))))



    (concatenate-string '("a" "b"))

I tried applying the divide and conquer strategy to the problem, but I guess that does not give a better result than plain iteration. Please let me know if this could have been done better.

I have also performed an analysis of the recursion obtained by the algorithm, it is available here.

2008-09-17 06:14:03

Answer 16

+1 A:

collecting different language implementations ?
Here is, for your amusement, a Smalltalk version:

join:collectionOfStrings separatedBy:sep

  |buffer|

  buffer := WriteStream on:''.
  collectionOfStrings 
      do:[:each | buffer nextPutAll:each ]
      separatedBy:[ buffer nextPutAll:sep ].
  ^ buffer contents.

Of course, the above code is already in the standard library found as:

Collection >> asStringWith:

so, using that, you'd write:

#('A' 'B' 'C') asStringWith:','

But here's my main point:

I would like to put more emphasis on the fact that using a StringBuilder (or what is called "WriteStream" in Smalltalk) is highly recommended. Do not concatenate strings using "+" in a loop - the result will be many many intermediate throw-away strings. If you have a good Garbage Collector, thats fine. But some are not and a lot of memory needs to be reclaimed. StringBuilder (and WriteStream, which is its grand-grand-father) use a buffer-doubling or even adaptive growing algorithm, which needs MUCH less scratch memory.

However, if its only a few small strings you are concatenating, dont care, and "+" them; the extra work using a StringBuilder might be actually counter-productive, up to an implementation- and language-dependent number of strings.

blabla999 2009-01-12 22:25:35

Depending on how a language implements the '+' operator on strings it can be good or bad. Not all languages create intermediate strings. I can't read smalltalk but I'm curious how you prevent the last ',' from being printed. You only need n-1 separators for n items.

Mendelt 2009-01-14 09:40:51

Answer 17

A:

Use the String.join method in C#

http://msdn.microsoft.com/en-us/library/57a79xd0.aspx

2009-04-05 17:39:59

ansaurus

tags:

views:

answers:

Algorithm for joining e.g. an array of strings

Perl 6

related questions