views:

547

answers:

5

I am looking for a algorithm that takes a string and splits it into a certain number of parts. These parts shall contain complete words (so whitespaces are used to split the string) and the parts shall be of nearly the same length, or contain the longest possible parts.

I know it is not that hard to code a function that can do what I want but I wonder whether there is a well-proven and fast algorithm for that purpose?

edit: To clarify my question I'll describe you the problem I am trying to solve.

I generate images with a fixed width. Into these images I write user names using GD and Freetype in PHP. Since I have a fixed width I want to split the names into 2 or 3 lines if they don't fit into one.

In order to fill as much space as possible I want to split the names in a way that each line contains as much words as possible. With this I mean that in one line should be as much words as neccessary in order to keep each line's length near to an average line length of the whole text block. So if there are one long word and two short words the two short words should stand on one line if it makes all lines about equal long.

(Then I compute the text block width using 1, 2 or 3 lines and if it fits into my image I render it. Just if there are 3 lines and it won't fit I decrease the font size until everything is fine.)

Example: This is a long text should be display something like that:

This is a
long text

or:

This is
a long
text

but not:

This
is a long
text

and also not:

This is a long
text

Hope I could explain clearer what I am looking for.

+5  A: 

If you're talking about line-breaking, take a look at Dynamic Line Breaking, which gives a Dynamic Programming solution to divide words into lines.

Larry
maybe have a look at how LaTex does it?
jk
A: 

Partitioning into equal sizes is NP-Complete

Steve B.
How would you reduce this problem to the Partitioning problem? I don't think this problem is NPC.
Jacob
-1 I agree with Jacob, there are maximum three lines in the problem as stated, so if the length of the string is N, there are O(N^2) possible ways to split the string into three substrings regardless of the amount of whitespace. You can iterate through all of them in polynomial time. There is nothing NP-complete in this problem. Even in the general case, you can first choose a split point (O(N) possibilities) and then recursively split the two parts, yielding worst-case quadratic algorithm and in practice O(N log N).
antti.huima
+1  A: 

I don't know about proven, but it seems like the simplest and most efficient solution would be to divide the length of the string by N then find the closest white space to the split locations (you'll want to search both forward and back).

The below code seems to work though there are plenty of error conditions that it doesn't handle. It seems like it would run in O(n) where n is the number of strings you want.

class Program
{
    static void Main(string[] args)
    {
        var s = "This is a string for testing purposes. It will be split into 3 parts";
        var p = s.Length / 3;
        var w1 = 0;
        var w2 = FindClosestWordIndex(s, p);
        var w3 = FindClosestWordIndex(s, p * 2);
        Console.WriteLine(string.Format("1: {0}", s.Substring(w1, w2 - w1).Trim()));
        Console.WriteLine(string.Format("2: {0}", s.Substring(w2, w3 - w2).Trim()));
        Console.WriteLine(string.Format("3: {0}", s.Substring(w3).Trim()));
        Console.ReadKey();
    }

    public static int FindClosestWordIndex(string s, int startIndex)
    {
        int wordAfterIndex = -1;
        int wordBeforeIndex = -1;
        for (int i = startIndex; i < s.Length; i++)
        {
            if (s[i] == ' ')
            {
                wordAfterIndex = i;
                break;
            }
        }
        for (int i = startIndex; i >= 0; i--)
        {
            if (s[i] == ' ')
            {
                wordBeforeIndex = i;
                break;
            }
        }

        if (wordAfterIndex - startIndex <= startIndex - wordBeforeIndex)
            return wordAfterIndex;
        else
            return wordBeforeIndex;
    }
}

The output for this is:

1: This is a string for
2: testing purposes. It will
3: be split into 3 parts
Brian
A: 

The way word-wrap is usually implemented is to place as many words as possible onto one line, and break to the next when there is no more room. This assumes, of course, that you have a maximum-width in mind.

Regardless of what algorithm you use, keep in mind that unless you are working with a fixed-width font, you want to work with the physical width of the word, not the number of letters.

BlueRaja - Danny Pflughoeft