views:

57

answers:

2

I have a arbitrarily large string of text from the user that needs to be split into 10k chunks (potentially adjustable value) and sent off to another system for processing.

  • Chunks cannot be longer than 10k (or other arbitrary value)
  • Text should be broken with natural language context in mind
    • split on punctuation when possible
    • split on spaces if no punction exists
    • break a word as a last resort

I'm trying not to re-invent the wheel with this, any suggestions before I roll this from scratch?

Using C#.

+1  A: 

I'm sure this will probably end up being more difficult than you're expecting (most natural language things), but check out Sharp Natural Language Parser.

I'm currently using SharpNLP, it works pretty well, but there's always 'gotcha's'.

Let me kow if this isn't what you're looking for.

Mark

MStodd
Thanks Mark, I'll check that library out.
Chris Ballance
+1  A: 

This may not handle every case as you need, but it should get you on your way.

    public IList<string> ChunkifyText(string bigString, int maxSize, char[] punctuation)
    {
        List<string> results = new List<string>();

        string chunk;
        int startIndex = 0;

        while (startIndex < bigString.Length)
        {
            if (startIndex + maxSize + 1 > bigString.Length)
                chunk = bigString.Substring(startIndex);
            else
                chunk = bigString.Substring(startIndex, maxSize);

            int endIndex = chunk.LastIndexOfAny(punctuation);

            if (endIndex < 0)
                endIndex = chunk.LastIndexOf(" ");

            if (endIndex < 0)
                endIndex = Math.Min(maxSize - 1, chunk.Length - 1);

            results.Add(chunk.Substring(0, endIndex + 1));

            startIndex += endIndex + 1;
        }

        return results;
    }
Scott J
+1 I ended up implementing something similar to this. Thanks for the code example!
Chris Ballance