tags:

views:

546

answers:

2

Assume I have the following string:

Hellotoevryone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsogladtoseeall.

This string represents a sequence of chars that are not separated by a space, in this string there is also an html image inserted. Now I want to separate the string into words , each having the length of 10 chars, so the aoutput should be:

1)Hellotoevr
2)yone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsog
3)ladtoseeal
4)l.

So the idea is to keep any html tag content as 0 length char.

I had written such a method, but it does not take into consideration the html tags:

public static string EnsureWordLength(this string target, int length)
{
    string[] words = target.Split(' ');
    for (int i = 0; i < words.Length; i++)
        if (words[i].Length > length)
        {
            var possible = true;
            var ord = 1;
            do
            {
                var lengthTmp = length*ord+ord-1;
                if (lengthTmp < words[i].Length) words[i] = words[i].Insert(lengthTmp, " ");
                else possible = false;
                ord++;
            } while (possible); 

        }

    return string.Join(" ", words);
}

I would like to see a code that performs the splitting as I described.Thanks.

A: 

This following code will handle the case you provided, but will break for anything more complex. Also, since you did not specify how it should handle long-form tags with inner text or HTML, it treats all tags as short-form ones (Run the code to see what I mean).

Works with this input:

Hellotoevryone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsogladtoseeall.
Hellotoevryone<img src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsoglad<img src="baz.jpeg" />toseeall.
Hello<span class="foo">toevryone</span>Iamso<em>glad</em>toseeallTheQuickBrown<img src="bar.jpeg" />FoxJumpsOverTheLazyDog.
Hello<span class="foo">toevryone</span>Iamso<em>glad</em>toseeall.
Loremipsumdolorsitamet,consecteturadipiscingelit.Nullamacnibhelit,quisvolutpatnunc.Donecultrices,ipsumquisaccumsanconvallis,tortortortorgravidaante,etsollicitudinipsumnequeeulorem.

Breaks with this input (note the incomplete tag):

Hellotoevryone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" /Iamsogladtoseeall.
using System;
using System.Text.RegularExpressions;
using System.IO;
using System.Collections.Generic;

public static class CustomSplit {
  public static void Main(String[] args) {
    if (args.Length > 0 && File.Exists(args[0])) {
      StreamReader sr = new StreamReader(args[0]);
      String[] lines = sr.ReadToEnd().Split(new String[]{Environment.NewLine}, StringSplitOptions.None);

      int counter = 0;
      foreach (String line in lines) {
        Console.WriteLine("########### Line {0} ###########", ++counter);
        Console.WriteLine(line);
        Console.WriteLine(line.EnsureWordLength(10));
      }
    }
  }

}

public static class EnsureWordLengthExtension {
  public static String EnsureWordLength(this String target, int length) {
    List<List<Char>> words = new List<List<Char>>();

    words.Add(new List<Char>());

    for (int i = 0; i < target.Length; i++) {
      words[words.Count - 1].Add(target[i]);

      if (target[i] == '<') {
        do {
          i++;
          words[words.Count - 1].Add(target[i]);
        } while(target[i] != '>');
      }

      if ((new String(words[words.Count - 1].ToArray())).CountCharsWithoutTags() == length) {
        words.Add(new List<Char>());
      }
    }

    String[] result = new String[words.Count];
    for (int j = 0; j < words.Count; j++) {
      result[j] = new String(words[j].ToArray());
    }

    return String.Join(" ", result);
  }

  private static int CountCharsWithoutTags(this String target) {
    return Regex.Replace(target, "<.*?>", "").Length;
  }
}
brianpeiris
+3  A: 

Here's a regular expressions solution matching your requirements. Bear in mind that this will probably not work if you decide to alter your requirements in the slightest bit, which is faithful to the well known quote here.

using System.Text.RegularExpressions;

string[] samples = {
    @"Hellotoevryone<img height=""115"" width=""150"" alt="""" src=""/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg"" />Iamsogladtoseeall.",
    "Testing123Hello.World",
    @"Test<a href=""http://stackoverflow.com""&gt;StackOverflow&lt;/a&gt;",
    @"Blah<a href=""http://stackoverflow.com""&gt;StackOverflow&lt;/a&gt;Blah&lt;a href=""http://serverfault.com""&gt;ServerFault&lt;/a&gt;",
    @"Test<a href=""http://serverfault.com""&gt;Server Fault</a>", // has a space, not matched
    "Stack Overflow" // has a space, not matched
};

// use these 2 lines if you don't want to use regex comments
//string pattern = @"^((?:\S(?:\<[^>]+\>)?){1,10})+$";
//Regex rx = new Regex(pattern);

// regex comments spanning multiple lines requires use of RegexOptions.IgnorePatternWhitespace
string pattern = @"^(               # match line/string start, begin group
                    (?:\S           # match (but don't capture) non-whitespace chars
                    (?:\<[^>]+\>)?  # optionally match (doesn't capture) an html <...> tag
                                    # to match img tags only change to (?:\<img[^>]+\>)?
                    ){1,10}         # match upto 10 chars (tags don't count per your example)
                    )+$             # match at least once, and match end of line/string
                    ";
Regex rx = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);

foreach (string sample in samples)
{
    if (rx.IsMatch(sample))
    {
        foreach (Match m in rx.Matches(sample))
        {
            // using group index 1, group 0 is the entire match which I'm not interested in
            foreach (Capture c in m.Groups[1].Captures)
            {
                Console.WriteLine("Capture: {0} -- ({1})", c.Value, c.Value.Length);
            }
        }
    }
    else
    {
        Console.WriteLine("Not a match: {0}", sample);
    }

    Console.WriteLine();
}

Using the samples above, here's the output (numbers in parentheses = string length):

Capture: Hellotoevr -- (10)
Capture: yone<img height="115" width="150" alt="" src="/Content/Edt/image/b49768
75-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsog -- (116)
Capture: ladtoseeal -- (10)
Capture: l. -- (2)

Capture: Testing123 -- (10)
Capture: Hello.Worl -- (10)
Capture: d -- (1)

Capture: Test<a href="http://stackoverflow.com"&gt;StackO -- (45)
Capture: verflow</a> -- (11)

Capture: Blah<a href="http://stackoverflow.com"&gt;StackO -- (45)
Capture: verflow</a>Bla -- (14)
Capture: h<a href="http://serverfault.com"&gt;ServerFau -- (43)
Capture: lt</a> -- (6)

Not a match: Test<a href="http://serverfault.com"&gt;Server Fault</a>

Not a match: Stack Overflow
Ahmad Mageed