tags:

views:

244

answers:

4

Hello,

I am currently looking into spliting a very long string that could contain HTML characteristics.

Once example is:

Thiiiissssaaaveryyyylonnngggstringgg

For this I have used this function in the past:

function split($sString, $iCount = 75)
{       
    $text = $sString;
 $new_text = '';
 $text_1 = explode('>',$text);
 $sizeof = sizeof($text_1);
 for ($i=0; $i<$sizeof; ++$i) {
        $text_2 = explode('<',$text_1[$i]);
     if (!empty($text_2[0])) {

       $new_text .= preg_replace('#([^\n\r .]{'. $iCount .'})#iu', '\\1  ', $text_2[0]);
     }
     if (!empty($text_2[1])) {
         $new_text .= '<' . $text_2[1] . '>';
     }
 }
 return $new_text; }

The function works to pick up such characters and split them after X characters. The problem is when HTML or ASCII characters are mixed in there like this:

Thissssiisss<a href="#">lonnnggg</a>sting&#228;&#228;&#228;

I have been trying to figure out how to split this string above and to not count characters within HTML tags and to count each ASCII character as 1.

Any help would be great.

Thank you

+2  A: 

Consider using the built-in wordwrap() instead?

Amber
the problem with wordwrap is that it can break the line in the middle of a utf8 wide char (rendering the string invalid utf8) or in the middle of an html element like , messing it up.
Omry
@omry, see my answer
Dominic Rodger
A: 

I use this function to split strings in FireStats.

you can probably take it out of context and use it pretty easily. note that it's calling some other functions. you can skip the utf8 check if you like.

Omry
+1  A: 

Get rid of that complexity, use a DOM parser to extract the plain-text

//Dump contents (without tags) from HTML
$pageText = file_get_html('http://www.google.com/')-&gt;plaintext;
echo "Length is: " . strlen($pageText);
karim79
+3  A: 

If you're worried about UTF-8 support for wordwrap, then you want this:

function utf8_wordwrap($str, $width = 75, $break = "\n") // wordwrap() with utf-8 support {
    $str = preg_split('#[\s\n\r]+#', $str);
    $len = 0;
    foreach ($str as $val) {
        $val .= ' ';
        $tmp = mb_strlen($val, 'utf-8');
        $len += $tmp;
        if ($len >= $width) {
            $return .= $break . $val;
            $len = $tmp;
        }
        else {
            $return .= $val;
        }
    }
    return $return;
}

Source: PHP Manual Comment

As to your issue with codepoints - you might want to look at html_entity_decode, which I think converts codepoints (e.g. &#223) to the character they represent. You'll need to give it a charset so it knows what 223 means (since what '223' means depends on the charset).

Dominic Rodger
Thanks for the tip on "html_entity_decode". I used that function and included it with what I was working on and it seems to be working perfect. Thanks again!
Patrik Johansson
@Patrik Johansson - glad it worked for you :)
Dominic Rodger