views:

63

answers:

3

I wrote a script that sends chunks of text of to google to translate, but sometimes the text, which is html source code) will end up splitting in the middle of an html tag and google will return the code incorrectly.

I already know how to split the string into an array, but is there a better way to do this while ensuring the output string does not exceed 5000 characters and does not split on a tag?

UPDATE: Thanks to answer, this is the code i ended up using in my project and it works great

function handleTextHtmlSplit($text, $maxSize) {
    //our collection array
    $niceHtml[] = '';

    // Splits on tags, but also includes each tag as an item in the result
    $pieces = preg_split('/(<[^>]*>)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

    //the current position of the index
    $currentPiece = 0;

    //start assembling a group until it gets to max size

    foreach ($pieces as $piece) {
        //make sure string length of this piece will not exceed max size when inserted
        if (strlen($niceHtml[$currentPiece] . $piece) > $maxSize) {
            //advance current piece
            //will put overflow into next group
            $currentPiece += 1;
            //create empty string as value for next piece in the index
            $niceHtml[$currentPiece] = '';
        }
        //insert piece into our master array
        $niceHtml[$currentPiece] .= $piece;
    }

    //return array of nicely handled html
    return $niceHtml;
}
A: 

Why not strip the html tags from the string before sending it to google. PHP has a strip_tags() function that can do this for you.

Mark Baker
because i need to keep the html intact because it will end up being rendered on the page
james
Doesn't Google translate strip out the html itself anyway?
Mark Baker
no, it ignores html tags and attributes other than 'alt' as far as my tests show. it returns them untouched
james
A: 

preg_split with a good regex would do it for you.

Scott Saunders
+2  A: 

Note: haven't had a chance to test this (so there may be a minor bug or two), but it should give you an idea:

function get_groups_of_5000_or_less($input_string) {

    // Splits on tags, but also includes each tag as an item in the result
    $pieces = preg_split('/(<[^>]*>)/', $input_string,
        -1, PREG_SPLIT_DELIM_CAPTURE);

    $groups[] = '';
    $current_group = 0;

    while ($cur_piece = array_shift($pieces)) {
        $piecelen = strlen($cur_piece);

        if(strlen($groups[$current_group]) + $piecelen > 5000) {
            // Adding the next piece whole would go over the limit,
            // figure out what to do.
            if($cur_piece[0] == '<') {
                // Tag goes over the limit, just put it into a new group
                $groups[++$current_group] = $cur_piece;
            } else {
                // Non-tag goes over the limit, split it and put the
                // remainder back on the list of un-grabbed pieces
                $grab_amount = 5000 - $strlen($groups[$current_group];
                $groups[$current_group] .= substr($cur_piece, 0, $grab_amount);
                $groups[++$current_group] = '';
                array_unshift($pieces, substr($cur_piece, $grab_amount));
            }
        } else {
            // Adding this piece doesn't go over the limit, so just add it
            $groups[$current_group] .= $cur_piece;
        }
    }
    return $groups;
}

Also note that this can split in the middle of regular words - if you don't want that, then modify the part that begins with // Non-tag goes over the limit to choose a better value for $grab_amount. I didn't bother coding that in since this is just supposed to be an example of how to get around splitting tags, not a drop-in solution.

Amber
Wow Amber, thanks for that. It should really get my wheels spinning. I will give it a go.
james