views:

940

answers:

4

A 'truncate words' would take a string of words and return only the first, let's say, 10 words.

In dojo (javascript library) they have such a function, whose code is this:

truncatewords: function(value, arg){
    // summary: Truncates a string after a certain number of words
   // arg: Integer
   //              Number of words to truncate after
   arg = parseInt(arg);
   if(!arg){
           return value;
   }

   for(var i = 0, j = value.length, count = 0, current, last; i < value.length; i++){
           current = value.charAt(i);
           if(dojox.dtl.filter.strings._truncatewords.test(last)){
                   if(!dojox.dtl.filter.strings._truncatewords.test(current)){
                           ++count;
                           if(count == arg){
                                   return value.substring(0, j + 1);
                           }
                   }
           }else if(!dojox.dtl.filter.strings._truncatewords.test(current)){
                   j = i;
           }
           last = current;
   }
   return value;
}

where dojox.dtl.filter.strings._truncatewords. is /(&.*?;|<.*?>|(\w[\w-]*))/g

Why isn't this written like so:

function truncate(value,arg) {
    var value_arr = value.split(' ');
    if(arg < value_arr.length) {
     value = value_arr.slice(0,arg).join(' '); }
    return value;
}

and what are the differences?

A: 

The code you're looking at is from the dtl library, which is for supporting the django templating language. (http://www.dojotoolkit.org/book/dojo-book-0-9/part-5-dojox/dojox-dtl). I'm sure the code in there is not for just doing a straight string split, but rather parsing the templates they're using.

Also, looking at that regex, they're handling a lot more scenarios than just spaces...for example, the <.*?> will cause any group of words enclosed in opening and closing tags to be considered a "word".

jvenema
Yeah, I'm also working on a port of django templates for javascript and I figured that dojo's dtl is a good place to get some ideas and perhaps some code.I'm surprised (puzzled?) on why would html/xml tags would be considered as words. Usually when I truncate a string, it's because I want to show a summary with a more.. link, no?
snz3
I can't speak to how they were using the code in there...for your purposes, sure, that makes sense. But since the regex is including them, I guess its valid. Maybe its just to show the first X words of a template in some sort of template preview? Without spending more time in there, I'm not sure. If you post to the dojo mailing list, I'm sure they could help you out there.
jvenema
+2  A: 

Your split should take into account that any sequence of blank characters is a word separator. You should split on a regexp like \s+.

But other than that, it seems dojo's code takes entities and xml tags as words as well. If you know you don't have such things in your string, your implementation might do the trick. Be careful though that your slice does not go beyond the number of words found, this might need a little check.

subtenante
A: 
  1. function declaration: this is probably a javascript object, and using function_name: function(params) {... helps keep javascript out of the global scope.
  2. By checking the arg variable, they're ensuring that an integer was passed. Using parseInt() will allow both 10 and "10" to be accepted.
  3. This method can handle more delimiters than spaces by the regex being used.
  4. This code is safe for array overflow. You can't count to 10 if there are only 8 words in value. Otherwise, you'd get an array out of bounds or object does not exist error.
Jarrett Meyer
Of course, they should use parseInt(arg, 10) ...
Greg
A: 

the regex is 3 parts

  1. &.*?; will match character entities (like &)
  2. <.*?> will match thing in angle brackets
  3. (\w[\w-]) will match strings starting with [a-zA-Z0-9] and followed by the same with a dash

it's not just spliting on space. It's looking for things it thinks could be part of a word, and once it finds something that is not, it ups the word count.

It should take a comma or pipe seperated list and work as well as a space seperated list.

Charlie
Having read your comment and the comments above, I tried using dojo's regexp for a better solution. Problem is that you can't truncate with dojo if the string is written in non-latin characters. (as you said, \w will only match a-zA-Z characters).So my new method would be:...var value_arr = value.match(/(.+?([^\-](?=\s|,)))/g); if(value_arr }return value;
snz3