views:

1904

answers:

2

For a poor man's implementation of near-collation-correct sorting on the client side I need a JavaScript function that does efficient single character replacement in a string.

Here is what I mean (note that this applies to German text, other languages sort differently):

native sorting gets it wrong: a b c o u z ä ö ü
collation-correct would be:   a ä b c o ö u ü z

Basically, I need all occurrences of "ä" of a given string replaced with "a" (and so on). This way the result of native sorting would be very close to what a user would expect (or what a database would return).

Other languages have facilities to do just that: Python supplies str.translate(), in Perl there is tr/…/…/, XPath has a function translate(), ColdFusion has ReplaceList(). But what about JavaScript?

Here is what I have right now.

// s would be a rather short string (something like 
// 200 characters at max, most of the time much less)
function makeSortString(s) {
  var translate = {
    "ä": "a", "ö": "o", "ü": "u",
    "Ä": "A", "Ö": "O", "Ü": "U"   // probably more to come
  };
  var translate_re = /[öäüÖÄÜ]/g;
  return ( s.replace(translate_re, function(match) { 
    return translate[match]; 
  }) );
}

For starters, I don't like the fact that the regex is rebuilt every time I call the function. I guess a closure can help in this regard, but I don't seem to get the hang of it for some reason.

Can someone think of something more efficient?

+3  A: 

I can't speak to what you are trying to do specifically with the function itself, but if you don't like the regex being built every time, here are two solutions and some caveats about each.

Here is one way to do this:

function makeSortString(s) {
  if(!makeSortString.translate_re) makeSortString.translate_re = /[öäüÖÄÜ]/g;
  var translate = {
    "ä": "a", "ö": "o", "ü": "u",
    "Ä": "A", "Ö": "O", "Ü": "U"   // probably more to come
  };
  return ( s.replace(makeSortString.translate_re, function(match) { 
    return translate[match]; 
  }) );
}

This will obviously make the regex a property of the function itself. The only thing you may not like about this (or you may, I guess it depends) is that the regex can now be modified outside of the function's body. So, someone could do this to modify the interally-used regex:

makeSortString.translate_re = /[a-z]/g;

So, there is that option.

One way to get a closure, and thus prevent someone from modifying the regex, would be to define this as an anonymous function assignment like this:

var makeSortString = (function() {
  var translate_re = /[öäüÖÄÜ]/g;
  return function(s) {
    var translate = {
      "ä": "a", "ö": "o", "ü": "u",
      "Ä": "A", "Ö": "O", "Ü": "U"   // probably more to come
    };
    return ( s.replace(translate_re, function(match) { 
      return translate[match]; 
    }) );
  }
})();

Hopefully this is useful to you.


UPDATE: It's early and I don't know why I didn't see the obvious before, but it might also be useful to put you translate object in a closure as well:

var makeSortString = (function() {
  var translate_re = /[öäüÖÄÜ]/g;
  var translate = {
    "ä": "a", "ö": "o", "ü": "u",
    "Ä": "A", "Ö": "O", "Ü": "U"   // probably more to come
  };
  return function(s) {
    return ( s.replace(translate_re, function(match) { 
      return translate[match]; 
    }) );
  }
})();
Jason Bunting
What I'm trying to do is make the sorting of the jQuery tablesorter plugin work correctly for table data in German. The plugin can take an user-defined function to extract the string to sort on, which is what I have to do or the resulting sort order will be wrong.
Tomalak
Is this function really that inefficient? What have you done as far as testing?
Jason Bunting
I did not mean to say my implementation was inefficient. It's close to the most efficient way of doing it that I can think of. But I can't think of everything, so I hoped there was some really clever way of string manipulation that I was unaware of.
Tomalak
I see - well, I think your solution is sufficient; because I could see a use for this function in the long term, I did some basic testing. I did 5000 iterations on a string of 200 characters that contained at least one of these characters once every 8 characters and it took around 500 ms.
Jason Bunting
BTW, that testing was done in FF. In Chrome, it ran about the same; since Chrome's JS engine (V8) is quicker, generally speaking, it might be worth noting this fact, FWIW.
Jason Bunting
Thanks for your time, greatly appreciated. I will do some testing of my own (but not today, it's 6 PM here), and post my findings here.
Tomalak
Actually I have never gotten to writing a test case to compare results. I've left this open as a reminder to do it someday, but it's not fair to not accept the answer I have used so I do it now. Sorry for delaying so long.
Tomalak
No worries - I hope I was helpful!
Jason Bunting
You were. ;-) I posted what I have working right now, maybe someone else finds it useful.
Tomalak
+2  A: 

Based on the solution by Jason Bunting, here is what I use now.

The whole thing is for the jQuery tablesorter plug-in: For (nearly correct) sorting of non-English tables with tablesorter plugin it is necessary to make use of a custom textExtraction function.

This one:

  • translates the most common accented letters to unaccented ones (the list of supported letters is easily expandable)
  • changes dates in German format ('dd.mm.yyyy') to a recognized format ('yyyy-mm-dd')

Be careful to save the JavaScript file in UTF-8 encoding or it won't work.

// file encoding must be UTF-8!
function GetTextExtractor()
{
  return (function() {
    var patternLetters = /[öäüÖÄÜáàâéèêúùûóòôÁÀÂÉÈÊÚÙÛÓÒÔß]/g;
    var patternDateDmy = /^(?:\D+)?(\d{1,2})\.(\d{1,2})\.(\d{2,4})$/;
    var lookupLetters = {
      "ä": "a", "ö": "o", "ü": "u",
      "Ä": "A", "Ö": "O", "Ü": "U",
      "á": "a", "à": "a", "â": "a",
      "é": "e", "è": "e", "ê": "e",
      "ú": "u", "ù": "u", "û": "u",
      "ó": "o", "ò": "o", "ô": "o",
      "Á": "A", "À": "A", "Â": "A",
      "É": "E", "È": "E", "Ê": "E",
      "Ú": "U", "Ù": "U", "Û": "U",
      "Ó": "O", "Ò": "O", "Ô": "O",
      "ß": "s"
    };
    var TranslateCallback = function(match) { 
      if (lookupLetters[match])
        return lookupLetters[match]; 
      else
        return match;
    }

    return function(node) {
      var text = $.trim($(node).text());
      var matches;
      if (matches = text.match(patternDateDmy))
        return [matches[3], matches[2], matches[1]].join("-");
      else
        return text.replace(patternLetters, TranslateCallback);
    }
  })();
}

You can use it like this:

$("table.sortable").tablesorter({ 
  textExtraction: GetTextExtractor()
});
Tomalak
Don't know if someone will see my comment but i need the same function for some accented letter in portuguese and i cant manage to make it work. Should the concerned letters in my php file be called by the 'html code': Í or by typing directly the 'Í' letter ? I tryed both, nothing works. And yeah i changed the js function to suit my needs with the Í and í letters and my js is encoded utf-8.
kevin
@kevin: Of course someone notices the comment. ;-) The character in your HTML (which is produced by that PHP file, I presume) can be `Í` or the actual `Í`. It makes no difference as long as encoding settings are correct (actual PHP file encoding, PHP server percieved file encoding, HTTP Content-Type header, HTML meta tags). Using the HTML entity may be safest. If the .js file is UTF-8 encoded, it must be served as such (`text/javascript; Charset=UTF-8`), then all should be well.
Tomalak
Thanks for noticing ;-), i checked and tried in may ways what you said, it just doesn't go. Could this be due to other js files being called in the same php page ? If u wanna give a look, it's here: http://schulz-al.tempsite.ws/br/?page_id=51 . Thanks for help, apreciated.
kevin
@kevin: BTW check your references to `sitemap-up.gif` and `sitemap-down.gif`, I get 401 Access Denied for them.
Tomalak
@kevin: Next thing: Your scripts are being served as `Content-Type: text/html` without a `Charset` parameter. They should at least be `Content-Type: text/javascript;`. Also, your `GetTextExtractor()` method (the one in `jquery.tablesorter.min.js`) differs quite heavily from my function, no idea why you think your's could work. ;-) Tip: Put the text extractor into `scripts.js`, not into the tablesorter plugin code. You should not touch the plugin code to avoid future headaches.
Tomalak
Yeah saw the images pb, solved. About the <code>Content-Type: text/javascript</code>, all my scripts are being called that way <code><script type="text/javascript" src="<?php bloginfo('template_url'); ?>/js/jquery.tablesorter.min.js" charset="utf-8"></script></code>, so i dont understand what u mean. I just did what u suggested, copied back your js code to a new scripts.js file and added my "Í" and "í" letters, still nothing, am going crazy and i feel dumb. Thanks a lot for help anyway.
kevin
@kevin: I'm sorry to say that there is reason to feel stupid. ;-) You have copied my code `$("table.sortable").tablesorter(…);`, but *your table* is actually `$("table.tablesorter")`. Also, there is no need to call `tablesorter()` a second time. Once you make that change, it is going to work - I just tested via FireBug.
Tomalak
Omg, that was pretty dumb indeed... Thanks a lot for help Tomalak, really appreciated, it's working great now. I also had to call the character extraction <code>{ textExtraction: GetTextExtractor() }</code> after the zebra widget call <code>$.tablesorter.defaults.widgets = ['zebra'];</code> to get it all together. Thanks again!
kevin
@kevin: Glad to hear it worked after all. :) P.S.: I would appreciate an up-vote in return. ;)
Tomalak
Sure, how can i do that ? (is it clicking on the flag icon "this is a great comment" ?)
kevin
@kevin: No, it's by clicking on the voting buttons on the top left side of the answer. ;) Comments can be voted on, too, but only votes for questions oder answers generate reputation, which is the primary currency of this site. ;)
Tomalak
There it goes ;)
kevin