ansaurus

Question

How to convert characters to HTML entities using plain JavaScript

Answer 1

+1 A:

You need Javascript equivalent for PHP's htmlentities with html translation table. This is basically a big set of special chars (their charcodes, to be precise) with their entities and a replacer function.

n1313 2009-08-30 15:19:25

This is only needed if you want the entities escaped by name. Using the Unicode format, with either decimal or hex numbers, is much faster and less prone to issues with extended Unicode characters.

richardtallent 2009-09-01 02:52:42

Answer 2

+3 A:

Having a lookup table with a bazillion replace() calls is slow and not maintainable.

Fortunately, the build-in escape() function also encodes most of the same characters, and puts them in a consistent format (%XX, where XX is the hex value of the character).

So, you can let escape() method do most of the work for you and just change its answer to be HTML entities instead of URL-escaped characters:

htmlescaped = escape(mystring).replace(/%(..)/g,"&#x$1;");

This uses the hex format for escaping values rather than the named entities, but for storing and displaying the values, it works just as well as named entities.

Of course, escape also escapes characters you don't need to escape in HTML (spaces, for instance), but you can unescape them with a few replace calls.

Edit: I like bucabay's answer better than my own... handles a larger range of characters, and requires no hacking afterward to get spaces, slashes, etc. unescaped.

richardtallent 2009-08-30 15:23:42

Answer 3

A:

The following function (posted elswhere) might answer your needs. If something is missing it will be easy to add to the data set.

dimus 2009-08-30 15:32:02

Answer 4

A:

I adapted one of the answers from the referenced question, but added the ability to define an explicit mapping for character names.

var char_names = {
 160:'nbsp',
 161:'iexcl',
 220:'Uuml',
 223:'szlig',
 196:'Auml',
 252:'uuml',
 };

function HTMLEncode(str){
  var aStr = str.split(''),
   i = aStr.length,
   aRet = [];

  while (--i >= 0) {
   var iC = aStr[i].charCodeAt();
    if (iC < 32 || (iC > 32 && iC < 65) || iC > 127 || (iC>90 && iC<97)) {
  if(char_names[iC]!=undefined) {
   aRet.push('&'+char_names[iC]+';');
  }
  else {
   aRet.push('&#'+iC+';');
  }
    } else {
  aRet.push(aStr[i]);
    }
 }
 return aRet.reverse().join('');
   }

var text = "Übergroße Äpfel mit Würmer";

alert(HTMLEncode(text));

Jason R. Coombs 2009-08-30 15:43:46

Answer 5

+1 A:

You can use:

function encodeHTML(str){
 var aStr = str.split(''),
     i = aStr.length,
     aRet = [];

   while (--i) {
    var iC = aStr[i].charCodeAt();
    if (iC < 65 || iC > 127 || (iC>90 && iC<97)) {
      aRet.push('&#'+iC+';');
    } else {
      aRet.push(aStr[i]);
    }
  }
 return aRet.reverse().join('');
}

This function HTMLEncodes everything that is not a-z/A-Z.

KooiInc 2009-08-30 18:10:13

this should be a lot faster then using text.replace(). I love how while(--i) is used instead of for() loop. I'm assuming the theory is that for large text/loops the fast condition test offsets the Array.reverse().join('') outside the loop. Otherwise you would have just used string concatenation?

bucabay 2009-09-01 19:57:06

Thanks. Decrementing loops are faster than incrementing indeed, it's an optimization step I read about long time ago and I use it most of the time (it's also less code). I'm not sure if the 'reverse'-part weighs out the speed gain. For shorter strings it may be still faster to use aRet[i] = [value] in stead of aRet.push (as is very well explained by "olliej" in http://stackoverflow.com/questions/614126/why-is-array-push-sometimes-faster-than-arrayn-value).

KooiInc 2009-09-02 19:35:27

Answer 6

+5 A:

Using escape() should work with the character code range 0x00 to 0xFF (UTF-8 range).

If you go beyond 0xFF (255), such as 0x100 (256) then escape() will not work:

escape("\u0100"); // %u0100

and:

text = "\u0100"; // Ā
html = escape(text).replace(/%(..)/g,"&#x$1;"); // &#xu0;100

So, if you want to cover all Unicode charachacters as defined on http://www.w3.org/TR/html4/sgml/entities.html , then you could use something like:

var html = text.replace(/[\u00A0-\u00FF]/g, function(c) {
   return '&#'+c.charCodeAt(0)+';';
});

Note here the range is between: \u00A0-\u00FF.

Thats the first character code range defined in http://www.w3.org/TR/html4/sgml/entities.html which is the same as what escape() covers.

You'll need to add the other ranges you want to cover as well, or all of them.

Example: UTF-8 range with general punctuations (\u00A0-\u00FF and \u2022-\u2135)

var html = text.replace(/[\u00A0-\u00FF\u2022-\u2135]/g, function(c) {
   return '&#'+c.charCodeAt(0)+';';
});

Edit:

BTW: \u00A0-\u2666 should convert every Unicode character code not within ASCII range to HTML entities blindly:

var html = text.replace(/[\u00A0-\u2666]/g, function(c) {
   return '&#'+c.charCodeAt(0)+';';
});

bucabay 2009-08-30 18:10:40

Very good point, bucabay... I was handling the simplest case of UTF8 with a quick hack, but this is definitely a more robust solution. Great use of a passed function for handling RegEx replacement, I forgot about being able to do that. Upvoted. Needs a quick fix, however, to add ampersand, less-than, and greater-than to the character range so it can completely replace my code.

richardtallent 2009-09-01 02:48:55

Answer 7

+5 A:

With the help of bucabay and the advice to create my own function i created this one which works for me. What do you guys think, is there a better solution somewhere?

if(typeof escapeHtmlEntities == 'undefined') {
  escapeHtmlEntities = function (text) {
    return text.replace(/[\u00A0-\u2666<>\&]/g, function(c) { return '&' + 
      escapeHtmlEntities.entityTable[c.charCodeAt(0)] || '#'+c.charCodeAt(0) + ';'; });
  };

  // all HTML4 entities as defined here: http://www.w3.org/TR/html4/sgml/entities.html
  // added: amp, lt, gt, quot and apos
  escapeHtmlEntities.entityTable = { 34 : 'quot', 38 : 'amp', 39 : 'apos', 60 : 'lt', 62 : 'gt', 160 : 'nbsp', 161 : 'iexcl', 162 : 'cent', 163 : 'pound', 164 : 'curren', 165 : 'yen', 166 : 'brvbar', 167 : 'sect', 168 : 'uml', 169 : 'copy', 170 : 'ordf', 171 : 'laquo', 172 : 'not', 173 : 'shy', 174 : 'reg', 175 : 'macr', 176 : 'deg', 177 : 'plusmn', 178 : 'sup2', 179 : 'sup3', 180 : 'acute', 181 : 'micro', 182 : 'para', 183 : 'middot', 184 : 'cedil', 185 : 'sup1', 186 : 'ordm', 187 : 'raquo', 188 : 'frac14', 189 : 'frac12', 190 : 'frac34', 191 : 'iquest', 192 : 'Agrave', 193 : 'Aacute', 194 : 'Acirc', 195 : 'Atilde', 196 : 'Auml', 197 : 'Aring', 198 : 'AElig', 199 : 'Ccedil', 200 : 'Egrave', 201 : 'Eacute', 202 : 'Ecirc', 203 : 'Euml', 204 : 'Igrave', 205 : 'Iacute', 206 : 'Icirc', 207 : 'Iuml', 208 : 'ETH', 209 : 'Ntilde', 210 : 'Ograve', 211 : 'Oacute', 212 : 'Ocirc', 213 : 'Otilde', 214 : 'Ouml', 215 : 'times', 216 : 'Oslash', 217 : 'Ugrave', 218 : 'Uacute', 219 : 'Ucirc', 220 : 'Uuml', 221 : 'Yacute', 222 : 'THORN', 223 : 'szlig', 224 : 'agrave', 225 : 'aacute', 226 : 'acirc', 227 : 'atilde', 228 : 'auml', 229 : 'aring', 230 : 'aelig', 231 : 'ccedil', 232 : 'egrave', 233 : 'eacute', 234 : 'ecirc', 235 : 'euml', 236 : 'igrave', 237 : 'iacute', 238 : 'icirc', 239 : 'iuml', 240 : 'eth', 241 : 'ntilde', 242 : 'ograve', 243 : 'oacute', 244 : 'ocirc', 245 : 'otilde', 246 : 'ouml', 247 : 'divide', 248 : 'oslash', 249 : 'ugrave', 250 : 'uacute', 251 : 'ucirc', 252 : 'uuml', 253 : 'yacute', 254 : 'thorn', 255 : 'yuml', 402 : 'fnof', 913 : 'Alpha', 914 : 'Beta', 915 : 'Gamma', 916 : 'Delta', 917 : 'Epsilon', 918 : 'Zeta', 919 : 'Eta', 920 : 'Theta', 921 : 'Iota', 922 : 'Kappa', 923 : 'Lambda', 924 : 'Mu', 925 : 'Nu', 926 : 'Xi', 927 : 'Omicron', 928 : 'Pi', 929 : 'Rho', 931 : 'Sigma', 932 : 'Tau', 933 : 'Upsilon', 934 : 'Phi', 935 : 'Chi', 936 : 'Psi', 937 : 'Omega', 945 : 'alpha', 946 : 'beta', 947 : 'gamma', 948 : 'delta', 949 : 'epsilon', 950 : 'zeta', 951 : 'eta', 952 : 'theta', 953 : 'iota', 954 : 'kappa', 955 : 'lambda', 956 : 'mu', 957 : 'nu', 958 : 'xi', 959 : 'omicron', 960 : 'pi', 961 : 'rho', 962 : 'sigmaf', 963 : 'sigma', 964 : 'tau', 965 : 'upsilon', 966 : 'phi', 967 : 'chi', 968 : 'psi', 969 : 'omega', 977 : 'thetasym', 978 : 'upsih', 982 : 'piv', 8226 : 'bull', 8230 : 'hellip', 8242 : 'prime', 8243 : 'Prime', 8254 : 'oline', 8260 : 'frasl', 8472 : 'weierp', 8465 : 'image', 8476 : 'real', 8482 : 'trade', 8501 : 'alefsym', 8592 : 'larr', 8593 : 'uarr', 8594 : 'rarr', 8595 : 'darr', 8596 : 'harr', 8629 : 'crarr', 8656 : 'lArr', 8657 : 'uArr', 8658 : 'rArr', 8659 : 'dArr', 8660 : 'hArr', 8704 : 'forall', 8706 : 'part', 8707 : 'exist', 8709 : 'empty', 8711 : 'nabla', 8712 : 'isin', 8713 : 'notin', 8715 : 'ni', 8719 : 'prod', 8721 : 'sum', 8722 : 'minus', 8727 : 'lowast', 8730 : 'radic', 8733 : 'prop', 8734 : 'infin', 8736 : 'ang', 8743 : 'and', 8744 : 'or', 8745 : 'cap', 8746 : 'cup', 8747 : 'int', 8756 : 'there4', 8764 : 'sim', 8773 : 'cong', 8776 : 'asymp', 8800 : 'ne', 8801 : 'equiv', 8804 : 'le', 8805 : 'ge', 8834 : 'sub', 8835 : 'sup', 8836 : 'nsub', 8838 : 'sube', 8839 : 'supe', 8853 : 'oplus', 8855 : 'otimes', 8869 : 'perp', 8901 : 'sdot', 8968 : 'lceil', 8969 : 'rceil', 8970 : 'lfloor', 8971 : 'rfloor', 9001 : 'lang', 9002 : 'rang', 9674 : 'loz', 9824 : 'spades', 9827 : 'clubs', 9829 : 'hearts', 9830 : 'diams', 34 : 'quot', 38 : 'amp', 60 : 'lt', 62 : 'gt', 338 : 'OElig', 339 : 'oelig', 352 : 'Scaron', 353 : 'scaron', 376 : 'Yuml', 710 : 'circ', 732 : 'tilde', 8194 : 'ensp', 8195 : 'emsp', 8201 : 'thinsp', 8204 : 'zwnj', 8205 : 'zwj', 8206 : 'lrm', 8207 : 'rlm', 8211 : 'ndash', 8212 : 'mdash', 8216 : 'lsquo', 8217 : 'rsquo', 8218 : 'sbquo', 8220 : 'ldquo', 8221 : 'rdquo', 8222 : 'bdquo', 8224 : 'dagger', 8225 : 'Dagger', 8240 : 'permil', 8249 : 'lsaquo', 8250 : 'rsaquo', 8364 : 'euro' };
}

usage example:

var text = "Übergroße Äpfel mit Würmern";
alert(escapeHtmlEntities (text));

result:

&Uuml;bergro&szlig;e &Auml;pfel mit W&uuml;rmern

Update1: Thanks bucabay again for the || - hint
Update2: Updated entity table with amp,lt,gt,apos,quot, thanks richardtallent for the hint

Chris 2009-08-30 19:31:26

looks good. I'd go for: `escapeHtmlEntities.entityTable[c.charCodeAt(0)] || '#'+c.charCodeAt(0)` so you can catch those charCode's not in entityTable.

bucabay 2009-08-30 21:24:23

This is a great solution, good balance of capturing all extended Unicode characters but still providing named entities for the most common ones. You should probably add amp, gt, and lt to the entityTable. One small caveat: some older browsers may not support all of the named entities you have in that dictionary.

richardtallent 2009-09-01 02:56:59

ansaurus

tags:

views:

answers:

How to convert characters to HTML entities using plain JavaScript

related questions