views:

113

answers:

5

I am trying to break the following sentence in words and wrap them in span.

<p class="german_p big">Das ist ein schönes Armband</p>

I followed this: http://stackoverflow.com/questions/2444430/how-to-get-a-word-under-cursor-using-javascript

$('p').each(function() {
            var $this = $(this);
            $this.html($this.text().replace(/\b(\w+)\b/g, "<span>$1</span>"));
        });

The only problem i am facing is, after wrapping the words in span the resultant html is like this:

<p class="german_p big"><span>Das</span> <span>ist</span> <span>ein</span> <span>sch</span>ö<span>nes</span> <span>Armband</span>.</p>

so, schönes is broken into three words sch, ö and nes. why this is happening? What could be the correct regex for this?

+6  A: 

\w only matches A-Z, a-z, 0-9, and _ (underscore).

You could use something like \S+ to match all non-space characters, including non-ASCII characters like ö. This might or might not work depending on how the rest of your string is formatted.

Reference: http://www.javascriptkit.com/javatutors/redev2.shtml

kijin
That would end up being `$this.text().replace(/\b(\S+)\b/g, "<span>$1</span>")`
TheAdamGaskins
Note: Unlike `\w+`, `\S+` will also match periods, commas, etc. at the end of words. So if you parsed this comment with this regex, the first match will be "Note:" not "Note". You'll need to tweak your regex or perform additional checks if this is not what you want.
kijin
+5  A: 

\w and \b are not unicode-aware in javascript; they only match ASCII word/boundary characters. If you use cases will all allow splitting on whitespace, you can use \s/\S, which are unicode-aware.

Wooble
+1  A: 

As others note, the \w shortcut is not very useful for non-Latin character sets. If you need to match other text ranges you should use hex* notation (Ref1) (Ref2) for the appropriate range.

* could be hex or octal or unicode, you'll often see these collectively referred as hex notation.

annakata
+2  A: 

You can also use

/\b([äöüÄÖÜß\w]+)\b/g

instead of

/\b(\w+)\b/g

in order to handle the umlauts

XViD
+1  A: 
tchrist