views:

49

answers:

4

My string contain a lot of HTML entities, like this

"Hello <everybody> there"

And I want to split it by HTML entities into this :

Hello
everybody
there

Can anybody suggest me a way to do this please? May be using Regex?

+1  A: 

It looks like you can just split on &[^;]*; regex. That is, the delimiter are strings that starts with &, ends with ;, and in between there can be anything but ;.

If you can have multiple delimiters in a row, and you don't want the empty strings between them, just use (&[^;]*;)+ (or in general (delim)+ pattern).

If you can have delimiters in the beginning or front of the string, and you don't want them the empty strings caused by them, then just trim them away before you split.


Example

Here's a snippet to demonstrate the above ideas (see also on ideone.com):

var s = ""Hello <everybody> there""

print (s.split(/&[^;]*;/));
// ,Hello,,everybody,,there,

print (s.split(/(?:&[^;]*;)+/));
// ,Hello,everybody,there,

print (
   s.replace(/^(?:&[^;]*;)+/, "")
    .replace(/(?:&[^;]*;)+$/, "")
    .split(/(?:&[^;]*;)+/)
);
// Hello,everybody,there
polygenelubricants
The snippet is nice. I can see that the split function have its own limitation that it doesn't return the delimiters along with the splited string. There is a workaround to add a group of parentheses () to the RegEX. But this solution doesn't work in IE. Do you have any idea about this?
@user: do you need something like this perhaps? http://ideone.com/lfVK4 I'm not sure what you want.
polygenelubricants
This is what I'm looking for. I tried it on IE 6 and IE 8 and FF 3 and it works fine. I'm really thank you and other people so much for your help :)
A: 

var a = str.split(/\&[#a-z0-9]+\;/); should do it, although you'll end up with empty slots in the array when you have two entities next to each other.

T.J. Crowder
It's such a simple but good solution. However, a valid entity may contain maximum only 10 characters included the ampersand and the semi-colon ( the longest entity I have ever seen is an named entity ϑ ). Can you add an option to limit the number of characters please?
@user: http://www.regular-expressions.info/repeat.html ; a pattern like ` /a{3,5}/ ` would match at least 3 and at most 5 letter `a`. So perhaps what you want is instead of `+` (one or more), something like `{1,10}`.
polygenelubricants
A: 
split(/&.*?;(?=[^&]|$)/)

and cut the last and first result:

["", "Hello", "everybody", "there", ""]
Gabor de Mooij
A: 
>> ""Hello <everybody> there"".split(/(?:&[^;]+;)+/)
['', 'Hello', 'everybody', 'there', '']

The regex is: /(?:&[^;]+;)+/

Matches entities as & followed by 1+ non-; characters, followed by a ;. Then matches at least one of those (or more) as the split delimiter. The (?:expression) non-capturing syntax is used so that the delimiters captured don't get put into the result array (split() puts capture groups into the result array if they appear in the pattern).

Amber