I need to make a tokenizer that is able to English words.
Currently, I'm stuck with characters where they can be part of of a url expression.
For instance, if the characters ':','?','=' are part of a url, i shouldn't really segment them.
My qns is, can this be expressed in regex? I have the regex
\b(?:(?:https?|ftp|file)://|www\.|ftp\.)
(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*
(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])
from here
but I don't know how to piece everything such that if the characters are spotted inside the above expression, don't insert spaces between them.
Help!