tags:

views:

998

answers:

3

I am trying to come up with a regex for removing all words that contain non-word characters.

So if it contains a colon, comma, number, bracket etc then remove it from the line, not just the character but the word. I have this so far.

$wordline = s/\s.*\W.*?\s//g;

Does not have to be perfect so removing strings with dash and apostrophe is ok.

+1  A: 
s/\w*([^\w\s]|\d)+\w* ?//g;
Jacek Szymański
Why not \W instead of ^\w? Just curious if there was a specific reason.
Telemachus
Yes, \W catches spaces, ^\w\s do not.
Jacek Szymański
@Telemachus: because he also wants to exclude space characters. \W would include spaces.
runrig
This works except that it still keeps strings containing numbers
Brian G
Yeah, I overlooked numbers. Corrected.
Jacek Szymański
Telemachus
+3  A: 
$wordline = join(" ", grep(/^\w+$/, split(/\s+/, $wordline)));
Glomek
This is what I would do. Note, however, the \w includes the underscore, _, too. If you don't want that, just specify your own character class.
brian d foy
OP also said below he doesn't want digits, so that leaves you with /^[A-Za-z]+$/ (or Unicode-aware equivalent).
Alan Moore
one more caveat: depending how tokens are split, Brian G might want to keep the splitting characters as they are. Your solution changes all token separators to <Space>.
+1  A: 
s/(?<!\S)(?![A-Za-z]+(?:\s|$))\S+(?!\S)//g

In regex-land, a "word character" is a letter, a digit, or an underscore ([A-Za-z0-9_]). It sounds like you're using it to mean just letters, so \w and \W won't do you any good. My regex matches:

  • a bunch of non-whitespace characters: \S+

  • not preceded: (?<!\S) or followed: (?!\S) by non-whitespace characters

  • unless all the characters are letters: (?![A-Za-z]+(?:\s|$))

This will leave behind all the spaces surrounding the words that it deletes. Dealing with those correctly is a little trickier than you might expect; it's much easier to do in a separate step, e.g.:

s/^ +| +(?= |$)//g
Alan Moore
[A-Za-z] doesn't deal with Unicode, you probably want to use [[:alpha:]] instead.
Leon Timmermans
I claim teacher's license. :) With [A-Za-z] it's perfectly clear what you're matching (and what you aren't). BTW, [[:alpha:]] doesn't deal with Unicode either; it's POSIX-speak for "things classified as letters in the locale of the underlying platform."
Alan Moore