views:

380

answers:

5

I have a file like this:

my line - some words & text
oh lóok i've got some characters

I want to 'normalize' it and remove all the non-word characters. I want to end up with something like this:

mylinesomewordstext
ohlóokivegotsomecharacters

I'm using Linux on the command line at the moment, and I'm hoping there's some one-liner I can use.

I tried this:

cat file | perl -pe 's/\W//'

But that removed all the newlines and put everything one line. Is there someway I can tell Perl to not include newlines in the \W? Or is there some other way?

+5  A: 

This removes characters that don't match \w or \n:

cat file | perl -C -pe 's/[^\w\n]//g'
sth
This drops the accented o in the original text.
Mark Rushakoff
You'll have to add a flag to the command to get it Unicode aware -C should do it. http://perldoc.perl.org/perlrun.html#Command-Switches
Dominic Mitchell
Right, changed that.
sth
You don't need the cat either: perl -C -pe 's/[^\w\n]//g' file
ire_and_curses
+1  A: 

The previous response isn't echoing the "ó" character. At least in my case.

sed 's/\W//g' file
dcruz
useless use of cat
camh
true. I saw it too late =/
dcruz
feel free to edit your answer then.
Ether
+4  A: 

@sth's solution uses Perl, which is (at least on my system) not Unicode compatible, thus it loses the accented o character.

On the other hand, sed is Unicode compatible (according to the lists on this page), and gives a correct result:

$ sed 's/\W//g' a.txt
mylinesomewordstext
ohlóokivegotsomecharacters
Mark Rushakoff
+1  A: 

Best practices for shell scripting dictate that you should use the tr program for replacing single characters instead of sed, because it's faster and more efficient. Obviously use sed if replacing longer strings.

tr -d '[:blank:][:punct:]' < file

When run with time I get:

real 0m0.003s
user 0m0.000s
sys 0m0.004s

When I run the sed answer (sed -e 's/\W//g' file) with time I get:

real 0m0.003s
user 0m0.004s
sys 0m0.004s

While not a "huge" difference, you'll notice the difference when running against larger data sets. Also please notice how I didn't pipe cat's output into tr, instead using I/O redirection (one less process to spawn).

Sam Bisbee
+2  A: 

In Perl, I'd just add the -l switch, which re-adds the newline by appending it to the end of every print():

 perl -ple 's/\W//g' file

Notice that you don't need the cat.

brian d foy