views:

265

answers:

4

I generally need to do a fair amount of text processing for my research, such as removing the last token from all lines, extracting the first two tokens from each line, splitting each line into tokens, etc.

What is the best way to perform this? Should I learn Perl for this? Or should I learn some kind of shell commands? The main concern is speed. If I need to write long code for such stuff, it defeats the purpose.

EDIT:

I started learning sed on @Mimisbrunnr 's recommendation and already could do what I needed to. But it seems people favor awk more. So, will try that. Thanks for all your replies.

+3  A: 

For doing simple steam editing sed is a great utility that comes standard on most *nix boxes, but for anything much more complex than that I would suggest getting into Perl. The curve isn't that bad and it's great for writing most forms of regular text parsing. A great reference can be found here.

Mimisbrunnr
+6  A: 

Perl and awk come to mind, although Python will do, if you'd rather not learn a new language.

Perl's a general purpose language, awk's more oriented to text processing of the type you've described.

ronys
“Whenever faced with a problem, some people say `Lets use AWK.' Now, they have two problems.” -- D. Tilbrook ;)
J.F. Sebastian
@J.F, that's just nonsense.
ghostdog74
@ronys, awk is not just for text processing. you can use it as a programming language as well.
ghostdog74
@ghostdog: The quote survived 20 years (since 1988 http://regex.info/blog/2006-09-15/247 ). It tells something. Also note `;)` at the end :)
J.F. Sebastian
don't you think its irrelevant and dated? awk has come a long way since then.
ghostdog74
can u suggest any good resources for awk ?
euphoria83
+1  A: 
#!/usr/bin/env python
# process.py     
import fileinput

for line in fileinput.input(): # you could use `inplace=True` parameter here
    words = line.split() # e.g. split on white spaces
    all_except_last = words[:-1]
    print ' '.join(all_except_last)
    # or
    first_two = words[:2]
    print ' '.join(first_two)

Examples:

$ echo a b c | python process.py
$ ./process.py input.txt another.txt
J.F. Sebastian
`perl -lane '$,=" ";pop@F;print@F'` or `perl -lane '$,=" ";print@F[0,1]'`
Hynek -Pichi- Vychodil
@Hynek -Pichi- Vychodil: Try little experiment: show Perl and Python versions to somebody who doesn't know neither and ask them what these scripts do. And I agree nothing beats Perl one-liners in brevity except J (for math stuff).
J.F. Sebastian
+1  A: 

*nix tools such as awk/grep/tail/head/sed etc are good file processing tools. If you want to search for patterns in files and process them, you can use awk. For big files, you can use a combination of grep+awk. Grep for its speed in pattern searching and awk for its ability to manipulate text. with regards to sed, oftern what sed does, awk can already do them, so i find it redundant to use sed for file processing.

In terms of speed of processing files, awk is often on par, or sometimes better than Perl or other languages.

Also, 2 very good tools for getting the front and back portion of a file FAST, are tail and head. So to get last lines, you can use tail.

ghostdog74
I assume that by "tokens" the OP means items on a line, not lines of the file, so `tail` would not be applicable to that case. `cut`, on the other hand...
Dave Sherohman