I'd strongly recommend that you do most simple text parsing with a combination of sed
awk
and bash
If your needs extend beyond the capabilities of these dedicated text processing tools, find a scripting language you are most comfortable with, Ruby or Python suit most, but don't be dismissive of Perl, it was designed originally to process text and does it quickly and powerfully, the CPAN library is (literally) awesome too.
A great deal of text processing can be done with a simple Bash script.. e.g.
cat file | while read a b c; do
#process ...likely: echo "${a//search/replace} $b $c"; # etc...
done
However, you should probably post some examples of the types of text parsing problems you generally face, to get truly useful answers.
Update : for CSV use case.
Assuming a bash
shell (zsh
for example will do this differently) from the command line without creating a shell script.
Let's assume for this example file.csv looks like this...
john doe, 2010-09-20, male, 090-555-1234
jane doe, 2010-09-30, female, 080-555-4321
so:
cat file.csv | while IFS=, read name date sex number; do echo "name: ${name}\ndate: ${date}\nsex: ${sex}\nnum: ${number}\n"; done
would produce:
name: john doe
date: 2010-09-20
sex: male
number: 090-555-1234
name: jane doe
date: 2010-09-30
sex: female
number: 080-555-4321
Let's break that single line up...
cat file.csv |
while IFS=, read name date sex number; #use IFS to split the incoming stream into comma separated sets of 4 parameters.
# name, date, sex, number.
do
#access the parameters (safely!) inside quotes with the ${param} syntax.
echo "name: ${name}\ndate: ${date}\nsex: ${sex}\nnum: ${number}\n";
done;
Bash has a fairly rich parameter expansion syntax ( @see http://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion ) that will let you do search / replace and a variety of other simple operations on the fields in your csv records.
Problems will need to be fairly complex before you need to use a scripting language. For example, grep
can filter results (or more often the incoming file before processing), sort
and uniq
will do common sorting and de-duping. tr
can do things like remove or squeeze whitespace or replace specific chars.
The power of the basic unix commands to process text should be understood, they will save you many hours of time trying to things which have been done many times before.
Once you know how each tool works you can quickly utilize unix tools and it's pipeline to solve most problems.
Additional note
I often use Bash to process text directly from the clipboard, (I'm assuming cygwin in your case.)
so cat file.csv
would be replaced by cat /dev/clipboard
in the example one-liner.
While and read
read
in the example is not a specific parameter of the while command, it simply reads input from the incoming pipe and allows you to split it into arbitrary parameters... see more info here. http://ss64.com/bash/read.html
Cut
In cases where you have text delimited at specific columns you can use the unix cut
command to split the incoming line at the required column numbers. @see http://compute.cnr.berkeley.edu/cgi-bin/man-cgi?cut+1
Problems with CSV format.
Since the CSV format is a fairly loose specification, it's worth noting Toad's point about CSV's with records which include delimiter chars or newlines in their records.
In these cases, Bash is inadequate on it's own, and a scripting language with a useful CSV library is a better option, but don't forget that you can simplify your script to just process the input and provide a suitable output, which you can sort / grep etc. The choice is yours of course, but beware of reinventing the wheel, finding the right tools for a specific problem comes with experience, and is also dependent on your preferred runtime conditions.
Good CSV libraries for popular scripting languages.