views:

187

answers:

6

I frequently find myself writing small programs to parse text files (typically CSV files) and find that I always fall back on writing these parsers in Java simply because that's the language I'm most familiar with.

But since this sort of task keeps coming up, what language would people recommend I use for this kind of task? I'm happy to invest the time in learning a new language if it would save me time in the long run whenever I write this sorts of thing.

I'm not concerned at how quickly the language itself can run/parse files. What I want is a language that makes writing ad hock parsers easy/quick. I'm suspect Java is not the most efficient language for me to be doing thing kind of thing in.

Python? Ruby? Something else? (Please, don't let it be Perl)

A: 

I typically use Python for text parsing. It has a full set of useful string manipulation feature, regular expressions, simple file i/o, and it's very quick to whip up a program for what I need. There's a pretty nice tutorial, too.

I think everyone can agree that it is certainly not Malbolge.

JoshD
+1  A: 

I have worked with Java and also Python and Python is the easiest to read through files and also output files. Python also has PLY (Python Lex-Yacc) which is very easy to parse even complicated files.

Kartech
+1  A: 

If you're willing to check out a new (to you!) language, you might want to give SNOBOL a try.

It might be overkill for simple CSV files and the like, but since its pattern matching paradigm is closer to context-free grammars than regular expressions, parsers tend to be simple and transparent. Contrast this to extended regular expressions, which can be coerced into parsing some context-free constructs, but can quickly explode into incomprehensible tangles of metacharacters and multiple levels of quoting and escaping, more closely resembling line noise than code.

Jim Lewis
Ok, I wouldn't recommend SNOBOL to anyone but... +1 for mentioning it. Haven't programmed in it for 15 years or so :)
RHSeeger
@RHSeeger: What can I say..I was exposed to it at an impressionable age. :-) Too bad it never really caught on...it was one of my more enjoyable language learning experiences, which is why it stuck with me, I guess.
Jim Lewis
+1  A: 

The best language is clearly HyperTalk, Example:

put newStr into character pos to (pos +the length of pattern)-1 of inStr 
TokenMacGuy
+1 clearly the best!
ergosys
+4  A: 

I'd strongly recommend that you do most simple text parsing with a combination of sed awk and bash

If your needs extend beyond the capabilities of these dedicated text processing tools, find a scripting language you are most comfortable with, Ruby or Python suit most, but don't be dismissive of Perl, it was designed originally to process text and does it quickly and powerfully, the CPAN library is (literally) awesome too.

A great deal of text processing can be done with a simple Bash script.. e.g.

cat file | while read a b c; do 
    #process ...likely: echo "${a//search/replace} $b $c"; # etc...
done

However, you should probably post some examples of the types of text parsing problems you generally face, to get truly useful answers.

Update : for CSV use case.

Assuming a bash shell (zsh for example will do this differently) from the command line without creating a shell script.

Let's assume for this example file.csv looks like this...

john doe, 2010-09-20, male, 090-555-1234
jane doe, 2010-09-30, female, 080-555-4321

so:

cat file.csv | while IFS=, read name date sex number; do echo "name: ${name}\ndate: ${date}\nsex: ${sex}\nnum: ${number}\n";  done

would produce:

name: john doe
date: 2010-09-20
sex: male
number: 090-555-1234

name: jane doe
date: 2010-09-30
sex: female
number: 080-555-4321

Let's break that single line up...

cat file.csv | 
  while IFS=, read name date sex number;  #use IFS to split the incoming stream into comma separated sets of 4 parameters.
    # name, date, sex, number.
    do 
      #access the parameters (safely!) inside quotes with the ${param} syntax.
      echo "name: ${name}\ndate: ${date}\nsex: ${sex}\nnum: ${number}\n";
    done;

Bash has a fairly rich parameter expansion syntax ( @see http://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion ) that will let you do search / replace and a variety of other simple operations on the fields in your csv records.

Problems will need to be fairly complex before you need to use a scripting language. For example, grep can filter results (or more often the incoming file before processing), sort and uniq will do common sorting and de-duping. tr can do things like remove or squeeze whitespace or replace specific chars.

The power of the basic unix commands to process text should be understood, they will save you many hours of time trying to things which have been done many times before.

Once you know how each tool works you can quickly utilize unix tools and it's pipeline to solve most problems.

Additional note

I often use Bash to process text directly from the clipboard, (I'm assuming cygwin in your case.)

so cat file.csv would be replaced by cat /dev/clipboard in the example one-liner.

While and read

read in the example is not a specific parameter of the while command, it simply reads input from the incoming pipe and allows you to split it into arbitrary parameters... see more info here. http://ss64.com/bash/read.html

Cut

In cases where you have text delimited at specific columns you can use the unix cut command to split the incoming line at the required column numbers. @see http://compute.cnr.berkeley.edu/cgi-bin/man-cgi?cut+1

Problems with CSV format.

Since the CSV format is a fairly loose specification, it's worth noting Toad's point about CSV's with records which include delimiter chars or newlines in their records.

In these cases, Bash is inadequate on it's own, and a scripting language with a useful CSV library is a better option, but don't forget that you can simplify your script to just process the input and provide a suitable output, which you can sort / grep etc. The choice is yours of course, but beware of reinventing the wheel, finding the right tools for a specific problem comes with experience, and is also dependent on your preferred runtime conditions.

Good CSV libraries for popular scripting languages.

slomojo
useless cat example.
The point is you can do a great deal of things with a text file without even invoking a scripting language or a tool like sed, awk etc. ... Without knowing what type of problems the OP has to solve, most specific examples would be useless. This scriptlet does nothing of course, and assumes some knowledge on the OP's part. But it's a gimme for more info. I'm open, as are most, to requests for more specific responses. But thanks for your insightful comment.
slomojo
The most recent file I processed like this, comes from a web based event registration form. Users can select which day they want to attend (Monday through Friday). In the db that stores the registrations, the days were put in a single field. The user then wanted to count how many attendees had registered for a Monday and asked I separate the values into separate fields in the CSV dump of the db.
nedlud
(oops. hit enter too soon.) So the script just extracts the value for the days field and separates the days into individual fields in the CSV, creating 4 new columns (one for each day) in the process. If a registrant is not attending on a particular day, the field is left blank.
nedlud
Ok, I'm adding an example...
slomojo
Updated the example now, hope it helps.
slomojo
@slomojo parsing csv files with bash files or regexes has problems for csv files which contains doublequotes(used for fields which contains the delimiter or newlines). In that case parsing will go completely wrong. Correctly parsing csv files requires a statemachine.
Toad
@Toad however, if you are in control of the incoming file, and you can assume that your delimiters have integrity, you're fine. But it is worth pointing out that arbitrary CSV parsing will often require a CSV library.
slomojo
@slomojo csv files with doublequotes or newlines or perfectly valid csv files and have integrity (see: http://en.wikipedia.org/wiki/Comma-separated_values ), yet they can not be parsed using regexes or bash
Toad
@Toad - I think you already made this point and my response is valid, if you have strict control over your delimiters you are perfectly ok with bash to process a character or column delimited file. -- if you have something more complex or out of your control, fire up Ruby/Python/Perl and use a decent Csv library. Certainly don't feel you need to construct a statemachine out of thin air when there are numerous tools already available. In a word, the best tool for the job, requires knowing the power and limits of the tools available.
slomojo
+1  A: 

The BEST option would just be on what you're most comfortable with. Each language varies from another, but in general, they can all do what parsing and what not. I personally copy and paste from Excel into TextEdit, and use Regular Expressions to mass edit. If it's super complicated, I use PHP because that's the programming language I feel the best at.

If you really don't want to use Java though, I suggest that you use PHP. Some advantages are:

  • Not having to deal with variable types (that much)
  • Easy file management functions i.e. $file_contents=file_get_contents('filename');
  • Great documentation

You could also try out Python, which I hear is great, but I haven't really used.

Kranu