ansaurus

Question

Answer 1

A:

I typically use Python for text parsing. It has a full set of useful string manipulation feature, regular expressions, simple file i/o, and it's very quick to whip up a program for what I need. There's a pretty nice tutorial, too.

I think everyone can agree that it is certainly not Malbolge.

JoshD 2010-09-27 03:28:03

Answer 2

+1 A:

I have worked with Java and also Python and Python is the easiest to read through files and also output files. Python also has PLY (Python Lex-Yacc) which is very easy to parse even complicated files.

Kartech 2010-09-27 03:38:12

Answer 3

+1 A:

If you're willing to check out a new (to you!) language, you might want to give SNOBOL a try.

It might be overkill for simple CSV files and the like, but since its pattern matching paradigm is closer to context-free grammars than regular expressions, parsers tend to be simple and transparent. Contrast this to extended regular expressions, which can be coerced into parsing some context-free constructs, but can quickly explode into incomprehensible tangles of metacharacters and multiple levels of quoting and escaping, more closely resembling line noise than code.

Jim Lewis 2010-09-27 03:53:08

Ok, I wouldn't recommend SNOBOL to anyone but... +1 for mentioning it. Haven't programmed in it for 15 years or so :)

RHSeeger 2010-09-27 17:55:43

@RHSeeger: What can I say..I was exposed to it at an impressionable age. :-) Too bad it never really caught on...it was one of my more enjoyable language learning experiences, which is why it stuck with me, I guess.

Jim Lewis 2010-09-27 18:14:55

Answer 4

+1 A:

The best language is clearly HyperTalk, Example:

put newStr into character pos to (pos +the length of pattern)-1 of inStr

TokenMacGuy 2010-09-27 03:59:21

+1 clearly the best!

ergosys 2010-09-27 04:14:22

Answer 5

+4 A:

I'd strongly recommend that you do most simple text parsing with a combination of sed awk and bash

If your needs extend beyond the capabilities of these dedicated text processing tools, find a scripting language you are most comfortable with, Ruby or Python suit most, but don't be dismissive of Perl, it was designed originally to process text and does it quickly and powerfully, the CPAN library is (literally) awesome too.

A great deal of text processing can be done with a simple Bash script.. e.g.

cat file | while read a b c; do 
    #process ...likely: echo "${a//search/replace} $b $c"; # etc...
done

However, you should probably post some examples of the types of text parsing problems you generally face, to get truly useful answers.

Update : for CSV use case.

Assuming a bash shell (zsh for example will do this differently) from the command line without creating a shell script.

Let's assume for this example file.csv looks like this...

john doe, 2010-09-20, male, 090-555-1234
jane doe, 2010-09-30, female, 080-555-4321

so:

cat file.csv | while IFS=, read name date sex number; do echo "name: ${name}\ndate: ${date}\nsex: ${sex}\nnum: ${number}\n";  done

would produce:

name: john doe
date: 2010-09-20
sex: male
number: 090-555-1234

name: jane doe
date: 2010-09-30
sex: female
number: 080-555-4321

Let's break that single line up...

cat file.csv | 
  while IFS=, read name date sex number;  #use IFS to split the incoming stream into comma separated sets of 4 parameters.
    # name, date, sex, number.
    do 
      #access the parameters (safely!) inside quotes with the ${param} syntax.
      echo "name: ${name}\ndate: ${date}\nsex: ${sex}\nnum: ${number}\n";
    done;

Bash has a fairly rich parameter expansion syntax ( @see http://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion ) that will let you do search / replace and a variety of other simple operations on the fields in your csv records.

Problems will need to be fairly complex before you need to use a scripting language. For example, grep can filter results (or more often the incoming file before processing), sort and uniq will do common sorting and de-duping. tr can do things like remove or squeeze whitespace or replace specific chars.

The power of the basic unix commands to process text should be understood, they will save you many hours of time trying to things which have been done many times before.

Once you know how each tool works you can quickly utilize unix tools and it's pipeline to solve most problems.

Additional note

I often use Bash to process text directly from the clipboard, (I'm assuming cygwin in your case.)

so cat file.csv would be replaced by cat /dev/clipboard in the example one-liner.

While and read

read in the example is not a specific parameter of the while command, it simply reads input from the incoming pipe and allows you to split it into arbitrary parameters... see more info here. http://ss64.com/bash/read.html

Cut

In cases where you have text delimited at specific columns you can use the unix cut command to split the incoming line at the required column numbers. @see http://compute.cnr.berkeley.edu/cgi-bin/man-cgi?cut+1

Problems with CSV format.

Since the CSV format is a fairly loose specification, it's worth noting Toad's point about CSV's with records which include delimiter chars or newlines in their records.

In these cases, Bash is inadequate on it's own, and a scripting language with a useful CSV library is a better option, but don't forget that you can simplify your script to just process the input and provide a suitable output, which you can sort / grep etc. The choice is yours of course, but beware of reinventing the wheel, finding the right tools for a specific problem comes with experience, and is also dependent on your preferred runtime conditions.

Good CSV libraries for popular scripting languages.

Ruby : FasterCSV http://fastercsv.rubyforge.org
Python : just use import csv - docs http://docs.python.org/release/2.5.2/lib/module-csv.html)
Perl : use Text::CSV; (usually present in the standard Perl install, or get with CPAN - @see http://perlmeme.org/tutorials/parsing_csv.html)

slomojo 2010-09-27 04:06:05

useless cat example.

2010-09-27 04:27:06

The point is you can do a great deal of things with a text file without even invoking a scripting language or a tool like sed, awk etc. ... Without knowing what type of problems the OP has to solve, most specific examples would be useless. This scriptlet does nothing of course, and assumes some knowledge on the OP's part. But it's a gimme for more info. I'm open, as are most, to requests for more specific responses. But thanks for your insightful comment.

slomojo 2010-09-27 04:38:08

The most recent file I processed like this, comes from a web based event registration form. Users can select which day they want to attend (Monday through Friday). In the db that stores the registrations, the days were put in a single field. The user then wanted to count how many attendees had registered for a Monday and asked I separate the values into separate fields in the CSV dump of the db.

nedlud 2010-09-27 04:50:35

(oops. hit enter too soon.) So the script just extracts the value for the days field and separates the days into individual fields in the CSV, creating 4 new columns (one for each day) in the process. If a registrant is not attending on a particular day, the field is left blank.

nedlud 2010-09-27 04:53:43

Ok, I'm adding an example...

slomojo 2010-09-27 05:20:08

Updated the example now, hope it helps.

slomojo 2010-09-27 05:46:26

@slomojo parsing csv files with bash files or regexes has problems for csv files which contains doublequotes(used for fields which contains the delimiter or newlines). In that case parsing will go completely wrong. Correctly parsing csv files requires a statemachine.

Toad 2010-09-27 05:56:22

@Toad however, if you are in control of the incoming file, and you can assume that your delimiters have integrity, you're fine. But it is worth pointing out that arbitrary CSV parsing will often require a CSV library.

slomojo 2010-09-27 06:06:28

@slomojo csv files with doublequotes or newlines or perfectly valid csv files and have integrity (see: http://en.wikipedia.org/wiki/Comma-separated_values ), yet they can not be parsed using regexes or bash

Toad 2010-09-27 06:35:09

@Toad - I think you already made this point and my response is valid, if you have strict control over your delimiters you are perfectly ok with bash to process a character or column delimited file. -- if you have something more complex or out of your control, fire up Ruby/Python/Perl and use a decent Csv library. Certainly don't feel you need to construct a statemachine out of thin air when there are numerous tools already available. In a word, the best tool for the job, requires knowing the power and limits of the tools available.

slomojo 2010-09-27 06:43:40

Answer 6

+1 A:

The BEST option would just be on what you're most comfortable with. Each language varies from another, but in general, they can all do what parsing and what not. I personally copy and paste from Excel into TextEdit, and use Regular Expressions to mass edit. If it's super complicated, I use PHP because that's the programming language I feel the best at.

If you really don't want to use Java though, I suggest that you use PHP. Some advantages are:

Not having to deal with variable types (that much)
Easy file management functions i.e. $file_contents=file_get_contents('filename');
Great documentation

You could also try out Python, which I hear is great, but I haven't really used.

Kranu 2010-09-27 04:11:30

ansaurus

tags:

views:

answers:

Best language to do text parsing in?

Update : for CSV use case.

Additional note

While and read

Cut

Problems with CSV format.

Good CSV libraries for popular scripting languages.

related questions