ansaurus

Question

How do I know if a file is tab or space delimited in Perl?

Answer 1

A:

You could just use a regular expression. That's what Perl is famous for ;-).

Simple example:

perl -ne 'if ($_=~/^(\d+\s+)+$/){print "yep\n";}'

will only accept lines that contain only digits and whitespace. That should get you going.

sleske 2009-03-30 22:17:09

I'm not good at regex. could you please explain that expression?

2009-03-30 22:26:22

also, how can this regex be modified so that it works on last line of the file too. it never passes the last line of the file. maybe because of the eof character?

2009-03-30 22:44:47

i modified it to: ~/^(\d+\s*)+?$/ which seems to work but does it look ok

2009-03-30 23:10:20

It's hard to explain regexes in 300 chars, sorry. But you really, REALLY should learn the basics, it's essential for text processing. See e.g. "perlretut" in the Perl manpages/documentation. And for me the solution also works for the last line of a file. Strange...

sleske 2009-03-30 23:18:50

And your modified regex should do the same as mine, as far as I can tell. For what input does it produce a different result?

sleske 2009-03-30 23:22:05

@sleske, your regex requires each line to end with one or more spaces or tabs. The OP's version makes the trailing whitespace optional.

Alan Moore 2009-03-31 00:16:17

For the record, I think it's best to avoid a regex-specific- and command-line-specific-answer in a beginner question like this. The OP (no offense intended) appears to be a bit new at Perl, and using something like this is going to convert him/her to Python. :P

Chris Lutz 2009-03-31 02:07:06

@Alan M: My bad, you're right. Shouldn't answer when I'm tired. I overlooked the * vs. + :-(.

sleske 2009-03-31 08:10:25

@Chris Lutz: For the record, I think any programmer worth his/her salt should now the basics of regexes :-). It's just too important a tool to miss, esp. in Perl. But I think here we have the best of both: solutions with and w/o regexes, so OP can choose.

sleske 2009-03-31 08:11:37

Also note that the top voted answer by wisnij uses a regex (inside the split) ;-).

sleske 2009-03-31 08:12:42

Answer 2

+3 A:

sounds like it doesn't matter wether it's delimited by spaces or tabs. You will have to at some point read all of the characters of the file to validate them and to parse them. Why make these two steps. Consume integers from the file until you run into something that isn't whitespace or a valid integer, then complain (and possibly roll back)

TokenMacGuy 2009-03-30 22:17:22

Answer 3

+8 A:

It's easy enough to split on both spaces and tabs:

my @fields = split /[ \t]/, $line;

but if it has to be only one or the other, and you don't know which ahead of time, that's a little trickier. If you know how many columns there should be in the input, you can try counting the number of spaces and the number of tabs on each line and seeing if there are the right number of separators. E.g. if there are supposed to be 5 columns and you see 4 tabs on each line, it's a good bet that the user is using tabs as separators. If neither one matches up, return an error.

Checking for integer values is straightforward:

for my $val ( @fields ) {
    die "'$val' is not an integer!" if $val !~ /^-?\d+$/;
}

wisnij 2009-03-30 22:19:48

I don't think the OP should make it matter whether or not someone mixed spaces and tabs in a file. It seems like it would add a lot of headache with with very little benefit.

Chris Lutz 2009-03-31 02:03:32

Answer 4

+1 A:

I am uploading a file to a perl programfrom from an html page. After the file has been uploaded I want to determine whether the file is either (space or tab delimited) and all the values are integers. If this is not the case then I want to output some message.

This condition means that your data should contain of only digits, space and tab characters (basically it should be digits and space, or digits and tab only).

For this, just load the data to variable, and check if it matches:

$data =~ /\A[0-9 \t]+\z/;

If it matches - it will mean that you will have set of integers delimited by spaces or tabs (it's not really relevant which character was used to delimit the integers).

If your next step is to extract these integers (which sounds logical), you can do it easily by:

@integers = split /[ \t]+/, $data;

or

@integers = $data =~ /(\d+)/g;

depesz 2009-03-30 23:04:19

You are only considering positive integers in your regular expressions. Additionally, if include '-' then your first regex will no longer correctly validate input "12\t1-3\t4" will validate though the input is not valid.

2009-03-31 06:30:24

Answer 5

+1 A:

Your question isn't very clear. It sounds like you expect the data to be in this format:

123 456 789
234 567 890

In other words, each line contains one or more groups of digits, separated by whitespace. Assuming you're processing the file one line at a time as you said in the original question, I would use this regex:

/^\d+(\s+\d+)*$/

If there can be negative numbers, use this instead:

/^-?\d+(\s+-?\d+)*$/

Your regex won't match a blank line, and this one won't either. That's probably as it should be; I would expect blank lines (including lines containing nothing but whitespace) to be prohibited in a case like this. However, there could be one or more empty lines at the end of the file. That means, once you find a line that doesn't match the regex above, you should verify that each of the remaining lines has a length of zero.

But I'm making a lot of assumptions here. If this isn't what you're trying to do, you'll need to give us more detailed requirements. Also, all this accomplishes is a rough validation of the format of the data. That's fine if you're merely storing the data, but if you also want to extract information, you probably should do the validation as part of that process.

Alan Moore 2009-03-31 00:27:30

Answer 6

A:

I assume several things about your format and desired results.

consecutive delimiters collapse.
numbers may not wrap around lines, ie new lines are effectively delimiters.
tabs and spaces in one file are ok. Either delimiter is acceptable.
files are small enough that processing a whole file at once will not be an issue.

Further, my code accepts any whitespace as a delimiter.

use strict;
use warnings;

# Slurp whole file into a scalar.
my $file_contents;
{   local $/;
    $/ = undef;
    $file_contents = <DATA>;
}

# Extract and validate numbers
my @ints = grep validate_integer($_), 
                split( /\s+/, $file_contents ); 
print "@ints\n";


sub validate_integer {
    my $value = shift;

    # is it an integer?
    # add additional validation here.
    if( $value =~ /^-?\d+$/ ) {
        return 1;
    }

    # die here if you want a fatal exception.
    warn "Illegal value '$value'\n";
    return;
}

__DATA__
1 -2 3 4
5 8.8
-6
    10a b c10 -99-
    8   9 98- 9-8
10 -11  12 13

This results in:

Illegal value '8.8'
Illegal value '10a'
Illegal value 'b'
Illegal value 'c10'
Illegal value '-99-'
Illegal value '98-'
Illegal value '9-8'
1 -2 3 4 5 -6 8 9 10 -11 12 13

Updates:

Fixed handling of negative numbers.
Replaced validation map with grep.
Switched to split instead of non-whitespace capture from re.

If you want to process the file line by line, you can wrap the grep in a loop that reads the file.

daotoad 2009-03-31 01:57:35

Answer 7

A:

To add to the answer, I will write a clear and simple one. This version:

uses only the most basic Perl functions and constructs, so anyone who knows even a little Perl should get it quite quickly. Not to offend or anything, and there's no shame in being a newbie - I'm just trying to write something that you'll be able to understand no matter what your skill level is.
accepts either tabs or spaces as a delimiter, allowing them to be mixed freely. Commented-out code will detail a trivial way to enforce an either-or throughout the entire document.
prints nice error messages when it encoutnters bad values. Should show the illegal value and the line it appeared on.
allows you to process the data however you like. I'm not going to store it in an array or anything, just put a ... at one point, and there you will add in a bit of code to do whatever processing of the data on a given line you want to perform.

So here goes:

use strict;
use warnings;

open(my $data, "<", $filename);
# define $filename before this, or get it from the user

my $whitespace = "\t ";

chomp(my @data = <$data>);

# check first line for whitespace to enforce...
#if($data[0] =~ /\t/ and $data[0] !~ / /) {
#  $whitespace = "\t";
#} elsif($data[0] =~ / / and $data[0] !~ /\t/) {
#  $whitespace = " ";
#} else {
#  warn "Warning: mixed whitespace on line 1 - ignoring whitespace.\n";
#}

foreach my $n (0 .. $#data) {
  my @fields = split(/[$whitespace]+/, $data[$n]);
  foreach my $f (@fields) {
    if($f !~ /-?\d/) { # \D will call "-12" invalid
      if($f =~ /\s/) {
        warn "Warning: invalid whitespace use at line $n - ignoring.\n";
      } else {
        warn "Warning: invalid value '$f' at line $n - ignoring.\n";
      }
    } else {
      ... # do something with $f, or...
    }
  }
  ... # do something with @fields if you want to process the whole list
}

There are better, faster, more compact, and perhaps even more readable (depending on who you ask) ways to do it, but this one uses the most basic constructs, and any Perl programmer should be able to read this, regardless of skill level (okay, if you're just starting with Perl as a first language, you may not know any of it, but then you shouldn't be trying to do something like this quite yet).

EDIT: fixed my regex for matching integers. It was lazy before, and allowed "12-4", which is obviously not an integer (though it evaluates to one - but that's much more complicated (well, not really, but it's not what the OP wants (or is it? It would be a fun feature (INSERT LISP JOKE HERE)))). Thanks wisnij - I'm glad I re-read your post, since you wrote a better regex than I did.

Chris Lutz 2009-03-31 02:41:30

ansaurus

tags:

views:

answers:

How do I know if a file is tab or space delimited in Perl?

related questions