I think a regular expression is overkill for this.
What I'd do is clean up the input and use Text::CSV_XS on the file, specifying the record separator (sep_char).
I think a regular expression is overkill for this.
What I'd do is clean up the input and use Text::CSV_XS on the file, specifying the record separator (sep_char).
Like Ether said, another tool would be appropriate for this job.
@fields = split /\t/, $line;
if (@fields == 11) { # less than 11 fields is probably header/footer
$the_5th_column = $fields[4];
...
}
If you ever do write a regex this long, you should at least use the x
flag to ignore whitespace, and importantly allow whitespace and comments:
m/
whatever
something else # actually trying to do this
blah # for fringe case X
/xi
If you find it hard to read your own regex, others will find it Impossible.
My first thought is that the sample data is horribly mangled in your example. It'd be great to see it embedded inside some <pre>...</pre>
tags so columns will be preserved.
If you are dealing with columnar data, you can go after it using substr() or unpack() easier than you can using regex. You can use regex to parse out the data, but most of us who've been programming Perl a while also learned that regex is not the first tool to grab a lot of times. That's why you got the other comments. Regex is a powerful weapon, but it's also easy to shoot yourself in the foot.
http://perldoc.perl.org/functions/substr.html
http://perldoc.perl.org/functions/unpack.html
Update:
After a bit of nosing around on the SEC edgar site, I've found that the 13F files are nicely formatted. And, you should have no problem figuring out how to process them using substr and/or unpack.
FORM 13F INFORMATION TABLE VALUE SHARES/ SH/ PUT/ INVSTMT OTHER VOTING AUTHORITY NAME OF ISSUER TITLE OF CLASS CUSIP (x$1000) PRN AMT PRN CALL DSCRETN MANAGERS SOLE SHARED NONE - ------------------------------ ---------------- --------- -------- -------- --- ---- ------- ------------ -------- -------- -------- 3M CO COM 88579Y101 478 6051 SH SOLE 6051 0 0 ABBOTT LABS COM 002824100 402 8596 SH SOLE 8596 0 0 AFLAC INC COM 001055102 291 6815 SH SOLE 6815 0 0 ALCATEL-LUCENT SPONSORED ADR 013904305 172 67524 SH SOLE 67524 0 0
If you are seeing the 13F files unformatted, as in your example, then you are not viewing correctly because there are tabs between columns in some of the files.
I looked through 68 files to get an idea of what's out there, then wrote a quick unpack-based routine and got this:
3M CO, COM, 88579Y101, 478, 6051, SH, , SOLE, , 6051, 0, 0 ABBOTT LABS, COM, 002824100, 402, 8596, SH, , SOLE, , 8596, 0, 0 AFLAC INC, COM, 001055102, 291, 6815, SH, , SOLE, , 6815, 0, 0 ALCATEL-LUCENT, SPONSORED ADR, 013904305, 172, 67524, SH, , SOLE, , 67524, 0, 0
Based on some of the other files here's some thoughts on how to process them:
Some of the files use tabs to separate the columns. Those are trivial to parse and you do not need regex to split the columns. 0001031972-10-000004.txt appears to be that way and looks very similar to your example.
Some of the files use tabs to align the columns, not separate them. You'll need to figure out how to compress multiple tab runs into a single tab, then probably split on tabs to get your columns.
Others use a blank line to separate the rows vertically so you'll need to skip blank lines.
Others allow wrap columns to the next line (like a spreadsheet would in a column that is not wide enough. It's not too hard to figure out how to deal with that, but how to do it is being left as an exercise for you.
Some use centered column alignment, resulting in leading and trailing whitespace in your data. s/^\s+//;
and s/\s+$//;
will become your friends.
The most interesting one I saw appeared to have been created correctly, then word-wrapped at column 78, leading me to think some moron loaded their spreadsheet or report into their word processor then saved it. Reading that is a two step process of getting rid of the wrapping carriage-returns, then re-processing the data to parse out the columns. As an added task they also have column headings in the data for page breaks.
You should be able to get 100% of the files parsed, however you'll probably want to do it with a couple different parsing methods because of the use of tabs and blank lines and embedded column headers.
Ah, the fun of processing data from the wilderness.