tags:

views:

91

answers:

4
+1  A: 

I think a regular expression is overkill for this.

What I'd do is clean up the input and use Text::CSV_XS on the file, specifying the record separator (sep_char).

szbalint
Unfortunately, there's very little structure to this date to there isn't a comma or a tab separator. Although, that may mean the columns are fixed width...
+1  A: 

Like Ether said, another tool would be appropriate for this job.

    @fields = split /\t/, $line;
    if (@fields == 11) {  # less than 11 fields is probably header/footer
        $the_5th_column = $fields[4];
        ...
    }
mobrule
Well, technically speaking, you're still in the "and now you have two problems" territory - split uses regex as opposed to a proper state machine parser like Text::CSV_XS :) Which doesn't mean it's not a perfectly working solution, mind you :)
DVK
+1  A: 

If you ever do write a regex this long, you should at least use the x flag to ignore whitespace, and importantly allow whitespace and comments:

m/
  whatever
  something else   # actually trying to do this
  blah             # for fringe case X
  /xi

If you find it hard to read your own regex, others will find it Impossible.

Ben Humphreys
That's a fantastic suggestion. Thank you.
I've gotten the script working (and only get a few hundred documents that I can't parse out of several thousand). This tip was key because I had to keep tweaking the regex to handle the million or so edge cases. Now on to analyzing the next quarter :) Thanks Ben.
A: 

My first thought is that the sample data is horribly mangled in your example. It'd be great to see it embedded inside some <pre>...</pre> tags so columns will be preserved.

If you are dealing with columnar data, you can go after it using substr() or unpack() easier than you can using regex. You can use regex to parse out the data, but most of us who've been programming Perl a while also learned that regex is not the first tool to grab a lot of times. That's why you got the other comments. Regex is a powerful weapon, but it's also easy to shoot yourself in the foot.

http://perldoc.perl.org/functions/substr.html

http://perldoc.perl.org/functions/unpack.html

Update:

After a bit of nosing around on the SEC edgar site, I've found that the 13F files are nicely formatted. And, you should have no problem figuring out how to process them using substr and/or unpack.

                                                     FORM 13F INFORMATION TABLE
                                                             VALUE  SHARES/ SH/ PUT/ INVSTMT  OTHER            VOTING AUTHORITY
NAME OF ISSUER                 TITLE OF CLASS   CUSIP     (x$1000)  PRN AMT PRN CALL DSCRETN MANAGERS         SOLE   SHARED     NONE
- ------------------------------ ---------------- --------- -------- -------- --- ---- ------- ------------ -------- -------- --------
3M CO                          COM              88579Y101      478     6051 SH       SOLE                     6051        0        0
ABBOTT LABS                    COM              002824100      402     8596 SH       SOLE                     8596        0        0
AFLAC INC                      COM              001055102      291     6815 SH       SOLE                     6815        0        0
ALCATEL-LUCENT                 SPONSORED ADR    013904305      172    67524 SH       SOLE                    67524        0        0

If you are seeing the 13F files unformatted, as in your example, then you are not viewing correctly because there are tabs between columns in some of the files.

I looked through 68 files to get an idea of what's out there, then wrote a quick unpack-based routine and got this:

3M CO, COM, 88579Y101, 478, 6051, SH, , SOLE, , 6051, 0, 0
ABBOTT LABS, COM, 002824100, 402, 8596, SH, , SOLE, , 8596, 0, 0
AFLAC INC, COM, 001055102, 291, 6815, SH, , SOLE, , 6815, 0, 0
ALCATEL-LUCENT, SPONSORED ADR, 013904305, 172, 67524, SH, , SOLE, , 67524, 0, 0

Based on some of the other files here's some thoughts on how to process them:

Some of the files use tabs to separate the columns. Those are trivial to parse and you do not need regex to split the columns. 0001031972-10-000004.txt appears to be that way and looks very similar to your example.

Some of the files use tabs to align the columns, not separate them. You'll need to figure out how to compress multiple tab runs into a single tab, then probably split on tabs to get your columns.

Others use a blank line to separate the rows vertically so you'll need to skip blank lines.

Others allow wrap columns to the next line (like a spreadsheet would in a column that is not wide enough. It's not too hard to figure out how to deal with that, but how to do it is being left as an exercise for you.

Some use centered column alignment, resulting in leading and trailing whitespace in your data. s/^\s+//; and s/\s+$//; will become your friends.

The most interesting one I saw appeared to have been created correctly, then word-wrapped at column 78, leading me to think some moron loaded their spreadsheet or report into their word processor then saved it. Reading that is a two step process of getting rid of the wrapping carriage-returns, then re-processing the data to parse out the columns. As an added task they also have column headings in the data for page breaks.

You should be able to get 100% of the files parsed, however you'll probably want to do it with a couple different parsing methods because of the use of tabs and blank lines and embedded column headers.

Ah, the fun of processing data from the wilderness.

Greg
"My first thought is that the sample data is horribly mangled in your example."Unfortunately, this is what the data looks like. To make matters worse, it's not just this one file; I'm trying to extract data from every 13F-HR filing made to the SEC over the last year and different firms will use different formatting.
How are you accessing/retrieving the information? Via HTML? The reason I ask is because a filing should look like http://moneywatch.bnet.com/money-library/sec-filings/c/2010/quarterly-reports/13f-hr/20100517/n53628664/?tag=content;col1 or http://www.secinfo.com/$/SEC/Filing.asp?T=vJcw.sc_1ut
Greg
I'm ftping the filings from the sec's edgar database. These are the unadulterated originals.