tags:

views:

189

answers:

1

I have log files that look like this...

2009-12-18T08:25:22.983Z     1         174 dns:0-apr-credit-cards-uk.pedez.co.uk P http://0-apr-credit-cards-uk.pedez.co.uk/ text/dns #170 20091218082522021+89 sha1:AIDBQOKOYI7OPLVSWEBTIAFVV7SRMLMF - -
2009-12-18T08:25:22.984Z     1           5 dns:0-60racing.co.uk P http://0-60racing.co.uk/ text/dns #116 20091218082522037+52 sha1:WMII7OOKYQ42G6XPITMHJSMLQFLGCGMG - -
2009-12-18T08:25:23.066Z     1          79 dns:0-addiction.metapress.com.wam.leeds.ac.uk P http://0-addiction.metapress.com.wam.leeds.ac.uk/ text/dns #042 20091218082522076+20 sha1:NSUQN6TBIECAP5VG6TZJ5AVY34ANIC7R - -
...plus millions of other records

I need to convert these into csv files...

"2009-12-18T08:25:22.983Z","1","174","dns:0-apr-credit-cards-uk.pedez.co.uk","P","http://0-apr-credit-cards-uk.pedez.co.uk/","text/dns","#170","20091218082522021+89","sha1:AIDBQOKOYI7OPLVSWEBTIAFVV7SRMLMF","-","-"
"2009-12-18T08:25:22.984Z","1","5","dns:0-60racing.co.uk","P","http://0-60racing.co.uk/","text/dns","#116","20091218082522037+52","sha1:WMII7OOKYQ42G6XPITMHJSMLQFLGCGMG","-","-"
"2009-12-18T08:25:23.066Z","1","79","dns:0-addiction.metapress.com.wam.leeds.ac.uk","P","http://0-addiction.metapress.com.wam.leeds.ac.uk/","text/dns","#042","20091218082522076+20","sha1:NSUQN6TBIECAP5VG6TZJ5AVY34ANIC7R","-","-"

The field delimiter can be either a single or multiple space characters, with both fixed width and variable width fields. This tends to confuse most CSV parsers that I find.

Ultimately I want to bcp these files into SQL Server but you can only specify a single character as a field delimiter (i.e. ' ') and this breaks the fixed length fields.

So far - I'm using PowerShell

gc -ReadCount 10 -TotalCount 200 .\crawl_sample.log | foreach { ([regex]'([\S]*)\s+').matches($_) } | foreach {$_.Groups[1].Value}

and this returns a stream of the fields:

2009-12-18T08:25:22.983Z
1
74
dns:0-apr-credit-cards-uk.pedez.co.uk
P
http://0-apr-credit-cards-uk.pedez.co.uk/
text/dns
#170
20091218082522021+89
sha1:AIDBQOKOYI7OPLVSWEBTIAFVV7SRMLMF
-
-
2009-12-18T08:25:22.984Z
1
55
dns:0-60racing.co.uk
P
http://0-60racing.co.uk/
text/dns
#116
20091218082522037+52
sha1:WMII7OOKYQ42G6XPITMHJSMLQFLGCGMG
-

but how do I convert that output into the CSV format?

A: 

Anwering my own question again...

measure-command {
    $q = [regex]" +"
    $q.Replace( ([string]::join([environment]::newline, (Get-Content -ReadCount 1 \crawl_sample2.log))), "," ) > crawl_sample2.csv
}

and it's quick!

Observations:

  • I was using \s+ as a regex seperator and this was breaking line feeds
  • Get-Content -ReadCount 1 to stream single row arrays to the regex
  • Then pipe the output string to the new file

UPDATE

This script works but uses a HUGE amount of RAM when working with large files. So, how can I do the same thing without the 8GB of RAM and swap being used!

I think this is caused by the join buffering up all the data again.... Any ideas?

UPDATE 2

OK - got a better solution...

Get-Content -readcount 100 -totalcount 100000 .\crawl.log | 
    ForEach-Object { $_ } |
       foreach { $_ -replace " +", "," } > .\crawl.csv

A VERY handy guide to Powershell - Powershell regular expressions

Guy
...any better solutions or improvements to the script would be welcome!
Guy
You can simplify this a bit by getting rid of the middle Foreach-Object since -replace operates on string arrays e.g. `'a b','c d','e f' -replace ' +',','`. Try this `gc crawl.log -read 100 -total 100000 | %{$_ -replace ' +',','} > crawl.csv`
Keith Hill
Considering `-replace`, it can be even simpler: `(gc crawl.log ...) -replace ' +', ','` > crawl.csv (my post *chain of operators* http://www.leporelo.eu/blog.aspx?id=powershell-tips-and-tricks-3-chain-of-operators )
stej