views:

523

answers:

2

I am using Powershell for some ETL work, reading compressed text files in and splitting them out depending on the first three characters of each line.

If I were just filtering the input file, I could pipe the filtered stream to Out-File and be done with it. But I need to redirect the output to more than one destination, and as far as I know this can't be done with a simple pipe. I'm already using a .NET streamreader to read the compressed input files, and I'm wondering if I need to use a streamwriter to write the output files as well.

The naive version looks something like this:

while (!$reader.EndOfFile) {
  $line = $reader.ReadLine();
  switch ($line.substring(0,3) {
    "001" {Add-Content "output001.txt" $line}
    "002" {Add-Content "output002.txt" $line}
    "003" {Add-Content "output003.txt" $line}
    }
  }

That just looks like bad news: finding, opening, writing and closing a file once per row. The input files are huge 500MB+ monsters.

Is there an idiomatic way to handle this efficiently w/ Powershell constructs, or should I turn to the .NET streamwriter?

Are there methods of a (New-Item "path" -type "file") object I could use for this?

EDIT for context:

I'm using the DotNetZip library to read ZIP files as streams; thus streamreader rather than Get-Content/gc. Sample code:

[System.Reflection.Assembly]::LoadFrom("\Path\To\Ionic.Zip.dll") 
$zipfile = [Ionic.Zip.ZipFile]::Read("\Path\To\File.zip")

foreach ($entry in $zipfile) {
  $reader = new-object system.io.streamreader $entry.OpenReader();
  while (!$reader.EndOfFile) {
    $line = $reader.ReadLine();
    #do something here
  }
}

I should probably Dispose() of both the $zipfile and $reader, but that is for another question!

+2  A: 

Given the size of input files, you definitely want to process a line at a time. I wouldn't think the re-opening/closing of the output files would be too huge a perf hit. It certainly makes the implemation possible using the pipeline even as a one-liner - really not too different from your impl. I wrapped it here to get rid of the horizontal scrollbar:

gc foo.log | %{switch ($_.Substring(0,3)) {
    '001'{$input | out-file output001.txt -enc ascii -append} `
    '002'{$input | out-file output002.txt -enc ascii -append} `
    '003'{$input | out-file output003.txt -enc ascii -append}}}
Keith Hill
Keith, in `$_ >> output001.txt` statement the `$_` variable is not the one from `for-each` but from `switch` - it contains only the substring.
stej
I just need to hit the sack. It's getting late here and i'm just getting punchy. :-)
Keith Hill
You guys are awesome, thanks.
Peter
+6  A: 

Reading

As for reading the file and parsing, I would go with switch statement:

switch -file c:\temp\stackoverflow.testfile2.txt -regex {
  "^001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $_}
  "^002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $_}
  "^003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $_}
}

I think it is better approach because

  • there is support for regex, you don't have to make substring (which might be expensive) and
  • the parameter -file is quite handy ;)

Writing

As for writing the output, I'll test to use streamwriter, however if performance of Add-Content is decent for you, I would stick to it.

Added: Keith proposed to use >> operator, however, it seems that it is very slow. Besides that it writes output in Unicode which doubles the file size.

Look at my test:

[1]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
>>             '001'{$c >> c:\temp\stackoverflow.testfile.001.txt} `
>>             '002'{$c >> c:\temp\stackoverflow.testfile.002.txt} `
>>             '003'{$c >> c:\temp\stackoverflow.testfile.003.txt}}}
>> }).TotalSeconds
>>
159,1585874
[2]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
>>             '001'{$c | Add-content c:\temp\stackoverflow.testfile.001.txt} `
>>             '002'{$c | Add-content c:\temp\stackoverflow.testfile.002.txt} `
>>             '003'{$c | Add-content c:\temp\stackoverflow.testfile.003.txt}}}
>> }).TotalSeconds
>>
9,2696923

The difference is huge.

Just for comparison:

[3]: (measure-command {
>>     $reader = new-object io.streamreader c:\temp\stackoverflow.testfile2.txt
>>     while (!$reader.EndOfStream) {
>>         $line = $reader.ReadLine();
>>         switch ($line.substring(0,3)) {
>>             "001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $line}
>>             "002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $line}
>>             "003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $line}
>>             }
>>         }
>>     $reader.close()
>> }).TotalSeconds
>>
8,2454369
[4]: (measure-command {
>>     switch -file c:\temp\stackoverflow.testfile2.txt -regex {
>>         "^001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $_}
>>         "^002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $_}
>>         "^003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $_}
>>     }
>> }).TotalSeconds
8,6755565

Added: I was curious about the writing performance .. and I was a little bit surprised

[8]: (measure-command {
>>     $sw1 = new-object io.streamwriter c:\temp\stackoverflow.testfile.001.txt3b
>>     $sw2 = new-object io.streamwriter c:\temp\stackoverflow.testfile.002.txt3b
>>     $sw3 = new-object io.streamwriter c:\temp\stackoverflow.testfile.003.txt3b
>>     switch -file c:\temp\stackoverflow.testfile2.txt -regex {
>>         "^001" {$sw1.WriteLine($_)}
>>         "^002" {$sw2.WriteLine($_)}
>>         "^003" {$sw3.WriteLine($_)}
>>     }
>>     $sw1.Close()
>>     $sw2.Close()
>>     $sw3.Close()
>>
>> }).TotalSeconds
>>
0,1062315

It is 80 times faster. Now you you have to decide - if speed is important, use StreamWriter. If code clarity is important, use Add-Content.


Substring vs. Regex

According to Keith Substring is 20% faster. It depends, as always. However, in my case the results are like this:

[102]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
>>             '001'{$c | Add-content c:\temp\stackoverflow.testfile.001.s.txt} `
>>             '002'{$c | Add-content c:\temp\stackoverflow.testfile.002.s.txt} `
>>             '003'{$c | Add-content c:\temp\stackoverflow.testfile.003.s.txt}}}
>> }).TotalSeconds
>>
9,0654496
[103]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch -regex ($_) {
>>             '^001'{$c | Add-content c:\temp\stackoverflow.testfile.001.r.txt} `
>>             '^002'{$c | Add-content c:\temp\stackoverflow.testfile.002.r.txt} `
>>             '^003'{$c | Add-content c:\temp\stackoverflow.testfile.003.r.txt}}}
>> }).TotalSeconds
>>
9,2563681

So the difference is not important and for me, regexes are more readable.

stej
Actually, substring is ~20% faster.
Keith Hill
Good catch on speed of Add-Content vs >>. Using Out-File -enc ascii seems to be on par with Add-Content in my tests. Interesting that using streamwriter is that much faster.
Keith Hill
Yes, I was surprised as well. I added some measurements as for substring/regex.If you would like to compare the speed of `StreamWriter`, here is my code that generates the test file: `1..5000 | % { $n = Get-Random -Min 1 -Max 4; $x=1..(Get-Random -Min 20 -Max 150) | % { ([char](Get-Random -Min 65 -Max 120)) }; $x = $x -join ""; '{0:000} {1}' -f $n,$x } | Add-Content C:\temp\stackoverflow.testfile.txt` (number of lines is now 5000)
stej
Looks like it's streamwriter for me. Thanks for this, I'm new to Powershell and all the examples specific to my task are very helpful. I can't use the `switch -file` construct, but it's good to know it's available when I'm working w/ uncompressed files.
Peter
I'm glad I could help.
stej