views:

488

answers:

3

Hi Guys,

Heres a little segment from a script Im writing;

Get-Content $tempDir\$todaysLog | Where-Object { $_ -match "" } |
    ForEach-Object -Process {
    $fields = [regex]::split($_,'@|\s+')
    Add-Content -Path $importSource2\$todaysLog -value ($($fields[0]) + "`t"  + $($fields[1]) + "`t" + $($fields[2]) + " " + $($fields[3])+ "`t" + "<*sender*@"+($($fields[5])) + "`t" + "<*recipient*@"+($($fields[7])))
    }

Sorry about the wrapping, essentially it tokenises the elements of the file into an an array then writes out certain elements with some other text around it. The purpose is to replace sentitive sender/recipient information with something meaningless.

heres a sample of the logfile I'm parsing;

10.197.71.28 SG 02012009 00:00:00 [email protected] <['[email protected]']>

Obviously Ive replaced the address info in my sample. The above segment works just fine althgogh Im conscious its very expensive. Can anything think of something that would be less expensive, perhaps a select-string to replace the text rather than tokenising/rewriting it?

Cheers

A: 
$s1 = "10.197.71.28 SG  02012009 00:00:00 [email protected] <['[email protected]']>"
$s2 = $s1 -replace "\t[^@\t']+@", "`t*sender*@"
$s3 = $s2 -replace "\<\['.+@", "<['*recipient*@"
write-host $s3

I'm assuming that all the log entries look like the sample line. If they don't, then we may have to be a bit more sophisticated.

Note that if you copy-and-paste the above code, you may need to manually re-insert the tab character before "sender" in the first line.

dangph
+1  A: 
cat $tempDir\$todaysLog |
  %{ [regex]::Replace($_, "[A-Z0-9._%+-]+(@[A-Z0-9.-]+\.[A-Z]{2,4}\s<\[')[A-Z0-9._%+-]+(@[A-Z0-9.-]+\.[A-Z]{2,4}'\]>)", '*sender*$1*recipients*$2', "IgnoreCase") } > $importSource2\$todaysLog

The log entries must look like the sample line (especially the [email protected] <['[email protected]']> part).


Edit : I've done some benchmarking (1 mo file (~15000 lines sample like)):

Andy Walker's solution (using split) -> 18,44s

Measure-Command {

Get-Content $tempDir\$todaysLog | Where-Object { $_ -match "" } |
    ForEach-Object -Process {
    $fields = [regex]::split($_,'@|\s+')
    Add-Content -Path $importSource2\$todaysLog -value ($($fields[0]) + "`t"  + $($fields[1]) + "`t" + $($fields[2]) + " " + $($fields[3])+ "`t" + "<*sender*@"+($($fields[5])) + "`t" + "<*recipient*@"+($($fields[7])))
    }

}

Dangph's solution (using replace) -> 18,16s

Measure-Command {

Get-Content $tempDir\$todaysLog | Where-Object { $_ -match "" } |
    ForEach-Object -Process {
    $s2 = $_ -replace "\t[^@\t']+@", "`t*sender*@"
    $s3 = $s2 -replace "\<\['.+@", "<['*recipient*@"
    Add-Content -Path $importSource2\$todaysLog -value $s3
    }

}

Madgnome's solution (using regex) -> 6,16s

Measure-Command {

cat $tempDir\$todaysLog |
  %{ [regex]::Replace($_, "[A-Z0-9._%+-]+(@[A-Z0-9.-]+\.[A-Z]{2,4}\s<\[')[A-Z0-9._%+-]+(@[A-Z0-9.-]+\.[A-Z]{2,4}'\]>)", '*sender*$1*recipients*$2', "IgnoreCase") } > $importSource2\$todaysLog

}
madgnome
Interesting results. What do you mean by "1 mo"? How many lines is that? I'd be curious to know what size files Andy Walker is dealing with.
dangph
To time commands, use Measure-Command like so: Measure-Command { 1..1000 }
Keith Hill
thanks for the Measure-Command, i'll edit my answer.
madgnome
A: 

You should avoid using Powershell as the engine for large file size log parsing. I would use logparser.exe (you have a space delimited entry that could be converted to a csv) and then use import-csv in Powershell to recreate a Powershell object. From there you can strip and replace fields (on a per object basis). Powershell is glue not fuel. Using it to parse large logs of any size is not quite downright folly, but will be expensive for you and the CPU. Although, Lee Holmes has an excellent Convert-TextObject.ps1 from his book examples at http://examples.oreilly.com/9780596528492/ , you want a log parsing engine of some type to handle the heavy lifting.