tags:

views:

114

answers:

2

I want to find a piece of text in a large xml file and want to replace with some other text. The size of the file is around ( 50GB). I want to do this in command line. I am looking at Powershell and want to know if it can handle the large size. Also I would like to the know the syntax for escaping the key operators in powershell. I am a PowerShell newbie

Currently I am trying something like this but it does not like it

    Get-Content C:\File1.xml | Foreach-Object {$_ -replace "xmlns:xsi=\"http:\/\/www\.w3\.org\/2001\/XMLSchema-instance\"", ""} | Set-Content C:\File1.xml

The text I want to replace is xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" with empty string "".

Questions

  1. Can powerShell handle large files
  2. How do I call the powershell script from command line
  3. The syntax for escaping key operators in powerShell and the list of key operators in powerShell.
  4. I don't want the replace to happen in memory and prefer streaming assuming that will not bring the server to its knees.
  5. Are there any other approaches I can take (Different tools / strategy ?)

Thanks

+1  A: 

It does not like it because you can't read from a file and write back to it at the same time using Get-Content/Set-Content. I recommend using a temp file and then at the end, rename file1.xml to file1.xml.bak and rename the temp file to file1.xml.

  1. Yes as long as you don't try to load the whole file at once. Line-by-line will work but is going to be a bit slow. Use the -ReadCount parameter and set it to 1000 to improve performance.
  2. Which command line? PowerShell? If so then you can invoke your script like so .\myscript.ps1 and if it takes parameters then c:\users\joe\myscript.ps1 c:\temp\file1.xml.
  3. In general for regexes I would use single quotes if you don't need to reference PowerShell variables. Then you only need to worry about regex escaping and not PowerShell escaping as well. If you need to use double-quotes then the back-tick character is the escape char in double-quotes e.g. "`$p1 is set to $ps1". In your example single quoting simplifies your regex to (note: forward slashes aren't metacharacters in regex):

    'xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"'

  4. Absolutely you want to stream this since 50GB won't fit into memory. However, this poses an issue if you process line-by-line. What if the text you want to replace is split across multiple lines?

  5. If you don't have the split line issue then I think PowerShell can handle this.
Keith Hill
@Keith, you really trust PowerShell ;) I would maybe worry about OutOfMemoryException because 50gb is large enough to collect little memory leaks.. just a guess. Personally I would use directly `File.Open` and work with a stream and compare manually (no regex).
stej
And shouldn't one use some sort of XML API to do this? Just a thought. Dunno if SAX or StAX are available in .NET; I work too rarely with XML, but doing a string replace sounds wrong for this.
Joey
.NET has a forward-only, cursor style reader (XmlReader/XmlTextReader) - a pull mechanism which is a bit different than the SAX push approach. It's a bit tedious but a good way to go when the whole Xml document won't fit into memory.
Keith Hill
@stej, good point on the regex - doesn't look it is required and coudl be replaced by a String.Replace().
Keith Hill
A: 

The escape character in powershell strings is the backtick ( ` ), not backslash ( \ ). I'd give an example, but the backtick is also used by the wiki markup. :(

The only thing you should have to escape is the quotes - the periods and such should be fine without.

M Aguilar