tags:

views:

201

answers:

4

I'm concatenating data files, but the problem is that I'm seeing some extra bytes where the files are joined. The new file has extra bytes. I had thought this was maybe a problem with encoding.

Here are the methods that I've tried to use to concatenate the files. The first example I'm getting extra 0xA0 0x00 bytes.

     Dim inputfiles() As String = Directory.GetFiles(sourcedir, pattern)

     Dim bufSize As Integer = 1024 * 64
     Dim buf As Byte() = New Byte(bufSize) {}

     For Each inputfile As String In inputfiles

             Using fs As New FileStream(inputfile, FileMode.Open, FileAccess.Read)
                 Dim arrfile() As Byte = New Byte(fs.Length) {}
                 fs.Read(arrfile, 0, arrfile.Length)
                 fs.Close()

                 Using fo As New FileStream(outfilename, FileMode.Append, FileAccess.Write)
                     Using bw As New BinaryWriter(fo)
                         bw.Write(arrfile, 0, arrfile.Length)
                         bw.Close()
                         fo.Close()
                     End Using
                 End Using

             End Using
         Next

And the second I get only the 0xA0 byte.

     For Each inputfile As String In inputfiles
            Using fs As New FileStream(inputfile, FileMode.Open, FileAccess.Read)
                Using sr As New StreamReader(fs, Encoding.ASCII)
                    While Not sr.EndOfStream
                       Using fo As New FileStream(outfilename, FileMode.Append, FileAccess.Write)
                            Using sw As New StreamWriter(fo, Encoding.ASCII)
                                sw.Write(sr.ReadToEnd)
                                sw.Close()
                                fo.Close()
                            End Using
                        End Using
                    End While
                End Using
            End Using
       Next

Thanks for the help in advance.

+2  A: 

0xA0 0x00 is an UTF-16 Line Feed character. The first code snippet uses UTF-16 (default .NET encoding used for strings) and the second ASCII.

In your first code snippet, the BinaryWriter supports writing strings in a specific encoding.

BinaryWriter writer = new BinaryWriter(stream, Encoding.ASCII);
Mitch Wheat
A: 

Just a shot in the dark here but if those files are actually encoded as UTF-8/16/32 (rather than ASCII), you might be seeing the UTF BOM (Byte Order Mark) between them.

Try changing your encoding to UTF-8 and if they are text give them an encoding while reading.

NOTE UTF-8 is a super-set of ASCII so it would be a better way to read them anyway.

McKAMEY
A: 

Why are you using BinaryWriter at all? You can just write directly to the stream.

A few general comments:

  • You don't need to explicitly close the streams etc if you're using a Using statement
  • If you're copying binary files you definitely don't want to treat them as text. Stay away from TextReader/TextWriters.
  • When you're copying a stream you should generally loop round reading a block at a time and writing it out, taking note of the result of Stream.Read. That means you don't end up relying on:
    • The file length staying the same
    • All the data being read in one go
    • Having enough memory to read it all in one go in the first place
  • Why are you reopening the output stream several times? Just open it once and keep writing to it.
  • How exactly are you determining the contents of the input and output file? Are you using a hex editor? I wonder whether the "extra" bytes are actually in the input file, but you just didn't notice them if you were looking at the files with a text editor.

Here's the VB version of a method I find useful:

Public Shared Sub CopyStream(ByVal input As Stream, ByVal output As Stream)
    Dim num As Integer
    Dim buffer As Byte() = New Byte(&H2000  - 1) {}
    Do While (num = input.Read(buffer, 0, buffer.Length) > 0)
        output.Write(buffer, 0, num)
    Loop
End Sub

Call that several times, one per input file, but with the same output stream each time. (Don't close it between calls, obviously.)

Jon Skeet
I'm using BeyondCompare3 and V TheFileViewer to view the files. You are correct I have had problems with these methods hitting the 2gig file limit.
Tom Alderman
A: 

The bytes ended up being at the end of each file....

This might be a hack but here is what I here is the solution I came up with.

Because I got two extra bytes every time I added a file I subtracted 2 from the length of the new byte array.

Private Sub ConcatFiles(ByVal sourcedir As String, ByVal outfilename As String, ByVal pattern As String)

    Dim inputfiles() As String = Directory.GetFiles(sourcedir, pattern)
    Dim bufSize As Integer = 1024 * 64
    Dim buf As Byte() = New Byte(bufSize) {}

        Using fo As New FileStream(outfilename, FileMode.Append, FileAccess.Write)

            For Each inputfile As String In inputfiles

                Using fs As New FileStream(inputfile, FileMode.Open, FileAccess.Read)
                    Dim arrfile() As Byte = New Byte(fs.Length - 2) {}
                    fs.Read(arrfile, 0, arrfile.Length)
                    fo.Write(arrfile, 0, arrfile.Length)
                End Using

            Next

    End Using

End Sub
Tom Alderman