views:

967

answers:

3

UPDATED BELOW

I am reading a Binary file using BinaryReader in VB.NET. The structure of each row in the file is:

    "Category" = 1 byte
    "Code" = 1 byte
    "Text" = 60 Bytes

    Dim Category As Byte
    Dim Code As Byte
    Dim byText() As Byte
    Dim chText() As Char
    Dim br As New BinaryReader(fs)

    Category = br.ReadByte()
    Code = br.ReadByte()
    byText = br.ReadBytes(60)
    chText = encASCII.GetChars(byText)

The problem is that the "Text" field has some funky characters used for padding. Mostly seems to be 0x00 null characters.

  1. Is there any way to get rid of these 0x00 characters by some Encoding?

  2. Otherwise, how can I do a replace on the chText array to get rid of the 0x00 characters? I am trying to serialize the resulting datatable to XML and it is failing on these non compliant characters. I am able to loop through the array, however I can not figure out how to do the replace?

UPDATE:

This is where I am at with a lot of help from guys/gals below. The first solutions works, however not as flexible as I hoped, the second one fails for one use case, however is much more generic.

Ad 1) I can solve the issue by passing the string to this subroutine

    Public Function StripBad(ByVal InString As String) As String
        Dim str As String = InString
        Dim sb As New System.Text.StringBuilder
        strNew = strNew.Replace(chBad, " ")
        For Each ch As Char In str

            If StrComp(ChrW(Val("&H25")), ch) >= 0 Then
                ch = " "
            End If
            sb.Append(ch)
        Next

        Return sb.ToString()
    End Function

Ad 2) This routine does takes out several offending characters, however fails for 0x00. This was adapted from MSDN, http://msdn.microsoft.com/en-us/library/kdcak6ye.aspx.

    Public Function StripBadwithConvert(ByVal InString As String) As String
        Dim unicodeString As String
        unicodeString = InString
        ' Create two different encodings.
        Dim ascii As Encoding = Encoding.ASCII
        Dim [unicode] As Encoding = Encoding.UTF8

        ' Convert the string into a byte[].
        Dim unicodeBytes As Byte() = [unicode].GetBytes(unicodeString)

        Dim asciiBytes As Byte() = Encoding.Convert([unicode], ascii, unicodeBytes)

        Dim asciiChars(ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length) - 1) As Char
        ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0)
        Dim asciiString As New String(asciiChars)

        Return asciiString
    End Function
A: 

If the null characters are used as right padding (i.e. terminating) the text, which would be the normal case, this is fairly easy:

Dim strText As String = encASCII.GetString(byText)
Dim strlen As Integer = strText.IndexOf(Chr(0))
If strlen <> -1 Then
    strText = strText.Substr(0, strlen - 1)
End If

If not, you can still do a normal Replace on the string. It would be slightly “cleaner” if you did the pruning in the byte array, before converting it to a string. The principle remains the same, though.

Dim strlen As Integer = Array.IndexOf(byText, 0)
If strlen = -1 Then
    strlen = byText.Length + 1
End If
Dim strText = encASCII.GetString(byText, 0, strlen - 1)
Konrad Rudolph
Thanks for the effort, however This does not seem to work for me (I tried the second code listing)The 00 characters are not just at the end of the file When looking in a Hex editor, I see 00 in the place of the bad characters. They are interdispersed in several spots through the string "20 43 68 61 72 67 65 00 00 00 00 00 00 67 65 00 00 00"I used your code and the characters remained.
Paul
unknown, are you sure it was written as ASCII?
Henk Holterman
Well in that place, you can simply use `String.Replace`. However, Henk is right: your data most probably isn’t ASCII-encoded in the first place. You should definitely try to get more information on the input data.
Konrad Rudolph
Konrad, I was able to do a strText.replace (Chr(0)," ") to get rid of "Most" of the offending characters.However I am now stuck with a single bad character "error: illegal character 0x19". It does not go away with chr(19), any other suggestions.
Paul
Henk, The file is a binary file, that I am trying to load into a database. I want to strip out any binary characters and load just the plain ascii text (well at least on the text fields). the fields are fixed width, however the text fiels seem to contain some binary garbage and also the afore mentioned null padding. However I want to get rid of that. I had been using Char.IsLetterOrDigit() to weed out bad characters, however that is too general and takes out symbols that I need to keep in the text, so now I am trying to replace only the bad chars individually.
Paul
unknown if the text (part) was written as UTF-8 then those binaries aren't garbage but escape codes.
Henk Holterman
Henk, I beleive that you are correct that the input is UTF, however If that is the case, then how can I get rid of those escape characters (as I need to write to XML and those escape codes are not valid in XML, and are of no use to me for my application of the data).This code gets rid of all of them except for 0x19? Dim ascii As Encoding = Encoding.ASCII Dim [unicode] As Encoding = Encoding.Unicode Dim asciiBytes As Byte() = Encoding.Convert([unicode], ascii, unicodeBytes)
Paul
Konrad, The input data was created by user input in a legacy application where anyone could have posted data to these text fields and could essentially contain any characters.
Paul
+3  A: 

First of all you should find out what the format for the text is, so that you are just blindly removing something without knowing what you hit.

Depending on the format, you use different methods to remove the characters.

To remove only the zero characters:

Dim len As Integer = 0
For pos As Integer = 0 To byText.Length - 1
   If byText(pos) <> 0 Then
      byText(len) = byText(pos)
      len += 1
   End If
Next
strText = Encoding.ASCII.GetChars(byText, 0, len)

To remove everything from the first zero character to the end of the array:

Dim len As Integer
While len < byText.Length AndAlso byText(len) <> 0
   len += 1
End While
strText = Encoding.ASCII.GetChars(byText, 0, len)

Edit:
If you just want to keep any junk that happens to be ASCII characters:

Dim len As Integer = 0
For pos As Integer = 0 To byText.Length - 1
   If byText(pos) >= 32 And byText(pos) <= 127 Then
      byText(len) = byText(pos)
      len += 1
   End If
Next
strText = Encoding.ASCII.GetChars(byText, 0, len)
Guffa
Guffa, I am looking to keep only ASCII valid characters.There is no rhyme or reason to what characters are in there because the legacy app allowed for users to cut and paste into that field, and some were copying in word docs, etc.I need to serialize to XML, so I beleive that I need to be valid ASCII.
Paul
I see. I added another option above that might be useful.
Guffa
Guffa, This last bit did the trick.Thank you and Thanks to all who helped.
Paul
A: 

You can use a struct to load the data:

[System.Runtime.InteropServices.StructLayout(System.Runtime.InteropServices.LayoutKind.Explicit)]
internal struct TextFileRecord
{
    [System.Runtime.InteropServices.FieldOffset(0)]
    public byte Category;
    [System.Runtime.InteropServices.FieldOffset( 1 )]
    public byte Code;
    [System.Runtime.InteropServices.FieldOffset( 2 )]
    [System.Runtime.InteropServices.MarshalAs(System.Runtime.InteropServices.UnmanagedType.LPTStr, SizeConst=60)]
    public string Text;
}

You have to adjust the UnmanagedType-Argument to fit with your string encoding.

PVitt