views:

662

answers:

4

So I have an ASP.Net (vb.net) application. It has a textbox and the user is pasting text from Microsoft Word into it. So things like the long dash (charcode 150) are coming through as input. Other examples would be the smart quotes or accented characters. In my app I'm encoding them in xml and passing that to the database as an xml parameter to a sql stored procedure. It gets inserted in the database just as the user entered it.

The problem is the app that reads this data doesn't like these characters. So I need to translate them into the lower ascii (7bit I think) character set. How do I do that? How do I determine what encoding they are in so I can do something like the following. And would just requesting the ASCII equivalent translate them intelligently or do I have to write some code for that?

Also maybe it might be easier to solve this problem in the web page to begin with. When you copy the selection of characters from Word it puts several formats in the clipboard. The straight text one is the one I want. Is there a way to have the html textbox get that text when the user pastes into it? Do I have to set the encoding of the web page somehow?

System.Text.Encoding.ASCII.GetString(System.Text.Encoding.GetEncoding(1251).GetBytes(text))

Code from the app that encodes the input into xml:

   Protected Function RequestStringItem( _
      ByVal strName As System.String) As System.String

      Dim strValue As System.String

      strValue = Me.Request.Item(strName)
      If Not (strValue Is Nothing) Then
         RequestStringItem = strValue.Trim()
      Else
         RequestStringItem = ""
      End If

   End Function

     ' I get the input from the textboxes into an array like this
     m_arrInsertDesc(intIndex) = RequestStringItem("txtInsertDesc" & strValue)
     m_arrInsertFolder(intIndex) = RequestInt32Item("cboInsertFolder" & strValue)

  ' create xml file for inserts
  strmInsertList = New System.IO.MemoryStream()
  wrtInsertList = New System.Xml.XmlTextWriter(strmInsertList, System.Text.Encoding.Unicode)

  ' start document and add root element
  wrtInsertList.WriteStartDocument()
  wrtInsertList.WriteStartElement("Root")

  ' cycle through inserts
  For intIndex = 0 To m_intInsertCount - 1

     ' if there is an insert description
     If m_arrInsertDesc(intIndex).Length > 0 Then

        ' if the insert description is of the appropriate length
        If m_arrInsertDesc(intIndex).Length <= 96 Then

           ' add element to xml
           wrtInsertList.WriteStartElement("Insert")
           wrtInsertList.WriteAttributeString("insertdesc", m_arrInsertDesc(intIndex))
           wrtInsertList.WriteAttributeString("insertfolder", m_arrInsertFolder(intIndex).ToString())
           wrtInsertList.WriteEndElement()

        ' if insert description is too long
        Else

           m_strError = "ERROR: INSERT DESCRIPTION TOO LONG"
           Exit Function

        End If

     End If

  Next

  ' close root element and document
  wrtInsertList.WriteEndElement()
  wrtInsertList.WriteEndDocument()
  wrtInsertList.Close()

  ' when I add the xml as a parameter to the stored procedure I do this
  cmdAddRequest.Parameters.Add("@insert_list", OdbcType.NText).Value = System.Text.Encoding.Unicode.GetString(strmInsertList.ToArray())
A: 

How big is the range of these input characters? 256? (each char fits into a single byte). If that's true, it wouldn't be hard to implement a 256 value lookup table. I haven't toyed with BASIC in years, but basically you'd DIM an array of 256 bytes and fill in the array with translated values, i.e. the 'a'th byte would get 'a' (since it's OK as is) but the 150'th byte would get a hyphen.

Arthur Kalliokoski
A: 

I tried

System.Text.Encoding.ASCII.GetString(System.Text.Encoding.GetEncoding(1251).GetBytes(text))

But what I got was question marks instead of intelligent translation. That is the long dash should become regular dash and smart quotes should become regular quotes.

Will Rickards
A: 

This seems to work for long dash to short dash and smart quotes to regular quotes. As my html pages has the following as the content type. But it converts all the accented characters to questions marks. Which is not what the Text version of the clipboard has. So I'm closer, I just think I have the target encoding wrong.

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

System.Text.Encoding.ASCII.GetString(System.Text.Encoding.GetEncoding("iso-8859-1").GetBytes(m_arrFolderDesc(intIndex)))

Edit: Found the correct target encoding for my purposes which is 1252.

System.Text.Encoding.GetEncoding(1252).GetString(System.Text.Encoding.GetEncoding("iso-8859-1").GetBytes(m_arrFolderDesc(intIndex)))
Will Rickards
+1  A: 

If you convert to a non-unicode character set, you will lose some characters in the process. If the legacy app reading the data doesn't need to do any string transformations, you might want to consider using UTF-7, and converting it back once it gets back into the unicode world - this will preserve all special characters.

bdonlan