I have a PHP script (running on a Linux server) that ouputs the names of some files on the server. It outputs these file names in a simple text-only format.
This output is read from a VB.NET program by using HttpWebRequest, HttpWebResponse, and a StreamReader.
The problem is that some of the file names being output contain... unusual characters. Specifically, the "section" symbol (§).
If I view the output of the PHP script in a web browser, the symbol appears fine.
But when I read the output of the PHP script into my .NET program, the symbol doesn't appear correctly (it appears as a generic "block" symbol).
I've tried all the different character encoding options that you can use when reading the response stream (from the HttpWebResponse). I've tried outputting the stream directly to a text file (no good), displaying it in a TextBox (no good), and even when viewing the results directly in the Visual Studio debugger, the character appears as a block instead of as the "section" symbol.
I've examined the output in a hex editor (as suggested by a related question, "how do you troubleshoot character encoding problems."
When I write out the section symbol (§) from .NET itself, the hex bytes I see representing it are "c2 a7" (makes sense if it's unicode, right? requires two bytes?). When I write out the output from the PHP script directly to a file and examine that with a hex editor, the symbol shows up as "ef bf bd" - three bytes instead of two?
I'm at a loss as to what to do - if I need to specify some other character encoding, or if I'm missing something obvious about this.
Here's the code that's used to get the output of the PHP script (VB-style comments modified so they appear correctly on this site):
Dim myRequest As HttpWebRequest = WebRequest.Create("http://www.example.com/sample.php")
Dim myResponse As HttpWebResponse = myRequest.GetResponse()
// read the response stream
Dim myReader As New StreamReader(myResponse.GetResponseStream())
// read the entire output in one block (just as an example)
Dim theOutput as String = myReader.ReadToEnd()
Any ideas?
- Am I using the wrong kind of StreamReader? (I've tried passing the character encoding in the call to create the new StreamReader - I've tried all the ones that are in System.Text.Encoding - UTF-8, UTF-7, ASCII, UTF-32, Unicode, etc.)
- Should I be using a different method for reading the output of the PHP script?
- Is there something I should be doing different on the PHP-side when outputting the text?
UPDATED INFO:
- The output from PHP is specifically encoded UTF-8 by calling:
utf8_encode($file);
- When I wrote out the symbol from .NET, I copied and pasted the symbol from the Character Map app in Windows. I also copied & pasted it directly from the file's name (in Windows) and from this web page itself - all gave the same hex value when written out (c2 a7).
- Yes, the "section symbol" I'm talking about is U+00A7 (ALT+0167 on Windows, according to Character Map).
- The content-type is set explicitly via
header('Content-Type: text/html; charset=utf-8');
right at the beginning of the PHP script.
UPDATE:
Figured it out myself, but I couldn't have done it without the help from the people who answered. Thank you!