utf-8

What is "ANSI as UTF-8" and how can I make fputcsv() generate UTF-8 w/BOM?

I made a PHP script that generates CSV files that were previously generated by another process. And then, the CSV files have to be imported by yet another process. The import of the old CSV files works fine, but but when importing the new CSV files there are issues with special characters. When I open old CSVs with Notepad++, it says t...

What happens if I connect to a utf8 MySQL DB table using latin1?

Interesting question... if I have a MySQL table with CHARSET=utf8, and I open a connection with latin1 encoding, what happens? I tried this, and even characters such as ß and æ could be stored and retrieved properly. Those characters are represented with different byte sequences in utf8 and in latin1, so I didn't expect it to work. Is ...

Encoding problem (UTF-8) in PHP

Hello! I want to output the following string in PHP: ä ö ü ß € Therefore, I've encoded it to utf8 manually: ä ö ü ß € So my script is: <?php header('content-type: text/html; charset=utf-8'); echo 'ä ö ü ß €'; ?> The first 4 characters are correct (ä ö ü ß) but unfortunately the € sign isn't correct: ä ö ü ß € Here you...

Java modified UTF-8 strings in Python

Hello I am interfacing with a Java application via Python. I need to be able to construct byte sequences which contain utf-8 strings. Java uses a modified utf-8 encoding in DataInputStream.readUTF() which is not supported by python (yet at least) Can anybody point me in the right direction to construct java modified utf-8 strings in py...

Handling special characters in XML when transforming with Saxon

I'm attempting to apply a stylesheet to an XML document using Saxon. Given an XML file that was generated in Microsoft Word and that has Microsoft Word-style quotes, such as around FOO in the following document <?xml version="1.0" encoding="UTF-8"?> <doc> <act> <performer typeCode=“FOO“ /> <performer typeCode="BAR" /> ...

Guessing UTF-8 encoding

I have a question that may be quite naive, but I feel the need to ask, because I don't really know what is going on. I'm on Ubuntu. Suppose I do echo "t" > test.txt if I then file test.txt I get test.txt:ASCII text If I then do echo "å" > test.txt Then I get test.txt: UTF-8 Unicode text How does that happen? How does file...

url encode a utf-8 string in c?

Should I write my own or is there a library function that already does that? I need this for a pidgin plugin, so if there is something in the pidgin/purple/gnome libraries, that would be ideal. But other sources are fine, too. ...

Which code set is /etc/passwd stored in? Can it be UTF-8? What limits are placed on user names?

On a modern Unix or Linux system, how can you tell which code set the /etc/passwd file stores user names in? Are user names allowed to contain accented characters (from the range 0x80..0xFF in, say, ISO 8859-1 or 8859-15)? Can the /etc/passwd file contain UTF-8? Can you tell that it contains UTF-8? What about the plain text of passwo...

NSData to NString conversion problem

I'm getting an HTML file as NSData and need to extract some parts of it. For that I need to convert it to NSString with UTF8 encoding. The thing is that this conversion fails, probably because the NSData contains bytes that are invalid for UTF8. I have tried to get the byte array of the data and go over it, but each time I come across no...

Which programming languages were designed with Unicode support from the beginning?

Which widely used programming languages were designed ground-up with Unicode support? A lot of programming languages have added Unicode support as an afterthought in later versions, but which widely used languages were released with Unicode support from day one? ...

Servlet request.getParameters non english character help!

Heya guys, I'm in desperate need of help. I have a Java servlet that is accessed by a HTTP Get URL with eight parameters in it. The problem is that the parameters are not exclusive to English. Any other language can be in those parameters, like Hebrew, for example. Now, when I send the data - either from the class that is supposed to...

In C# String/Character Encoding what is the difference between GetBytes(), GetString() and Convert()?

We are having trouble getting a Unicode string to convert to a UTF-8 string to send over the wire: // Start with our unicode string. string unicode = "Convert: \u10A0"; // Get an array of bytes representing the unicode string, two for each character. byte[] source = Encoding.Unicode.GetBytes(unicode); // Convert the Unicode bytes to U...

Encoding issues with python's etree.tostring

I'm using python 2.6.2's xml.etree.cElementTree to create an xml document: import xml.etree.cElementTree as etree elem = etree.Element('tag') elem.text = (u"Würth Elektronik Midcom").encode('utf-8') xml = etree.tostring(elem,encoding='UTF-8') At the end of the day, xml looks like: <?xml version='1.0' encoding='UTF-8'?> <tag>W&#195;&#...

enabling UTF-8 encoding for clojure source files.

Hi all, I'm working on a project which involves maven, java and clojure. The problem I'm facing is this, I have some UTF-8 chars in my clojure source files because of which my source code is not interpreted correctly by the java compiler, I kinda got it working by setting the environment variable JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF...

Can anyone tell me what this ascii character is?

I have this character showing up occasionally and I can't seem to find it in the ascii table. I'd like to run a filter on the data before it's sent to the database but I have to know what it is first. Maybe someone can clue me in. I am using a wysiwyg editor and this is where it's coming from. The character appears very sporadicly but se...

PHP: replace invalid characters in utf-8 string in

How replace (use regex in PHP5) invalid characters in utf-8 string on white space characters? ...

R character encodings across windows, mac and linux

I use OS X and I am currently cooperating with a windows user and deploying the scripts on a linux server. We use git for version control, and I keep getting R scripts from his end where the character encoding used has mixed latin1 and utf8 encodings. So I have a couple of questions. Is there a simple to use editor for windows that h...

Parsing UTF-8-encoded XML in MSXML/ASP

I'm at the receiving end of a HTTP POST (x-www-form-urlencoded), where one of the fields contains an XML document. I need to receive that document, look at a couple of elements, and store it in a database (for later use). The document is in UTF-8 format (and has the appropriate header), and can contain lots of strange characters. When I...

How to decode a string that has been UTF-8 encoded twice to simple UTF-8?

I have a huge MySQL table which has its rows encoded in UTF-8 twice. For example "Újratárgyalja" is stored as "Újratárgyalja". The MySQL .Net connector downloads them this way. I tried lots of combinations with System.Text.Encoding.Convert() but none of them worked. Sending set names 'utf8' (or other charset) won't solve it. How can...

Mysql ASCII vs Unicode

Just a quick one: Will SELECT ... WHERE name LIKE '...' query be faster if name column is ASCII rather then UTF-8? Thanks! ...