questions about utf-8 | ansaurus

utf-8

UCS2/HexEncoded characters to UTF8 in php

Hi Guys, I asked a question previously to get a UCS-2/HexEncoded string from UTF-8, and I got some help from some guys at the following link. UCS2/HexEncoded characters But now I need to get the correct UTF-8 from a UCS-2/HexEncoded string in PHP. For the following strings: 00480065006C006C006F will return 'Hello' 06450631062d0628...

Python Encoding Issue

I am really lost in all the encoding/decoding issues with Python. Having read quite few docs about how to handle incoming perfectly, i still have issues with few languages, like Korean. Anyhow, here is the what i am doing. korean_text = korean_text.encode('utf-8', 'ignore') korean_text = unicode(korean_text, 'utf-8') I save the above ...

Cannot decode/encode in UTF-8.

I have a text-box which allows users to enter a word. The user enters: über In the backend, I get the word like this: def form_process(request): word = request.GET.get('the_word') word = word.encode('utf-8') #word = word.decode('utf-8') print word For some reason, I cannot decode or encode this!! It gives me the err...

Windows-1252 to UTF-8 encoding

I've copied certain files from a Windows machine to a Linux machine. So all the windows encoded(windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the "recode" utility for that. How can I specify that the "recode" utility should only convert windows-1252 enco...

A problem with passing Japanese characters(UTF-8) via json_encode

Hi, I'm having a trouble transferring Japanese characters from PHP to JavaScript via json_encode. Here is the raw data read from csv file. PRODUCT1,QA,テスト PRODUCT2,QA,aテスト PRODUCT3,QA,1テスト The problem is that when passing those data by echo json_encode($return_value), where $return_value is a 2-dimentional array containing above dat...

character-encoding

ISO-8859-1 to UTF-8 Charset conversion in PHP

Hello all. I am having to import data from a database where the character encoding being used is ISO-8859-1 and the new site that we are using is using UTF-8. The site that the data is being pulled from is old, hence the reason that it is in ISO still I presume. I have tried the following solutions with no results: iconv Neverthe...

character-encoding

Allowing non-English (ASCII) characters in the URL for SEO?

I have lots of UTF-8 content that I want inserted into the URL for SEO purposes. For example, post tags that I want to include in th URI (site.com/tags/id/TAG-NAME). However, only ASCII characters are allowed by the standards. Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These inc...

internationalization

Is StringComparer.CurrentCulture the right choice to use in this case?

I have a list of UTF-8 strings that I want to sort using Enumerable.OrderBy. The strings may contain any number of character sets - e.g., English, German, and Japanese, or a mix of them, even. For example, here is a sample input list: ["東京","North 東京", "München", "New York", "Chicago", "大阪市"] I am confused as to whether using String...

string-comparison

PHP PREG Regex: What does "\W" mean when using the UTF-8 modifier?

I know that in normal php regex (ASCII mode) "\w" (word) means "letter, number, and _". But what does it mean when you are using multibyte regex with the "u" modifier? preg_replace('/\W/u', '', $string); ...

PHP: is urlencode() a safe way to allow valid UTF-8 strings in the URL?

I have user submitted tags that can be any type of (valid) UTF-8 string. I want to know if it is safe to include them in the URL merly by running them through urlencode(). In other words, is urlencode() safe to use for valid UTF-8 strings? (by valid I mean id have already force-encoded them to UTF-8) ...

Storing multi-language geodata in MySQL

My application needs to use geodata for displaying location names. I'm very familiar with large-scale complex geodata generally (e.g. Geonames.org) but not so much with the possible MySQL implementation. I have a custom dataset of four layers, including lat/lon data for each: - Continents (approx 10) - Countries (approx 200) - Regions/S...

Preamble is empty for (new Utf8Encoding()).GetPreamble() - weird

Can anyone explain the difference between calling GetPreamble() on a newly instantiated utf8 encoding as opposed to the public ones available from the Encoding class? byte[] p1 = Encoding.UTF8.GetPreamble(); byte[] p2 = new UTF8Encoding().GetPreamble(); p1 is the normal 3 byte utf-8 preamble, but p2 ends up being empty, which seems ve...

Problems with utf8 encoding in database.yml discarding strings on insert

So I'm doing some screen scraping with this rails app I author, and when I go to insert some text from the page into the database ... rails refuses to do it (inserting empty strings into the db column instead). I looked more closely and realized that it was doing it if the string contains 'weird' characters. Weird character would be som...

How can I treat command-line arguments as UTF-8 in Perl?

How do I treat the elements of @ARGV as UTF-8 in Perl? Currently I'm using the following work-around .. use Encode qw(decode encode); my $foo = $ARGV[0]; $foo = decode("utf-8", $foo); .. which works but is not very elegant. I'm using Perl v5.8.8 which is being called from bash v3.2.25 with a LANG set to en_US.UTF-8. ...

Looking for case insensitive MySQL collation where "a" != "ä"

Hi all, I'm looking for a MySQL collation for UTF8 which is case insensitive and distinguishes between "a" and "ä" (or more generally, between umlauted / accented characters and their "pure" form). utf8_general_ci does the former, utf8_bin the latter, bot none does both. If there is no such collation, what can I do to get as close as po...

case-insensitive

Converting these types of unicode to UTF8 in PHP

Hi, I am trying to convert this in to readable UTF8 text in PHP Tel Aviv-Yafo (Hebrew: \u05ea\u05b5\u05bc\u05dc\u05be\u05d0\u05b8\u05d1\u05b4\u05d9\u05d1-\u05d9\u05b8\u05e4\u05d5\u05b9; Arabic: \u062a\u0644 \u0623\u0628\u064a\u0628\u200e, Tall \u02bcAb\u012bb), usually called Tel Aviv Any ideas on how to do so? Tried several methods...

Universal Sorting Function for PHP without the Locale Hassle

I asked a very similar question a while back and I was wondering if correctly sorting an array with UTF-8 chars got a little easier with the new improvements of PHP 5.3+. The solution provided in my previous question works, but I'm looking for a universal solution; one that doesn't depend on the locale specified - kind of what MySQL doe...

internationalization

How do I use UTF in a Rails URL?

I have the following route in routes.rb: map.resources 'protégés', :controller => 'Proteges', :only => [:index] # # this version doesn't work any better: # map.resources 'proteges', :as => 'protégés', :only => [:index] When I go to "http://localhost:3000/protégés" I get the following: No route matches "/prot%C3%A9g%C3%A9s" with {:met...

internationalization

How to search and replace utf-8 special characters in Python?

I'm a Python beginner, and I have a utf-8 problem. I have a utf-8 string and I would like to replace all german umlauts with ASCII replacements (in German, u-umlaut 'ü' may be rewritten as 'ue'). u-umlaut has unicode code point 252, so I tried this: >>> str = unichr(252) + 'ber' >>> print repr(str) u'\xfcber' >>> print repr(str).repl...

Determine input encoding

I'm getting console input from the user and want to encode it to UTF-8. My understanding is C++ does not have a standard encoding for input streams, and that it instead depends on the compiler, the runtime environment, localization, and what not. How can I determine the input encoding? ...

1
...
28
29
30
31
32
...
69