questions about multibyte | ansaurus

multibyte

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, an...

character-encoding

P/Invoke with [Out] StringBuilder / LPTSTR and multibyte chars: Garbled text?

I'm trying to use P/Invoke to fetch a string (among other things) from an unmanaged DLL, but the string comes out garbled, no matter what I try. I'm not a native Windows coder, so I'm unsure about the character encoding bits. The DLL is set to use "Multi-Byte Character Set", which I can't change (because that would break other project...

PHP Multibyte String Functions

Today I ran into a problem with the php function strpos(), because it returned FALSE even if the correct result was obviously 0. This was because one parameter was encoded in UTF-8, but the other (origin is a HTTP GET parameter) obviously not. Now I have noticed that using the mb_strpos function solved my problem. My question is now: I...

What is a multibyte character set?

Does the term multibyte refer to a charset whose characters can - but don't have to be - wider than 1 byte, (e.g. UTF-8) or does it refer to character sets which are in any case wider than 1 byte (e.g. UTF-16) ? In other words: What is meant if anybody talks about multibyte character sets? ...

Javascript equivalent for PHP's md5() which will also work with multibyte strings?

EDIT: the script mentioned in the question, and the other script pointed among the answers, both work just fine with multibyte strings - turned out my problem was elsewhere. Does anyone know of such implementation? The script at http://phpjs.org/functions/view/469 works well, just not on multibyte strings. ...

rename not supporting multi-byte characters

If I write: rename('php109.tmp','test.jpg'); then it's fine and working. but if I change it into: rename('php109.tmp','中文.jpg'); it'll report "No such file or directory...". But the multi-byte characters can be written into database then read out fine, why it fails when towards rename ? ...

how to detect multi-byte characters inputting end by javascript?

currently I listen on "Enter" key to start sending a message, But for multi-byte characters,the "Enter" key is supposed to choose a certain character. The problem is that I've no idea how to detect whether a user is in the middle of inputting a multi-byte character,and even if he's in that process,the message will be sent the first ...

Where can I get a complete list of all multi-byte functions for PHP?

Where can I get a complete list of all multi-byte functions for PHP? I need to go through my application and switch the non MB string functions to the new mb functions. ...

Should I use mb_* or iconv_* functions for multibyte strings?

Hi there! As we all now, handling multibyte strings is not that easy in PHP. For example I want to get the length of the following string: ä strlen('ä'); // 2, because ä equals 2 bytes mb_strlen('ä', 'UTF-8'); // 1 iconv_strlen('ä', 'UTF-8'); // 1 Which functions should I use? The mb_* or iconv_*? Why? Considering that the encoding ...

How to safely parse multibyte feeds in Ruby/Rails?

(Sorry if a newb question...I've done quite a bit of research, honestly...) I'm writing some Ruby on Rails code to parse RSS/ATOM feeds. My code is throwing-up on on a pesky '£' symbol. I've been trying the approach of normalizing the description and title fields of the feeds before doing anything else: descr = self.description.mb_ch...

How do I use CharNext in the Windows API properly?

I have a multi-byte string containing a mixture of japanese and latin characters. I'm trying to copy parts of this string to a separate memory location. Since it's a multi-byte string, some of the characters uses one byte and other characters uses two. When copying parts of the string, I must not copy "half" japanese characters. To be ab...

How to get the exact number of multibyte characters?

I tried: mb_strlen('普通话'); strlen('普通话'); both of them output 9,while in fact there are only 3 characters. What's the right way to count characters? ...

How to convert multi-byte punctuations to single byte ones with PHP?

For example,both ， and , are commas,but the first one takes 2 byte,while the second one only 1. How to convert the 2 byte one to 1 byte? ...

Why is there an extra empty row when splited by multibyte punctuation?

Try this: $pattern = '/[\x{ff0c},]/u'; //$string = "something here ; and there, oh,that's all!"; $string = 'hei,nihao，a '; echo '<pre>', print_r( preg_split( $pattern, $string ), 1 ), '</pre>'; exit(); output: <pre>Array ( [0] => hei,nihao，a ) </pre> ...

Newline Control for JasperReports 3.6.

I'm working with JasperReports 3.6 and trying to create a report with the PDF format which will meet our customer's needs. By exporting to PDF, I found that multi-byte characters (Japanese/Shift-JIS, EUC-JP, UTF-8, ISO-2022-JP) would be printed in the newline automatically without setting the "/n" code. The field would be expected to s...

How to make jscon_encode work with multibyte characters?

echo '<a title=' .json_encode("按时间先后进行排序") . '>test</a>'; The above will generate something like "\u6309\u65f6\u95f4\u5148\u540e\u8fdb\u884c\u6392\u5e8f" and it's a mess! ...

Replace "abc123def" with "abc 123 def" in multibyte string

Normally I would just do this. $str = preg_replace('#(\d+)#', ' $1 ', $str); If I knew it was going to be utf-8 I would add a lowercase "u" modifier to the pattern and I think I would be good. But because of reports of utf-8 taking 2x and in some cases 3x the storage space than it would take if the native character set were used, I'm ...

How does UTF-8 "variable-width encoding" work?

The unicode standard has enough code-points in it that you need 4 bytes to store them all. That's what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width encoding". In fact, it manages to represent the first 127 characters of US-ASCII in just one...

character-encoding

Detect chinese (multibyte) character in the string

$str = "This is a string containing 中文 characters. Some more characters - 中华人民共和国 "; How do I detect chinese characters from this string and print the part which starts with the first character and ends with "-"? (it would be "中文 characters. Some more characters -"). Thank you! ...

chinese-characters

PHP mbstring.func_overload vs using mbstring functions

I want to conform my site's string handling to support other languages per UTF-8. It seems that the best way to do this is to forsake all the standard string functions. So I have two options, I can set the mbstring.func_overload option in php.ini or I can go back over my code and just replace all the functions with mb_*. I would assume ...

1
2
3