multibyte

Parsing multibyte string in PHP

I would like to write a (HTML) parser based on state machine but I have doubts how to acctually read/use an input. I decided to load the whole input into one string and then work with it as with an array and hold its index as current parsing position. There would be no problems with single-byte encoding, but in multi-byte encoding each ...

Merging two Regular Expressions to Truncate Words in Strings

I'm trying to come up with the following function that truncates string to whole words (if possible, otherwise it should truncate to chars): function Text_Truncate($string, $limit, $more = '...') { $string = trim(html_entity_decode($string, ENT_QUOTES, 'UTF-8')); if (strlen(utf8_decode($string)) > $limit) { $string ...

multibyte strtr() -> mb_strtr()

Does anyone have written multibyte variant of function strtr() ? I need this one. Edit 1 (example of desired usage): Example: $from = 'ľľščťžýáíŕďňäô'; // these chars are in UTF-8 $to = 'llsctzyaiŕdnao'; // input - in UTF-8 $str = 'Kŕdeľ ďatľov učí koňa žrať kôru.'; $str = mb_strtr( $str, $from, $to ); // output - str without di...

C/C++ I18N mbstowcs question

I am working on internationalizing the input for a C/C++ application. I have currently hit an issue with converting from a multi-byte string to wide character string. The code needs to be cross platform compatible, so I am using mbstowcs and wcstombs as much as possible. I am currently working on a WIN32 machine and I have set the loc...

How to check if the word is Japanese or English using PHP

I want to have different process for English word and Japanese word in this function function process_word($word) { if($word is english) { ///////// }else if($word is japanese) { //////// } } thank you ...

Merge two bytes in java/android

Hi, I have a frame of 22 bytes. The frame is the input stream from an accelerometer via bluetooth. The acceleromter readings are a 16 bit number split over two bytes. When i try to merge the bytes with buffer[1] + buffer[2], rather than adding the bytes, it just puts the results side by side. so 1+2 = 12. Could someone tell me how to ...

Convert Multi Byte charater into Hex

HI, I have an incoming file that will pass a BizTalk mapper. I need to identify if there is a 3byte chinese character in one of the field of the file (file is an xml). I already got an idea how to find the 3 byte character. However, How can I convert this into its Hex Value? The Hex value is that I will send to the output schema then se...

Mac/iPhone - NSString - comparing multibyte character and wide character strings

I'm using NSString that is a combination of "japanese" and "english" characters. All are two byte (multi byte) characters. From a webservice I'm receiving a string that is also a combination of "japanese" and "english" characters, but as far as I know english characters in that string are one byte characters. I want to compare my string ...

How to get byte size of multibyte string

How do I get the byte size of a multibyte-character string in Visual C? Is there a function or do I have to count the characters myself? Or, more general, how do I get the right byte size of a TCHAR string? Solution: _tcslen(_T("TCHAR string")) * sizeof(TCHAR) EDIT: I was talking about null-terminated strings only. ...

Why doesn't wstring::c_str cause a memory leak if not properly deleted

Code Segment 1: wchar_t *aString() { wchar_t *str = new wchar[5]; wcscpy(str, "asdf\0"); return str; } wchar_t *value1 = aString(); Code Segment 2 wstring wstr = L"a value"; wchar_t *value = wstr.c_str(); If value from code segment 2 is not deleted then an memory leak does not occur. However, if value1 from code seg...

mb_str_replace()... is slow. any alternatives?

Heya all. I want to make sure some string replacement's I'm running are multi byte safe. I've found a few mb_str_replace functions around the net but they're slow. I'm talking 20% increase after passing maybe 500-900 bytes through it. Any recommendations? I'm thinking about using preg_replace as it's native and compiled in so it might b...

removing multibyte characters from a file using sed

i need to remove all multibyte characters from a file, i dont know what they are so i need to cover the whole range. I can find them using grep like so: grep -P "[\x80-\xFF]" 'myfile' Trying to do a simular thing with sed, but delete them instead. Cheers ...

strpos searching for unicode in PHP (and handling inline UTF-8)

I am having a problem dealing with a simple search for a two character unicode string (the needle) inside another string (the haystack) that may or may not be UTF-8 Part of the problem is I don't know how to specify the code for use in strpos, and I don't know if PHP has to be compiled with any special support for the code, or if I have...

PHP mb_ereg_replace not replacing while preg_replace works as intended.

I am trying to replace in a string all non word characters with empty string expect for spaces and the put together all multiple spaces as one single space. Following code does this. $cleanedString = preg_replace('/[^\w]/', ' ', $name); $cleanedString = preg_replace('/\s+/', ' ', $cleanedString); But when I am trying to use mb_ereg...

str_replace() on multibyte strings dangerous?

Given certain multibyte character sets, am I correct in assuming that the following doesn't do what it was intended to do? $string = str_replace('"', '\\"', $string); In particular, if the input was in a character set that might have a valid character like 0xbf5c, so an attacker can inject 0xbf22 to get 0xbf5c22, leaving a valid chara...

Multi-byte safe wordwrap() function for UTF-8

PHP's wordwrap() function doesn't work correctly for multi-byte strings like UTF-8. There are a few examples of mb safe functions in the comments, but with some different test data they all seem to have some problems. The function should take the exact same parameters as wordwrap(). Specifically be sure it works to: cut mid-word if ...

How to output multiple byte characters normally in c/c++ console application?

printf("%s\n", multibytestring); By default the multi-byte characters will show up like ??? in console, how can I fix it? ...

php extension: how do I use the the mb_* functions

There's a lot of functionality available in PHP for scripts. Is this functionality available somehow to the extension writer? I'd really like to use the multibyte functions but can't find an example thereof. ...

Multibyte Safe Url Title Conversion in PHP

I'm trying to create a multibyte safe title => url string converter, however I've run into the problem of not knowing how to allow legal asian (and other) characters in the url when removing others. This is the function set at the moment. public static function convertAccentedCharacters($string) { $table ...