multibyte

Converting accented characters in PostgreSQL?

Is there an existing function to replace accented characters with unadorned characters in PostgreSQL? Characters like å and ø should become a and o respectively. The closest thing I could find is the translate function, given the example in the comments section found here. Some commonly used accented characters can be searched us...

What do these PHP mbstring settings do?

I'm trying to figure out exactly what these php.ini settings do. What happens when they're set to different values? When are they necessary? When are they harmful? mbstring.language mbstring.http_input mbstring.http_output mbstring.encoding_translation As usual, the PHP manual is less than helpful. EDIT: Just to clarify, I understan...

Finding bytes in strings with PHP's mbstring.func_overload on

I have PHP configured with mbstring.func_overload = 7, so all the single-byte-string functions are mapped to their multi-byte equivalents. But I still sometimes need to treat strings as byte arrays; for example, when calculating their size or doing encryption. What's the best approach here? Can I just use the multi-byte functions and pa...

Search MultiByte Strings using RegEx C# Winforms

I am working on html documents using WebBrowser Control, I need to make a utility which searches a word and highlights it in the browser. It works well if the string is in English, but for strings in other languages for example in Korean, it doesn't seem to work. The Scenario where the below mentioned code works is- Consider user has s...

How do I make emacs display a multi-byte encoded file, properly? Is it mule?

When I open a multi-byte file, I get this: ...

Are the PHP preg_functions multibyte safe?

There are no multibyte 'preg' functions available in PHP, so does that mean the default preg_functions are all mb safe? Couldn't find any mention in the php documentation. ...

ruby 1.9: how do I get a byte-index-based slice of a String?

I'm working with UTF-8 strings. I need to get a slice using byte-based indexes, not char-based. I found references on the web to String#subseq, which is supposed to be like String#[], but for bytes. Alas, it seems not to have made it to 1.9.1. Now, why would I want to do that? There's a chance I'll end up with an invalid string should ...

Ruby 1.9: how to properly upcase/downcase multibyte strings?

So matz took the questionable decision to keep upcase and downcase limited to /[A-Z]/i in ruby 1.9.1. ActiveSupport::Multibyte has long had great i18n case jiggering in ruby 1.8.x via String#mb_chars. However, when tried under ruby 1.9.1, it doesn't seem to work. Here's a simple test script I wrote, along with the output I'm getting: ...

_tcslen in Multibyte character set: how to convert WCHAR [1] to const char * ?

I search over internet for about 2 hours and I don't find any work solution. My program have multibyte character set, in code i got: WCHAR value[1]; _tcslen(value); And in compiling, I got error: 'strlen' : cannot convert parameter 1 from 'WCHAR [1]' to 'const char *' How to convert this WCHAR[1] to const char * ? ...

Problem with diacritics and mb_substr

I am slicing unicode string with diacritics using mb_substr function but it works as I would use simple substr function. It splits unicode characters in half displaying question marked diamond. E.g. echo mb_substr('ááááá', 0, 5); //Displays áá� What might be wrong? ...

Converting Multibyte characters to UTF-8

Hi All, My application has to write data to an XML file which will be read by a swf file. The swf expects the data in the XML to be in UTF-8 encoding. I have to convert some Multibyte characters in my app(Chinese simplified, Japanese, Korean etc..) to UTF-8. Are there any API calls which could allow me to do this?I would pre...

Set character set to multi byte by using code

Is there a way to set the character set to multi byte in code. By that I mean without going into the properties of the compiler and setting it. I mean it by, well...in code. :p Thanks in advanced it would mean a lot for an answer, this has been bugging me for a while. :D ...

Split a sentence into separate words

Hi, guys! I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走 (with spaces it would be: 主楼 怎么 走). At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will: 1) try to find th...

Is there even fast implementaion about multibyte character string convert to unicode wstring?

Hi, In my project, where I adopted Aho-Corasick algorithm to do some message filter mode in the server side, message the server got is string of multibyte character. But after several tests I found the bottleneck is the conversion between mulitbyte string and unicode wstring. What I use now is the pair of mbstowcs_s and wcstombs_s, whic...

Truncate a multibyte String to n chars

I am trying to get this method in a String Filter working: public function truncate($string, $chars = 50, $terminator = ' …'); I'd expect this $in = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890"; $out = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …"; and also this $in = "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀā...

Unicode vs Multi-byte

I'm really confused by this unicode vs multi-byte thing. Say I'm compiling my program in Unicode (but ultimately, I want a solution that is independent of the character set used). 1) Will all 'char' be interpreted as wide characters? 2) If I have a simple printf statement, i.e. printf("Hello World\n"); with no character strings, can I...

Converting from ANSI to Unicode

Hi all, I'm using Visual Studio .NET 2003, and I'm trying to convert a program written in purely ANSI characters to be independent of Unicode/Multi-byte characters. The program has a callback function of pcap_loop, called "got_packet". It's defined as void got_packet(u_char *user, const struct pcap_pkthdr *header, const u_char *cpacke...

How to locate a sequence of values (specifically, bytes) within a larger collection in .NET.

I need to parse the bytes from a file so that I only take the data after a certain sequence of bytes has been identified. For example, if the sequence is simply 0xFF (one byte), then I can use LINQ on the collection: byte[] allBytes = new byte[] {0x00, 0xFF, 0x01}; var importantBytes = allBytes.SkipWhile(byte b => b != 0xFF); // importa...

c++: getting ascii value of a wide char

Hi all! let's say i have a char array like "äa". is there a way to get the ascii value (e.g 228) of the first char, which is a multibyte? even if i cast my array to a wchar_t * array, i'm not able to get the ascii value of "ä", because its 2 bytes long. is there a way to do this, im trying for 2 days now :( i'm using gcc. thanks! ...

Variable-byte encoding clarification

Hello: I am very new to the world of byte encoding so please excuse me (and by all means, correct me) if I am using/expressing simple concepts in the wrong way. I am trying to understand variable-byte encoding. I have read the Wikipedia article (http://en.wikipedia.org/wiki/Variable-width_encoding) as well as a book chapter from an Inf...