views:

165

answers:

4

Where does glibc get its database of unicode attributes, for such functions as eg, wcwidth()? I'm interested in correcting a few errant entries, but I can't seem to find where this information is in its source distribution.

If it matters, I'm primarily interested in this under debian or ubuntu linux.

A: 

I believe that it's defined in the locale definition file. See this page for more information about locales. glibc includes a bunch of locale definitions in localedate/locales, although none of them seem to have any width information.

Adam Rosenfield
However locales are generated by the localedef application which... comes with glibc. I'm more interested in finding the canonical location to edit this information...
bdonlan
+1  A: 

Okay, so I'm just poking around myself so I'm not absolutely sure, but it appears that the table you are looking for is found in the following location relative to the glibc root:

localedata/locales/i18n

This appears to be the Unicode (version 5) locale. It contains the following, which is where I believe you need to make your changes:

% ENCLOSED ALPHANUMERICS/
   <U24D0>..<U24E9>;/

In case you're wondering, the function ctype_output (ld-ctype.c) calls allocate_arrays which calls wcwidth_table_init. The function wcwidth_table_init is generated by 3level.h (which also generates other tables that follow the same template). This is the chain that I followed to track down the files in localedate/locales.

Like I said, I'm not 100% sure that this is the right table, but I thought I'd share what I had found.

Naaff
The comments in that file suggest it's generated by localedata/gen-unicode-ctype.c, which talks about a UnicodeData file, but where is the UnicodeData file that's used in the glibc distribution...? I don't want to patch a generated file, it seems like that'd get sticky the next time there's a new release.
bdonlan
Hmmm... that's a good point. Have you tried modifying the generated file anyway, just to verify that wcwidth() returns the correct values? This might be useful as it would prove that we're on the right path. Then we could put more effort into finding out how the files are generated so the problem can be fixed at its root.
Naaff
A: 

I believe it is explained somewhere around there

dmityugov
A: 

It looks like the data is generated by the (apparently manually-run) localedata/gen-unicode-ctype.c from the unicode datafiles published at http://unicode.org/Public/UNIDATA/ . Thanks to Naaff for pointing me in the right direction!

bdonlan