How should I handle digits from different sets of UNICODE digits in the same string?

tags:

perl
unicode

views:

109

answers:

+4 Q:

How should I handle digits from different sets of UNICODE digits in the same string?

I am writing a function that transliterates UNICODE digits into ASCII digits, and I am a bit stumped on what to do if the string contains digits from different sets of UNICODE digits. So for example, if I have the string "\x{2463}\x{24F6}" ("④⓶"). Should my function

return 42?
croak that the string contains mixed sets?
carp that the string contains mixed sets and return 42?
give the user an additional argument to specify one of the three above behaviours?
do something else?

+1 A:

Your current function appears to do #1.

I suggest that you should also write another function to do #4, but only when the requirement appears, and not before .

I'm sure Joel wrote about "premature implementation" in a blog article sometime recently, but I can't find it.

Alnitak 2009-05-21 14:17:53

Well, since this is going to go up on CPAN, I won't know how people will want to use it. It is easy enough to add an optional parameter to the current function and do some checking if the parameter is passed in, or do what I am doing now if it isn't. What I don't know is if anyone wants that functionality.

Chas. Owens 2009-05-21 15:17:13

so add it later when someone asks ;-)

Alnitak 2009-05-21 15:49:10

I'm not sure I see a problem.

You support numeric conversion from a range of scripts, which is to say, you are aware of the Unicode codepoints for their numeric characters.

If you find an unknown codepoint in your input data, it is an error.

It is up to you what you do in the event of an error; you may insert a space or underscore, or you may abort conversion. What you would do will depend on the environment in which your function executes; it is not something we can tell you.

Blank Xavier 2009-05-21 14:24:33

I think you are wrong. The codepoints specifically _do_ have the same meaning as the digits 0..9, it's just that some cultures use completely different glyphs for them.

Alnitak 2009-05-21 14:30:18

@Alnitak: you are correct, I mis-read the question.

Blank Xavier 2009-05-21 14:35:31

@Atlnitak: answer rewritten

Blank Xavier 2009-05-21 14:38:18

I am only transliterating digits, there are 570 characters I can find in UNICODE that have the digit property and the all can be mapped to 0 - 9 (you can check the ranges in the code I linked to), if you know of a counter example, please share it.

Chas. Owens 2009-05-21 15:21:21

Note that some characters do not have the digit property, but could still be usefully transliterated: super/sub script numbers for example, are not considered digits, but do have a numeric value. Just in case you want to increase coverage of your function.

mirod 2009-05-21 16:17:37

@Chas: don't forget, you can only transliterate Arabic counting system type numbers. Roman numerals, for example, cannot by a codepoint by codepoint conversion be represented in Arabic numerals.

Blank Xavier 2009-05-21 16:23:17

Roman numerals do not have the digit property set (see http://www.fileformat.info/info/unicode/char/2182/index.htm for example)

mirod 2009-05-21 16:55:47

@mirod The purpose of the module is to covert things that match \d to a number you can do Math with, that is why I named it Unicode::Digits, not Unicode::Numbers. I would also need to handle things like U+24FE (DOUBLE CIRCLED NUMBER TEN) Should ⓾⓷ be 103? That way madness lies. Roman Numerals likewise would need to be handled by its own transliteration engine.

Chas. Owens 2009-05-21 17:16:30

@Chas I agree, don't worry about the Roman numerals, there are already at least 4 modules on CPAN that deal with them ;--)

mirod 2009-05-21 17:32:57

@Blank Xavier I am not worried about unknown characters at this point, I am worried about mixed ranges. Should a string consisting of Mongolian 4 and Arabic 2 be treated as 42 or an error. I think I can't make that decision and must push it off on the user (option 4). Of course, that leads me to the question of what the default should be.

Chas. Owens 2009-05-21 17:52:10

My initial thought was #4; strictly based on the fact that I like options. However, I changed my mind, when I viewed your function.

The purpose of the function seems to be, simply, to get the resulting digits 0..9. Users may find it useful to send in mixed sets (a feature :) . I'll use it.

Fran Corpier 2009-05-21 16:11:31

If you ever have to handle input in bases greater than 10, you may end up having to treat many variants on the first 6 letters of the Latin alphabet ('ABCDEF') as digits in all their forms.

Novelocrat 2009-05-30 05:43:17

Those wouldn't be UNICODE digits then would they? This is related to turning what \d matches (i.e. characters with the digit property) in Perl back into something you can do math with. Matching numbers is something different I leave to the individual. For instance "IV" is sometimes considered a number (4), and sometimes an abbreviation (intravenous). There is no way (baring natural language processing) to determine which meaning (if any) "IV" has. However, "\x{1814}\x{1812}" is unambiguously 42 in Mongolian digits.

Chas. Owens 2009-05-30 17:06:07

ansaurus

tags:

views:

answers:

How should I handle digits from different sets of UNICODE digits in the same string?

related questions