tags:

views:

145

answers:

2

What is the best way to find out if the scalar value is ASCII/UTF8 (text) or a binary data in Perl? Is this code right?:

if (is_utf8($scalar, 1) or ($scalar =~ m/\A [[:ascii:]]* \Z/xms)) {
     # $scalar is a text
}
else {
     # $scalar is a binary
}

Is there a better way?

+1  A: 

is_utf8 tests whether the Perl utf8 flag is turned on or not. It's possible for a scalar to contain correctly formed utf-8 and not have the flag turned on. I think it's possible to deliberately turn the flag on even with malformed utf-8 too, but I'm not sure.

To check whether the scalar contains UTF-8 data, you need to check the flag, and if it is not, also try something like

eval {
    my $utf8 = decode_utf8 ($scalar);
}

and then check for errors in $@.

To check whether a non-UTF-8 scalar contains non-ASCII data, your idea $scalar =~ m/\A [[:ascii:]]* \Z/xms looks ok.

Kinopiko
This is yet another example of a correct answer on a Perl question being downvoted (twice).
Kinopiko
A: 

The best way, clearly, is to simply keep track when you are reading the data. You as the programmer should already know whether you are getting text (and its encoding) or binary data. When you're reading text, you Encode::decode() it (see http://p3rl.org/UNI for details) into Perl text strings.

If you really don't know beforehand, the -T and -B file tests offer a heuristic.

Disregard Kinopiko's answer, in the vast majority of cases, you should not need to know about the internal representation of data, and messing with the utility functions from the utf8 pragma module is the wrong approach.

daxim
I don't think you've thought this through. It's very possible that a module author, for example, might need to know whether data is UTF-8 or not.
Kinopiko
I don't think you know what you're talking about. The documentation of `utf8` itself states its functions are off-hand, so generally it's wrong to recommend using them. Do read it. Pointing this fact out is justified, and I find petty retaliation through downvoting from you is bad form. (continued)
daxim
To expound my argument: An author knows whether it is text or binary data because you cannot have the same codepath for both and treat them the same way; e.g. unpack is for binary only, and ucfirst is for text only. Now, if the __encoding__ of text is unknown, that's a completely different topic, and its solution is `Encode::Detect`.
daxim
This is the only correct answer. Using the utf8 flag is ignorant, horrible, and wrong, and will break your code.
hobbs
@hobbs: You and daxim are the ignorant ones. If you are only writing small scripts for yourself, maybe you can always keep track. But, as I said above, and you and daxim both seem to have ignored, it's quite conceivable that a module author would need to check for this. For example, the DBI module contains is_utf8 in several places. In some cases, there is no other way except using that. You shouldn't just tell people that it's "horrible and wrong" without thinking about it.
Kinopiko
for my purpose - serialization of anything passed to the function, there is no way how to know what is inside. /me wonders how for example JSON and other serializing modules handle this. in XS?
Jozef