tags:

views:

237

answers:

3

Background. I'm working with netlists, and in general, people specify different hierarchies by using /. However, it's not illegal to actually use a / as a part of an instance name.

For example, X1/X2/X3/X4 might refer to instance X4 inside another instance named X1/X2/X3. Or it might refer an instance named X3/X4 inside an instance named X2 inside an instance named X1. Got it?

There's really no "regular" character that cannot be used as a part of an instance name, so you resort to a non-printable one, or ... perhaps one outside of the standard 0..127 ASCII chars.

I thought I'd try (decimal) 166, because for me it shows up as the pipe: ¦.

So... I've got some C++ code which constructs the path name using ¦ as the hierarchical separator, so the path above looks like X1¦X2/X3¦X4.

Now the GUI is written in Tcl/Tk, and to properly translate this into human readable terms I need to do something like the following:

set path [getPathFromC++] ;# returns X1¦X2/X3¦X4
set humanreadable [join [split $path ¦] /]

Basically, replace the ¦ with / (I could also accomplish this with [string map]).

Now, the problem is, the ¦ in the string I get from C++ doesn't match the ¦ I can create in Tcl. i.e. This fails:

set path [getPathFromC++] ;# returns X1¦X2/X3¦X4
string match $path [format X1%cX2/X3%cX4 166 166]

Visually, the two strings look identical, but string match fails. I even tried using scan to see if I'd mixed up the bit values. But

set path [getPathFromC++] ;# returns X1¦X2/X3¦X4
set path2 [format X1%cX2/X3%cX4 166 166]
for {set i 0} {$i < [string length $path]} {incr i} {
   set p [string range $path $i $i]
   set p2 [string range $path2 $i $i]
   scan %c $p c
   scan %c $p2 c2
   puts [list $p $c :::: $p2 $c2 equal? [string equal $c $c2]]
}

Produces output which looks like everything should match, except the [string equal] fails for the ¦ characters with a print line:

¦ 166 :::: ¦ 166 equal? 0

For what it's worth, the character in C++ is defined as:

const char SEPARATOR = 166;

Any ideas why a character outside the regular ASCII range would fail like this? When I changed the separator to (decimal) 28 (^\), things worked fine. I just don't want to get bit by a similar problem on a different platform. (I'm currently using Redhat Linux).

+6  A: 

Latin-1 has two different vertical bar characters:

  • 124 | VERTICAL LINE
  • 166 ¦ BROKEN BAR

Some older fonts mixed up the two glyphs.

dan04
Right, the issue is that `[scan %c $string]` returns 166 for **both** my Tcl and C++ generated characters. If the problem were as you described, the `[scan %c |]` would return 124 (not 166).
Trey Jackson
@dan04: I am only able to type | [vertical line] from my keyboard. How did you type the [broken bar]?
Lazer
I used Character Map.
dan04
+4  A: 

On my system, the tcl script puts [format %c 166] outputs in UTF-8 ("\xC2\xA6"), while the C++ statement cout << "\xA6"; outputs Latin-1. Make sure encoding differences aren't throwing you off.

outis
+3  A: 

As I understand it, modern versions of TCL use UTF-8 internally for string representation. In UTF-8, decimal 166 is half of a character, so it's no wonder that all hell is breaking loose. ;-)

My guess is that your C++ code is using a Latin-1 string (i.e., char *) and you're passing that to TCL which is interpreting it as a UTF-8 string. You need to convert your C++ string to UTF-8 before passing it to any TCL C functions. TCL provides some functions for this purpose.

You can read more about TCL and UTF-8.

Daniel Stutzbach
NB: Modern means "from 8.1 onwards" and has been this way for well over a decade. Also, the function the questioner is looking for is `Tcl_ExternalToUtfDString`.
Donal Fellows