tags:

views:

164

answers:

2

Hello, Does anybody have a code sample for a unicode-aware strings program? Programming language doesn't matter. I want something that essentially does the same thing as the unix command "strings", but that also functions on unicode text (UTF-16 or UTF-8), pulling runs of english-language characters and punctuation. (I only care about english characters, not any other alphabet).

Thanks!

+3  A: 

Do you just want to use it, or do you for some reason insist on the code?

On my Debian system, it seems strings command can do this out of the box. See the exercept from the manpage:

  --encoding=encoding
       Select the character encoding of the strings that are to be found.  Possible values for encoding are: s = single-7-bit-byte characters (ASCII, ISO  8859,
       etc.,  default),  S  = single-8-bit-byte characters, b = 16-bit bigendian, l = 16-bit littleendian, B = 32-bit bigendian, L = 32-bit littleendian. Useful
       for finding wide character strings.

Edit: OK. I don't know C# so this may be a bit hairy, but basically, you need to search for sequences of alternating zeros and English characters.

byte b;
int i=0;
while(!endOfInput()) {
  b=getNextByte();
LoopBegin:
  if(!isEnglish(b)) {
    if(i>0) // report successful match of length i
    i=0;
    continue;
  }
  if(endOfInput()) break;
  if((b=getNextByte())!=0)
    goto LoopBegin;
  i++; // found another character
}

This should work for little-endian.

jpalecek
I need the code... I need to incorporate it in a system I'm writing (in c#, if it matters).
Evan
Thanks, exactly what I needed. Pretty obvious, now that I think about it; just skip the null bytes.
Evan
A: 

Thanks Jpalecek, that's exactly the answer I needed for UTF-16.

And now that I think about it, English strings in UTF-8 will look exactly like ASCII, that's the whole point of UTF-8. So the standard Strings program will pull them.

Last time I ask a question before noon on a Monday :-)

Evan
that's not a forum, post your comments as comments.
SilentGhost