ansaurus

Question

Answer 1

+3 A:

Do you just want to use it, or do you for some reason insist on the code?

On my Debian system, it seems strings command can do this out of the box. See the exercept from the manpage:

  --encoding=encoding
       Select the character encoding of the strings that are to be found.  Possible values for encoding are: s = single-7-bit-byte characters (ASCII, ISO  8859,
       etc.,  default),  S  = single-8-bit-byte characters, b = 16-bit bigendian, l = 16-bit littleendian, B = 32-bit bigendian, L = 32-bit littleendian. Useful
       for finding wide character strings.

Edit: OK. I don't know C# so this may be a bit hairy, but basically, you need to search for sequences of alternating zeros and English characters.

byte b;
int i=0;
while(!endOfInput()) {
  b=getNextByte();
LoopBegin:
  if(!isEnglish(b)) {
    if(i>0) // report successful match of length i
    i=0;
    continue;
  }
  if(endOfInput()) break;
  if((b=getNextByte())!=0)
    goto LoopBegin;
  i++; // found another character
}

This should work for little-endian.

jpalecek 2009-02-23 16:02:16

I need the code... I need to incorporate it in a system I'm writing (in c#, if it matters).

Evan 2009-02-23 16:05:26

Thanks, exactly what I needed. Pretty obvious, now that I think about it; just skip the null bytes.

Evan 2009-02-23 17:06:49

Answer 2

A:

Thanks Jpalecek, that's exactly the answer I needed for UTF-16.

And now that I think about it, English strings in UTF-8 will look exactly like ASCII, that's the whole point of UTF-8. So the standard Strings program will pull them.

Last time I ask a question before noon on a Monday :-)

Evan 2009-02-23 17:09:08

that's not a forum, post your comments as comments.

SilentGhost 2009-02-23 17:17:43

ansaurus

tags:

views:

answers:

Unicode-aware strings(1) program

related questions