views:

68

answers:

3

I have a ksh script that generates a long, random string using /dev/urandom and tr:

STRING="$(cat /dev/urandom|tr -dc 'a-zA-Z0-9-_'|fold -w 64 |head -1)"

On the Linux and AIX servers where I used this it resulted in 64 characters of upper and lower case alpha chars, digits, dash and underscore characters. Example:

W-uch3_4fbnk34u2nc08w_nj23n089023ncNjxz979823n23-n88h30pmLCxkMKj

When I used the script on Solaris the ranges were interpreted as literals and it resulted in strings from the set aAzZ09-_. Example:

AA0z9_aZ-a-z00aZ9_azAZa0zZza9-Az0-_za-9aa0az_a0z-0a0z000-A9Z_0a

Oddly, on this Solaris server the man page for tr indicates that the syntax used should have produced the desired result.

The idea is to use /dev/urandom to produce a pseudo-random string from which we extract characters so that the result a) does not contain spaces and b) does not contain shell special characters. The string will be used on the command line as an argument later on in the script. We don't want to use classes like :alnum: because locale can convert these into multi-byte values that don't work on the command line. This ksh one-liner did the trick perfectly on a great many installations until we got to Solaris.

We have temporarily converted this to a somewhat nasty Perl regex. Is there a syntax for tr or some other utility or ksh built-in that will perform this task consistently across UNIX variants and is universally installed? Doesn't have to be a one-liner but simplicity is appreciated.

Update: We tried the Locale settings with no luck. Waiting on results of using xpg6 version.

$ uname -a
SunOS hostname 5.10 Generic_142900-04 sun4u sparc SUNW,SPARC-Enterprise
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
0-a9-z9a_zzZAa_a_0az-9_z0a_90Z_9az09aZzZAa-9aa_-__za0ZA9_ZzzZazA
$ set | grep '^L[AC]'
LANG=C
LC_ALL=C
LC_COLLATE=en_US
LC_CTYPE=en_US
LC_MESSAGES=en_US
LC_MONETARY=en_US
LC_NUMERIC=en_US
LC_TIME=en_US
$ export LC_CTYPE="$LC_ALL" LC_MESSAGES="$LC_ALL"
$ set | grep '^L[AC]'
LANG=C
LC_ALL=C
LC_COLLATE=en_US
LC_CTYPE=C
LC_MESSAGES=C
LC_MONETARY=en_US
LC_NUMERIC=en_US
LC_TIME=en_US
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
0900z9az99_a0za09__0zA0_Z--Z_-Aa-AaA9zAZz-Aa90A00z__ZzA9A-Z0aA_-
$ unset LC_ALL; export LC_COLLATE=C LC_NUMERIC=C LC_TIME=C
$ set | grep '^L[AC]'
LANG=C
LC_COLLATE=C
LC_CTYPE=C
LC_MESSAGES=C
LC_MONETARY=en_US
LC_NUMERIC=C
LC_TIME=C
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
_AA9aA_Za-A0-AZa_A-0ZA--a_za-a9zZZz__a0az_-0A-9-0aA-0za00A-__9-0
$ unset LANG LC_COLLATE LC_NUMERIC LC_TIME
$ set | grep '^L[AC]'
LC_CTYPE=C
LC_MESSAGES=C
LC_MONETARY=en_US
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
_-_9zz9Z-Z-Z-Z_0_a9zzzZZaAa--9_zAZaaAZz-ZaAZ09Z-_z-zz09ZZAzAz0Z0
$ unset LC_CTYPE LC_MESSAGES LC_MONETARY
$ set | grep '^L[AC]'
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
_0aAa9_Z_a_Z--_Az-aa0ZA0ZzZ-9Aa9-Z0--0A_Z0Zaz-AA_Zz0z---Z_99z_a9
$ export LANG=C LC_ALL=C LC_COLLATE=C LC_CTYPE=C LC_MESSAGES=C LC_MONETARY=C LC_NUMERIC=C LC_TIME=C
$ set | grep '^L[AC]'
LANG=C
LC_ALL=C
LC_COLLATE=C
LC_CTYPE=C
LC_MESSAGES=C
LC_MONETARY=C
LC_NUMERIC=C
LC_TIME=C
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
Za_000z9aa--aA00zAAZza0AA90090--z0a00_zZ9ZA0_---aZZ09a0ZA0_0zZaa
$ cat /dev/urandom | tr -dc "[a-z][A-Z][0-9]-_" | fold -w 64 | head -1 | sed 's/^-/_/'
x7dni9gIXVF6AHQc3B-H6hjnBVHChJ9zM-z5EQ5UEruATI_NNFaCoVLOqM6gVaT5
$

Of course, on Linux that last version spits out square brackets.

A: 

Try:

LANG=C tr -dc 'a-zA-Z0-9-_'

also try specifying the full path to tr (and compare the results from /usr/bin/tr to the xpg version).

What is the difference between -c ("values") and -C ("characters") on Solaris? On Linux they're the same.

An aside: Are you able to use head -c 64 to replace fold -w 64 |head -1? Also, you can eliminate cat: tr ... < /dev/urandom | ...

Ultimately, depending on availability one of these may work for you (but the character set may be a little different than what you want):

base64 /dev/urandom | head -c 64

or

uuencode /dev/urandom | head -c 64
Dennis Williamson
Thanks, I'll give these a shot and let you know. My client is in UTC+1 so I'll have to wait until tomorrow for them to test.
T.Rob
@Dennis: `LANG=C` may or may not work, depending on how the `LC_COLLATE` category was set (see [my answer](http://stackoverflow.com/questions/3567882/consistent-implementation-of-tr/3569199#3569199)).
Gilles
I was really encouraged by the Locale responses but when we tested on the Solaris host it had no effect on the behavior of tr. Thanks, though.
T.Rob
+2  A: 

What you've observed is not a different between operating systems, but different machines having different locale settings. Your Solaris machine has LC_COLLATE set to a non-default value, which is a sure recipe for the kind of problems you have.

Locale settings are set from the environment as follows:

  • If the environment variable LC_ALL is set, its value is used for all categories.

  • Otherwise, if LC_FOO is set, its value is used for category LC_FOO.

  • Otherwise, if LANG is set, its value is used for categories that weren't explicitly set.

  • The default locale is called C. On Unix systems, POSIX is a synonym for C.

The main locale categories are:

  • LC_CTYPE indicates the character set and encoding used for file names, file contents and terminal I/O. You should carefully preserve this setting unless you know it's inaccurate (e.g. because a particular file format specifies a particular encoding).

  • LC_MESSAGES is the language of the messages that the user sees. You should preserve this setting. If you really need to parse an error message, set LC_MESSAGES=C.

  • LC_COLLATE indicates the sorting order of characters. It's nearly always undesirable in scripts. Most values other than C cause trouble, such as A-Z matching lowercase letters.

  • Occasionally LC_NUMERIC may cause trouble because numbers may be printed with different punctuation, and LC_TIME influences the way some commands show a date and time. The other categories hardly ever matter in scripts.

Here's a reasonable strategy for scripts (warning, typed directly into the browser):

unset LANGUAGE  # a GNU-specific setting
if [ -n "$LC_ALL" ]; then
  export LC_CTYPE="$LC_ALL" LC_MESSAGES="$LC_ALL"
  unset LC_ALL
elif [ -n "$LANG" ]; then
  export LC_COLLATE=C LC_NUMERIC=C LC_TIME=C
else
  unset LC_COLLATE LC_NUMERIC LC_TIME
fi

Standard shell utilities obey the locale settings. Perl doesn't unless you tell it to.

Gilles
Thanks, this looks very promising. I've emailed my client and asked them to test in the morning.
T.Rob
+1 for a detailed answer that led me to investigate Locale more deeply. Unfortunately, as you can see from the update to the question, nothing we did to Locale settings resulted in the correct behavior.
T.Rob
+2  A: 

If you set your path to /usr/xpg6/bin/ then it'll work as expected The locale seems to have no affect here. A cross platform hack is:

tr -dc '[a-z][A-Z][0-9]_-' < /dev/urandom | tr -d '][' | fold -w64 | head -n1
pixelbeat
We had the same experience - all the Locale settings had no effect whatsoever. But I heard back this morning about the testing and the xpg6 version of tr worked perfectly. Our revised version of the scripts now works on Solaris, AIX and all the Linux versions I have access to.
T.Rob