views:

441

answers:

8

My application needs to be I18N compliant. One of the requirements is that the initial password I generate should be in the User's chosen language. What API in Java can I use to achieve this? I have the user's language and will need to get a few random characters from the unicode set of that language.

Edit: Thank you for the answers so far. Some clarifications: This is for a web-based application. For security reasons, we cannot maintain a stock of characters/words in different languages.

Edit 2: In response to the comments and some answers: This application will be deployed in various geographies & I don't want them to have to do work to setup passwords in whichever language they deploy in. That and security are the reasons I've not accepted the most popular answer yet.

+7  A: 

While I'm not sure setting a default password is a desirable action (people often don't know how to change it and then forget it) if I were doing this, I would get a load of wordlists in various languages and pick perhaps two random words, concatenate and use that as the password.

Means you'll have to do some leg work to find the wordlists but it should be a fairly simple process once you've got them.

Edit: If you're just making random strings, it gets a lot simpler: you just store a file of available characters for each language. Open the right one when you come to generate and then pick random letters. Bish bash bosh. Done.

Edit 2: As Marcelo correctly commented, you could run into the problem of generating some obscene password for the user. It might be worth also keeping localised blacklisted strings to check your password for. If any of the strings appear in the password (just in it, not the whole thing), generate a different password. This does mean you'll never generate an innocent enough password like scunthorpe but it also means you won't get things like assclown slipping through either.

As you may have gathered, this is starting to look like a lot of work:

  • Get all the valid characters for every language you plan to support
  • Get all the obscene words for every language you plan to support
  • Generate a password based on the letters
  • Check none of them contain a swear word
  • Remember that some obscene words are adopted by other languages but might not feature on language-specific blacklists so keep an international black-list too.

You might find that setting a pass-phrase using known clean words from each language (per my original answer) works better.

If all that looks too stressful, you might be better off re-adjudicating the reason for setting a random password in the first place. If it's an email-verification device, there are other, easier methods to use. Eg: sending a unique link to be clicked


Edit: Would numbers be okay? They're a lot safer, don't need combing, are international and can be long enough to be unique, they're just rarely memorable. If they're one-off copy-and-paste jobs, they should do you fine.

If you need them to be short but highly unique (and need lots) perhaps mixing numbers with letters in predictable patterns like (a = letter, n = number) annn-annn gives 676,000,000 combinations. Even simple things like annn give enough to not be guessed (26000 combos) if they don't need to be unique... If these are passwords, there's nothing wrong with two being the same.

Oli
You've read http://thedailywtf.com/Articles/The-Automated-Curse-Generator.aspx, haven't you?
dan04
I have now. Had Brian and Barry followed my advise, words like `fukushita` and `kakashite` would have been screened... And it wouldn't take long to generate a phonetic-synonym list to screen out `fukusuka`. Don't get me wrong. Random strings suck because you have to coddle them so much... If you need a random string but you can't be arsed combing over the text, **just use numbers**.
Oli
@Oli: "fukusuka". Nice one. Perfectly offensive for bilingual english/russian speakers. "need a random string but you can't be arsed combing over the text, just use numbers" I'm pretty sure you can write obscenities using leetspeak with numbers only.
SigTerm
But you wouldn't usually attempt to pronounce a long string of numbers as a single word. I think you have to look a lot harder to find evil.
Oli
+3  A: 

If you want to have meaningful words in every language you could use a set of words in english translated (via google translator for example).

Then you can wrap them around with numbers and special characters for password strength.

And of course, in your place, I would instruct the user to immediately change any initial password you provide to something of his choice.

ruslanoid
+1. While an online translator may not be perfect, this solution definitely lets you stay away from Brian/Barry's curse generator. If you're worried about someone discovering your list of English words (and honestly, if they can access that list, I'd be more worried about the other things they can accomplish than discovering default passwords), you could additionally randomly convert letters in the word to numbers or symbols. 'a' _might_ get converted to '@', or to '4', or not be converted at all.
Brian S
+1  A: 

Unicode Character Database and Unicode Common Locale Data Repository contains a lot of information about languages and scripts. You can access the information using, for instance, IBM ICU (a I18N/L10N library more complete than the locales of Java).

In particular, you can get a set of exemplar characters, using LocaleData.getExemplarSet. Note however that the CLDR is far for complete, you should expect missing data.

Po' Lazarus
Note that my answer doesn't lessen in any way the relevance of the comment of Marcelo...
Po' Lazarus
This looks promising. Will have a look - thank you.
Vivek Kodira
+1  A: 

Have a look at http://thedailywtf.com/Articles/The-Automated-Curse-Generator.aspx where something similar is described. the key words to look for seem to be "markov chains".

Edit: just noticed that dan04 already mentioned that link.

Jörn Horstmann
+2  A: 

The JDK includes the display country names, display language and display text for each locale, stored in all available locales. You could use this to build an initial character set for each locale, and use Character.toUpperCase Character.toLowerCase to also get case variants.

   String getCharsForLocale(Locale locale) {
      Locale[] allLocales = Locale.getAvailableLocales();
      StringBuilder buf = new StringBuilder();
      for (Locale l: allLocales) {
          buf.append(l.getDisplayName(locale).toLowerCase())
             .append(l.getDisplayName(locale).toUpperCase())
             .append(l.getDisplaycountry(locale).toLowerCase())
             .append(l.getDisplayCountry(locale).toUpperCase())
             .append(l.getDisplayLanguage(locale).toLowerCase())
             .append(l.getDisplayLanguage(locale).toUpperCase());
      }
      // add other chars, such as decimals
      DecimalFormatSymbols decimals = DecimalFormatSymbols.getInstance(locale);
      buf.append(decimals.getMinusSign())
         .append(decimals.getPercent())
         .append(decimals.getDecimalSeparator());
      // etc..

      // now remove dupicates
      char[] chars = new char[buf.length()];
      buf.getChars(0, chars.length, chars, 0);
      Arrays.sort(chars);
      buf = new StringBuilder();
      char last = '\0';
      for (char c: chars) {
          if (c!=last) buf.append(c);
          last = c;
      }
      return buf.toString();
   }

While this doesn't exhaustively list every single possible character, the intent is that it lists those that are commonly used.

mdma
+1  A: 

1) Use a Latin dictionary with picking random words in CammelCase until they are of the desired length.

2) If you insist in getting Unicode fancy, try the UN UDHR. Just scrape it with some simple regex. Due to the smaller dictionary size you might need longer passwords, but who cares.

Chad Brewbaker
+1  A: 

Maybe this sounds a bit tricky. But you can keep a lot of words in a database. And translate them with some translation tool to the user's language. (Maybe Google, but I don't know if Google permits that)
A bit like reCHAPTA does. They have a lot of tekst (like Lorem Ipsum) where they pick out two words.

You can combine two or three words (with a maximum length) with a number. Maybe with a their country symbol in front. So you get things like:

  • BE_VogelHuis9751
  • FR_ChienVoiture3104
  • ES_GatoEdificio9554
  • D_BrustTausend9672
  • ...
Martijn Courteaux
A: 

maybe u can create an english word, and translate it with google translate it

Fincha