I need to do natural language detection (with confidence scores), preferably in Java, I'd really not introduce more platforms/technologies at this stage of the project. I have previously used the Google API for this in a PoC, but I now need to scale up to very large amounts of data, so any web-based solution won't cut it, (also Google are unclear in their TOS).
While it doesn't provide confidence scores, you could at least start from my cue.language library, and modify its language-detection stuff to return all of the languages it found, with a simple confidence score. (It would be an easy modification.)
We sell one, with confidence scores. www.basistech.com. If you are an academic we can probably let you use if for free. If you need open source, of course, there are other answers being posted.
You need to determine what language a text is in? I've used some n-gram-based algorithms for this. NGramJ is a Java-implementation. (One of several, I have no opinion on which Java-implementation is the better one)
Take a look at Nutch' Language Identifier. I have used it couple of years back with excellent results.