Either look for a data structure allowing you to keep a compacted dictionary in memory, or simply give your process more memory. Three hundred thousand words is not that much.
I think a way to do this could be to use a TreeSet
where you put all the dictionary then use the method subSet
to retreive all the words beginning by the desired letter and do a random on the subset.
But in my opinion the best way to do this, due to the quantity of data, would be to use a database with SQL requests instead of Java.
The goal is to increase your English language vocabulary - not to increase your computer's English language vocabulary.
If you do not share this goal, why are you (or your parents) paying tuition?
If I do this:
class LoadWords {
public static void main(String... args) {
try {
Scanner s = new Scanner(new File("/usr/share/dict/words"));
ArrayList<String> ss = new ArrayList<String>();
while (s.hasNextLine())
ss.add(s.nextLine());
System.out.format("Read %d words\n", ss.size());
} catch (FileNotFoundException e) {
e.printStackTrace(System.err);
}
}
}
I can run it with java -mx16m LoadWords
, which limits the Java heap size to 16 Mb, which is not that much memory for Java. My /usr/share/dict/words
file has approximately 250,000 words in it, so it may be a bit smaller than yours.
You'll need to use a different data structure than the simple ArrayList<String>
that I've used. Perhaps a HashMap
of ArrayList<String>
, keyed on the starting letter of the word would be a good starting choice.
Hope this doesn't spoil your fun or something, but if I were you I'd take this approach..
Pseudo java:
abstract class Word {
String word;
char last();
char first();
}
abstract class DynamicDictionary {
Map<Character,Set<Word>> first_indexed;
Word removeNext(Word word){
Set<Word> candidates = first_indexed.get(word.last());
return removeRandom(candidates);
}
/**
* Remove a random word out from the entire dic.
*/
Word removeRandom();
/**
* Remove and return a random word out from the set provided.
*/
Word removeRandom(Set<Word> wordset);
}
and then
Word primer = dynamicDictionary.removeRandom();
List<Word> list = new ArrayList<Word>(500);
list.add(primer);
for(int i=0, Word cur = primer;i<499;i++){
cur = dynamicDictionary.removeNext(cur);
list.add(cur);
}
NOTE: Not intended to be viewed as actual java code, just a way to roughly explain the approach (no error handling, not a good class structure if it were really used, no encupsulation etc. etc.)
Should I encounter memory issues, maybe I'll do this:
abstract class Word {
int lineNumber;
char last();
char first();
}
If that is not sufficient, guess I'll use a binary search on the file or put it in a DB etc..
Here is some word frequency lists: http://www.robwaring.org/vocab/wordlists/vocfreq.html
This text file, reachable from the above link, contains the first 2000 words that are used most frequently: http://www.robwaring.org/vocab/wordlists/1-2000.txt