views:

1573

answers:

7

G'day!

I have one million different words which I'd like to query for in a table with 15 million rows. The result of synonyms together with the word is getting processed after each query.

table looks like this:

    synonym      word
    ---------------------
    ancient      old
    anile        old
    centenarian  old
    darkened     old
    distant      far
    remote       far
    calm         gentle
    quite        gentle

This is how it is done in Java currently:

....
PreparedStatement stmt;
ResultSet wordList;
ResultSet syns;
...

stmt = conn.prepareStatement("select distinct word from table");
wordList = stmt.executeQuery();

while (wordList.next()) {
    stmt = conn.prepareStatement("select synonym from table where word=?");
    stmt.setString(1, wordList.getString(1));
    syns = stmt.executeQuery();

    process(syns, wordList.getString(1));
}
...

This is incredible slow. What's the fastest way to do stuff like this?

Cheers, Chris

+4  A: 

Two ideas:

a) How about making it one query:

select synonym from table where word in (select distinct word from table)

b) Or, if you process method needs to deal with them as a set of synonyms of one word, why not sort them by word and start process anew each time word is different? That query would be:

select word, synonym 
from table 
order by word
sblundy
I refined my querstion. The processing has to happen after each word.
chris
This would return all synonyms in the order they appear in the original table. It would also not return the words themselves. The "in (select distinct word from table)" would only act as a slow-down.
configurator
I've got two options here. I've added a query for option b
sblundy
The second point is exactly it! Sometimes I'm just blind for the easiest solution. configurator also mentiond it, but you provided code. This might be easier for others to follow, therefore: Accepted and Upvote for both ;) Cheers
chris
Just wanted to point out that running a sort on 15 million rows will take a while. Better have some coffee ready =)
depending on indexing, the sort could take little or no time, actually.
John Gardner
+5  A: 
  1. Ensure that there is an index on the 'word' column.

  2. Move the second prepareStatement outside the word loop. Each time you create a new statement, the database compiles and optimizes the query - but in this case the query is the same, so this is unnecessary.

  3. Combine the statements as sblundy above has done.

JeeBee
+3  A: 

Why are you querying the synonyms inside the loop if you're querying all of them anyway? You should use a single select word, synonym from table order by word, and then split by words in the Java code.

configurator
+1  A: 
PreparedStatement stmt;
ResultSet syns;
...

stmt = conn.prepareStatement("select distinct " + 
                             "  sy.synonm " + 
                             "from " +
                             "  table sy " +
                             "  table wd " +
                             "where sy.word = wd.word");
syns = stmt.executeQuery();
process(syns);
JosephStyons
A: 

The problem is solved. The important point is, that the table can be sorted by word. Therefore, I can easily iterate through the whole table. Like this:

....
Statement stmt;
ResultSet rs;
String currentWord;
HashSet<String> syns = new HashSet<String>();
...

stmt = conn.createStatement();
rs = stmt.executeQuery(select word, synonym from table order by word);

rs.next();
currentWord = rs.getString(1);
syns.add(rs.getString(2));

while (rs.next()) {
    if (rs.getString(1) != currentWord) {
        process(syns, currentWord);
        syns.clear();
        currentWord = rs.getString(1);
    }
    syns.add(rs.getString(2));
}
...
chris
+1  A: 

related but unrelated:

while (wordList.next()) {
    stmt = conn.prepareStatement("select synonym from table where word=?");
    stmt.setString(1, wordList.getString(1));
    syns = stmt.executeQuery();

    process(syns, wordList.getString(1));
}

You should move that preparestatement call outside the loop:

stmt = conn.prepareStatement("select synonym from table where word=?");
while (wordList.next()) {
    stmt.setString(1, wordList.getString(1));
    syns = stmt.executeQuery();

    process(syns, wordList.getString(1));
}

The whole point of preparing a statement is for the db to compile/cache/etc because you're going to use the statement repeatedly. You also may need to clean up your result sets explicitly if you're going to do that many queries, to ensure that you don't run out of cursors.

John Gardner
+1  A: 

You should also consider utilizing the statement object's setFetchSize method to reduce the context switches between your application and the database. If you know you are going to process a million records, you should use setFetchSize(someRelativelyHighNumberLike1000). This tells java to grab up to 1000 records each time it needs more from Oracle [instead of grabbing them one at a time, which is a worst-case-scenario for this kind of batch processing operation]. This will improve the speed of your program. You should also consider refactoring and doing batch processing of your word/synonyms, as

  1. fetch 1
  2. process 1
  3. repeat

is slower than

  1. fetch 50/100/1000
  2. process 50/100/1000
  3. repeat

just hold the 50/100/1000 [or however many you retrieve at once] in some array structure until you process them.