I'm experimenting with moving a JDBC webapp to JDO DataNucleus 2.1.1.
Assume I have some classes that look something like this:
public class Position { private Integer id; private String title; }
public class Employee { private Integer id; private String name; private Position position; }
The contents of the Position SQL table really don't change very often. Using JDBC, I read the entire table into memory (with the ability to refresh periodically or at-will). Then, when I read an Employee into memory, I simply retrieve the position ID from the Employee table and use that to obtain the in-memory Position instance.
However, using DataNucleus, if I iterate over all Positions:
Extent<Position> extent =pm.getExtent(Position.class, true);
Iterator<Position> iter =extent.iterator();
while(iter.hasNext()) {
Position position =iterPosition.next();
System.out.println(position.toString());
}
And then later, with a different PersistenceManager, iterate over all Employees, obtaining their Position:
Extent<Employee> extent =pm.getExtent(Employee.class, true);
Iterator<Employee> iter =extent.iterator();
while(iter.hasNext()) {
Employee employee =iter.next();
System.out.println(employee.getPosition());
}
Then DataNucleus appears to produce SQL joining the two tables when I obtain an Employee's Position:
SELECT A0.POSITION_ID,B0.ID,B0.TITLE FROM MYSCHEMA.EMPLOYEE A0 LEFT OUTER JOIN MYSCHEMA."POSITION" B0 ON A0.POSITION_ID = B0.ID WHERE A0.ID = <1>
My understanding is that DataNucleus will use a cached Position instance, when available. (Is that correct?) However, I'm concerned that the joins will degrade performance. I'm not yet far enough along to run benchmarks. Are my fears misplaced? Should I continue, and benchmark? Is there a way to have DataNucleus avoid the join?
<jdo>
<package name="com.example.staff">
<class name="Position" identity-type="application" schema="MYSCHEMA" table="Position">
<inheritance strategy="new-table"/>
<field name="id" primary-key="true">
<column name="ID" jdbc-type="integer"/>
</field>
<field name="title">
<column name="TITLE" jdbc-type="varchar"/>
</field>
</class>
</package>
</jdo>
<jdo>
<package name="com.example.staff">
<class name="Employee" identity-type="application" schema="MYSCHEMA" table="EMPLOYEE">
<inheritance strategy="new-table"/>
<field name="id" primary-key="true">
<column name="ID" jdbc-type="integer"/>
</field>
<field name="name">
<column name="NAME" jdbc-type="varchar"/>
</field>
<field name="position" table="Position">
<column name="POSITION_ID" jdbc-type="int" />
<join column="ID" />
</field>
</class>
</package>
</jdo>
I guess what I'm hoping to be able to do is tell DataNucleus to go ahead and read the POSITION_ID int as part of the default fetch group, and see if the corresponding Position is already cached. If so, then set that field. If not, then do the join later, if called upon. Better yet, go ahead and stash that int ID somewhere, and use it if getPosition() is later called. That would avoid the join in all cases.
I would think that knowing the class and the primary key value would be enough to avoid the naive case, but I don't yet know enough about DataNucleus.
With the helpful feedback I've received, my .jdo is now cleaned up. However, after adding the POSITION_ID field to the default fetch group, I'm still getting a join.
SELECT 'com.example.staff.Employee' AS NUCLEUS_TYPE,A0.ID,A0."NAME",A0.POSITION_ID,B0.ID,B0.TITLE FROM MYSCHEMA.EMPLOYEE A0 LEFT OUTER JOIN MYSCHEMA."POSITION" B0 ON A0.POSITION_ID = B0.ID
I understand why it is doing that, the naive method will always work. I was just hoping it was capable of more. Although DataNucleus might not read all columns from the result set, but rather return the cached Position, it is still calling upon the datastore to access a second table, with all that entails - including possible disk seeks and reads. The fact that it will throw that work away is little consolation.
What I was hoping to do was tell DataNucleus that all Positions will be cached, trust me on that. And if for some reason you find one that isn't, blame me for the cache miss. I understand that you'll have to (transparently) perform a separate select on the Position table. (Even better, pin any Positions you do have to go fetch due to a cache miss. That way there won't be a cache miss on the object again.)
That is what I'm doing now using JDBC, by way of a DAO. One of the reasons for investigating a persistence layer was to ditch these DAOs. It is difficult to imagine moving to a persistence layer that can't move beyond naive fetches resulting in expensive joins.
As soon as Employee has not only a Position, but a Department, and other fields, an Employee fetch causes a half dozen tables to be accessed, even though all of those objects are already pinned in the cache, and are addressable given their class and primary key. In fact, I can implement this myself, changing Employee.position to an Integer, creating an IntIdentity, and passing it to PersistenceManager.getObjectByID().
What I think I'm hearing is that DataNucleus is not capable of this optimization. Is that right? It's fine, just not what I expected.