views:

146

answers:

5

Hello.

First, here is a motivating example:

public class Algorithm
{
    public static void compute(Data data)
    {
     List<Task> tasks = new LinkedList<Task>();
     Client client = new Client();
     int totalTasks = 10;

     for(int i = 0; i < totalTasks; i++)
      tasks.add(new Task(data));

      client.submit(tasks);
    }
}

// AbstractTask implements Serializable
public class Task extends AbstractTask
{
    private final Data data;

    public Task(Data data)
    {
     this.data = data;
    }

    public void run()
    {
     // Do some stuff with the data.
    }
}

So, I am doing some parallel programming and have a method which creates a large number of tasks. The tasks share the data that they will operate on, but I am having problems giving each task a reference to the data. The problem is, when the tasks are serialized, a copy of the data is made for each task. Now, in this task class, I could make a static reference to the data so that it is only stored once, but doing this doesn't really make much sense in the context of the task class. My idea is to store the object as a static in another external class and have the tasks request the object from the class. This can be done before the tasks are sent, likely, in the compute method in the example posted above. Do you think that this is appropriate? Can anyone offer any alternative solutions or tips regarding the idea suggested? Thanks!

A: 

Edit: The answer below is not actually relevant, due to a misunderstanding about what was being asked. Leaving it here pending more details from the question's author.


This is precisely why the transient keyword was invented.

Declares that an instance field is not part of the default serialized form of an object. When an object is serialized, only the values of its non-transient instance fields are included in the default serial representation. When an object is deserialized, transient fields are initialized only to their default value.

public class Task extends AbstractTask {
    private final transient Data data;

    public Task(Data data) {
        this.data = data;
    }

    public void run() {
        // Do some stuff with the data.
    }
}
William Brendel
Yes, but how do I recover the object for each task? The still need to access the data.
So you are not trying to exclude the "data" field from serialization then, right? You are trying to avoid serializing the data thousands of times. Is that it?
William Brendel
Yeah, pretty much. However, if I were to store the data as a static in an external class, I'm not sure serialization would be the right word to use. It depends on how this ends up being implemented. Regardless, to answer your question more loosely, I would like to avoid sending the data across the network more than once.
A: 

I'm not sure I fully understand the question, but it sounds to me as though Tasks are actually serialized for later execution.

If this is the case, an important question would be whether all of the Task objects are written to the same ObjectOutputStream. If so, the Data will only be serialized the first time it is encountered. Later "copies" will just reference the same object handle from the stream.

Perhaps one could take advantage of that to avoid static references to the data (which can cause a number of problems in OO design).

erickson
They aren't. I am not in control of the underlying implementation, but I know that when I call submit, the Data object is being copied for each task. I think the reason may be that the tasks are being sent to different locations.
Okay, if they are sent to different remote locations, then each location needs a copy of the data, right?
erickson
Yes, that's correct. However, hundreds of tasks may go to same location, so it is worth sharing the data among these tasks.
A: 

Have you considered making a singleton instead of making it static?

Jesse
+1  A: 

Can you explain more about this serialization situation you're in? How do the Tasks report a result, and where does it go -- do they modify the Data? Do they produce some output? Do all tasks need access to all the Data? Are any of the Tasks written to the same ObjectOutputStream?

Abstractly, I guess I can see two classes of solutions.

  1. If the Tasks don't all need access to all the Data, I would try to give each Task only the data that it needs.
  2. If they do all need all of it, then instead of having the Task contain the Data itself, I would have it contain an ID of some kind that it can use to get the data. How to get just one copy of the Data transferred to each place a Task could run, and give the Task access to it, I'm not sure, without better understanding the overall situation. But I would suggest trying to manage the Data separately.
David Moles
A: 

My idea is to store the object as a static in another external class and have the tasks request the object from the class.

Forget about this idea. When the tasks are serialzed and sent over the network, that object will not be sent; static data is not (and cannot) be shared in any way between JVMs.

Basically, if your Tasks are serialized separately, the only way to share the data is to send it separately, or send it only in one task and somehow have the others acquire it on the receiving machine. This could happen via a static field that the one task that has the data sets and the others query, but of course that requires that one task to be run first. And it could lead to synchronization problems.

But actually, it sounds like you are using some sort of processing queue that assumes tasks to be self-contained. By trying to have them share data, you are going against that concept. How big is your data anyway? Is it really absolutely necessary to share the data?

Michael Borgwardt