views:

192

answers:

3

I went through answers on similar topics here on SO but could't find a satisfying answer. Since i know this is a rather large topic, i will try to be more specific.

I want to write a program which processes files. The processing is nontrivial, so the best way is to split different phases into standalone modules which then would be used as necessary (since sometimes i will be only interested in the output of module A, sometimes i would need output of five other modules, etc). The thing is, that i need the modules to cooperate, because the output of one might be the input of another. And i need it to be FAST. Moreover i want to avoid doing certain processing more than once (if module A creates some data which then need to be processed by module B and C, i don't want to run module A twice to create the input for modules B,C ).

The information the modules need to share would mostly be blocks of binary data and/or offsets into the processed files. The task of the main program would be quite simple - just parse arguments, run required modules (and perhaps give some output, or should this be the task of the modules?).

I don't need the modules to be loaded at runtime. It's perfectly fine to have libs with a .h file and recompile the program every time there is a new module or some module is updated. The idea of modules is here mainly because of code readability, maintaining and to be able to have more people working on different modules without the need to have some predefined interface or whatever (on the other hand, some "guidelines" on how to write the modules would be probably required, i know that). We can assume that the file processing is a read-only operation, the original file is not changed.

Could someone point me in a good direction on how to do this in C++ ? Any advice is wellcome (links, tutorials, pdf books...).

+2  A: 

I am wondering if the C++ is the right level to think about for this purpose. In my experience, it has always proven useful to have separate programs that are piped together, in the UNIX philosophy.

If your data is no overly large, there are many advantages in splitting. You first gain the ability to test every phase of your processing independently, you run one program an redirect the output to a file: you can easily check the result. Then, you take advantage of multiple core systems even if each of your programs is single threaded, and thus much easier to create and debug. And you also take advantage of the operating system synchronization using the pipes between your programs. Maybe also some of your programs could be done using already existing utility programs?

Your final program will create the glue to gather all of your utilities into a single program, piping data from a program to another (no more files at this times), and replicating it as required for all your computations.

Didier Trosset
Forgot to say that i'm bound to Windows OS. And i really want just one program, not a set of programs which would work together (since it is quite possible that the modules I create won't be used only in my app, but also in others). Anyway, thanks for your answer.
PeterK
There are libraries for piping independent of the OS (or more precisely, abstracting it).
Matthieu M.
Being bound to Windows is not a show-stopper for creating several programs and piping them together. Even Windows can do this perfectly!
Didier Trosset
+1  A: 

This looks very similar to a plugin architecture. I recommend to start with a (informal) data flow chart to identify:

  • how these blocks process data
  • what data needs to be transferred
  • what results come back from one block to another (data/error codes/ exceptions)

With these Information you can start to build generic interfaces, which allow to bind to other interfaces at runtime. Then I would add a factory function to each module to request the real processing object out of it. I don't recommend to get the processing objects direct out of the module interface, but to return a factory object, where the processing objects ca be retrieved. These processing objects then are used to build the entire processing chain.

A oversimplified outline would look like this:

struct Processor
{
    void doSomething(Data);
};

struct Module
{
    string name();
    Processor* getProcessor(WhichDoIWant);
    deleteprocessor(Processor*);
};

Out of my mind these patterns are likely to appear:

  • factory function: to get objects from modules
  • composite && decorator: forming the processing chain
Rudi
Thank you for your answer, the factory pattern approach looks good!
PeterK
The implementation of the factory looks wrong though. Use RAII and stop asking the client for returning its `Processor` to the `Module`: we know he'll forget!
Matthieu M.
@Matthieu M. even if there was no delete method, the client side must perform the deletion, since the objects can't pass per value, but only per pointer. So RAII does not prevent any damage at this point. The reason to have a deletion method is to have more freedom for the factory implementation, and not to be forced to use new for the object construction. I use this pattern in one project where some factories create objects upon demand, whereas others return pointers to singletons or objects from a pool.
Rudi
Hum, I think I understand, the `deleteprocessor` method is in fact to ask the `Module` (factory) to remove an item from the "constructible" objects, is that it ? I usually use the "id" for that, so as not to ask the client to retrieve the object first.
Matthieu M.
@Matthieu M. My approach is that the factory returns the processor object, this processor object is bound by the requesting code into some processing context, then the processing happens and afterwards the processor object gets passed back to its factory for deletion. Using this way i can have more than one processor object alive at the same time. Say I have a two pipelines where every char should be converted to lower case, my factory can return two independent lower-case processors (or one singleton instance), and whenever one of these pipes is done, it returns its lowercase-processor.
Rudi
Ah, then I don't like your solution: why do you explicitly have to return the processor to the factory ? Using RAII concept, it would be automatically returned (if you don't want to simply delete it) when the handle it's bound to drops out of the stack. That's way much cleaner.
Matthieu M.
@Matthieu M. you are mixing RAII with smart pointers. RAII means that allocation of a resource is also the initialization, there is nothing said about the resource deallocation.
Rudi
Well, technically RAII only speaks of initialization. However it is generally used for guaranteed deallocation (that is implied by the ownership bit). So I am not mixing it up, but I may not be expressing myself clearly enough...
Matthieu M.
A: 

This really seems quite trivial, so I suppose we miss some requirements.

Use Memoization to avoid computing the result more than once. This should be done in the framework.

You could use some flowchart to determine how to make the information pass from one module to another... but the simplest way is to have each module directly calling those they depend upon. With memoization it does not cost much since if it's already been computed, you're fine.

Since you need to be able to launch about any module, you need to give them IDs and register them somewhere with a way to look them up at runtime. There are two ways to do this.

  • Exemplar: You get the unique exemplar of this kind of module and execute it.
  • Factory: You create a module of the kind requested, execute it and throw it away.

The downside of the Exemplar method is that if you execute the module twice, you'll not be starting from a clean state but from the state that the last (possibly failed) execution left it in. For memoization it might be seen as an advantage, but if it failed the result is not computed (urgh), so I would recommend against it.

So how do you ... ?

Let's begin with the factory.

class Module;
class Result;

class Organizer
{
public:
  void AddModule(std::string id, const Module& module);
  void RemoveModule(const std::string& id);

  const Result* GetResult(const std::string& id) const;

private:
  typedef std::map< std::string, std::shared_ptr<const Module> > ModulesType;
  typedef std::map< std::string, std::shared_ptr<const Result> > ResultsType;

  ModulesType mModules;
  mutable ResultsType mResults; // Memoization
};

It's a very basic interface really. However, since we want a new instance of the module each time we invoke the Organizer (to avoid problem of reentrance), we need will need to work on our Module interface.

class Module
{
public:
  typedef std::auto_ptr<const Result> ResultPointer;

  virtual ~Module() {}               // it's a base class
  virtual Module* Clone() const = 0; // traditional cloning concept

  virtual ResultPointer Execute(const Organizer& organizer) = 0;
}; // class Module

And now, it's easy:

// Organizer implementation
const Result* Organizer::GetResult(const std::string& id)
{
  ResultsType::const_iterator res = mResults.find(id);

  // Memoized ?
  if (res != mResults.end()) return *(it->second);

  // Need to compute it
  // Look module up
  ModulesType::const_iterator mod = mModules.find(id);
  if (mod != mModules.end()) return 0;

  // Create a throw away clone
  std::auto_ptr<Module> module(it->second->Clone());

  // Compute
  std::shared_ptr<const Result> result(module->Execute(*this).release());
  if (!result.get()) return 0;

  // Store result as part of the Memoization thingy
  mResults[id] = result;

  return result.get();
}

And a simple Module/Result example:

struct FooResult: Result { FooResult(int r): mResult(r) {} int mResult; };

struct FooModule: Module
{
  virtual FooModule* Clone() const { return new FooModule(*this); }

  virtual ResultPointer Execute(const Organizer& organizer)
  {
    // check that the file has the correct format
    if(!organizer.GetResult("CheckModule")) return ResultPointer();

    return ResultPointer(new FooResult(42));
  }
};

And from main:

#include "project/organizer.h"
#include "project/foo.h"
#include "project/bar.h"


int main(int argc, char* argv[])
{
  Organizer org;

  org.AddModule("FooModule", FooModule());
  org.AddModule("BarModule", BarModule());

  for (int i = 1; i < argc; ++i)
  {
    const Result* result = org.GetResult(argv[i]);
    if (result) result->print();
    else std::cout << "Error while playing: " << argv[i] << "\n";
  }
  return 0;
}
Matthieu M.