+1  A: 

The simplest solution to your problem is to just count the number of people infected whenever you want to know the answer. This does mean you have to loop over every host at query time, but it's not combinatorial.

If that is too slow, then I would suggest maintaining a multimap from serotype to each infected host. This has the same synchrony problems as your fourth option, but it has the advantage that you easily have access to all the infected hosts to calculate more statistics.

If you're really doing a lot of queries based on lots of different attributes, and you don't want to manually build up the data structures to make each query efficient, you may want to consider using an SQL relational database. These are really designed for such queries.

mathmike
Thanks for voting for option four. I need to know the number of hosts in each age class with each serotype at frequent intervals, but it's probably simpler for me to update counts at relevant event times (e.g., when a host gets infected or dies) rather than looping over all hosts at every query. The multimap option is a good one, and I might adopt it in the future.
Sarah
+1  A: 

To implement option 2, define an enumerated type for the serotypes within another class, then write a templated key extractor. I'm not certain that boost::multi_index_container supports reindexing (I doubt it does), so this approach may be doomed from the start.

class ... {
    enum Serotype {
        ...
        sup // 1 greater than the highest serotype
    };
    ...
};

template <class H, typename H::Serotype s>
struct serotype_key_extractor {
    typedef bool result_type;

    result_type operator()(const boost::shared_ptr<Host>& h) const 
        { ! return h->isInfected(s); }
};

Of course, this requires an index for each serotype on your multi_index_container (Serotype::sup total), with a probable performance of O(I * lg(n)) for each index over the lifetime of the simulation (where I is the number of infection events and n is the number of hosts). If there are a one or two common queries, you could use composite keys, with the serotype key extractors one component.

Option 4, keeping track of infected hosts in a map of sets std::map<H::Serotype, std::set<boost::shared_ptr<Host> > >, can be made more performant over infection events at the cost of querying. Insertion to/removal from a set might be O(lg(n)), so it might be better to use a hash or list rather than a set, though this will increase the cost of queries. Since you're running a simulation, keeping this structure synchronized isn't too bad. Update it during an infection (or cure, if they exist) event. If you're using event based programming, it's quite easy.

For querying with option 4, you could build other sets based on the other attributes, then find the intersection of those sets. This is O(f(I)+f(n)+g(S,T)) overall, where f(x) is the cost of inserting x elements, g(X,Y) is the cost of intersecting sets and S and T are the size of the sets (the O(f(I)+f(N)) comes from building all the sets; I'm playing it a little loose with the timing, since I might be assuming f(x) is homomorphic, which it likely isn't). At best, f(x)=x and g(X,Y)=X+Y, depending on your set implementation. Alternatively, fold over one of the sets (e.g. find all the hosts of a particular age, then count infected; count_if is a particular fold you could use, where count_if(start, end, pred) is equivalent to foldl(start, end, [](int count, const boost::shared_ptr<Host>& h){return count + pred(h); })). In the latter case, make use of the most restrictive index to reduce the sub-population you need to scan as much as possible. This is O(f(I)+S) overall.

Part of the difficulty in designing a performant solution is that you can really only make use of one independent index at a time. Within the groups from an index, other attributes you might index on are arranged essentially randomly. If two indices aren't independent (those that are geographically close are more likely to have the same infections), you might be able to make use of both simultaneously, though this doesn't gain you much: you can get most of the sub-population early in the query, but you'd still have to examine the rest of the group from the primary index.

If you're not married to C++, you could switch to C# and handle the queries with LINQ.

As for switching to a database, as mathmike suggests, you'd probably only want to use an object-oriented, in-memory database. Christopher Brown lists some, including FastDB and OOFile, though I can't say how suitable they are for your project.

outis
Thank you for the thorough answer, including all the embedded references. I'm not testing the templated key extractor just now, since I'm trying to make a very rudimentary version of option 4 work, and you're right that it's pretty simple with with event-driven programming. I'm pretty married to C++ (not a professional programmer, mainly need to get scientific results).
Sarah
+1  A: 

I find your question very realistic.

As the number of serotypes seems to be big (~100) I don't think the option to indexing by serotype can be a valid option. If the number of serotype is reduced '~10) the answer of outis is the correct way to go on to make indexes on serotypes.

If I had to implement your problem, and if one of the main goals is to "be able to determine at any time in the simulation the number of people infected with a given serotype" I would maintain a counter for each possible serotype and increase/decrease each time I add an element in the container with this serotype. You could of course encapsulate this on a class that maintain the counters and the multi-index. Of course this is intrusive, but what it is important it to provide what you need.

This is simple and efficient.

If you need to iterate over all the Host satisfying a given query "after counting the number infected with serotype s, count the number of those infected who are also in a particular household and have a particular age" you could do the following

  • get the index related to the age and get the range of those with the given age (equal_range) or
  • get the index relative to the household and get the range of those with the given household (equal_range)

and on this index count all the host satisfying the other conditions using 'count_if'.

There is also a Project iterator feature that could be helpful.

Even if you had no serotype, the more simple query "count the number of those infected who are also in a particular household and have a particular age" for two information elements that have an index you will need to use count_if on one of the index. There is no such thing as a merge on two ranges.

This is simple, has a good compromise between efficiency and data.

If you need to make a complex query that need to be very efficient you will need to add an index that take care of all the conditions. For example for the preceding query you could add an index sorted by the household and age. The use of this index will give you directly with equal_range the concerned elements, but you will need yet to count them using count_if. This will be more efficient as the set of elements to iterate on will be shorter.

Vicente Botet Escriba
Vicente, I appreciate your suggesting I go with the external counters of total infectons by serotype and explaining why this is efficient. (I will also index by age.) I did not mention above that I am already using single and composite keys in my MultiIndex to extract hosts by age, household, id, marriage status, etc.--those numbers are easy to count--but I'm still faced with an inability to sort first with respect to any serotype s. I'm struggling to find documentation for count_if with MultiIndex. Do you have examples? What kinds of arguments can it take?
Sarah
+1  A: 

If all serotype states are fully independent, and you can't afford to create, say, a vector per serotype and push on / erase shared_ptrs appropriately, there's nothing you can do to go faster than just looping through them all to determine the count of an arbitrary serotype. Now that could be optimized in some ways, like a binary string, but fundamentally, it's the same.

I've been thinking about a sort of, minimum serotype value. If you're going to be querying many different serotypes in quick order, you could store in the object the maximum serotype and the minimum serotype it has, efficiently excluding it from all the remaining searches at once.

DeadMG
The min serotype value might be smart if we end up simulating with a lot of serotypes--thanks.
Sarah
+1  A: 

Solution 2 (maintaining a separate index for each serotype can be done with a little metaprogramming:

#include <boost/mpl/push_back.hpp>
#include <boost/mpl/range_c.hpp>
#include <boost/mpl/fold.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/shared_ptr.hpp>

using namespace boost::multi_index;

struct host
{
  static const int NUM_SEROTYPES=8;

  bool infections[NUM_SEROTYPES];
};

typedef boost::shared_ptr<host> host_ptr;

template<typename Num>
struct infection_extractor
{
  typedef bool result_type;

  result_type& operator()(const host_ptr& p)const
  {
    return p->infections[Num::value];
  }
};

typedef boost::mpl::fold<
  boost::mpl::range_c<int,0,host::NUM_SEROTYPES>,
  boost::mpl::vector<>,
  boost::mpl::push_back<
    boost::mpl::_1,
    ordered_non_unique<infection_extractor<boost::mpl::_2> >
  >
>::type index_list_type;

typedef multi_index_container<
  host_ptr,
  index_list_type
> multi_t;

template<int Num>
std::size_t infected_with(const multi_t& m)
{
  return m.get<Num>().count(true);
};

typedef std::size_t (*infected_with_type)(const multi_t&);

struct infected_with_table_assigner
{
  infected_with_table_assigner(infected_with_type* table):table(table)
  {
    boost::mpl::for_each<
      boost::mpl::range_c<int,0,host::NUM_SEROTYPES>
    >(*this);
  }

  template<typename Num>
  void operator()(Num)const{table[Num::value]=&infected_with<Num::value>;}

  infected_with_type* table;
};

std::size_t infected_with(const multi_t& m,int n)
{
  static infected_with_type           table[host::NUM_SEROTYPES];
  static infected_with_table_assigner assigner(table);

  return table[n](m);
}

#include <iostream>

int main()
{

  multi_t m;
  host h;
  for(int i=0;i<200;++i){
    for(int j=0;j<host::NUM_SEROTYPES;++j)h.infections[j]=(i&(1<<j))?true:false;
    m.insert(host_ptr(new host(h)));
  }

  for(int n=0;n<host::NUM_SEROTYPES;++n)
  {
    std::cout<<"infected with serotype "<<n<<": "
      <<infected_with(m,n)<<std::endl;
  }
}

But take into account that maintaining 8~10 indices has a memory and insertion time penalty. A simpler solution is to just maintain one random access index and sort this appropriately (with a custom comparer suited to the particular serotype of interest in each moment) before querying.

Joaquín M López Muñoz
This is exactly the sort of thing that I haven't done before and wasn't sure could be done. Thanks also for pointing out the memory and insertion time penalties. I think I will avoid it for now, since we will probably start with many serotypes. The random access index, if I understand it correctly, is also slightly costly with insertions and deletions. Option four appears cheapest, assuming that most of the time I need only the number of people of a given age infected with a serotype.
Sarah