Boost.MultiIndex Tutorial: Index types


Classification

Boost.MultiIndex provides six different index types, which can be classified as shown in the table below. Ordered and sequenced indices, which are the most commonly used, have been explained in the basics section; the rest of index types can be regarded as variations of the former providing some added benefits, functionally or in the area of performance.

Boost.MultiIndex indices
 

type specifier
  key-based     ordered     ordered_unique  
  ordered_non_unique  
  hashed     hashed_unique  
  hashed_non_unique  
  non key-based     sequenced  
  random_access  

Key-based indices, of which ordered indices are the usual example, provide efficient lookup of elements based on some piece of information called the element key: There is an extensive suite of key extraction utility classes allowing for the specification of such keys. Fast lookup imposes an internally managed order on these indices that the user is not allowed to modify; non key-based indices, on the other hand, can be freely rearranged at the expense of lacking lookup facilities. Sequenced indices, modeled after the interface of std::list, are the customary example of a non key-based index.


Hashed indices

Hashed indices constitute a trade-off with respect to ordered indices: if correctly used, they provide much faster lookup of elements, at the expense of losing sorting information. Revisit the employee_set example: suppose a field for storing the Social Security number is added, with the requisite that lookup by this number should be as fast as possible. Instead of the usual ordered index, a hashed index can be resorted to:

#include <boost/multi_index_container.hpp>
#include <boost/multi_index/hashed_index.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/multi_index/identity.hpp>
#include <boost/multi_index/member.hpp>

struct employee
{
  int         id;
  std::string name;
  int         ssnumber;

  employee(int id,const std::string& name,int ssnumber):
    id(id),name(name),ssnumber(ssnumber){}

  bool operator<(const employee& e)const{return id<e.id;}
};

typedef multi_index_container<
  employee,
  indexed_by<
    // sort by employee::operator<
    ordered_unique<identity<employee> >,
    
    // sort by less<string> on name
    ordered_non_unique<member<employee,std::string,&employee::name> >,
    
    // hashed on ssnumber
    hashed_unique<member<employee,int,&employee::ssnumber> >
  >
> employee_set

Note that the hashed index does not guarantee any particular ordering of the elements: so, for instance, the user cannot efficiently query the employees whose SSN is greater than a given number. Consider these restrictions when determining whether a hashed index is preferred over an ordered one.

If familiar with non-standard hash_sets provided by some compiler vendors, then learning to use hashed indices should be straightforward. However, the interface of hashed indices is modeled after the specification for unordered associative containers by the C++ Standard Library Technical Report (TR1), which differs in some significant aspects from existing pre-standard implementations:

Check the reference for a complete specification of the interface of hashed indices, and example 8 and example 9 for practical applications.

Unique and non-unique variants

Just like ordered indices, hashed indices have unique and non-unique variants, selected with the specifiers hashed_unique and hashed_non_unique, respectively. In the latter case, elements with equivalent keys are kept together and can be jointly retrieved by means of the equal_range member function.

Specification

Hashed indices specifiers have two alternative syntaxes, depending on whether tags are provided or not:

(hashed_unique | hashed_non_unique)
  <[(tag)[,(key extractor)[,(hash function)[,(equality predicate)]]]]>

(hashed_unique | hashed_non_unique)
  <[(key extractor)[,(hash function)[,(equality predicate)]]]>

The key extractor parameter works in exactly the same way as for ordered indices; lookup, insertion, etc., are based on the key returned by the extractor rather than the whole element.

The hash function is the very core of the fast lookup capabilities of this type of indices: a hasher is just a Unary Function returning an std::size_t value for any given key. In general, it is impossible that every key map to a different hash value, for the space of keys can be greater than the number of permissible hash codes: what makes for a good hasher is that the probability of a collision (two different keys with the same hash value) is as close to zero as possible. This is a statistical property depending on the typical distribution of keys in a given application, so it is not feasible to have a general-purpose hash function with excellent results in every possible scenario; the default value for this parameter uses Boost.Hash, which often provides good enough results.

The equality predicate is used to determine whether two keys are to be treated as the same. The default value std::equal_to<KeyFromValue::result_type> is in most cases exactly what is needed, so the user very rarely provides an own predicate. Note that hashed indices require that two equivalent keys have the same hash value, which in practice reduces the freedom in choosing an equality predicate.

Lookup

The lookup interface of hashed indices consists in member functions find, count and equal_range. Note that lower_bound and upper_bound are not provided, as there is no intrinsic ordering of keys in this type of indices.

Just as with ordered indices, these member functions take keys as their search arguments, rather than entire objects. Remember that ordered indices lookup operations are further augmented to accept compatible keys, which can roughly be regarded as "subkeys". For hashed indices, a concept of compatible key is also supported, though its usefulness is much more limited: basically, a compatible key is an object which is entirely equivalent to a native object of key_type value, though maybe with a different internal representation:

// US SSN numbering scheme
struct ssn
{
  ssn(int area_no,int group_no,int serial_no):
    area_no(area_no),group_no(group_no),serial_no(serial_no)
  {}

  int to_int()const
  {
    return serial_no+10000*group_no+1000000*area_no;
  }

private:
  int area_no;
  int group_no;
  int serial_no;
};

// interoperability with SSNs in raw int form

struct ssn_equal
{
  bool operator()(const ssn& x,int y)const
  {
    return x.to_int()==y;
  }

  bool operator()(int x,const ssn& y)const
  {
    return x==y.to_int();
  }
};

struct ssn_hash
{
  std::size_t operator()(const ssn& x)const
  {
    return boost::hash<int>()(x.to_int());
  }

  std::size_t operator()(int x)const
  {
    return boost::hash<int>()(x);
  }
};

typedef employee_set::nth_index<2>::type employee_set_by_ssn;

employee_set         es;
employee_set_by_ssn& ssn_index=es.get<2>();
...
// find an employee by ssn
employee e=*(ssn_index.find(ssn(12,1005,20678),ssn_hash(),ssn_equal()));

In the example, provided a hash functor ssn_hash and an equality predicate ssn_equal allowing for interoperability between ssn objects and the raw ints stored as SSNs in employee_set.

By far, the most useful application of compatible keys in the context of hashed indices lies in the fact that they allow for seamless usage of composite keys.

Updating

Hashed indices have replace, modify and modify_key member functions, with the same functionality as in ordered indices.

Guarantees on iterator validity and exception safety

Due to the internal constraints imposed by the Boost.MultiIndex framework, hashed indices provide guarantees on iterator validity and exception safety that are actually stronger than required by the C++ Standard Library Technical Report (TR1) with respect to unordered associative containers:

In general, these stronger guarantees play in favor of the user's convenience, specially that which refers to iterator stability. A (hopefully minimal) degradation in performance might result in exchange for these commodities.


Random access indices

Random access indices offer the same kind of functionality as sequenced indices, with the extra advantages that their iterators are random access, and operator[] and at() are provided for accessing elements based on their position in the index. Let us rewrite a container used in a previous example, using random access instead of sequenced indices:

#include <boost/multi_index_container.hpp>
#include <boost/multi_index/random_access_index.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/multi_index/identity.hpp>

// text container with fast lookup based on a random access index
typedef multi_index_container<
  std::string,
  indexed_by<
    random_access<>,
    ordered_non_unique<identity<std::string> >
  >
> text_container;

// global text container object
text_container tc;

Random access capabilities allow the user to write a code like the following:

void print_page(std::size_t page_num)
{
  static const std::size_t words_per_page=50;

  std::size_t pos0=std::min(tc.size(),page_num*words_per_page);
  std::size_t pos1=std::min(tc.size(),pos0+words_per_page);

  // note random access iterators can be added offsets 
  std::copy(
    tc.begin()+pos0,tc.begin()+pos1,
    std::ostream_iterator<std::string>(std::cout));
}

void print_random_word()
{
  std::cout<<tc[rand()%tc.size()];
}

This added flexibility comes at a price: insertions and deletions at positions other than the end of the index have linear complexity, whereas these operations are constant time for sequenced indices. This situation is reminiscent of the differences in complexity behavior between std::list and std::vector: in the case of random access indices, however, insertions and deletions never incur any element copying, so the actual performance of these operations can be acceptable, despite the theoretical disadvantage with respect to sequenced indices.

Example 10 and example 11 in the examples section put random access indices in practice.

Specification

Random access indices are specified with the random_access construct, where the tag parameter is, as usual, optional:

random_access<[(tag)]>

Interface

All public functions offered by sequenced indices are also provided by random access indices, so that the latter can act as a drop-in replacement of the former (save with respect to their complexity bounds, as explained above). Besides, random access indices have operator[] and at() for positional access to the elements, and member functions capacity and reserve that control internal reallocation in a similar manner as the homonym facilities in std::vector. Check the reference for details.

Comparison with std::vector

It is tempting to see random access indices as an analogue of std::vector for use in Boost.MultiIndex, but this metaphor can be misleading, as both constructs, though similar in many respects, show important semantic differences. An advantage of random access indices is that their iterators, as well as references to their elements, are stable, that is, they remain valid after any insertions or deletions. On the other hand, random access indices have several disadvantages with respect to std::vectors:

The latter shortcoming can be partially remedied by means of the rearranging interface these indices provide.

In general, it is more instructive to regard random access indices as a variation of sequenced indices providing random access semantics, instead of insisting on the std::vector analogy.


Index rearranging

By design, index elements are immutable, i.e. iterators only grant const access to them, and only through the provided updating interface (replace, modify and modify_key) can the elements be modified. This restriction is set up so that the internal invariants of key-based indices are not broken (for instance, ascending order traversal in ordered indices), but induces important limitations in non key-based indices:

typedef multi_index_container<
  int,
  indexed_by<
    random_access<>,
    ordered_unique<identity<int> >
  >
> container;	

container c;
...
// compiler error: assignment to read-only objects
std::random_shuffle(c.begin(),c.end());

What is unfortunate about the previous example is that the operation performed by std::random_shuffle is potentially compatible with multi_index_container invariants, as its result can be described by a permutation of the elements in the random access index with no actual modifications to the elements themselves. There are many more examples of such compatible algorithms in the C++ standard library, like for instance all sorting and partition functions.

Sequenced and random access indices provide a means to take advantage of such external algorithms. In order to introduce this facility we need a preliminary concept: a view of an index is defined as some iterator range [first,last) over the elements of the index such that all its elements are contained in the range exactly once. Continuing with the example, apply std::random_suffle on an ad hoc view obtained from the container:

// note that the elements of the view are not copies of the elements
// in c, but references to them
std::vector<boost::reference_wrapper<const int> > v;
BOOST_FOREACH(const int& i,c)v.push_back(boost::cref(i));

// this compiles OK, as reference_wrappers are assignable
std::random_shuffle(v.begin(),v.end());

Elements of v are reference_wrappers (from Boost.Ref) to the actual elements in the multi-index container. These objects still do not allow modification of the referenced entities, but they are Assignable, which is the only requirement std::random_suffle imposes. Once the desired rearrange is stored in the view, transfer it to the container with

c.rearrange(v.begin());

rearrange accepts an input iterator signaling the beginning of the external view (and end iterator is not needed since the length of the view is the same as that of the index) and internally relocates the elements of the index so that their traversal order matches the view. Albeit with some circumventions, rearrange allows for the application of a varied range of algorithms to non key-based indices. Please note that the view concept is very general, and in no way tied to the particular implementation example shown above. For instance, indices of a multi_index_container are indeed views with respect to its non key-based indices:

// rearrange as index #1 (ascending order)
c.rearrange(c.get<1>().begin());

// rearrange in descending order
c.rearrange(c.get<1>().rbegin());

The only important requirement imposed on views is that they must be free, i.e. they are not affected by relocations on the base index: thus, rearrange does not accept the following:

// undefined behavior: [rbegin(),rend()) is not free with respect
// to the base index
c.rearrange(c.rbegin());

The view concept is defined in detail in the reference. See example 11 in the examples section for a demonstration of use of rearrange.


Ordered indices node compression

Ordered indices are implemented by means of a data structure known as a red-black tree. Nodes of a red-black tree contain pointers to the parent and the two children nodes, plus a 1-bit field referred to as the node color (hence the name of the structure). Due to alignment issues, on most architectures the color field occupies one entire word, that is, 4 bytes in 32-bit systems and 8 bytes in 64-bit environments. This waste of space can be avoided by embedding the color bit inside one of the node pointers, provided not all the bits of the pointer representation contain useful information: this is precisely the case in many architectures where such nodes are aligned to even addresses, which implies that the least significant bit of the address must always be zero.

Boost.MultiIndex ordered indices implement this type of node compression whenever applicable. As compared with common implementations of the STL container std::set, node compression can result in a reduction of header overload by 25% (from 16 to 12 bytes on typical 32-bit architectures, and from 32 to 24 bytes on 64-bit systems). The impact on performance of this optimization has been checked to be negligible for moderately sized containers, whereas containers with many elements (hundreds of thousands or more) perform faster with this optimization, most likely due to L1 and L2 cache effects.

Node compression can be disabled by globally setting the macro BOOST_MULTI_INDEX_DISABLE_COMPRESSED_ORDERED_INDEX_NODES.


© Copyright 2003-2006 Joaquín M López Muñoz. Distributed under the Boost Software License, Version 1.0. (See accompanying file License.html or copy at http://www.boost.org/LICENSE_1_0.txt)