====================================================
ATRIP: An MPI-asynchronous implementation of CCSD(T)
====================================================


.. contents::


1 Introduction
--------------

The algorithm uses two main data types, the ``Slice`` and the
``SliceUnion`` as a container and resource manager of the ``Slice``.

2 The slice
-----------

The following section introduces the idea of a slice.

2.1 Introduction
~~~~~~~~~~~~~~~~

A slice is the concept of a subset of values of a given tensor.
As an example, for the doubles amplitudes :math:`T^{ab}_{ij}`, one need two kinds of objects:

- the object :math:`\mathsf{T}(a)^b_{ij}` which for every :math:`a` gets assigned the
  tensor :math:`T^{ab}_{ij}` of size :math:`N_\mathrm{o}^2 \times N_\mathrm{v}`

- the object :math:`\mathsf{T}(a,b)_{ij}` which for every pair of :math:`a, b`
  corresponds the :math:`N_\mathrm{o}^2`-sized tensor :math:`T^{ab}_{ij}`.

2.2 Location
~~~~~~~~~~~~

Every slice set, for instance,


.. math::

    S_k = \left\{
        a \mapsto \mathsf{T}(a)^{b}_{ij}
        \mid
        a \in A_k
        \right\}


where :math:`A_k` is some subset of
:math:`\mathsf{N}_\mathrm{v}`,
gets stored in some rank :math:`k`.
In general however, the number of elements in :math:`A_k` can be bigger
than the number of processes :math:`n_p`. Therefore in order to uniquely
indentify a given slice in :math:`S_k` we need two identifiers,
the rank :math:`k`, which tells us in which core's memory the slice is
allocated, and an additional tag which we will call ``source``.

The datatype that simply models this state of affairs
is therefore a simple structure:

.. code:: c++

    struct Location { size_t rank; size_t source; };

2.3 Type
~~~~~~~~

Due to the permutation operators in the equations
it is noticeable that for every one dimensional
slice and triple :math:`(a,b,c)`


.. math::

    a \mapsto \mathsf{t}(a)


one needs at the same time
:math:`\mathsf{t}(a)`,
:math:`\mathsf{t}(b)` and
:math:`\mathsf{t}(c)`.
For two dimensional slices, i.e., slices of the form


.. math::

    (a,b) \mapsto \mathsf{t}(a,b)


one needs in the equations the slices
:math:`\mathsf{t}(a,b)`,
:math:`\mathsf{t}(b,c)` and
:math:`\mathsf{t}(a,c)`.
In addition, in the case of diagrams where
the integral :math:`V^{ab}_{ci}` appears,
we additionaly need the permuted slices
from before, i.e.
:math:`\mathsf{t}(b,a)`,
:math:`\mathsf{t}(c,b)` and
:math:`\mathsf{t}(c,a)`.

This means, every slice has associated with it
a type which denotes which permutation it is.


.. code:: c++

    enum Type
      { A = 10
      , B
      , C
      // Two-parameter slices
      , AB = 20
      , BC
      , AC
      // for abci and the doubles
      , CB
      , BA
      , CA
      // The non-typed slice
      , Blank = 404
      };

2.4 State
~~~~~~~~~

Every slice can be in different states and every state
denotes which function the slice is going to provide
and which relations they have between themselves.

Fetch
    A slice is in state ``Fetch`` when it
    has a valid data pointer that ****must**** be written to.
    A ``Fetch`` slice should not live very long, this means
    that after the database send and receive phase,
    ``Fetch`` slices should be changed into ``Dispatched``
    in order to start the process of writing to the
    data pointer from some other rank.

Dispatched
    A ``Dispatched`` slice indicates that at some point
    send and receive MPI calls have been dispatched
    in order to get the data.
    However, the calls have just been dispatched and there
    is no warranty for the data to be there, for that,
    the slice must be unwrapped.

Ready
    ``Ready`` means that the data pointer can be read from
    directly.

SelfSufficient
    A slice is ``SelfSufficient`` when its contents are located
    in the same rank that it lives, so that it does not have to
    fetch from no other rank.
    This is important in order to handle the data pointers correctly
    and in order to save calls to MPI receive and send functions.

Recycled
    ``Recycled`` means that this slice gets its data pointer from another
    slice, so it should not be written to

Acceptor
    ``Acceptor`` means that the slice can accept a new slice, it is
    the counterpart of the ``Blank`` type, but for states

Again the implementation is a simple enum type.

.. code:: c++

    enum State {
      Fetch = 0,
      Dispatched = 2,
      Ready = 1,
      SelfSufficient = 911,
      Recycled = 123,
      Acceptor = 405
    };

2.5 Data pointer
~~~~~~~~~~~~~~~~

The data pointer type is an abstraction of where the data
of the slice are.
In a CPU architecture it will simply be ``F*``,
otherwise it will be a pointer to a GPU address in the case
of a GPU architecture.

.. code:: c++

    #pragma once
    #include <atrip/Complex.hpp>
    #include <atrip/Atrip.hpp>

    namespace atrip {

    template <typename F>
    struct DataField;

    template <>
    struct DataField<double> {
      using type = double;
    };

    #if defined(HAVE_CUDA)

    template <typename F>
    using DataPtr = CUdeviceptr;
    #define DataNullPtr 0x00

    template <>
    struct DataField<Complex> {
      using type = cuDoubleComplex;
    };


    #else

    template <typename F>
    using DataPtr = F*;
    #define DataNullPtr nullptr

    template <>
    struct DataField<Complex> {
      using type = Complex;
    };

    #endif


    template <typename F>
    using DataFieldType = typename DataField<F>::type;

    }

2.6 The Info structure
~~~~~~~~~~~~~~~~~~~~~~

Every slice has an information structure associated with it
that keeps track of the ****variable**** type, state and so on.

.. code:: c++

    struct Info {
      // which part of a,b,c the slice holds
      PartialTuple tuple;
      // The type of slice for the user to retrieve the correct one
      Type type;
      // What is the state of the slice
      State state;
      // Where the slice is to be retrieved
      Location from;
      // If the data are actually to be found in this other slice
      Type recycling;

      Info() : tuple{0,0}
              , type{Blank}
              , state{Acceptor}
              , from{0,0}
              , recycling{Blank}
              {}
    };

    using Ty_x_Tu = std::pair< Type, PartialTuple >;

2.7 Name
~~~~~~~~

CCSD(T) needs in this algorithm 5 types of tensor slices,
namely
:math:`V^{ij}_{ka}`, :math:`V^{ab}_{ci}`,
:math:`V^{ab}_{ij}`
and two times :math:`T^{ab}_{ij}`.
The reason why we need two times the doubles
amplitudes is because in the doubles contribution
to the energy, the :math:`T` amplidutes will be sliced
through one parameter for the particle contribution
and through two parameters for the hole contribution.

.. code:: c++

    enum Name
      { TA    = 100
      , VIJKA = 101
      , VABCI = 200
      , TABIJ = 201
      , VABIJ = 202
      };

2.8 Database
~~~~~~~~~~~~

The database is a simple representation of the slices of a slice union.
Every element of the database is given by the name of the tensor it
represents and the internal information structure.

.. code:: c++

    struct LocalDatabaseElement {
      Slice<F>::Name name;
      Slice<F>::Info info;
    };

A local database (of a given rank) and the global database is thus simply
a vector of these elements.

.. code:: c++

    using LocalDatabase = std::vector<LocalDatabaseElement>;
    using Database = LocalDatabase;

2.9 MPI Types
~~~~~~~~~~~~~

.. code:: c++

    struct mpi {

      static MPI_Datatype vector(size_t n, MPI_Datatype const& DT) {
        MPI_Datatype dt;
        MPI_Type_vector(n, 1, 1, DT, &dt);
        MPI_Type_commit(&dt);
        return dt;
      }

      static MPI_Datatype sliceLocation () {
        constexpr int n = 2;
        // create a sliceLocation to measure in the current architecture
        // the packing of the struct
        Slice<F>::Location measure;
        MPI_Datatype dt;
        const std::vector<int> lengths(n, 1);
        const MPI_Datatype types[n] = {usizeDt(), usizeDt()};

        static_assert(sizeof(Slice<F>::Location) == 2 * sizeof(size_t),
                      "The Location packing is wrong in your compiler");

        // measure the displacements in the struct
        size_t j = 0;
        MPI_Aint base_address, displacements[n];
        MPI_Get_address(&measure,        &base_address);
        MPI_Get_address(&measure.rank,   &displacements[j++]);
        MPI_Get_address(&measure.source, &displacements[j++]);
        for (size_t i = 0; i < n; i++)
          displacements[i] = MPI_Aint_diff(displacements[i], base_address);

        MPI_Type_create_struct(n, lengths.data(), displacements, types, &dt);
        MPI_Type_commit(&dt);
        return dt;
      }

      static MPI_Datatype usizeDt() { return MPI_UINT64_T; }

      static MPI_Datatype sliceInfo () {
        constexpr int n = 5;
        MPI_Datatype dt;
        Slice<F>::Info measure;
        const std::vector<int> lengths(n, 1);
        const MPI_Datatype types[n]
          = { vector(2, usizeDt())
            , vector(sizeof(enum Type), MPI_CHAR)
            , vector(sizeof(enum State), MPI_CHAR)
            , sliceLocation()
            , vector(sizeof(enum Type), MPI_CHAR)
            // TODO: Why this does not work on intel mpi?
            /*, MPI_UINT64_T*/
            };

        static_assert(sizeof(enum Type)  == 4, "Enum type not 4 bytes long");
        static_assert(sizeof(enum State) == 4, "Enum State not 4 bytes long");
        static_assert(sizeof(enum Name)  == 4, "Enum Name not 4 bytes long");

        // create the displacements from the info measurement struct
        size_t j = 0;
        MPI_Aint base_address, displacements[n];
        MPI_Get_address(&measure,             &base_address);
        MPI_Get_address(&measure.tuple[0],    &displacements[j++]);
        MPI_Get_address(&measure.type,        &displacements[j++]);
        MPI_Get_address(&measure.state,       &displacements[j++]);
        MPI_Get_address(&measure.from,        &displacements[j++]);
        MPI_Get_address(&measure.recycling,   &displacements[j++]);
        for (size_t i = 0; i < n; i++)
          displacements[i] = MPI_Aint_diff(displacements[i], base_address);

        MPI_Type_create_struct(n, lengths.data(), displacements, types, &dt);
        MPI_Type_commit(&dt);
        return dt;
      }

      static MPI_Datatype localDatabaseElement () {
        constexpr int n = 2;
        MPI_Datatype dt;
        LocalDatabaseElement measure;
        const std::vector<int> lengths(n, 1);
        const MPI_Datatype types[n]
          = { vector(sizeof(enum Name), MPI_CHAR)
            , sliceInfo()
            };

        // measure the displacements in the struct
        size_t j = 0;
        MPI_Aint base_address, displacements[n];
        MPI_Get_address(&measure,      &base_address);
        MPI_Get_address(&measure.name, &displacements[j++]);
        MPI_Get_address(&measure.info, &displacements[j++]);
        for (size_t i = 0; i < n; i++)
          displacements[i] = MPI_Aint_diff(displacements[i], base_address);

        static_assert( sizeof(LocalDatabaseElement) == sizeof(measure)
                     , "Measure has bad size");

        MPI_Type_create_struct(n, lengths.data(), displacements, types, &dt);
        MPI_Type_commit(&dt);
        return vector(sizeof(LocalDatabaseElement), MPI_CHAR);
        // TODO: write tests in order to know if this works
        return dt;
      }

    };

2.10 Static utilities
~~~~~~~~~~~~~~~~~~~~~

This section presents some functions which are useful to work with
slices and are inside the namespace created by the slice struct.


The function ``subtupleBySlice`` gives to every ``Slice::Type``
its meaning in terms of the triples :math:`(a,b,c)`.

Notice that since in general the relation
:math:`a < b < c` holds (in our implementation), the case
of one-dimensional parametrizations ``A``, ``B`` and ``C`` is well
defined.

The function should only throw if there is an implementation
error where the ``Slice::Type`` enum has been expanded and this
function has not been updated accordingly.

.. code:: c++

    static
    PartialTuple subtupleBySlice(ABCTuple abc, Type sliceType) {
      switch (sliceType) {
        case AB: return {abc[0], abc[1]};
        case BC: return {abc[1], abc[2]};
        case AC: return {abc[0], abc[2]};
        case CB: return {abc[2], abc[1]};
        case BA: return {abc[1], abc[0]};
        case CA: return {abc[2], abc[0]};
        case  A: return {abc[0], 0};
        case  B: return {abc[1], 0};
        case  C: return {abc[2], 0};
        default: throw "Switch statement not exhaustive!";
      }
    }

In the context of cleaning up slices during the main loop,
it is important to check if a given slice has some slices
referencing to it in quality of recycled slices.

This function should therefore return a vector of pointers
of slices referencing to the given slice's info, when
the length of the vector is zero, then there are no dangling
links.

.. code:: c++

    static std::vector<Slice<F>*> hasRecycledReferencingToIt
      ( std::vector<Slice<F>> &slices
      , Info const& info
      ) {
      std::vector<Slice<F>*> result;

      for (auto& s: slices)
        if (  s.info.recycling == info.type
           && s.info.tuple == info.tuple
           && s.info.state == Recycled
           ) result.push_back(&s);

      return result;
    }

The rest of the coming functions are utilities in order to find in a vector
of slices a given slice by reference. Mostly they are merely convenience
wrappers to the standard library function ``std::find_if``.

They are named as ``find<...>``, where ``<...>`` represents some condition
and must always return a reference to the found slice, i.e., ``Slice&``.
``Atrip`` relies on these functions to find the sought for slices,
therefore these functions will throw a ``std::domain_error`` if the
given slice could not be found.

.. code:: c++

    static Slice<F>& findOneByType(std::vector<Slice<F>> &slices, Slice<F>::Type type) {
        const auto sliceIt
          = std::find_if(slices.begin(), slices.end(),
                         [&type](Slice<F> const& s) {
                           return type == s.info.type;
                         });
        WITH_CRAZY_DEBUG
        WITH_RANK
          << "\t__ looking for " << type << "\n";
        if (sliceIt == slices.end())
          throw std::domain_error("Slice by type not found!");
        return *sliceIt;
    }

.. code:: c++

    static Slice<F>&
    findRecycledSource (std::vector<Slice<F>> &slices, Slice<F>::Info info) {
      const auto sliceIt
        = std::find_if(slices.begin(), slices.end(),
                       [&info](Slice<F> const& s) {
                         return info.recycling == s.info.type
                             && info.tuple == s.info.tuple
                             && State::Recycled != s.info.state
                             ;
                       });

      WITH_CRAZY_DEBUG
      WITH_RANK << "__slice__:find: recycling source of "
                << pretty_print(info) << "\n";
      if (sliceIt == slices.end())
        throw std::domain_error( "Slice not found: "
                               + pretty_print(info)
                               + " rank: "
                               + pretty_print(Atrip::rank)
                               );
      WITH_RANK << "__slice__:find: " << pretty_print(sliceIt->info) << "\n";
      return *sliceIt;
    }

.. code:: c++

    static Slice<F>& findByTypeAbc
      ( std::vector<Slice<F>> &slices
      , Slice<F>::Type type
      , ABCTuple const& abc
      ) {
        const auto tuple = Slice<F>::subtupleBySlice(abc, type);
        const auto sliceIt
          = std::find_if(slices.begin(), slices.end(),
                         [&type, &tuple](Slice<F> const& s) {
                           return type == s.info.type
                               && tuple == s.info.tuple
                               ;
                         });
        WITH_CRAZY_DEBUG
        WITH_RANK << "__slice__:find:" << type << " and tuple "
                  << pretty_print(tuple)
                  << "\n";
        if (sliceIt == slices.end())
          throw std::domain_error( "Slice not found: "
                                 + pretty_print(tuple)
                                 + ", "
                                 + pretty_print(type)
                                 + " rank: "
                                 + pretty_print(Atrip::rank)
                                 );
        return *sliceIt;
    }

.. code:: c++

    static Slice<F>& findByInfo(std::vector<Slice<F>> &slices,
                             Slice<F>::Info const& info) {
      const auto sliceIt
        = std::find_if(slices.begin(), slices.end(),
                       [&info](Slice<F> const& s) {
                         // TODO: maybe implement comparison in Info struct
                         return info.type == s.info.type
                             && info.state == s.info.state
                             && info.tuple == s.info.tuple
                             && info.from.rank == s.info.from.rank
                             && info.from.source == s.info.from.source
                              ;
                       });
      WITH_CRAZY_DEBUG
      WITH_RANK << "__slice__:find:looking for " << pretty_print(info) << "\n";
      if (sliceIt == slices.end())
        throw std::domain_error( "Slice by info not found: "
                               + pretty_print(info));
      return *sliceIt;
    }

2.11 Attributes
~~~~~~~~~~~~~~~

A slice object does not own data, it is just a container
or a pointer to data together with additional bookkeeping facilities.

It includes an info structure with the information about the slice,
``Type``, ``State`` etc, which will be later communicated to other ranks.

.. code:: c++

    Info info;

A pointer to data is also necessary for the ``Slice`` but not necessary
to be communicated to other ranks. The ``Slice`` should never allocate
or deallocate itself the pointer.

.. code:: c++

        DataPtr<F> data;
    #if defined(HAVE_CUDA)
        F* mpi_data;
    #endif

An ``MPI_Request`` handle is also included so that the slices that are
to receive data through MPI can know which request they belong to.

.. code:: c++

    MPI_Request request;

For practical purposes in MPI calls, the number of elements in ``data`` is also included.

.. code:: c++

    const size_t size;

2.12 Member functions
~~~~~~~~~~~~~~~~~~~~~

It is important to note that a ready slice should not be recycled from
any other slice, so that it can have access by itself to the data.

.. code:: c++

    void markReady() noexcept {
      info.state = Ready;
      info.recycling = Blank;
    }


The following function asks wether or not
the slice has effectively been unwrapped or not,
i.e., wether or not the data are accessible and already
there. This can only happen in two ways, either
is the slice ``Ready`` or it is ``SelfSufficient``,
i.e., the data pointed to was pre-distributed to the current node.

.. code:: c++

    bool isUnwrapped() const noexcept {
      return info.state == Ready
          || info.state == SelfSufficient
          ;
    }

The function ``isUnwrappable`` answers which slices can be unwrapped
potentially. Unwrapped slices can be unwrapped again idempotentially.
Also ``Recycled`` slices can be unwrapped, i.e. the slices pointed to by them
will be unwrapped.
The only other possibility is that the slice has been dispatched
in the past and can be unwrapped. The case where the state
is ``Dispatched`` is the canonical intuitive case where a real process
of unwrapping, i.e. waiting for the data to get through the network,
is done.

.. code:: c++

    bool isUnwrappable() const noexcept {
      return isUnwrapped()
          || info.state == Recycled
          || info.state == Dispatched
          ;
    }

    inline bool isDirectlyFetchable() const noexcept {
      return info.state == Ready || info.state == Dispatched;
    }

    void free() noexcept {
      info.tuple      = {0, 0};
      info.type       = Blank;
      info.state      = Acceptor;
      info.from       = {0, 0};
      info.recycling  = Blank;
      data            = DataNullPtr;
    }

    inline bool isFree() const noexcept {
      return info.tuple       == PartialTuple{0, 0}
          && info.type        == Blank
          && info.state       == Acceptor
          && info.from.rank   == 0
          && info.from.source == 0
          && info.recycling   == Blank
          && data             == DataNullPtr
           ;
    }

The function ``isRecylable`` answers the question, which slices can be recycled.

A slice can only be recycled if it is Fetch or Ready and has
a valid datapointer.

In particular, SelfSufficient are not recyclable, since it is easier
just to create a SelfSufficient slice than deal with data dependencies.

Furthermore, a recycled slice is not recyclable, if this is the case
then it is either bad design or a bug.

.. code:: c++

    inline bool isRecyclable() const noexcept {
      return (  info.state == Dispatched
             || info.state == Ready
             || info.state == Fetch
             )
          && hasValidDataPointer()
          ;
    }


The function ``hasValidDataPointer`` describes if a slice has a valid
data pointer.

This is important to know if the slice has some data to it, also
some structural checks are done, so that it should not be ``Acceptor``
or ``Blank``, if this is the case then it is a bug.

.. code:: c++

    inline bool hasValidDataPointer() const noexcept {
      return data       != DataNullPtr
          && info.state != Acceptor
          && info.type  != Blank
          ;
    }


The function
``unwrapAndMarkReady``
calls the low-level MPI functions
in order to wait whenever the state of the slice is correct.
The main behaviour of the function should

- return if state is ``Ready``, since then there is nothing to be done.

- throw if the state is not ``Dispatched``, only a dispatched slice
  can be unwrapped through MPI.

- throw if an MPI error happens.


.. code:: c++

        void unwrapAndMarkReady() {
          if (info.state == Ready) return;
          if (info.state != Dispatched)
            throw
              std::domain_error("Can't unwrap a non-ready, non-dispatched slice!");
          markReady();
          MPI_Status status;
    #ifdef HAVE_OCD
            WITH_RANK << "__slice__:mpi: waiting " << "\n";
    #endif
          const int errorCode = MPI_Wait(&request, &status);
          if (MPI_SUCCESS != MPI_Request_free(&request))
            throw "Error freeing MPI request";
          if (errorCode != MPI_SUCCESS)
            throw "MPI ERROR HAPPENED....";

    #if defined(HAVE_CUDA)
          // copy the retrieved mpi data to the device
          cuMemcpyHtoD(data, (void*)mpi_data, sizeof(F) * size);
          std::free(mpi_data);
    #endif

    #ifdef HAVE_OCD
          char errorString[MPI_MAX_ERROR_STRING];
          int errorSize;
          MPI_Error_string(errorCode, errorString, &errorSize);

          WITH_RANK << "__slice__:mpi: status "
                    << "{ .source="    << status.MPI_SOURCE
                    << ", .tag="       << status.MPI_TAG
                    << ", .error="     << status.MPI_ERROR
                    << ", .errCode="   << errorCode
                    << ", .err="       << errorString
                    << " }"
                    << "\n";
    #endif
        }

3 Utils
-------

This section presents some utilities

3.1 Pretty printing
~~~~~~~~~~~~~~~~~~~

The pretty printing uses the `dbg-macro <https://github.com/sharkdp/dbg-macro>`_ package.

.. code:: c++

    #pragma GCC diagnostic push
    #pragma GCC diagnostic ignored "-Wunused-parameter"
      template <typename T>
      std::string pretty_print(T&& value) {
        std::stringstream stream;
    #if ATRIP_DEBUG > 2
        dbg::pretty_print(stream, std::forward<T>(value));
    #endif
        return stream.str();
      }
    #pragma GCC diagnostic pop

3.2 Chrono
~~~~~~~~~~

The chrono is just a simple wrapper for a high resolution clock
that can be found in the ``std::chrono`` namespace of the standard library.

.. code:: c++

    #define WITH_CHRONO(__chrono_name, ...)         \
      Atrip::chrono[__chrono_name].start();         \
      __VA_ARGS__                                   \
      Atrip::chrono[__chrono_name].stop();

    struct Timer {
      using Clock = std::chrono::high_resolution_clock;
      using Event = std::chrono::time_point<Clock>;
      std::chrono::duration<double> duration;
      Event _start;
      inline void start() noexcept { _start = Clock::now(); }
      inline void stop() noexcept { duration += Clock::now() - _start; }
      inline void clear() noexcept { duration *= 0; }
      inline double count() const noexcept { return duration.count(); }
    };
    using Timings = std::map<std::string, Timer>;

4 The rank mapping
------------------

This section introduces the concept of rank mapping,
which defines how slices will be allocated to every
rank.

.. code:: c++

    #pragma once

    #include <vector>
    #include <algorithm>

    #include <atrip/Slice.hpp>
    #include <atrip/Tuples.hpp>

    namespace atrip {

      template <typename F=double>
      struct RankMap {

        static bool RANK_ROUND_ROBIN;
        std::vector<size_t> const lengths;
        size_t const np, size;
        ClusterInfo const clusterInfo;

        RankMap(std::vector<size_t> lens, size_t np_, MPI_Comm comm)
          : lengths(lens)
          , np(np_)
          , size(std::accumulate(lengths.begin(), lengths.end(),
                                1UL, std::multiplies<size_t>()))
          , clusterInfo(getClusterInfo(comm))
        { assert(lengths.size() <= 2); }

        size_t find(typename Slice<F>::Location const& p) const noexcept {
          if (RANK_ROUND_ROBIN) {
            return p.source * np + p.rank;
          } else {
            const size_t
              rankPosition = p.source * clusterInfo.ranksPerNode
                           + clusterInfo.rankInfos[p.rank].localRank
                           ;
            return rankPosition * clusterInfo.nNodes
                 + clusterInfo.rankInfos[p.rank].nodeId
                 ;
          }
        }

        size_t nSources() const noexcept {
          return size / np + size_t(size % np != 0);
        }


        bool isPaddingRank(size_t rank) const noexcept {
          return size % np == 0
              ? false
              : rank > (size % np - 1)
              ;
        }

        bool isSourcePadding(const size_t rank, const size_t source)
          const noexcept {
          return source == nSources() && isPaddingRank(rank);
        }

        typename Slice<F>::Location
        find(ABCTuple const& abc, typename Slice<F>::Type sliceType) const {
          // tuple = {11, 8} when abc = {11, 8, 9} and sliceType = AB
          // tuple = {11, 0} when abc = {11, 8, 9} and sliceType = A
          const auto tuple = Slice<F>::subtupleBySlice(abc, sliceType);

          const size_t index
            = tuple[0]
            + tuple[1] * (lengths.size() > 1 ? lengths[0] : 0)
            ;

          size_t rank, source;

          if (RANK_ROUND_ROBIN) {

            rank = index % np;
            source = index / np;

          } else {

            size_t const

              // the node that will be assigned to
                nodeId = index % clusterInfo.nNodes

              // how many times it has been assigned to the node
              , s_n = index / clusterInfo.nNodes

              // which local rank in the node should be
              , localRank = s_n % clusterInfo.ranksPerNode

              // and the local source (how many times we chose this local rank)
              , localSource = s_n / clusterInfo.ranksPerNode
              ;

            // find the localRank-th entry in clusterInfo
            auto const& it =
              std::find_if(clusterInfo.rankInfos.begin(),
                           clusterInfo.rankInfos.end(),
                           [nodeId, localRank](RankInfo const& ri) {
                             return ri.nodeId == nodeId
                                 && ri.localRank == localRank
                                 ;
                           });
            if (it == clusterInfo.rankInfos.end()) {
              throw "FATAL! Error in node distribution of the slices";
            }

            rank = (*it).globalRank;
            source = localSource;

          }

          return
            { rank
            , source
            };
        }

      };

    }

5 The slice union
-----------------

6 Tuples
--------

This section introduces the types for tuples :math:`(a,b,c)`
as well as their distribution to nodes and cores.

6.1 Tuples types
~~~~~~~~~~~~~~~~

The main tuple types are simple type aliases for finite-size arrays.
A tuple is thus simply 3 natural numbers :math:`(a,b,c)`
whereas a partial tuple is a two dimensional subset of these three.

.. code:: c++

    using ABCTuple = std::array<size_t, 3>;
    using PartialTuple = std::array<size_t, 2>;
    using ABCTuples = std::vector<ABCTuple>;

    constexpr ABCTuple FAKE_TUPLE = {0, 0, 0};
    constexpr ABCTuple INVALID_TUPLE = {1, 1, 1};

6.2 Distributing the tuples
~~~~~~~~~~~~~~~~~~~~~~~~~~~

In general it is our task to distribute all the tuples
:math:`(a,b,c)` among the ranks. Every distribution should
make sure to allocate the same amount of tuples to every rank,
padding the list with ``FAKE_TUPLE`` elements as necessary.

The interface that we propose for this is simplye

.. code:: c++

    struct TuplesDistribution {
      virtual ABCTuples getTuples(size_t Nv, MPI_Comm universe) = 0;
      virtual bool tupleIsFake(ABCTuple const& t) { return t == FAKE_TUPLE; }
    };

6.3 Node information
~~~~~~~~~~~~~~~~~~~~

nodeList
    List of hostnames of size :math:`N_n`

nodeInfos
    List of (hostname, local rank Id)
    of size :math:`N_p`, i.e., size of ranks
    where local rank id goes from 0 to 48.


``getNodeNames`` gets the names of the nodes used,
i.e., the size of the resulting vector gives the
number of nodes.

.. code:: c++

    std::vector<std::string> getNodeNames(MPI_Comm comm){
      int rank, np;
      MPI_Comm_rank(comm, &rank);
      MPI_Comm_size(comm, &np);

      std::vector<std::string> nodeList(np);
      char nodeName[MPI_MAX_PROCESSOR_NAME];
      char *nodeNames = (char*)malloc(np * MPI_MAX_PROCESSOR_NAME);
      std::vector<int> nameLengths(np)
                     , off(np)
                     ;
      int nameLength;
      MPI_Get_processor_name(nodeName, &nameLength);
      MPI_Allgather(&nameLength,
                    1,
                    MPI_INT,
                    nameLengths.data(),
                    1,
                    MPI_INT,
                    comm);
      for (int i(1); i < np; i++)
        off[i] = off[i-1] + nameLengths[i-1];
      MPI_Allgatherv(nodeName,
                     nameLengths[rank],
                     MPI_BYTE,
                     nodeNames,
                     nameLengths.data(),
                     off.data(),
                     MPI_BYTE,
                     comm);
      for (int i(0); i < np; i++) {
        std::string const s(&nodeNames[off[i]], nameLengths[i]);
        nodeList[i] = s;
      }
      std::free(nodeNames);
      return nodeList;
    }


``getNodeInfos``

.. code:: c++

    struct RankInfo {
      const std::string name;
      const size_t nodeId;
      const size_t globalRank;
      const size_t localRank;
      const size_t ranksPerNode;
    };

    template <typename A>
    A unique(A const &xs) {
      auto result = xs;
      std::sort(std::begin(result), std::end(result));
      auto const& last = std::unique(std::begin(result), std::end(result));
      result.erase(last, std::end(result));
      return result;
    }

    std::vector<RankInfo>
    getNodeInfos(std::vector<string> const& nodeNames) {
      std::vector<RankInfo> result;
      auto const uniqueNames = unique(nodeNames);
      auto const index = [&uniqueNames](std::string const& s) {
        auto const& it = std::find(uniqueNames.begin(), uniqueNames.end(), s);
        return std::distance(uniqueNames.begin(), it);
      };
      std::vector<size_t> localRanks(uniqueNames.size(), 0);
      size_t globalRank = 0;
      for (auto const& name: nodeNames) {
        const size_t nodeId = index(name);
        result.push_back({name,
                          nodeId,
                          globalRank++,
                          localRanks[nodeId]++,
                          (size_t)
                          std::count(nodeNames.begin(),
                                     nodeNames.end(),
                                     name)
                          });
      }
      return result;
    }

    struct ClusterInfo {
      const size_t nNodes, np, ranksPerNode;
      const std::vector<RankInfo> rankInfos;
    };

    ClusterInfo
    getClusterInfo(MPI_Comm comm) {
      auto const names = getNodeNames(comm);
      auto const rankInfos = getNodeInfos(names);

      return ClusterInfo {
        unique(names).size(),
        names.size(),
        rankInfos[0].ranksPerNode,
        rankInfos
      };

    }

6.4 Naive list
~~~~~~~~~~~~~~

The naive implementation of the global tuples list is simple
three for loops creating tuples of the sort
:math:`(a,b,c)` where the following conditions are met at the same time:

- :math:`a \leq b \leq c`

- :math:`a \neq b \land b \neq c`

This means,
:math:`(1, 2, 3) , (1, 1, 3) , (1, 2, 2)`
are acceptable tuples wherease :math:`(2, 1, 1)` and :math:`(1, 1, 1)` are not.


.. code:: c++

    ABCTuples getTuplesList(size_t Nv, size_t rank, size_t np) {

      const size_t
        // total number of tuples for the problem
           n = Nv * (Nv + 1) * (Nv + 2) / 6 - Nv

        // all ranks should have the same number of tuples_per_rank
        , tuples_per_rank = n / np + size_t(n % np != 0)

        // start index for the global tuples list
        , start = tuples_per_rank * rank

        // end index for the global tuples list
        , end = tuples_per_rank * (rank + 1)
        ;

      LOG(1,"Atrip") << "tuples_per_rank = " << tuples_per_rank << "\n";
      WITH_RANK << "start, end = " << start << ", " << end << "\n";
      ABCTuples result(tuples_per_rank, FAKE_TUPLE);

      for (size_t a(0), r(0), g(0); a < Nv; a++)
      for (size_t b(a);             b < Nv; b++)
      for (size_t c(b);             c < Nv; c++){
        if ( a == b && b == c ) continue;
        if ( start <= g && g < end) result[r++] = {a, b, c};
        g++;
      }

      return result;

    }

and all tuples would simply be

.. code:: c++

    ABCTuples getAllTuplesList(const size_t Nv) {
      const size_t n = Nv * (Nv + 1) * (Nv + 2) / 6 - Nv;
      ABCTuples result(n);

      for (size_t a(0), u(0); a < Nv; a++)
      for (size_t b(a); b < Nv; b++)
      for (size_t c(b); c < Nv; c++){
        if ( a == b && b == c ) continue;
        result[u++] = {a, b, c};
      }

      return result;
    }


With ``getTupleList`` we can easily define a tuple distribution like

.. code:: c++

    struct NaiveDistribution : public TuplesDistribution {
      ABCTuples getTuples(size_t Nv, MPI_Comm universe) override {
        int rank, np;
        MPI_Comm_rank(universe, &rank);
        MPI_Comm_size(universe, &np);
        return getTuplesList(Nv, (size_t)rank, (size_t)np);
      }
    };

6.5 Group and sort list
~~~~~~~~~~~~~~~~~~~~~~~

6.5.1 Utils
^^^^^^^^^^^

.. code:: c++


    // Provides the node on which the slice-element is found
    // Right now we distribute the slices in a round robin fashion
    // over the different nodes (NOTE: not mpi ranks but nodes)
    inline
    size_t isOnNode(size_t tuple, size_t nNodes) { return tuple % nNodes; }


    // return the node (or all nodes) where the elements of this
    // tuple are located
    std::vector<size_t> getTupleNodes(ABCTuple const& t, size_t nNodes) {
      std::vector<size_t>
        nTuple = { isOnNode(t[0], nNodes)
                 , isOnNode(t[1], nNodes)
                 , isOnNode(t[2], nNodes)
                 };
      return unique(nTuple);
    }

    struct Info {
      size_t nNodes;
      size_t nodeId;
    };

6.5.2 Distribution
^^^^^^^^^^^^^^^^^^

wording: home element = element which is located on the given node

1. we distribute the tuples such that each tuple has at least one 'home element'

2. we sort each tuple in a way that the 'home element' are the fastest indices

3. we sort the list of tuples on every node

4. we resort the tuples that for every tuple abc the following holds: a<b<c

.. code:: c++

    ABCTuples specialDistribution(Info const& info, ABCTuples const& allTuples) {

      ABCTuples nodeTuples;
      size_t const nNodes(info.nNodes);

      std::vector<ABCTuples>
          container1d(nNodes)
        , container2d(nNodes * nNodes)
        , container3d(nNodes * nNodes * nNodes)
        ;

      WITH_DBG if (info.nodeId == 0)
        std::cout << "\tGoing through all "
                  << allTuples.size()
                  << " tuples in "
                  << nNodes
                  << " nodes\n";

      // build container-n-d's
      for (auto const& t: allTuples) {
        // one which node(s) are the tuple elements located...
        // put them into the right container
        auto const _nodes = getTupleNodes(t, nNodes);

        switch (_nodes.size()) {
          case 1:
            container1d[_nodes[0]].push_back(t);
            break;
          case 2:
            container2d[ _nodes[0]
                       + _nodes[1] * nNodes
                       ].push_back(t);
            break;
          case 3:
            container3d[ _nodes[0]
                       + _nodes[1] * nNodes
                       + _nodes[2] * nNodes * nNodes
                       ].push_back(t);
            break;
        }

      }

      WITH_DBG if (info.nodeId == 0)
        std::cout << "\tBuilding 1-d containers\n";
      // DISTRIBUTE 1-d containers
      // every tuple which is only located at one node belongs to this node
      {
        auto const& _tuples = container1d[info.nodeId];
        nodeTuples.resize(_tuples.size(), INVALID_TUPLE);
        std::copy(_tuples.begin(), _tuples.end(), nodeTuples.begin());
      }

      WITH_DBG if (info.nodeId == 0)
        std::cout << "\tBuilding 2-d containers\n";
      // DISTRIBUTE 2-d containers
      //the tuples which are located at two nodes are half/half given to these nodes
      for (size_t yx = 0; yx < container2d.size(); yx++) {

        auto const& _tuples = container2d[yx];
          const
        size_t idx = yx % nNodes
             // remeber: yx = idy * nNodes + idx
             , idy = yx / nNodes
             , n_half = _tuples.size() / 2
             , size = nodeTuples.size()
             ;

        size_t nbeg, nend;
        if (info.nodeId == idx) {
          nbeg = 0 * n_half;
          nend = n_half;
        } else if (info.nodeId == idy) {
          nbeg = 1 * n_half;
          nend = _tuples.size();
        } else {
          // either idx or idy is my node
          continue;
        }

        size_t const nextra = nend - nbeg;
        nodeTuples.resize(size + nextra, INVALID_TUPLE);
        std::copy(_tuples.begin() + nbeg,
                  _tuples.begin() + nend,
                  nodeTuples.begin() + size);

      }

      WITH_DBG if (info.nodeId == 0)
        std::cout << "\tBuilding 3-d containers\n";
      // DISTRIBUTE 3-d containers
      for (size_t zyx = 0; zyx < container3d.size(); zyx++) {
        auto const& _tuples = container3d[zyx];

          const
        size_t idx = zyx % nNodes
             , idy = (zyx / nNodes) % nNodes
             // remember: zyx = idx + idy * nNodes + idz * nNodes^2
             , idz = zyx / nNodes / nNodes
             , n_third = _tuples.size() / 3
             , size = nodeTuples.size()
             ;

        size_t nbeg, nend;
        if (info.nodeId == idx) {
          nbeg = 0 * n_third;
          nend = 1 * n_third;
        } else if (info.nodeId == idy) {
          nbeg = 1 * n_third;
          nend = 2 * n_third;
        } else if (info.nodeId == idz) {
          nbeg = 2 * n_third;
          nend = _tuples.size();
        } else {
          // either idx or idy or idz is my node
          continue;
        }

        size_t const nextra = nend - nbeg;
        nodeTuples.resize(size + nextra, INVALID_TUPLE);
        std::copy(_tuples.begin() + nbeg,
                  _tuples.begin() + nend,
                  nodeTuples.begin() + size);

      }


      WITH_DBG if (info.nodeId == 0) std::cout << "\tswapping tuples...\n";
      /*
       *  sort part of group-and-sort algorithm
       *  every tuple on a given node is sorted in a way that
       *  the 'home elements' are the fastest index.
       *  1:yyy 2:yyn(x) 3:yny(x) 4:ynn(x) 5:nyy 6:nyn(x) 7:nny 8:nnn
       */
      for (auto &nt: nodeTuples){
        if ( isOnNode(nt[0], nNodes) == info.nodeId ){ // 1234
          if ( isOnNode(nt[2], nNodes) != info.nodeId ){ // 24
            size_t const x(nt[0]);
            nt[0] = nt[2];         // switch first and last
            nt[2] = x;
          }
          else if ( isOnNode(nt[1], nNodes) != info.nodeId){ // 3
            size_t const x(nt[0]);
            nt[0] = nt[1];         // switch first two
            nt[1] = x;
          }
        } else {
          if ( isOnNode(nt[1], nNodes) == info.nodeId   // 56
            && isOnNode(nt[2], nNodes) != info.nodeId
            ) { // 6
            size_t const x(nt[1]);
            nt[1] = nt[2];         // switch last two
            nt[2] = x;
          }
        }
      }

      WITH_DBG if (info.nodeId == 0) std::cout << "\tsorting list of tuples...\n";
      //now we sort the list of tuples
      std::sort(nodeTuples.begin(), nodeTuples.end());

      WITH_DBG if (info.nodeId == 0) std::cout << "\trestoring tuples...\n";
      // we bring the tuples abc back in the order a<b<c
      for (auto &t: nodeTuples)  std::sort(t.begin(), t.end());

    #if ATRIP_DEBUG > 1
      WITH_DBG if (info.nodeId == 0)
      std::cout << "checking for validity of " << nodeTuples.size() << std::endl;
      const bool anyInvalid
        = std::any_of(nodeTuples.begin(),
                      nodeTuples.end(),
                      [](ABCTuple const& t) { return t == INVALID_TUPLE; });
      if (anyInvalid) throw "Some tuple is invalid in group-and-sort algorithm";
    #endif

      WITH_DBG if (info.nodeId == 0) std::cout << "\treturning tuples...\n";
      return nodeTuples;

    }

6.5.3 Main
^^^^^^^^^^

The main routine should return the list of tuples to be handled by the current rank.

Let :math:`N_p` be the number of ranks or processes.
Let :math:`N_n` be the number of nodes or sockets.

Then we have the following

::

    Global rank | 0 1 2 3 4 5 6 7 8
    key         | global rank
    nodeId      | 0 1 0 1 1 0 2 2 2
    Local rank  | 0 0 1 1 2 2 0 1 2
    intra color | 0 1 0 1 1 0 2 2 2


.. code:: c++

    std::vector<ABCTuple> main(MPI_Comm universe, size_t Nv) {

      int rank, np;
      MPI_Comm_rank(universe, &rank);
      MPI_Comm_size(universe, &np);

      std::vector<ABCTuple> result;

      auto const nodeNames(getNodeNames(universe));
      size_t const nNodes = unique(nodeNames).size();
      auto const nodeInfos = getNodeInfos(nodeNames);

      // We want to construct a communicator which only contains of one
      // element per node
      bool const computeDistribution
        = nodeInfos[rank].localRank == 0;

      std::vector<ABCTuple>
        nodeTuples
          = computeDistribution
          ? specialDistribution(Info{nNodes, nodeInfos[rank].nodeId},
                                getAllTuplesList(Nv))
          : std::vector<ABCTuple>()
          ;

      LOG(1,"Atrip") << "got nodeTuples\n";

      // now we have to send the data from **one** rank on each node
      // to all others ranks of this node
        const
      int color = nodeInfos[rank].nodeId
        , key = nodeInfos[rank].localRank
        ;


      MPI_Comm INTRA_COMM;
      MPI_Comm_split(universe, color, key, &INTRA_COMM);

Every node has to distribute ****at least****
``nodeTuples.size() / nodeInfos[rank].ranksPerNode``
tuples among the ranks.

We have to communicate this quantity among all nodes.

.. code:: c++


    size_t const
      tuplesPerRankLocal
         = nodeTuples.size() / nodeInfos[rank].ranksPerNode
         + size_t(nodeTuples.size() % nodeInfos[rank].ranksPerNode != 0)
         ;

    size_t tuplesPerRankGlobal;

    MPI_Reduce(&tuplesPerRankLocal,
               &tuplesPerRankGlobal,
               1,
               MPI_UINT64_T,
               MPI_MAX,
               0,
               universe);

    MPI_Bcast(&tuplesPerRankGlobal,
              1,
              MPI_UINT64_T,
              0,
              universe);

    LOG(1,"Atrip") << "Tuples per rank: " << tuplesPerRankGlobal << "\n";
    LOG(1,"Atrip") << "ranks per node " << nodeInfos[rank].ranksPerNode << "\n";
    LOG(1,"Atrip") << "#nodes " << nNodes << "\n";

Now we have the tuples that every rank has to have, i.e.,
``tuplesPerRankGlobal``.

However before this,
the tuples in ``nodeTuples`` now have to be sent from the local rank
in every node to all the ranks in the given node,
and we have to make sure that every rank inside a given node
gets the same amount of tuples, in this case it should be
``tuplesPerRankLocal``, and in our node the total number
of tuples should be ``tuplesPerRankLocal * nodeInfos[rank].ranksPerNode``,
however this might not be the case up to now due to divisibility issues.

Up to now we have exactly ``nodeTuples.size()`` tuples, we have to make sure by
resizing that the condition above is met, i.e., so we can resize
and add some fake tuples at the end as padding.

.. code:: c++

    size_t const totalTuples
      = tuplesPerRankGlobal * nodeInfos[rank].ranksPerNode;

    if (computeDistribution) {
      // pad with FAKE_TUPLEs
      nodeTuples.insert(nodeTuples.end(),
                        totalTuples - nodeTuples.size(),
                        FAKE_TUPLE);
    }

And now we can simply scatter the tuples in nodeTuples and send
``tuplesPerRankGlobal`` to the different ranks in the node,

.. code:: c++

    {
      // construct mpi type for abctuple
      MPI_Datatype MPI_ABCTUPLE;
      MPI_Type_vector(nodeTuples[0].size(), 1, 1, MPI_UINT64_T, &MPI_ABCTUPLE);
      MPI_Type_commit(&MPI_ABCTUPLE);

      LOG(1,"Atrip") << "scattering tuples \n";

      result.resize(tuplesPerRankGlobal);
      MPI_Scatter(nodeTuples.data(),
                  tuplesPerRankGlobal,
                  MPI_ABCTUPLE,
                  result.data(),
                  tuplesPerRankGlobal,
                  MPI_ABCTUPLE,
                  0,
                  INTRA_COMM);

      MPI_Type_free(&MPI_ABCTUPLE);

    }


The next step is sending the tuples in the local root rank
to the other ranks in the node, this we do with the MPI function
``MPI_Scatterv``.
Every rank gets ``tuplesPerRankLocal`` tuples and
the ``nodeTuples`` vector is now homogeneous and divisible by the number
of ranks per node in our node.
Therefore, the ``displacements`` are simply the vector


.. math::

    \left\{
      k * \mathrm{tuplesPerNodeLocal}
      \mid
      k \in
      \left\{ 0
            , \ldots
            , \#\text{ranks in node} - 1
            \right\}
      \right\}

and the ``sendCounts`` vector is simply the constant vector
``tuplesPerRankLocal`` of size ``ranksPerNode``.

.. code:: c++


      return result;

    }

6.5.4 Interface
^^^^^^^^^^^^^^^

The distribution interface will then simply be

.. code:: c++

    struct Distribution : public TuplesDistribution {
      ABCTuples getTuples(size_t Nv, MPI_Comm universe) override {
        return main(universe, Nv);
      }
    };

7 Unions
--------

Every slice pertaining to every different tensor
is sliced differently.


.. code:: c++

    #pragma once
    #include <atrip/SliceUnion.hpp>

    namespace atrip {

      template <typename F=double>
      void sliceIntoVector
        ( std::vector<F> &v
        , CTF::Tensor<F> &toSlice
        , std::vector<int64_t> const low
        , std::vector<int64_t> const up
        , CTF::Tensor<F> const& origin
        , std::vector<int64_t> const originLow
        , std::vector<int64_t> const originUp
        ) {
        // Thank you CTF for forcing me to do this
        struct { std::vector<int> up, low; }
            toSlice_ = { {up.begin(), up.end()}
                       , {low.begin(), low.end()} }
          , origin_ = { {originUp.begin(), originUp.end()}
                      , {originLow.begin(), originLow.end()} }
          ;

        WITH_OCD
        WITH_RANK << "slicing into " << pretty_print(toSlice_.up)
                              << "," << pretty_print(toSlice_.low)
                  << " from " << pretty_print(origin_.up)
                       << "," << pretty_print(origin_.low)
                  << "\n";

    #ifndef ATRIP_DONT_SLICE
        toSlice.slice( toSlice_.low.data()
                     , toSlice_.up.data()
                     , 0.0
                     , origin
                     , origin_.low.data()
                     , origin_.up.data()
                     , 1.0);
        memcpy(v.data(), toSlice.data, sizeof(F) * v.size());
    #endif

      }


      template <typename F=double>
      struct TAPHH : public SliceUnion<F> {
        TAPHH( CTF::Tensor<F> const& sourceTensor
             , size_t No
             , size_t Nv
             , size_t np
             , MPI_Comm child_world
             , MPI_Comm global_world
             ) : SliceUnion<F>( {Slice<F>::A, Slice<F>::B, Slice<F>::C}
                              , {Nv, No, No} // size of the slices
                              , {Nv}
                              , np
                              , child_world
                              , global_world
                              , Slice<F>::TA
                              , 6) {
               this->init(sourceTensor);
             }

        void sliceIntoBuffer(size_t it, CTF::Tensor<F> &to, CTF::Tensor<F> const& from) override
        {
          const int Nv = this->sliceLength[0]
                  , No = this->sliceLength[1]
                  , a = this->rankMap.find({static_cast<size_t>(Atrip::rank), it});
                  ;


          sliceIntoVector<F>( this->sources[it]
                            , to,   {0, 0, 0},    {Nv, No, No}
                            , from, {a, 0, 0, 0}, {a+1, Nv, No, No}
                            );

        }

      };


      template <typename F=double>
      struct HHHA : public SliceUnion<F> {
        HHHA( CTF::Tensor<F> const& sourceTensor
            , size_t No
            , size_t Nv
            , size_t np
            , MPI_Comm child_world
            , MPI_Comm global_world
            ) : SliceUnion<F>( {Slice<F>::A, Slice<F>::B, Slice<F>::C}
                             , {No, No, No} // size of the slices
                             , {Nv}         // size of the parametrization
                             , np
                             , child_world
                             , global_world
                             , Slice<F>::VIJKA
                             , 6) {
               this->init(sourceTensor);
             }

        void sliceIntoBuffer(size_t it, CTF::Tensor<F> &to, CTF::Tensor<F> const& from) override
        {

          const int No = this->sliceLength[0]
                  , a = this->rankMap.find({static_cast<size_t>(Atrip::rank), it})
                  ;

          sliceIntoVector<F>( this->sources[it]
                            , to,   {0, 0, 0},    {No, No, No}
                            , from, {0, 0, 0, a}, {No, No, No, a+1}
                            );

        }
      };

      template <typename F=double>
      struct ABPH : public SliceUnion<F> {
        ABPH( CTF::Tensor<F> const& sourceTensor
            , size_t No
            , size_t Nv
            , size_t np
            , MPI_Comm child_world
            , MPI_Comm global_world
            ) : SliceUnion<F>( { Slice<F>::AB, Slice<F>::BC, Slice<F>::AC
                               , Slice<F>::BA, Slice<F>::CB, Slice<F>::CA
                               }
                             , {Nv, No} // size of the slices
                             , {Nv, Nv} // size of the parametrization
                             , np
                             , child_world
                             , global_world
                             , Slice<F>::VABCI
                             , 2*6) {
               this->init(sourceTensor);
             }

        void sliceIntoBuffer(size_t it, CTF::Tensor<F> &to, CTF::Tensor<F> const& from) override {

          const int Nv = this->sliceLength[0]
                  , No = this->sliceLength[1]
                  , el = this->rankMap.find({static_cast<size_t>(Atrip::rank), it})
                  , a = el % Nv
                  , b = el / Nv
                  ;


          sliceIntoVector<F>( this->sources[it]
                            , to,   {0, 0},       {Nv, No}
                            , from, {a, b, 0, 0}, {a+1, b+1, Nv, No}
                            );

        }

      };

      template <typename F=double>
      struct ABHH : public SliceUnion<F> {
        ABHH( CTF::Tensor<F> const& sourceTensor
            , size_t No
            , size_t Nv
            , size_t np
            , MPI_Comm child_world
            , MPI_Comm global_world
            ) : SliceUnion<F>( {Slice<F>::AB, Slice<F>::BC, Slice<F>::AC}
                             , {No, No} // size of the slices
                             , {Nv, Nv} // size of the parametrization
                             , np
                             , child_world
                             , global_world
                             , Slice<F>::VABIJ
                             , 6) {
               this->init(sourceTensor);
             }

        void sliceIntoBuffer(size_t it, CTF::Tensor<F> &to, CTF::Tensor<F> const& from) override {

          const int Nv = from.lens[0]
                  , No = this->sliceLength[1]
                  , el = this->rankMap.find({static_cast<size_t>(Atrip::rank), it})
                  , a = el % Nv
                  , b = el / Nv
                  ;

          sliceIntoVector<F>( this->sources[it]
                            , to,   {0, 0},       {No, No}
                            , from, {a, b, 0, 0}, {a+1, b+1, No, No}
                            );


        }

      };


      template <typename F=double>
      struct TABHH : public SliceUnion<F> {
        TABHH( CTF::Tensor<F> const& sourceTensor
             , size_t No
             , size_t Nv
             , size_t np
             , MPI_Comm child_world
             , MPI_Comm global_world
             ) : SliceUnion<F>( {Slice<F>::AB, Slice<F>::BC, Slice<F>::AC}
                              , {No, No} // size of the slices
                              , {Nv, Nv} // size of the parametrization
                              , np
                              , child_world
                              , global_world
                              , Slice<F>::TABIJ
                              , 6) {
               this->init(sourceTensor);
             }

        void sliceIntoBuffer(size_t it, CTF::Tensor<F> &to, CTF::Tensor<F> const& from) override {
          // TODO: maybe generalize this with ABHH

          const int Nv = from.lens[0]
                  , No = this->sliceLength[1]
                  , el = this->rankMap.find({static_cast<size_t>(Atrip::rank), it})
                  , a = el % Nv
                  , b = el / Nv
                  ;

          sliceIntoVector<F>( this->sources[it]
                            , to,   {0, 0},       {No, No}
                            , from, {a, b, 0, 0}, {a+1, b+1, No, No}
                            );


        }

      };

    }

8 Equations
-----------

This section presents on of the main physical contents of the program,
the equations used.

The equations are of two types, energy equations and intermediate
tensor contractions.

8.1 Energy
~~~~~~~~~~

For the energy we have two types of functions,
a function for when the tuples :math:`(a,b,c)`
are distinct and where they are the same,
they are accordingly called:


.. code:: c++

    template <typename F=double>
    double getEnergyDistinct
      ( F const epsabc
      , size_t const No
      , F* const epsi
      , F* const Tijk
      , F* const Zijk
      );

    template <typename F=double>
    double getEnergySame
      ( F const epsabc
      , size_t const No
      , F* const epsi
      , F* const Tijk
      , F* const Zijk
      );

Their implementations follow, and they are
written in a cache-friendly style for CPU architectures
through the ``blockSize`` variable.

.. code:: c++

    template <typename F=double>
    double getEnergyDistinct
      ( F const epsabc
      , size_t const No
      , F* const epsi
      , F* const Tijk
      , F* const Zijk
      ) {
      constexpr size_t blockSize=16;
      F energy(0.);
      for (size_t kk=0; kk<No; kk+=blockSize){
        const size_t kend( std::min(No, kk+blockSize) );
        for (size_t jj(kk); jj<No; jj+=blockSize){
          const size_t jend( std::min( No, jj+blockSize) );
          for (size_t ii(jj); ii<No; ii+=blockSize){
            const size_t iend( std::min( No, ii+blockSize) );
            for (size_t k(kk); k < kend; k++){
              const F ek(epsi[k]);
              const size_t jstart = jj > k ? jj : k;
              for (size_t j(jstart); j < jend; j++){
                F const ej(epsi[j]);
                F const facjk = j == k ? F(0.5) : F(1.0);
                size_t istart = ii > j ? ii : j;
                for (size_t i(istart); i < iend; i++){
                  const F
                      ei(epsi[i])
                    , facij = i == j ? F(0.5) : F(1.0)
                    , denominator(epsabc - ei - ej - ek)
                    , U(Zijk[i + No*j + No*No*k])
                    , V(Zijk[i + No*k + No*No*j])
                    , W(Zijk[j + No*i + No*No*k])
                    , X(Zijk[j + No*k + No*No*i])
                    , Y(Zijk[k + No*i + No*No*j])
                    , Z(Zijk[k + No*j + No*No*i])
                    , A(maybeConjugate<F>(Tijk[i + No*j + No*No*k]))
                    , B(maybeConjugate<F>(Tijk[i + No*k + No*No*j]))
                    , C(maybeConjugate<F>(Tijk[j + No*i + No*No*k]))
                    , D(maybeConjugate<F>(Tijk[j + No*k + No*No*i]))
                    , E(maybeConjugate<F>(Tijk[k + No*i + No*No*j]))
                    , _F(maybeConjugate<F>(Tijk[k + No*j + No*No*i]))
                    , value
                      = 3.0 * ( A * U
                                + B * V
                                + C * W
                                + D * X
                                + E * Y
                                + _F * Z )
                     + ( ( U + X + Y )
                       - 2.0 * ( V + W + Z )
                       ) * ( A + D + E )
                     + ( ( V + W + Z )
                       - 2.0 * ( U + X + Y )
                       ) * ( B + C + _F )
                    ;
                  energy += 2.0 * value / denominator * facjk * facij;
                } // i
              } // j
            } // k
          } // ii
        } // jj
      } // kk
      return std::real(energy);
    }


    template <typename F=double>
    double getEnergySame
      ( F const epsabc
      , size_t const No
      , F* const epsi
      , F* const Tijk
      , F* const Zijk
      ) {
      constexpr size_t blockSize = 16;
      F energy = F(0.);
      for (size_t kk=0; kk<No; kk+=blockSize){
        const size_t kend( std::min( kk+blockSize, No) );
        for (size_t jj(kk); jj<No; jj+=blockSize){
          const size_t jend( std::min( jj+blockSize, No) );
          for (size_t ii(jj); ii<No; ii+=blockSize){
            const size_t iend( std::min( ii+blockSize, No) );
            for (size_t k(kk); k < kend; k++){
              const F ek(epsi[k]);
              const size_t jstart = jj > k ? jj : k;
              for(size_t j(jstart); j < jend; j++){
                const F facjk( j == k ? F(0.5) : F(1.0));
                const F ej(epsi[j]);
                const size_t istart = ii > j ? ii : j;
                for(size_t i(istart); i < iend; i++){
                  const F
                    ei(epsi[i])
                  , facij ( i==j ? F(0.5) : F(1.0))
                  , denominator(epsabc - ei - ej - ek)
                  , U(Zijk[i + No*j + No*No*k])
                  , V(Zijk[j + No*k + No*No*i])
                  , W(Zijk[k + No*i + No*No*j])
                  , A(maybeConjugate<F>(Tijk[i + No*j + No*No*k]))
                  , B(maybeConjugate<F>(Tijk[j + No*k + No*No*i]))
                  , C(maybeConjugate<F>(Tijk[k + No*i + No*No*j]))
                  , value
                    = F(3.0) * ( A * U
                               + B * V
                               + C * W
                               )
                    - ( A + B + C ) * ( U + V + W )
                  ;
                  energy += F(2.0) * value / denominator * facjk * facij;
                } // i
              } // j
            } // k
          } // ii
        } // jj
      } // kk
      return std::real(energy);
    }

And we explicitly instantiate these templated functions:

.. code:: c++

    // instantiate double
    template
    double getEnergyDistinct
      ( double const epsabc
      , size_t const No
      , double* const epsi
      , double* const Tijk
      , double* const Zijk
      );

    template
    double getEnergySame
      ( double const epsabc
      , size_t const No
      , double* const epsi
      , double* const Tijk
      , double* const Zijk
      );

    // instantiate Complex
    template
    double getEnergyDistinct
      ( Complex const epsabc
      , size_t const No
      , Complex* const epsi
      , Complex* const Tijk
      , Complex* const Zijk
      );

    template
    double getEnergySame
      ( Complex const epsabc
      , size_t const No
      , Complex* const epsi
      , Complex* const Tijk
      , Complex* const Zijk
      );

8.2 Contractions
~~~~~~~~~~~~~~~~

8.2.1 Singles contribution
^^^^^^^^^^^^^^^^^^^^^^^^^^

The first contraction we discuss is the one involving :math:`t_i^a`
singles amplitudes.

For every :math:`(a,b,c)` we construct the object

\begin{align*}
Z_{ijk} &= t^a_i V^{bc}_{jk} \\
        & + t^b_j V^{ac}_{ik} \\
        & + t^c_k V^{ab}_{ij}
\end{align*}

and for it we will need the corresponding slices
of the :math:`V^{pp}_{ph}` integrals, i.e., ``AB``, ``AC`` and ``BC``.


.. code:: c++


      template <typename F=double>
      void singlesContribution
        ( size_t No
        , size_t Nv
        , const ABCTuple &abc
        , F* const Tph
        , F* const VABij
        , F* const VACij
        , F* const VBCij
        , F* Zijk
        ) {
        const size_t a(abc[0]), b(abc[1]), c(abc[2]);
        const size_t NoNo = No*No;
        // TODO: change order of for loops
        for (size_t k = 0; k < No; k++)
        for (size_t i = 0; i < No; i++)
        for (size_t j = 0; j < No; j++) {
          const size_t ijk = i + j*No + k*NoNo;
          Zijk[ijk] += Tph[ a + i * Nv ] * VBCij[ j + k * No ];
          Zijk[ijk] += Tph[ b + j * Nv ] * VACij[ i + k * No ];
          Zijk[ijk] += Tph[ c + k * Nv ] * VABij[ i + j * No ];
        }
      }

    // instantiate
      template void singlesContribution( size_t No
                                       , size_t Nv
                                       , const ABCTuple &abc
                                       , double* const Tph
                                       , double* const VABij
                                       , double* const VACij
                                       , double* const VBCij
                                       , double* Zijk
                                       );

      template void singlesContribution( size_t No
                                       , size_t Nv
                                       , const ABCTuple &abc
                                       , Complex* const Tph
                                       , Complex* const VABij
                                       , Complex* const VACij
                                       , Complex* const VBCij
                                       , Complex* Zijk
                                       );

8.2.2 Doubles contribution
^^^^^^^^^^^^^^^^^^^^^^^^^^

Next we build the tensor :math:`T_{ijk}` from the contraction
of the coulomb integrals with the doubles amplitudes :math:`t^{ab}_{ij}`.
There are two kinds of contractions, which are differently challenging
in terms of computation.

\begin{align*}
T_{ijk}  &=  \\
\sum_{{\color{red}L} = 0}^{N_\mathrm{o}}
&- T^{{\color{blue} ab}}_{{\color{red}L}j} V_{ik}^{{\color{red}L}{\color{blue}c}} \\
&- T^{{\color{blue} ab}}_{i{\color{red}L}} V_{jk}^{{\color{red}L}{\color{blue}c}} \\
&- T^{{\color{blue} ac}}_{{\color{red}L}k} V_{ij}^{{\color{red}L}{\color{blue}b}} \\
&- T^{{\color{blue} ac}}_{i{\color{red}L}} V_{kj}^{{\color{red}L}{\color{blue}b}} \\
&- T^{{\color{blue} bc}}_{{\color{red}L}k} V_{ji}^{{\color{red}L}{\color{blue}a}} \\
&- T^{{\color{blue} bc}}_{j{\color{red}L}} V_{ki}^{{\color{red}L}{\color{blue}a}}
\end{align*}

\begin{align*}
T_{ijk}  &=  \\
\sum_{{\color{red}e} = 0}^{N_\mathrm{v}}
 &\phantom{+}
V^{{\color{blue}ab}}_{{\color{red}e}i}  T^{{\color{blue}c}{\color{red}e}}_{ij} \\
        &+ V^{{\color{blue}ba}}_{{\color{red}e}j}  T^{{\color{blue}c}{\color{red}e}}_{ji} \\
        &+ V^{{\color{blue}cb}}_{{\color{red}e}k}  T^{{\color{blue}a}{\color{red}e}}_{kj} \\
        &+ V^{{\color{blue}ac}}_{{\color{red}e}i}  T^{{\color{blue}b}{\color{red}e}}_{ik} \\
        &+ V^{{\color{blue}bc}}_{{\color{red}e}j}  T^{{\color{blue}a}{\color{red}e}}_{jk} \\
        &+ V^{{\color{blue}ca}}_{{\color{red}e}k}  T^{{\color{blue}b}{\color{red}e}}_{ki} \\
\end{align*}


.. code:: c++

    #if defined(HAVE_CUDA)
      __device__
    #endif
      template <typename F=double>
      void doublesContribution
        ( const ABCTuple &abc
        , size_t const No
        , size_t const Nv
        // -- VABCI
        , DataPtr<F> const VABph
        , DataPtr<F> const VACph
        , DataPtr<F> const VBCph
        , DataPtr<F> const VBAph
        , DataPtr<F> const VCAph
        , DataPtr<F> const VCBph
        // -- VHHHA
        , DataPtr<F> const VhhhA
        , DataPtr<F> const VhhhB
        , DataPtr<F> const VhhhC
        // -- TA
        , DataPtr<F> const TAphh
        , DataPtr<F> const TBphh
        , DataPtr<F> const TCphh
        // -- TABIJ
        , DataPtr<F> const TABhh
        , DataPtr<F> const TAChh
        , DataPtr<F> const TBChh
        // -- TIJK
        // , DataPtr<F> Tijk
        , DataFieldType<F>* Tijk_
        );


.. code:: c++

    #if defined(HAVE_CUDA)
      __device__
    #endif
      template <typename F=double>
      void doublesContribution
        ( const ABCTuple &abc
        , size_t const No
        , size_t const Nv
        // -- VABCI
        , DataPtr<F> const VABph
        , DataPtr<F> const VACph
        , DataPtr<F> const VBCph
        , DataPtr<F> const VBAph
        , DataPtr<F> const VCAph
        , DataPtr<F> const VCBph
        // -- VHHHA
        , DataPtr<F> const VhhhA
        , DataPtr<F> const VhhhB
        , DataPtr<F> const VhhhC
        // -- TA
        , DataPtr<F> const TAphh
        , DataPtr<F> const TBphh
        , DataPtr<F> const TCphh
        // -- TABIJ
        , DataPtr<F> const TABhh
        , DataPtr<F> const TAChh
        , DataPtr<F> const TBChh
        // -- TIJK
        // , DataPtr<F> Tijk_
        , DataFieldType<F>* Tijk_
        ) {

        const size_t a = abc[0], b = abc[1], c = abc[2]
                  , NoNo = No*No, NoNv = No*Nv
                  ;

        typename DataField<F>::type* Tijk = (typename DataField<F>::type*) Tijk_;

    #if defined(ATRIP_USE_DGEMM)
    #define _IJK_(i, j, k) i + j*No + k*NoNo
    #define REORDER(__II, __JJ, __KK)                                   \
        WITH_CHRONO("doubles:reorder",                                  \
        for (size_t k = 0; k < No; k++)                                 \
        for (size_t j = 0; j < No; j++)                                 \
        for (size_t i = 0; i < No; i++) {                               \
          Tijk[_IJK_(i, j, k)] += _t_buffer_p[_IJK_(__II, __JJ, __KK)];   \
        }                                                               \
        )
    #if defined(HAVE_CUDA)
    #define __TO_DEVICEPTR(_v)                      \
        ((DataFieldType<F>*)                        \
         (CUdeviceptr)                              \
         thrust::raw_pointer_cast((_v)))
    #define DGEMM_PARTICLES(__A, __B)                   \
      atrip::xgemm<F>("T",                              \
                      "N",                              \
                      (int const*)&NoNo,                \
                      (int const*)&No,                  \
                      (int const*)&Nv,                  \
                      &one,                             \
                      (DataFieldType<F>*)__A,           \
                      (int const*)&Nv,                  \
                      (DataFieldType<F>*)__B,           \
                      (int const*)&Nv,                  \
                      &zero,                            \
                      _t_buffer_p,                      \
                      (int const*)&NoNo);
    #define DGEMM_HOLES(__A, __B, __TRANSB)             \
      atrip::xgemm<F>("N",                              \
                      __TRANSB,                         \
                      (int const*)&NoNo,                \
                      (int const*)&No,                  \
                      (int const*)&No,                  \
                      &m_one,                           \
                      __TO_DEVICEPTR(__A),              \
                      (int const*)&NoNo,                \
                      (DataFieldType<F>*)__B,           \
                      (int const*)&No,                  \
                      &zero,                            \
                      _t_buffer_p,                      \
                      (int const*)&NoNo                 \
                      );
    #define MAYBE_CONJ(_conj, _buffer)                                      \
      for (size_t __i = 0; __i < NoNoNo; ++__i)                             \
        __TO_DEVICEPTR(_conj.data())[__i] =                                 \
          maybeConjugate<DataFieldType<F>>(((DataFieldType<F>*)(_buffer))[__i]);
    #else
    #define __TO_DEVICEPTR(_v) (_v)
    #define DGEMM_PARTICLES(__A, __B)               \
      atrip::xgemm<F>("T",                          \
                      "N",                          \
                      (int const*)&NoNo,            \
                      (int const*)&No,              \
                      (int const*)&Nv,              \
                      &one,                         \
                      __A,                          \
                      (int const*)&Nv,              \
                      __B,                          \
                      (int const*)&Nv,              \
                      &zero,                        \
                      _t_buffer_p,                  \
                      (int const*)&NoNo             \
                      );
    #define DGEMM_HOLES(__A, __B, __TRANSB)         \
      atrip::xgemm<F>("N",                          \
                      __TRANSB,                     \
                      (int const*)&NoNo,            \
                      (int const*)&No,              \
                      (int const*)&No,              \
                      &m_one,                       \
                      __A,                          \
                      (int const*)&NoNo,            \
                      __B,                          \
                      (int const*)&No,              \
                      &zero,                        \
                      _t_buffer_p,                  \
                      (int const*)&NoNo             \
                      );
    #define MAYBE_CONJ(_conj, _buffer)                \
      for (size_t __i = 0; __i < NoNoNo; ++__i)       \
        _conj[__i] = maybeConjugate<F>(_buffer[__i]);
    #endif

        const size_t NoNoNo = No*NoNo;
    #ifdef HAVE_CUDA
        thrust::device_vector< DataFieldType<F> > _t_buffer;
    #else
        std::vector<F> _t_buffer;
    #endif
        _t_buffer.reserve(NoNoNo);
        DataFieldType<F>* _t_buffer_p = __TO_DEVICEPTR(_t_buffer.data());
        F one{1.0}, m_one{-1.0}, zero{0.0};

        WITH_CHRONO("double:reorder",
          for (size_t k = 0; k < NoNoNo; k++) {
            Tijk[k] = DataFieldType<F>{0.0};
           })

        // TOMERGE: replace chronos
        WITH_CHRONO("doubles:holes",
          { // Holes part %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    #ifdef HAVE_CUDA
            thrust::device_vector< DataFieldType<F> > _vhhh;
            _vhhh.reserve(NoNoNo);
    #else
            std::vector<F> _vhhh(NoNoNo);
    #endif

            // VhhhC[i + k*No + L*NoNo] * TABhh[L + j*No]; H1
            MAYBE_CONJ(_vhhh, VhhhC)
            WITH_CHRONO("doubles:holes:1",
              DGEMM_HOLES(_vhhh.data(), TABhh, "N")
              REORDER(i, k, j)
            )
            // VhhhC[j + k*No + L*NoNo] * TABhh[i + L*No]; H0
            WITH_CHRONO("doubles:holes:2",
              DGEMM_HOLES(_vhhh.data(), TABhh, "T")
              REORDER(j, k, i)
            )

            // VhhhB[i + j*No + L*NoNo] * TAChh[L + k*No]; H5
            MAYBE_CONJ(_vhhh, VhhhB)
            WITH_CHRONO("doubles:holes:3",
              DGEMM_HOLES(_vhhh.data(), TAChh, "N")
              REORDER(i, j, k)
            )
            // VhhhB[k + j*No + L*NoNo] * TAChh[i + L*No]; H3
            WITH_CHRONO("doubles:holes:4",
              DGEMM_HOLES(_vhhh.data(), TAChh, "T")
              REORDER(k, j, i)
            )

            // VhhhA[j + i*No + L*NoNo] * TBChh[L + k*No]; H1
            MAYBE_CONJ(_vhhh, VhhhA)
            WITH_CHRONO("doubles:holes:5",
              DGEMM_HOLES(_vhhh.data(), TBChh, "N")
              REORDER(j, i, k)
            )
            // VhhhA[k + i*No + L*NoNo] * TBChh[j + L*No]; H4
            WITH_CHRONO("doubles:holes:6",
              DGEMM_HOLES(_vhhh.data(), TBChh, "T")
              REORDER(k, i, j)
            )

          }
        )
      #undef MAYBE_CONJ

        WITH_CHRONO("doubles:particles",
          { // Particle part %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
            // TAphh[E + i*Nv + j*NoNv] * VBCph[E + k*Nv]; P0
            WITH_CHRONO("doubles:particles:1",
              DGEMM_PARTICLES(TAphh, VBCph)
              REORDER(i, j, k)
            )
            // TAphh[E + i*Nv + k*NoNv] * VCBph[E + j*Nv]; P3
            WITH_CHRONO("doubles:particles:2",
              DGEMM_PARTICLES(TAphh, VCBph)
              REORDER(i, k, j)
            )
            // TCphh[E + k*Nv + i*NoNv] * VABph[E + j*Nv]; P5
            WITH_CHRONO("doubles:particles:3",
              DGEMM_PARTICLES(TCphh, VABph)
              REORDER(k, i, j)
            )
            // TCphh[E + k*Nv + j*NoNv] * VBAph[E + i*Nv]; P2
            WITH_CHRONO("doubles:particles:4",
              DGEMM_PARTICLES(TCphh, VBAph)
              REORDER(k, j, i)
            )
            // TBphh[E + j*Nv + i*NoNv] * VACph[E + k*Nv]; P1
            WITH_CHRONO("doubles:particles:5",
              DGEMM_PARTICLES(TBphh, VACph)
              REORDER(j, i, k)
            )
            // TBphh[E + j*Nv + k*NoNv] * VCAph[E + i*Nv]; P4
            WITH_CHRONO("doubles:particles:6",
              DGEMM_PARTICLES(TBphh, VCAph)
              REORDER(j, k, i)
            )
          }
        )

      #undef REORDER
      #undef DGEMM_HOLES
      #undef DGEMM_PARTICLES
      #undef _IJK_
      #else
        for (size_t k = 0; k < No; k++)
        for (size_t j = 0; j < No; j++)
        for (size_t i = 0; i < No; i++){
          const size_t ijk = i + j*No + k*NoNo
                    ,  jk = j + k*No
                    ;
          Tijk[ijk] = 0.0; // :important
          // HOLE DIAGRAMS: TABHH and VHHHA
          for (size_t L = 0; L < No; L++){
            // t[abLj] * V[Lcik]        H1
            // t[baLi] * V[Lcjk]        H0      TODO: conjugate T for complex
            Tijk[ijk] -= TABhh[L + j*No] * VhhhC[i + k*No + L*NoNo];
            Tijk[ijk] -= TABhh[i + L*No] * VhhhC[j + k*No + L*NoNo];

            // t[acLk] * V[Lbij]        H5
            // t[caLi] * V[Lbkj]        H3
            Tijk[ijk] -= TAChh[L + k*No] * VhhhB[i + j*No + L*NoNo];
            Tijk[ijk] -= TAChh[i + L*No] * VhhhB[k + j*No + L*NoNo];

            // t[bcLk] * V[Laji]        H2
            // t[cbLj] * V[Laki]        H4
            Tijk[ijk] -= TBChh[L + k*No] * VhhhA[j + i*No + L*NoNo];
            Tijk[ijk] -= TBChh[j + L*No] * VhhhA[k + i*No + L*NoNo];
          }
          // PARTILCE DIAGRAMS: TAPHH and VABPH
          for (size_t E = 0; E < Nv; E++) {
            // t[aEij] * V[bcEk]        P0
            // t[aEik] * V[cbEj]        P3 // TODO: CHECK THIS ONE, I DONT KNOW
            Tijk[ijk] += TAphh[E + i*Nv + j*NoNv] * VBCph[E + k*Nv];
            Tijk[ijk] += TAphh[E + i*Nv + k*NoNv] * VCBph[E + j*Nv];

            // t[cEki] * V[abEj]        P5
            // t[cEkj] * V[baEi]        P2
            Tijk[ijk] += TCphh[E + k*Nv + i*NoNv] * VABph[E + j*Nv];
            Tijk[ijk] += TCphh[E + k*Nv + j*NoNv] * VBAph[E + i*Nv];

            // t[bEji] * V[acEk]        P1
            // t[bEjk] * V[caEi]        P4
            Tijk[ijk] += TBphh[E + j*Nv + i*NoNv] * VACph[E + k*Nv];
            Tijk[ijk] += TBphh[E + j*Nv + k*NoNv] * VCAph[E + i*Nv];
          }

        }
    #endif
      }


      // instantiate templates
    #if defined(HAVE_CUDA)
      __device__
    #endif
      template
      void doublesContribution<double>
        ( const ABCTuple &abc
        , size_t const No
        , size_t const Nv
        // -- VABCI
        , DataPtr<double> const VABph
        , DataPtr<double> const VACph
        , DataPtr<double> const VBCph
        , DataPtr<double> const VBAph
        , DataPtr<double> const VCAph
        , DataPtr<double> const VCBph
        // -- VHHHA
        , DataPtr<double> const VhhhA
        , DataPtr<double> const VhhhB
        , DataPtr<double> const VhhhC
        // -- TA
        , DataPtr<double> const TAphh
        , DataPtr<double> const TBphh
        , DataPtr<double> const TCphh
        // -- TABIJ
        , DataPtr<double> const TABhh
        , DataPtr<double> const TAChh
        , DataPtr<double> const TBChh
        // -- TIJK
        , DataFieldType<double>* Tijk
        );

    #if defined(HAVE_CUDA)
      __device__
    #endif
      template
      void doublesContribution<Complex>
        ( const ABCTuple &abc
        , size_t const No
        , size_t const Nv
        // -- VABCI
        , DataPtr<Complex> const VABph
        , DataPtr<Complex> const VACph
        , DataPtr<Complex> const VBCph
        , DataPtr<Complex> const VBAph
        , DataPtr<Complex> const VCAph
        , DataPtr<Complex> const VCBph
        // -- VHHHA
        , DataPtr<Complex> const VhhhA
        , DataPtr<Complex> const VhhhB
        , DataPtr<Complex> const VhhhC
        // -- TA
        , DataPtr<Complex> const TAphh
        , DataPtr<Complex> const TBphh
        , DataPtr<Complex> const TCphh
        // -- TABIJ
        , DataPtr<Complex> const TABhh
        , DataPtr<Complex> const TAChh
        , DataPtr<Complex> const TBChh
        // -- TIJK
        , DataFieldType<Complex>* Tijk
        );

9 Blas
------

The main matrix-matrix multiplication method used in this algorithm
is mainly using the ``DGEMM`` function, which we declare as
``extern`` since it should be known only at link-time.

.. code:: c++

    #include <atrip/Blas.hpp>
    #include <atrip/Atrip.hpp>

    #if defined(HAVE_CUDA)
    #  include <cstring>

      static inline
      cublasOperation_t char_to_cublasOperation(const char* trans) {
        if (strncmp("N", trans, 1) == 0)
          return CUBLAS_OP_N;
        else if (strncmp("T", trans, 1) == 0)
          return CUBLAS_OP_T;
        else
          return CUBLAS_OP_C;
      }

    #endif

    namespace atrip {


      template <>
      void xgemm<double>(const char *transa,
                 const char *transb,
                 const int *m,
                 const int *n,
                 const int *k,
                 double *alpha,
                 const typename DataField<double>::type *A,
                 const int *lda,
                 const typename DataField<double>::type *B,
                 const int *ldb,
                 double *beta,
                 typename DataField<double>::type *C,
                 const int *ldc) {
    #if defined(HAVE_CUDA)
        cublasDgemm(Atrip::cuda.handle,
                    char_to_cublasOperation(transa),
                    char_to_cublasOperation(transb),
                    *m, *n, *k,
                    alpha, A, *lda,
                    B, *ldb, beta,
                    C, *ldc);
    #else
        dgemm_(transa, transb,
               m, n, k,
               alpha, A, lda,
               B, ldb, beta,
               C, ldc);
    #endif
      }

      template <>
      void xgemm<Complex>(const char *transa,
                 const char *transb,
                 const int *m,
                 const int *n,
                 const int *k,
                 Complex *alpha,
                 const typename DataField<Complex>::type *A,
                 const int *lda,
                 const typename DataField<Complex>::type *B,
                 const int *ldb,
                 Complex *beta,
                 typename DataField<Complex>::type *C,
                 const int *ldc) {
    #if defined(HAVE_CUDA)
    #pragma warning HAVE_CUDA
        cuDoubleComplex
          cu_alpha = {std::real(*alpha), std::imag(*alpha)},
          cu_beta = {std::real(*beta), std::imag(*beta)};
        cublasZgemm(Atrip::cuda.handle,
                    char_to_cublasOperation(transa),
                    char_to_cublasOperation(transb),
                    *m, *n, *k,
                    &cu_alpha,

                    A, *lda,
                    B, *ldb,
                    &cu_beta,
                    C, *ldc);
    #else
        zgemm_(transa, transb,
               m, n, k,
               alpha, A, lda,
               B, ldb, beta,
               C, ldc);
    #endif
      }

    }

10 Atrip
--------

10.1 Header
~~~~~~~~~~~

.. code:: c++

    #pragma once
    #include <sstream>
    #include <string>
    #include <map>
    #include <mpi.h>
    #include "config.h"

    #if defined(HAVE_CUDA)
    #include <cuda.h>
    #define CUBLASAPI
    #include <cublas_api.h>
    #include <cublas_v2.h>
    #endif

    #include <atrip/Utils.hpp>
    #include <atrip/Types.hpp>

    #define ADD_ATTRIBUTE(_type, _name, _default)   \
      _type _name = _default;                       \
      Input& with_ ## _name(_type i) {              \
        _name = i;                                  \
        return *this;                               \
      }

    namespace atrip {

      struct Atrip {

        static size_t rank;
        static size_t np;
        static MPI_Comm communicator;
        static Timings chrono;
    #if defined(HAVE_CUDA)
        struct CudaContext {
          cublasStatus_t status;
          cublasHandle_t handle;
        };
        static CudaContext cuda;
    #endif

        static void init(MPI_Comm);


        template <typename F=double>
        struct Input {
          CTF::Tensor<F> *ei = nullptr
                       , *ea = nullptr
                       , *Tph = nullptr
                       , *Tpphh = nullptr
                       , *Vpphh = nullptr
                       , *Vhhhp = nullptr
                       , *Vppph = nullptr
                       ;
          Input& with_epsilon_i(CTF::Tensor<F> * t) { ei = t; return *this; }
          Input& with_epsilon_a(CTF::Tensor<F> * t) { ea = t; return *this; }
          Input& with_Tai(CTF::Tensor<F> * t) { Tph = t; return *this; }
          Input& with_Tabij(CTF::Tensor<F> * t) { Tpphh = t; return *this; }
          Input& with_Vabij(CTF::Tensor<F> * t) { Vpphh = t; return *this; }
          Input& with_Vijka(CTF::Tensor<F> * t) { Vhhhp = t; return *this; }
          Input& with_Vabci(CTF::Tensor<F> * t) { Vppph = t; return *this; }

          enum TuplesDistribution {
            NAIVE,
            GROUP_AND_SORT,
          };

          ADD_ATTRIBUTE(bool, deleteVppph, false)
          ADD_ATTRIBUTE(bool, rankRoundRobin, false)
          ADD_ATTRIBUTE(bool, chrono, false)
          ADD_ATTRIBUTE(bool, barrier, false)
          ADD_ATTRIBUTE(int, maxIterations, 0)
          ADD_ATTRIBUTE(int, iterationMod, -1)
          ADD_ATTRIBUTE(int, percentageMod, -1)
          ADD_ATTRIBUTE(TuplesDistribution, tuplesDistribution, NAIVE)
          ADD_ATTRIBUTE(std::string, checkpointPath, "atrip-checkpoint.yaml")
          ADD_ATTRIBUTE(bool, readCheckpointIfExists, true)
          ADD_ATTRIBUTE(bool, writeCheckpoint, true)
          ADD_ATTRIBUTE(float, checkpointAtPercentage, 10)
          ADD_ATTRIBUTE(size_t, checkpointAtEveryIteration, 0)

        };

        struct Output {
          double energy;
        };
        template <typename F=double>
        static Output run(Input<F> const& in);
      };

    }

    #undef ADD_ATTRIBUTE

10.2 Main
~~~~~~~~~

.. code:: c++

    #include <iomanip>

    #include <atrip/Atrip.hpp>
    #include <atrip/Utils.hpp>
    #include <atrip/Equations.hpp>
    #include <atrip/SliceUnion.hpp>
    #include <atrip/Unions.hpp>
    #include <atrip/Checkpoint.hpp>

    using namespace atrip;

    template <typename F> bool RankMap<F>::RANK_ROUND_ROBIN;
    template bool RankMap<double>::RANK_ROUND_ROBIN;
    template bool RankMap<Complex>::RANK_ROUND_ROBIN;
    size_t Atrip::rank;
    size_t Atrip::np;
    #if defined(HAVE_CUDA)
    typename Atrip::CudaContext Atrip::cuda;
    #endif
    MPI_Comm Atrip::communicator;
    Timings Atrip::chrono;

    // user printing block
    IterationDescriptor IterationDescription::descriptor;
    void atrip::registerIterationDescriptor(IterationDescriptor d) {
      IterationDescription::descriptor = d;
    }

    void Atrip::init(MPI_Comm world)  {
      Atrip::communicator = world;
      MPI_Comm_rank(world, (int*)&Atrip::rank);
      MPI_Comm_size(world, (int*)&Atrip::np);

    #if defined(HAVE_CUDA)
      Atrip::cuda.status = cublasCreate(&Atrip::cuda.handle);
    #endif

    }

    template <typename F>
    Atrip::Output Atrip::run(Atrip::Input<F> const& in) {

      const size_t np = Atrip::np;
      const size_t rank = Atrip::rank;
      MPI_Comm universe = Atrip::communicator;

      const size_t No = in.ei->lens[0];
      const size_t Nv = in.ea->lens[0];
      LOG(0,"Atrip") << "No: " << No << "\n";
      LOG(0,"Atrip") << "Nv: " << Nv << "\n";
      LOG(0,"Atrip") << "np: " << np << "\n";

      // allocate the three scratches, see piecuch
      // we need local copies of the following tensors on every
      // rank
      std::vector<F> _epsi(No)
                   , _epsa(Nv)
                   , _Tai(No * Nv)
                   ;

      in.ei->read_all(_epsi.data());
      in.ea->read_all(_epsa.data());
      in.Tph->read_all(_Tai.data());

    #if defined(HAVE_CUDA)
      DataPtr<F> Tai, epsi, epsa;
      cuMemAlloc(&Tai, sizeof(F) * _Tai.size());
      cuMemAlloc(&epsi, sizeof(F) * _epsi.size());
      cuMemAlloc(&epsa, sizeof(F) * _epsa.size());

      cuMemcpyHtoD(Tai, (void*)_Tai.data(), sizeof(F) * _Tai.size());
      cuMemcpyHtoD(epsi,(void*)_epsi.data(), sizeof(F) * _epsi.size());
      cuMemcpyHtoD(epsa, (void*)_epsa.data(), sizeof(F) * _epsa.size());

      DataPtr<F> Tijk, Zijk;
      cuMemAlloc(&Tijk, sizeof(F) * No * No * No);
      cuMemAlloc(&Zijk, sizeof(F) * No * No * No);
    #else
      std::vector<F> &Tai = _Tai, &epsi = _epsi, &epsa = _epsa;
      std::vector<F> Tijk(No*No*No), Zijk(No*No*No);
    #endif

      RankMap<F>::RANK_ROUND_ROBIN = in.rankRoundRobin;
      if (RankMap<F>::RANK_ROUND_ROBIN) {
        LOG(0,"Atrip") << "Doing rank round robin slices distribution\n";
      } else {
        LOG(0,"Atrip")
          << "Doing node > local rank round robin slices distribution\n";
      }


      // COMMUNICATOR CONSTRUCTION ========================================={{{1
      //
      // Construct a new communicator living only on a single rank
      int child_size = 1
        , child_rank
        ;
      const
      int color = rank / child_size
        , crank = rank % child_size
        ;
      MPI_Comm child_comm;
      if (np == 1) {
        child_comm = universe;
      } else {
        MPI_Comm_split(universe, color, crank, &child_comm);
        MPI_Comm_rank(child_comm, &child_rank);
        MPI_Comm_size(child_comm, &child_size);
      }

      // BUILD SLICES PARAMETRIZED BY NV x NV =============================={{{1
      WITH_CHRONO("nv-nv-slices",
        LOG(0,"Atrip") << "BUILD NV x NV-SLICES\n";
        ABPH<F> abph(*in.Vppph, (size_t)No, (size_t)Nv, (size_t)np, child_comm, universe);
        ABHH<F> abhh(*in.Vpphh, (size_t)No, (size_t)Nv, (size_t)np, child_comm, universe);
        TABHH<F> tabhh(*in.Tpphh, (size_t)No, (size_t)Nv, (size_t)np, child_comm, universe);
      )

      // delete the Vppph so that we don't have a HWM situation for the NV slices
      if (in.deleteVppph) {
        delete in.Vppph;
      }

      // BUILD SLICES PARAMETRIZED BY NV ==================================={{{1
      WITH_CHRONO("nv-slices",
        LOG(0,"Atrip") << "BUILD NV-SLICES\n";
        TAPHH<F> taphh(*in.Tpphh, (size_t)No, (size_t)Nv, (size_t)np, child_comm, universe);
        HHHA<F>  hhha(*in.Vhhhp, (size_t)No, (size_t)Nv, (size_t)np, child_comm, universe);
      )

      // all tensors
      std::vector< SliceUnion<F>* > unions = {&taphh, &hhha, &abph, &abhh, &tabhh};

      // get tuples for the current rank
      TuplesDistribution *distribution;

      if (in.tuplesDistribution == Atrip::Input<F>::TuplesDistribution::NAIVE) {
        LOG(0,"Atrip") << "Using the naive distribution\n";
        distribution = new NaiveDistribution();
      } else {
        LOG(0,"Atrip") << "Using the group-and-sort distribution\n";
        distribution = new group_and_sort::Distribution();
      }

      LOG(0,"Atrip") << "BUILDING TUPLE LIST\n";
      WITH_CHRONO("tuples:build",
        auto const tuplesList = distribution->getTuples(Nv, universe);
        )
      const size_t nIterations = tuplesList.size();
      {
        LOG(0,"Atrip") << "#iterations: "
                      << nIterations
                      << "/"
                      << nIterations * np
                      << "\n";
      }

      const size_t
          iterationMod = (in.percentageMod > 0)
                      ? nIterations * in.percentageMod / 100.0
                      : in.iterationMod

        , iteration1Percent = nIterations * 0.01
        ;


      auto const isFakeTuple
        = [&tuplesList, distribution](size_t const i) {
          return distribution->tupleIsFake(tuplesList[i]);
        };


      using Database = typename Slice<F>::Database;
      auto communicateDatabase
        = [ &unions
          , np
          ] (ABCTuple const& abc, MPI_Comm const& c) -> Database {

            WITH_CHRONO("db:comm:type:do",
              auto MPI_LDB_ELEMENT = Slice<F>::mpi::localDatabaseElement();
            )

            WITH_CHRONO("db:comm:ldb",
              typename Slice<F>::LocalDatabase ldb;
              for (auto const& tensor: unions) {
                auto const& tensorDb = tensor->buildLocalDatabase(abc);
                ldb.insert(ldb.end(), tensorDb.begin(), tensorDb.end());
              }
            )

            Database db(np * ldb.size(), ldb[0]);

            WITH_CHRONO("oneshot-db:comm:allgather",
            WITH_CHRONO("db:comm:allgather",
              MPI_Allgather( ldb.data()
                           , ldb.size()
                           , MPI_LDB_ELEMENT
                           , db.data()
                           , ldb.size()
                           , MPI_LDB_ELEMENT
                           , c);
            ))

            WITH_CHRONO("db:comm:type:free",
              MPI_Type_free(&MPI_LDB_ELEMENT);
            )

            return db;
          };

      auto doIOPhase
        = [&unions, &rank, &np, &universe] (Database const& db) {

        const size_t localDBLength = db.size() / np;

        size_t sendTag = 0
             , recvTag = rank * localDBLength
             ;

        // RECIEVE PHASE ======================================================
        {
          // At this point, we have already send to everyone that fits
          auto const& begin = &db[rank * localDBLength]
                    , end   = begin + localDBLength
                    ;
          for (auto it = begin; it != end; ++it) {
            recvTag++;
            auto const& el = *it;
            auto& u = unionByName(unions, el.name);

            WITH_DBG std::cout
              << rank << ":r"
              << "♯" << recvTag << " =>"
              << " «n" << el.name
              << ", t" << el.info.type
              << ", s" << el.info.state
              << "»"
              << " ⊙ {" << rank << "⇐" << el.info.from.rank
                        << ", "
                        << el.info.from.source << "}"
              << " ∴ {" << el.info.tuple[0]
                        << ", "
                        << el.info.tuple[1]
                        << "}"
              << "\n"
              ;

            WITH_CHRONO("db:io:recv",
              u.receive(el.info, recvTag);
            )

          } // recv
        }

        // SEND PHASE =========================================================
        for (size_t otherRank = 0; otherRank<np; otherRank++) {
          auto const& begin = &db[otherRank * localDBLength]
                    , end = begin + localDBLength
                    ;
          for (auto it = begin; it != end; ++it) {
            sendTag++;
            typename Slice<F>::LocalDatabaseElement const& el = *it;

            if (el.info.from.rank != rank) continue;

            auto& u = unionByName(unions, el.name);
            WITH_DBG std::cout
              << rank << ":s"
              << "♯" << sendTag << " =>"
              << " «n" << el.name
              << ", t" << el.info.type
              << ", s" << el.info.state
              << "»"
              << " ⊙ {" << el.info.from.rank << "⇒" << otherRank
                        << ", "
                        << el.info.from.source << "}"
              << " ∴ {" << el.info.tuple[0]
                        << ", "
                        << el.info.tuple[1]
                        << "}"
              << "\n"
              ;

            WITH_CHRONO("db:io:send",
              u.send(otherRank, el, sendTag);
            )

          } // send phase

        } // otherRank


      };

    #if defined(HAVE_OCD) || defined(ATRIP_PRINT_TUPLES)
      std::map<ABCTuple, double> tupleEnergies;
    #endif

      const double doublesFlops
        = double(No)
        * double(No)
        * double(No)
        * (double(No) + double(Nv))
        * 2.0
        * (traits::isComplex<F>() ? 2.0 : 1.0)
        * 6.0
        / 1e9
        ;

      // START MAIN LOOP ======================================================{{{1

      double energy(0.);
      size_t first_iteration = 0;
      Checkpoint c;
      const size_t checkpoint_mod
        = in.checkpointAtEveryIteration != 0
        ? in.checkpointAtEveryIteration
        : nIterations * in.checkpointAtPercentage / 100;
      if (in.readCheckpointIfExists) {
        std::ifstream fin(in.checkpointPath);
        if (fin.is_open()) {
          LOG(0, "Atrip") <<  "Reading checkpoint from "
                          << in.checkpointPath << "\n";
          c = read_checkpoint(fin);
          first_iteration = (size_t)c.iteration;
          if (first_iteration > nIterations) {
            // TODO: throw an error here
            // first_iteration is bigger than nIterations,
            // you probably started the program with a different number
            // of cores
          }
          if (No != c.no) {/* TODO: write warning */}
          if (Nv != c.nv) {/* TODO: write warning */}
          // TODO write warnings for nrank and so on
          if (Atrip::rank == 0) {
            // take the negative of the energy to correct for the
            // negativity of the equations, the energy in the checkpoint
            // should always be the correct physical one.
            energy = - (double)c.energy;
          }
          LOG(0, "Atrip") << "energy from checkpoint "
                          << energy << "\n";
          LOG(0, "Atrip") << "iteration from checkpoint "
                          << first_iteration << "\n";
        }
      }

      for ( size_t
              i = first_iteration,
              iteration = first_iteration + 1
          ; i < tuplesList.size()
          ; i++, iteration++
          ) {
        Atrip::chrono["iterations"].start();

        // check overhead from chrono over all iterations
        WITH_CHRONO("start:stop", {})

        // check overhead of doing a barrier at the beginning
        WITH_CHRONO("oneshot-mpi:barrier",
        WITH_CHRONO("mpi:barrier",
          if (in.barrier) MPI_Barrier(universe);
        ))

        // write checkpoints
        if (iteration % checkpoint_mod == 0) {
            double globalEnergy = 0;
            MPI_Reduce(&energy, &globalEnergy, 1, MPI_DOUBLE, MPI_SUM, 0, universe);
            Checkpoint out
              = {No,
                 Nv,
                 0, // TODO
                 0, // TODO
                 - globalEnergy,
                 iteration - 1,
                 in.rankRoundRobin};
            LOG(0, "Atrip") << "Writing checkpoint\n";
            if (Atrip::rank == 0) write_checkpoint(out, in.checkpointPath);
        }

        // write reporting
        if (iteration % iterationMod == 0 || iteration == iteration1Percent) {

          if (IterationDescription::descriptor) {
            IterationDescription::descriptor({
              iteration,
              nIterations,
              Atrip::chrono["iterations"].count()
            });
          }

          LOG(0,"Atrip")
            << "iteration " << iteration
            << " [" << 100 * iteration / nIterations << "%]"
            << " (" << doublesFlops * iteration / Atrip::chrono["doubles"].count()
            << "GF)"
            << " (" << doublesFlops * iteration / Atrip::chrono["iterations"].count()
            << "GF)"
            << "\n";


          // PRINT TIMINGS
          if (in.chrono)
          for (auto const& pair: Atrip::chrono)
            LOG(1, " ") << pair.first << " :: "
                        << pair.second.count()
                        << std::endl;

        }

        const ABCTuple abc = isFakeTuple(i)
                           ? tuplesList[tuplesList.size() - 1]
                           : tuplesList[i]
                     , *abcNext = i == (tuplesList.size() - 1)
                                ? nullptr
                                : &tuplesList[i + 1]
                     ;

        WITH_CHRONO("with_rank",
          WITH_RANK << " :it " << iteration
                    << " :abc " << pretty_print(abc)
                    << " :abcN "
                    << (abcNext ? pretty_print(*abcNext) : "None")
                    << "\n";
        )


        // COMM FIRST DATABASE ================================================{{{1
        if (i == first_iteration) {
          WITH_RANK << "__first__:first database ............ \n";
          const auto db = communicateDatabase(abc, universe);
          WITH_RANK << "__first__:first database communicated \n";
          WITH_RANK << "__first__:first database io phase \n";
          doIOPhase(db);
          WITH_RANK << "__first__:first database io phase DONE\n";
          WITH_RANK << "__first__::::Unwrapping all slices for first database\n";
          for (auto& u: unions) u->unwrapAll(abc);
          WITH_RANK << "__first__::::Unwrapping slices for first database DONE\n";
          MPI_Barrier(universe);
        }

        // COMM NEXT DATABASE ================================================={{{1
        if (abcNext) {
          WITH_RANK << "__comm__:" << iteration << "th communicating database\n";
          WITH_CHRONO("db:comm",
            const auto db = communicateDatabase(*abcNext, universe);
          )
          WITH_CHRONO("db:io",
            doIOPhase(db);
          )
          WITH_RANK << "__comm__:" <<  iteration << "th database io phase DONE\n";
        }

        // COMPUTE DOUBLES ===================================================={{{1
        OCD_Barrier(universe);
        if (!isFakeTuple(i)) {
          WITH_RANK << iteration << "-th doubles\n";
          WITH_CHRONO("oneshot-unwrap",
          WITH_CHRONO("unwrap",
          WITH_CHRONO("unwrap:doubles",
            for (auto& u: decltype(unions){&abph, &hhha, &taphh, &tabhh}) {
              u->unwrapAll(abc);
            }
          )))
          WITH_CHRONO("oneshot-doubles",
          WITH_CHRONO("doubles",
            doublesContribution<F>( abc, (size_t)No, (size_t)Nv
                                  // -- VABCI
                                  , abph.unwrapSlice(Slice<F>::AB, abc)
                                  , abph.unwrapSlice(Slice<F>::AC, abc)
                                  , abph.unwrapSlice(Slice<F>::BC, abc)
                                  , abph.unwrapSlice(Slice<F>::BA, abc)
                                  , abph.unwrapSlice(Slice<F>::CA, abc)
                                  , abph.unwrapSlice(Slice<F>::CB, abc)
                                  // -- VHHHA
                                  , hhha.unwrapSlice(Slice<F>::A, abc)
                                  , hhha.unwrapSlice(Slice<F>::B, abc)
                                  , hhha.unwrapSlice(Slice<F>::C, abc)
                                  // -- TA
                                  , taphh.unwrapSlice(Slice<F>::A, abc)
                                  , taphh.unwrapSlice(Slice<F>::B, abc)
                                  , taphh.unwrapSlice(Slice<F>::C, abc)
                                  // -- TABIJ
                                  , tabhh.unwrapSlice(Slice<F>::AB, abc)
                                  , tabhh.unwrapSlice(Slice<F>::AC, abc)
                                  , tabhh.unwrapSlice(Slice<F>::BC, abc)
                                  // -- TIJK
    #if defined(HAVE_CUDA)
                                  , (DataFieldType<F>*)Tijk
    #else
                                  , Tijk.data()
    #endif
                                  );
            WITH_RANK << iteration << "-th doubles done\n";
          ))
        }

        // COMPUTE SINGLES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% {{{1
        OCD_Barrier(universe);
        if (!isFakeTuple(i)) {
          WITH_CHRONO("oneshot-unwrap",
          WITH_CHRONO("unwrap",
          WITH_CHRONO("unwrap:singles",
            abhh.unwrapAll(abc);
          )))
          WITH_CHRONO("reorder",
            #pragma acc parallel
            for (size_t I(0); I < No*No*No; I++)
              ((DataFieldType<F>*)Zijk)[I] = ((DataFieldType<F>*)Tijk)[I];
          )
          WITH_CHRONO("singles",
          singlesContribution<F>( No, Nv, abc
    #if defined(HAVE_CUDA)
                                , (F*)Tai
    #else
                                , Tai.data()
    #endif
                                , (F*)abhh.unwrapSlice(Slice<F>::AB, abc)
                                , (F*)abhh.unwrapSlice(Slice<F>::AC, abc)
                                , (F*)abhh.unwrapSlice(Slice<F>::BC, abc)
    #if defined(HAVE_CUDA)
                                , (F*)Zijk);
    #else
                                , Zijk.data());
    #endif
          )
        }


        // COMPUTE ENERGY %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% {{{1
        if (!isFakeTuple(i)) {
          double tupleEnergy(0.);

          int distinct(0);
          if (abc[0] == abc[1]) distinct++;
          if (abc[1] == abc[2]) distinct--;
          const F epsabc(_epsa[abc[0]] + _epsa[abc[1]] + _epsa[abc[2]]);

          WITH_CHRONO("energy",
            if ( distinct == 0)
              tupleEnergy = getEnergyDistinct<F>(epsabc, No, (F*)epsi, (F*)Tijk, (F*)Zijk);
            else
              tupleEnergy = getEnergySame<F>(epsabc, No, (F*)epsi, (F*)Tijk, (F*)Zijk);
          )

    #if defined(HAVE_OCD) || defined(ATRIP_PRINT_TUPLES)
          tupleEnergies[abc] = tupleEnergy;
    #endif

          energy += tupleEnergy;

        }

        // TODO: remove this
        if (isFakeTuple(i)) {
          // fake iterations should also unwrap whatever they got
          WITH_RANK << iteration
                    << "th unwrapping because of fake in "
                    << i << "\n";
          for (auto& u: unions) u->unwrapAll(abc);
        }

    #ifdef HAVE_OCD
        for (auto const& u: unions) {
          WITH_RANK << "__dups__:"
                    << iteration
                    << "-th n" << u->name << " checking duplicates\n";
          u->checkForDuplicates();
        }
    #endif


        // CLEANUP UNIONS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%{{{1
        OCD_Barrier(universe);
        if (abcNext) {
          WITH_RANK << "__gc__:" << iteration << "-th cleaning up.......\n";
          for (auto& u: unions) {

            u->unwrapAll(abc);
            WITH_RANK << "__gc__:n" << u->name  << " :it " << iteration
                      << " :abc " << pretty_print(abc)
                      << " :abcN " << pretty_print(*abcNext)
                      << "\n";
            // for (auto const& slice: u->slices)
            //   WITH_RANK << "__gc__:guts:" << slice.info << "\n";
            u->clearUnusedSlicesForNext(*abcNext);

            WITH_RANK << "__gc__: checking validity\n";

    #ifdef HAVE_OCD
            // check for validity of the slices
            for (auto type: u->sliceTypes) {
              auto tuple = Slice<F>::subtupleBySlice(abc, type);
            for (auto& slice: u->slices) {
              if ( slice.info.type == type
                 && slice.info.tuple == tuple
                 && slice.isDirectlyFetchable()
                 ) {
                if (slice.info.state == Slice<F>::Dispatched)
                  throw std::domain_error( "This slice should not be undispatched! "
                                         + pretty_print(slice.info));
              }
            }
            }
    #endif


          }
        }

          WITH_RANK << iteration << "-th cleaning up....... DONE\n";

        Atrip::chrono["iterations"].stop();
        // ITERATION END %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%{{{1

      }
        // END OF MAIN LOOP

      MPI_Barrier(universe);

      // PRINT TUPLES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%{{{1
    #if defined(HAVE_OCD) || defined(ATRIP_PRINT_TUPLES)
      LOG(0,"Atrip") << "tuple energies" << "\n";
      for (size_t i = 0; i < np; i++) {
        MPI_Barrier(universe);
        for (auto const& pair: tupleEnergies) {
          if (i == rank)
            std::cout << pair.first[0]
                      << " " << pair.first[1]
                      << " " << pair.first[2]
                      << std::setprecision(15) << std::setw(23)
                      << " tupleEnergy: " << pair.second
                      << "\n"
                      ;
        }
      }
    #endif

      // COMMUNICATE THE ENERGIES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%{{{1
      LOG(0,"Atrip") << "COMMUNICATING ENERGIES \n";
      double globalEnergy = 0;
      MPI_Reduce(&energy, &globalEnergy, 1, MPI_DOUBLE, MPI_SUM, 0, universe);

      WITH_RANK << "local energy " << energy << "\n";
      LOG(0, "Atrip") << "Energy: "
        << std::setprecision(15) << std::setw(23)
        << (- globalEnergy) << std::endl;

      // PRINT TIMINGS {{{1
      if (in.chrono)
      for (auto const& pair: Atrip::chrono)
        LOG(0,"atrip:chrono") << pair.first << " "
                              << pair.second.count() << std::endl;


      LOG(0, "atrip:flops(doubles)")
        << nIterations * doublesFlops / Atrip::chrono["doubles"].count() << "\n";
      LOG(0, "atrip:flops(iterations)")
        << nIterations * doublesFlops / Atrip::chrono["iterations"].count() << "\n";

      // TODO: change the sign in  the getEnergy routines
      return { - globalEnergy };

    }
    // instantiate
    template Atrip::Output Atrip::run(Atrip::Input<double> const& in);
    template Atrip::Output Atrip::run(Atrip::Input<Complex> const& in);

11 Debug and Logging
--------------------

11.1 Macros
~~~~~~~~~~~

.. code:: c++

    #pragma once
    #include <functional>
    #define ATRIP_BENCHMARK
    //#define ATRIP_DONT_SLICE
    #ifndef ATRIP_DEBUG
    #  define ATRIP_DEBUG 1
    #endif
    //#define ATRIP_WORKLOAD_DUMP
    #define ATRIP_USE_DGEMM
    //#define ATRIP_PRINT_TUPLES

    #ifndef ATRIP_DEBUG
    #define ATRIP_DEBUG 1
    #endif

    #if ATRIP_DEBUG == 4
    #  pragma message("WARNING: You have OCD debugging ABC triples "    \
                      "expect GB of output and consult your therapist")
    #  include <dbg.h>
    #  define HAVE_OCD
    #  define OCD_Barrier(com) MPI_Barrier(com)
    #  define WITH_OCD
    #  define WITH_ROOT if (atrip::Atrip::rank == 0)
    #  define WITH_SPECIAL(r) if (atrip::Atrip::rank == r)
    #  define WITH_RANK std::cout << atrip::Atrip::rank << ": "
    #  define WITH_CRAZY_DEBUG
    #  define WITH_DBG
    #  define DBG(...) dbg(__VA_ARGS__)
    #elif ATRIP_DEBUG == 3
    #  pragma message("WARNING: You have crazy debugging ABC triples,"  \
                      " expect GB of output")
    #  include <dbg.h>
    #  define OCD_Barrier(com)
    #  define WITH_OCD if (false)
    #  define WITH_ROOT if (atrip::Atrip::rank == 0)
    #  define WITH_SPECIAL(r) if (atrip::Atrip::rank == r)
    #  define WITH_RANK std::cout << atrip::Atrip::rank << ": "
    #  define WITH_CRAZY_DEBUG
    #  define WITH_DBG
    #  define DBG(...) dbg(__VA_ARGS__)
    #elif ATRIP_DEBUG == 2
    #  pragma message("WARNING: You have some debugging info for ABC triples")
    #  define OCD_Barrier(com)
    #  define WITH_OCD if (false)
    #  define WITH_ROOT if (atrip::Atrip::rank == 0)
    #  define WITH_SPECIAL(r) if (atrip::Atrip::rank == r)
    #  define WITH_RANK std::cout << atrip::Atrip::rank << ": "
    #  define WITH_CRAZY_DEBUG if (false)
    #  define WITH_DBG
    #  define DBG(...) dbg(__VA_ARGS__)
    #else
    #  define OCD_Barrier(com)
    #  define WITH_OCD if (false)
    #  define WITH_ROOT if (false)
    #  define WITH_SPECIAL(r) if (false)
    #  define WITH_RANK if (false) std::cout << atrip::Atrip::rank << ": "
    #  define WITH_DBG if (false)
    #  define WITH_CRAZY_DEBUG if (false)
    #  define DBG(...)
    #endif

And users of the library can redefine the ``LOG`` macro
which in case of not being defined is defined as follows:

.. code:: c++

    #ifndef LOG
    #define LOG(level, name) if (Atrip::rank == 0) std::cout << name << ": "
    #endif

Furthermore, if you do not wish to see any output from ATRIP, simply
define ``ATRIP_NO_OUTPUT``


.. code:: c++

    #ifdef ATRIP_NO_OUTPUT
    #  undef LOG
    #  define LOG(level, name) if (false) std::cout << name << ": "
    #endif

11.2 Iteration informer
~~~~~~~~~~~~~~~~~~~~~~~

In general a code writer will want to write some messages in every iteration.
A developer then can register a function to be used in this sense.
The input of the function is an `IterationDescriptor`_ structure and the output
should be nothing.

.. code:: c++
    :name: IterationDescriptor

    namespace atrip {

      struct IterationDescription;
      using IterationDescriptor = std::function<void(IterationDescription const&)>;
      struct IterationDescription {
        static IterationDescriptor descriptor;
        size_t currentIteration;
        size_t totalIterations;
        double currentElapsedTime;
      };

      void registerIterationDescriptor(IterationDescriptor);

    }

12 Checkpoints and restarts
---------------------------

12.1 Introduction
~~~~~~~~~~~~~~~~~

For very heavy workloads and possible bugs in the packages it is often
useful to restart from a given state of the calculation.

An advantage of the ``atrip`` algorithm is that the state is essentially given
by the

.. code:: yaml

    No: number of occupied orbitals
    Nv: number of virtual orbitals
    Nranks: number of ranks
    Nnodes: number of nodes
    Energy: the current total energy of the iterations
    Iteration: the iteration number
    Distribution: the type of distribution
    RankRoundRobin: wether the round robin is done through the ranks or
      nodes

This information we can encode in a simple struct

.. code:: c++
    :name: checkpoint-definition

    // template <typename F>
    struct Checkpoint {
      size_t no, nv;
      size_t nranks;
      size_t nnodes;
      double energy;
      size_t iteration;
      // TODO
      // Input<F>::TuplesDistribution distribution(GROUP_AND_SORT);
      bool rankRoundRobin;
    };

12.2 Input and output
~~~~~~~~~~~~~~~~~~~~~

In order to read and write the `checkpoint information <checkpoint-definition>`_, we need
to define a format. We choose a simple yaml format without any kind
of depth, so that we can write quite easily a parser.

.. code:: c++

    void write_checkpoint(Checkpoint const& c, std::string const& filepath) {
      std::ofstream out(filepath);
      out << "No: " << c.no
          << "\n"
          << "Nv: " << c.nv
          << "\n"
          << "Nranks: " << c.nranks
          << "\n"
          << "Nnodes: " << c.nnodes
          << "\n"
          << "Energy: " << std::setprecision(19) << c.energy
          << "\n"
          << "Iteration: " << c.iteration
          << "\n"
          << "RankRoundRobin: " << (c.rankRoundRobin ? "true" : "false")
          << "\n";
    }


    Checkpoint read_checkpoint(std::ifstream& in) {
      Checkpoint c;
      // trim chars from the string, to be more sure and not use regexes
      auto trim = [](std::string& s, std::string const& chars) {
        s.erase(0, s.find_first_not_of(chars));
        s.erase(s.find_last_not_of(chars) + 1);
        return s;
      };
      for (std::string header, value; std::getline(in, header, ':');) {
        std::getline(in, value, '\n');
        trim(value, " \t"); // trim all whitespaces
        trim(header, " \t");

        /**/ if (header == "No")        c.no = std::atoi(value.c_str());
        else if (header == "Nv")        c.nv = std::atoi(value.c_str());
        else if (header == "Nranks")    c.nranks = std::atoi(value.c_str());
        else if (header == "Nnodes")    c.nnodes = std::atoi(value.c_str());
        else if (header == "Energy")    c.energy = std::atof(value.c_str());
        else if (header == "Iteration") c.iteration = std::atoi(value.c_str());
        else if (header == "RankRoundRobin") c.rankRoundRobin = (value[0] == 't');
      }
      return c;
    }


    Checkpoint read_checkpoint(std::string const& filepath) {
      std::ifstream in(filepath);
      return read_checkpoint(in);
    }

13 Miscellaneous
----------------

13.1 Complex numbers
~~~~~~~~~~~~~~~~~~~~

.. code:: c++

    #pragma once

    #include <complex>
    #include <mpi.h>
    #include "config.h"
    #if defined(HAVE_CUDA)
    #include <cuComplex.h>
    #endif

    namespace atrip {

      using Complex = std::complex<double>;

      template <typename F> F maybeConjugate(const F);

    #if defined(HAVE_CUDA)
      cuDoubleComplex& operator+=(cuDoubleComplex& lz, cuDoubleComplex const& rz);
    #endif

      namespace traits {

        template <typename FF> bool isComplex();

        namespace mpi {
          template <typename F> MPI_Datatype datatypeOf(void);
        }

      }

    }

.. code:: c++

    #include <atrip/Complex.hpp>

    namespace atrip {

      template <> double maybeConjugate(const double a) { return a; }
      template <> Complex maybeConjugate(const Complex a) { return std::conj(a); }

    #if defined(HAVE_CUDA)
      /*
      __device__
      template <> double2
      maybeConjugate(const double2 a) {
        return {a.x, -a.y};
      }
      */
      __device__
      template <> cuDoubleComplex
      maybeConjugate(const cuDoubleComplex a) {
        return {a.x, -a.y};
      }
      /*
      __device__
      double2& operator+=(double2& lz, double2 const& rz) {
        lz.x += rz.x;
        lz.y += rz.y;
        return lz;
      }
      */
      __device__
      cuDoubleComplex& operator+=(cuDoubleComplex& lz, cuDoubleComplex const& rz) {
        lz.x += rz.x;
        lz.y += rz.y;
        return lz;
      }
    #endif


      namespace traits {
        template <typename F> bool isComplex() { return false; }
        template <> bool isComplex<double>() { return false; }
        template <> bool isComplex<Complex>() { return true; }
      namespace mpi {
        template <> MPI_Datatype datatypeOf<double>() { return MPI_DOUBLE; }
        template <> MPI_Datatype datatypeOf<Complex>() { return MPI_DOUBLE_COMPLEX; }
      }
      }

    }

14 Include header
-----------------

.. code:: c++

    #pragma once

    #include <atrip/Atrip.hpp>