sphinx/source/libCEEDapi.md

bcb2dfaeSJed Brown# Interface Concepts
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownThis page provides a brief description of the theoretical foundations and the
bcb2dfaeSJed Brownpractical implementation of the libCEED library.
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown(theoretical-framework)=
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown## Theoretical Framework
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownIn finite element formulations, the weak form of a Partial Differential Equation
bcb2dfaeSJed Brown(PDE) is evaluated on a subdomain $\Omega_e$ (element) and the local results
bcb2dfaeSJed Brownare composed into a larger system of equations that models the entire problem on
bcb2dfaeSJed Brownthe global domain $\Omega$. In particular, when high-order finite elements or
bcb2dfaeSJed Brownspectral elements are used, the resulting sparse matrix representation of the global
bcb2dfaeSJed Brownoperator is computationally expensive, with respect to both the memory transfer and
bcb2dfaeSJed Brownfloating point operations needed for its evaluation. libCEED provides an interface
bcb2dfaeSJed Brownfor matrix-free operator description that enables efficient evaluation on a variety
bcb2dfaeSJed Brownof computational device types (selectable at run time). We present here the notation
bcb2dfaeSJed Brownand the mathematical formulation adopted in libCEED.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownWe start by considering the discrete residual $F(u)=0$ formulation
bcb2dfaeSJed Brownin weak form. We first define the $L^2$ inner product between real-valued functions
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown$$
bcb2dfaeSJed Brown\langle v, u \rangle = \int_\Omega v u d \bm{x},
bcb2dfaeSJed Brown$$
bcb2dfaeSJed Brown
bcb2dfaeSJed Brownwhere $\bm{x} \in \mathbb{R}^d \supset \Omega$.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownWe want to find $u$ in a suitable space $V_D$,
bcb2dfaeSJed Brownsuch that
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown$$
bcb2dfaeSJed Brown\langle  \bm v,  \bm f(u) \rangle = \int_\Omega  \bm v \cdot  \bm f_0 (u, \nabla u) + \nabla \bm v :  \bm f_1 (u, \nabla u) = 0
bcb2dfaeSJed Brown$$ (residual)
bcb2dfaeSJed Brown
bcb2dfaeSJed Brownfor all $\bm v$ in the corresponding homogeneous space $V_0$, where $\bm f_0$
bcb2dfaeSJed Brownand $\bm f_1$ contain all possible sources in the problem. We notice here that
8791656fSJed Brown$\bm f_0$ represents all terms in {eq}`residual` which multiply the (possibly vector-valued) test
bcb2dfaeSJed Brownfunction $\bm v$ and $\bm f_1$ all terms which multiply its gradient $\nabla \bm v$.
bcb2dfaeSJed BrownFor an n-component problems in $d$ dimensions, $\bm f_0 \in \mathbb{R}^n$ and
bcb2dfaeSJed Brown$\bm f_1 \in \mathbb{R}^{nd}$.
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown:::{note}
bcb2dfaeSJed BrownThe notation $\nabla \bm v \!:\! \bm f_1$ represents contraction over both
bcb2dfaeSJed Brownfields and spatial dimensions while a single dot represents contraction in just one,
bcb2dfaeSJed Brownwhich should be clear from context, e.g., $\bm v \cdot \bm f_0$ contracts only over
bcb2dfaeSJed Brownfields.
bcb2dfaeSJed Brown:::
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown:::{note}
bcb2dfaeSJed BrownIn the code, the function that represents the weak form at quadrature
bcb2dfaeSJed Brownpoints is called the {ref}`CeedQFunction`. In the {ref}`Examples` provided with the
bcb2dfaeSJed Brownlibrary (in the {file}`examples/` directory), we store the term $\bm f_0$ directly
bcb2dfaeSJed Browninto `v`, and the term $\bm f_1$ directly into `dv` (which stands for
8791656fSJed Brown$\nabla \bm v$). If equation {eq}`residual` only presents a term of the
bcb2dfaeSJed Browntype $\bm f_0$, the {ref}`CeedQFunction` will only have one output argument,
8791656fSJed Brownnamely `v`. If equation {eq}`residual` also presents a term of the type
bcb2dfaeSJed Brown$\bm f_1$, then the {ref}`CeedQFunction` will have two output arguments, namely,
bcb2dfaeSJed Brown`v` and `dv`.
bcb2dfaeSJed Brown:::
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown## Finite Element Operator Decomposition
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownFinite element operators are typically defined through weak formulations of
bcb2dfaeSJed Brownpartial differential equations that involve integration over a computational
bcb2dfaeSJed Brownmesh. The required integrals are computed by splitting them as a sum over the
bcb2dfaeSJed Brownmesh elements, mapping each element to a simple *reference* element (e.g. the
bcb2dfaeSJed Brownunit square) and applying a quadrature rule in reference space.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownThis sequence of operations highlights an inherent hierarchical structure
bcb2dfaeSJed Brownpresent in all finite element operators where the evaluation starts on *global
bcb2dfaeSJed Brown(trial) degrees of freedom (dofs) or nodes on the whole mesh*, restricts to
bcb2dfaeSJed Brown*dofs on subdomains* (groups of elements), then moves to independent
bcb2dfaeSJed Brown*dofs on each element*, transitions to independent *quadrature points* in
bcb2dfaeSJed Brownreference space, performs the integration, and then goes back in reverse order
bcb2dfaeSJed Brownto global (test) degrees of freedom on the whole mesh.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownThis is illustrated below for the simple case of symmetric linear operator on
bcb2dfaeSJed Brownthird order ($Q_3$) scalar continuous ($H^1$) elements, where we use
bcb2dfaeSJed Brownthe notions **T-vector**, **L-vector**, **E-vector** and **Q-vector** to represent
bcb2dfaeSJed Brownthe sets corresponding to the (true) degrees of freedom on the global mesh, the split
bcb2dfaeSJed Brownlocal degrees of freedom on the subdomains, the split degrees of freedom on the
bcb2dfaeSJed Brownmesh elements, and the values at quadrature points, respectively.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownWe refer to the operators that connect the different types of vectors as:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- Subdomain restriction $\bm{P}$
0fe925dfSnbeams- Element restriction $\bm{\mathcal{E}}$
bcb2dfaeSJed Brown- Basis (Dofs-to-Qpts) evaluator $\bm{B}$
bcb2dfaeSJed Brown- Operator at quadrature points $\bm{D}$
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownMore generally, when the test and trial space differ, they get their own
0fe925dfSnbeamsversions of $\bm{P}$, $\bm{\mathcal{E}}$ and $\bm{B}$.
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown(fig-operator-decomp)=
bcb2dfaeSJed Brown
0fe925dfSnbeams:::{figure} ../../img/libCEED.svg
bcb2dfaeSJed BrownOperator Decomposition
bcb2dfaeSJed Brown:::
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownNote that in the case of adaptive mesh refinement (AMR), the restrictions
0fe925dfSnbeams$\bm{P}$ and $\bm{\mathcal{E}}$ will involve not just extracting sub-vectors,
bcb2dfaeSJed Brownbut evaluating values at constrained degrees of freedom through the AMR interpolation.
bcb2dfaeSJed BrownThere can also be several levels of subdomains ($\bm P_1$, $\bm P_2$,
bcb2dfaeSJed Brownetc.), and it may be convenient to split $\bm{D}$ as the product of several
bcb2dfaeSJed Brownoperators ($\bm D_1$, $\bm D_2$, etc.).
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown### Terminology and Notation
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownVector representation/storage categories:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- True degrees of freedom/unknowns, **T-vector**:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown  > - each unknown $i$ has exactly one copy, on exactly one processor, $rank(i)$
bcb2dfaeSJed Brown  > - this is a non-overlapping vector decomposition
bcb2dfaeSJed Brown  > - usually includes any essential (fixed) dofs.
bcb2dfaeSJed Brown  >
bcb2dfaeSJed Brown  > ```{image} ../../img/T-vector.svg
bcb2dfaeSJed Brown  > ```
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- Local (w.r.t. processors) degrees of freedom/unknowns, **L-vector**:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown  > - each unknown $i$ has exactly one copy on each processor that owns an
bcb2dfaeSJed Brown  >   element containing $i$
bcb2dfaeSJed Brown  > - this is an overlapping vector decomposition with overlaps only across
bcb2dfaeSJed Brown  >   different processors---there is no duplication of unknowns on a single
bcb2dfaeSJed Brown  >   processor
bcb2dfaeSJed Brown  > - the shared dofs/unknowns are the overlapping dofs, i.e. the ones that have
bcb2dfaeSJed Brown  >   more than one copy, on different processors.
bcb2dfaeSJed Brown  >
bcb2dfaeSJed Brown  > ```{image} ../../img/L-vector.svg
bcb2dfaeSJed Brown  > ```
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- Per element decomposition, **E-vector**:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown  > - each unknown $i$ has as many copies as the number of elements that contain
bcb2dfaeSJed Brown  >   $i$
bcb2dfaeSJed Brown  > - usually, the copies of the unknowns are grouped by the element they belong
bcb2dfaeSJed Brown  >   to.
bcb2dfaeSJed Brown  >
bcb2dfaeSJed Brown  > ```{image} ../../img/E-vector.svg
bcb2dfaeSJed Brown  > ```
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- In the case of AMR with hanging nodes (giving rise to hanging dofs):
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown  > - the **L-vector** is enhanced with the hanging/dependent dofs
bcb2dfaeSJed Brown  > - the additional hanging/dependent dofs are duplicated when they are shared
bcb2dfaeSJed Brown  >   by multiple processors
bcb2dfaeSJed Brown  > - this way, an **E-vector** can be derived from an **L-vector** without any
bcb2dfaeSJed Brown  >   communications and without additional computations to derive the dependent
bcb2dfaeSJed Brown  >   dofs
bcb2dfaeSJed Brown  > - in other words, an entry in an **E-vector** is obtained by copying an entry
bcb2dfaeSJed Brown  >   from the corresponding **L-vector**, optionally switching the sign of the
bcb2dfaeSJed Brown  >   entry (for $H(\mathrm{div})$---and $H(\mathrm{curl})$-conforming spaces).
bcb2dfaeSJed Brown  >
bcb2dfaeSJed Brown  > ```{image} ../../img/L-vector-AMR.svg
bcb2dfaeSJed Brown  > ```
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- In the case of variable order spaces:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown  > - the dependent dofs (usually on the higher-order side of a face/edge) can
bcb2dfaeSJed Brown  >   be treated just like the hanging/dependent dofs case.
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- Quadrature point vector, **Q-vector**:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown  > - this is similar to **E-vector** where instead of dofs, the vector represents
bcb2dfaeSJed Brown  >   values at quadrature points, grouped by element.
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- In many cases it is useful to distinguish two types of vectors:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown  > - **X-vector**, or **primal X-vector**, and **X'-vector**, or **dual X-vector**
bcb2dfaeSJed Brown  > - here X can be any of the T, L, E, or Q categories
bcb2dfaeSJed Brown  > - for example, the mass matrix operator maps a **T-vector** to a **T'-vector**
bcb2dfaeSJed Brown  > - the solutions vector is a **T-vector**, and the RHS vector is a **T'-vector**
bcb2dfaeSJed Brown  > - using the parallel prolongation operator, one can map the solution
bcb2dfaeSJed Brown  >   **T-vector** to a solution **L-vector**, etc.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownOperator representation/storage/action categories:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- Full true-dof parallel assembly, **TA**, or **A**:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown  > - ParCSR or similar format
bcb2dfaeSJed Brown  > - the T in TA indicates that the data format represents an operator from a
bcb2dfaeSJed Brown  >   **T-vector** to a **T'-vector**.
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- Full local assembly, **LA**:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown  > - CSR matrix on each rank
bcb2dfaeSJed Brown  > - the parallel prolongation operator, $\bm{P}$, (and its transpose) should use
bcb2dfaeSJed Brown  >   optimized matrix-free action
bcb2dfaeSJed Brown  > - note that $\bm{P}$ is the operator mapping T-vectors to L-vectors.
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- Element matrix assembly, **EA**:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown  > - each element matrix is stored as a dense matrix
bcb2dfaeSJed Brown  > - optimized element and parallel prolongation operators
bcb2dfaeSJed Brown  > - note that the element prolongation operator is the mapping from an
bcb2dfaeSJed Brown  >   **L-vector** to an **E-vector**.
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- Quadrature-point/partial assembly, **QA** or **PA**:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown  > - precompute and store $w\det(J)$ at all quadrature points in all mesh elements
bcb2dfaeSJed Brown  > - the stored data can be viewed as a **Q-vector**.
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- Unassembled option,  **UA** or **U**:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown  > - no assembly step
bcb2dfaeSJed Brown  > - the action uses directly the mesh node coordinates, and assumes specific
bcb2dfaeSJed Brown  >   form of the coefficient, e.g. constant, piecewise-constant, or given as a
bcb2dfaeSJed Brown  >   **Q-vector** (Q-coefficient).
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown### Partial Assembly
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownSince the global operator $\bm{A}$ is just a series of variational restrictions
0fe925dfSnbeamswith $\bm{B}$, $\bm{\mathcal{E}}$ and $\bm{P}$, starting from its
bcb2dfaeSJed Brownpoint-wise kernel $\bm{D}$, a "matvec" with $\bm{A}$ can be
bcb2dfaeSJed Brownperformed by evaluating and storing some of the innermost variational restriction
bcb2dfaeSJed Brownmatrices, and applying the rest of the operators "on-the-fly". For example, one can
bcb2dfaeSJed Browncompute and store a global matrix on **T-vector** level. Alternatively, one can compute
bcb2dfaeSJed Brownand store only the subdomain (**L-vector**) or element (**E-vector**) matrices and
bcb2dfaeSJed Brownperform the action of $\bm{A}$ using matvecs with $\bm{P}$ or
0fe925dfSnbeams$\bm{P}$ and $\bm{\mathcal{E}}$. While these options are natural for
bcb2dfaeSJed Brownlow-order discretizations, they are not a good fit for high-order methods due to
bcb2dfaeSJed Brownthe amount of FLOPs needed for their evaluation, as well as the memory transfer
bcb2dfaeSJed Brownneeded for a matvec.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownOur focus in libCEED, instead, is on **partial assembly**, where we compute and
bcb2dfaeSJed Brownstore only $\bm{D}$ (or portions of it) and evaluate the actions of
0fe925dfSnbeams$\bm{P}$, $\bm{\mathcal{E}}$ and $\bm{B}$ on-the-fly.
bcb2dfaeSJed BrownCritically for performance, we take advantage of the tensor-product structure of the
bcb2dfaeSJed Browndegrees of freedom and quadrature points on *quad* and *hex* elements to perform the
bcb2dfaeSJed Brownaction of $\bm{B}$ without storing it as a matrix.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownImplemented properly, the partial assembly algorithm requires optimal amount of
bcb2dfaeSJed Brownmemory transfers (with respect to the polynomial order) and near-optimal FLOPs
bcb2dfaeSJed Brownfor operator evaluation. It consists of an operator *setup* phase, that
bcb2dfaeSJed Brownevaluates and stores $\bm{D}$ and an operator *apply* (evaluation) phase that
bcb2dfaeSJed Browncomputes the action of $\bm{A}$ on an input vector. When desired, the setup
bcb2dfaeSJed Brownphase may be done as a side-effect of evaluating a different operator, such as a
bcb2dfaeSJed Brownnonlinear residual. The relative costs of the setup and apply phases are
bcb2dfaeSJed Browndifferent depending on the physics being expressed and the representation of
bcb2dfaeSJed Brown$\bm{D}$.
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown### Parallel Decomposition
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownAfter the application of each of the first three transition operators,
0fe925dfSnbeams$\bm{P}$, $\bm{\mathcal{E}}$ and $\bm{B}$, the operator evaluation
0fe925dfSnbeamsis decoupled  on their ranges, so $\bm{P}$, $\bm{\mathcal{E}}$ and
bcb2dfaeSJed Brown$\bm{B}$ allow us to "zoom-in" to subdomain, element and quadrature point
bcb2dfaeSJed Brownlevel, ignoring the coupling at higher levels.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownThus, a natural mapping of $\bm{A}$ on a parallel computer is to split the
bcb2dfaeSJed Brown**T-vector** over MPI ranks (a non-overlapping decomposition, as is typically
bcb2dfaeSJed Brownused for sparse matrices), and then split the rest of the vector types over
bcb2dfaeSJed Browncomputational devices (CPUs, GPUs, etc.) as indicated by the shaded regions in
bcb2dfaeSJed Brownthe diagram above.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownOne of the advantages of the decomposition perspective in these settings is that
0fe925dfSnbeamsthe operators $\bm{P}$, $\bm{\mathcal{E}}$, $\bm{B}$ and
bcb2dfaeSJed Brown$\bm{D}$ clearly separate the MPI parallelism
bcb2dfaeSJed Brownin the operator ($\bm{P}$) from the unstructured mesh topology
0fe925dfSnbeams($\bm{\mathcal{E}}$), the choice of the finite element space/basis ($\bm{B}$)
bcb2dfaeSJed Brownand the geometry and point-wise physics $\bm{D}$. These components also
bcb2dfaeSJed Brownnaturally fall in different classes of numerical algorithms -- parallel (multi-device)
bcb2dfaeSJed Brownlinear algebra for $\bm{P}$, sparse (on-device) linear algebra for
0fe925dfSnbeams$\bm{\mathcal{E}}$, dense/structured linear algebra (tensor contractions) for
bcb2dfaeSJed Brown$\bm{B}$ and parallel point-wise evaluations for $\bm{D}$.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownCurrently in libCEED, it is assumed that the host application manages the global
bcb2dfaeSJed Brown**T-vectors** and the required communications among devices (which are generally
bcb2dfaeSJed Brownon different compute nodes) with **P**. Our API is thus focused on the
bcb2dfaeSJed Brown**L-vector** level, where the logical devices, which in the library are
bcb2dfaeSJed Brownrepresented by the {ref}`Ceed` object, are independent. Each MPI rank can use one or
bcb2dfaeSJed Brownmore {ref}`Ceed`s, and each {ref}`Ceed`, in turn, can represent one or more physical
bcb2dfaeSJed Browndevices, as long as libCEED backends support such configurations. The idea is
bcb2dfaeSJed Brownthat every MPI rank can use any logical device it is assigned at runtime. For
bcb2dfaeSJed Brownexample, on a node with 2 CPU sockets and 4 GPUs, one may decide to use 6 MPI
bcb2dfaeSJed Brownranks (each using a single {ref}`Ceed` object): 2 ranks using 1 CPU socket each, and
bcb2dfaeSJed Brown4 using 1 GPU each. Another choice could be to run 1 MPI rank on the whole node
bcb2dfaeSJed Brownand use 5 {ref}`Ceed` objects: 1 managing all CPU cores on the 2 sockets and 4
bcb2dfaeSJed Brownmanaging 1 GPU each. The communications among the devices, e.g. required for
bcb2dfaeSJed Brownapplying the action of $\bm{P}$, are currently out of scope of libCEED. The
bcb2dfaeSJed Browninterface is non-blocking for all operations involving more than O(1) data,
bcb2dfaeSJed Brownallowing operations performed on a coprocessor or worker threads to overlap with
bcb2dfaeSJed Brownoperations on the host.
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown## API Description
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownThe libCEED API takes an algebraic approach, where the user essentially
0fe925dfSnbeamsdescribes in the *frontend* the operators $\bm{\bm{\mathcal{E}}}$, $\bm{B}$, and $\bm{D}$ and the library
bcb2dfaeSJed Brownprovides *backend* implementations and coordinates their action to the original
bcb2dfaeSJed Brownoperator on **L-vector** level (i.e. independently on each device / MPI task).
*02076a18SnbeamsThis is visualized in the schematic below; "active" and "passive" inputs/outputs
*02076a18Snbeamswill be discussed in more detail later.
*02076a18Snbeams
*02076a18Snbeams(fig-operator-schematic)=
*02076a18Snbeams
*02076a18Snbeams:::{figure} ../../img/libceed_schematic.svg
*02076a18SnbeamsFlow of data through vector types inside libCEED Operators, through backend implementations
*02076a18Snbeamsof $\bm{\bm{\mathcal{E}}}$, $\bm{B}$, and $\bm{D}$
*02076a18Snbeams:::
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownOne of the advantages of this purely algebraic description is that it already
bcb2dfaeSJed Brownincludes all the finite element information, so the backends can operate on
bcb2dfaeSJed Brownlinear algebra level without explicit finite element code. The frontend
bcb2dfaeSJed Browndescription is general enough to support a wide variety of finite element
bcb2dfaeSJed Brownalgorithms, as well as some other types algorithms such as spectral finite
bcb2dfaeSJed Browndifferences. The separation of the front- and backends enables applications to
bcb2dfaeSJed Browneasily switch/try different backends. It also enables backend developers to
bcb2dfaeSJed Brownimpact many applications from a single implementation.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownOur long-term vision is to include a variety of backend implementations in
bcb2dfaeSJed BrownlibCEED, ranging from reference kernels to highly optimized kernels targeting
bcb2dfaeSJed Brownspecific devices (e.g. GPUs) or specific polynomial orders. A simple reference
bcb2dfaeSJed Brownbackend implementation is provided in the file
bcb2dfaeSJed Brown[ceed-ref.c](https://github.com/CEED/libCEED/blob/main/backends/ref/ceed-ref.c).
bcb2dfaeSJed Brown
52006392Snbeams
bcb2dfaeSJed BrownOn the frontend, the mapping between the decomposition concepts and the code
bcb2dfaeSJed Brownimplementation is as follows:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown- **L-**, **E-** and **Q-vector** are represented as variables of type {ref}`CeedVector`.
bcb2dfaeSJed Brown  (A backend may choose to operate incrementally without forming explicit **E-** or
bcb2dfaeSJed Brown  **Q-vectors**.)
0fe925dfSnbeams- $\bm{\mathcal{E}}$ is represented as variable of type {ref}`CeedElemRestriction`.
bcb2dfaeSJed Brown- $\bm{B}$ is represented as variable of type {ref}`CeedBasis`.
bcb2dfaeSJed Brown- the action of $\bm{D}$ is represented as variable of type {ref}`CeedQFunction`.
0fe925dfSnbeams- the overall operator $\bm{\mathcal{E}}^T \bm{B}^T \bm{D} \bm{B} \bm{\mathcal{E}}$
bcb2dfaeSJed Brown  is represented as variable of type
bcb2dfaeSJed Brown  {ref}`CeedOperator` and its action is accessible through {c:func}`CeedOperatorApply()`.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownTo clarify these concepts and illustrate how they are combined in the API,
bcb2dfaeSJed Brownconsider the implementation of the action of a simple 1D mass matrix
bcb2dfaeSJed Brown(cf. [tests/t500-operator.c](https://github.com/CEED/libCEED/blob/main/tests/t500-operator.c)).
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:linenos: true
bcb2dfaeSJed Brown```
*02076a18SnbeamsIn the following figure, we specialize the schematic used above for general operators so that
*02076a18Snbeamsit corresponds to the mass matrix as implemented in the sample code. Notations marked as "L[#]"
*02076a18Snbeamsdenote the line number in the code where that object or evaluation mode is set for the operator.
*02076a18Snbeams
*02076a18Snbeams(fig-operator-schematic-mass)=
*02076a18Snbeams
*02076a18Snbeams:::{figure} ../../img/libceed_schematic_op_mass.svg
*02076a18SnbeamsSpecific combination of $\bm{\bm{\mathcal{E}}}$, $\bm{B}$, $\bm{D}$, and input/output vectors
*02076a18Snbeamscorresponding to t500-operator
*02076a18Snbeams:::
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownThe constructor
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
bcb2dfaeSJed Brown:end-at: CeedInit
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:start-at: CeedInit
bcb2dfaeSJed Brown```
bcb2dfaeSJed Brown
bcb2dfaeSJed Browncreates a logical device `ceed` on the specified *resource*, which could also be
bcb2dfaeSJed Browna coprocessor such as `"/nvidia/0"`. There can be any number of such devices,
bcb2dfaeSJed Brownincluding multiple logical devices driving the same resource (though performance
bcb2dfaeSJed Brownmay suffer in case of oversubscription). The resource is used to locate a
bcb2dfaeSJed Brownsuitable backend which will have discretion over the implementations of all
bcb2dfaeSJed Brownobjects created with this logical device.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownThe `setup` routine above computes and stores $\bm{D}$, in this case a
bcb2dfaeSJed Brownscalar value in each quadrature point, while `mass` uses these saved values to perform
bcb2dfaeSJed Brownthe action of $\bm{D}$. These functions are turned into the {ref}`CeedQFunction`
bcb2dfaeSJed Brownvariables `qf_setup` and `qf_mass` in the {c:func}`CeedQFunctionCreateInterior()` calls:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
bcb2dfaeSJed Brown:end-before: //! [QFunction Create]
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:start-after: //! [QFunction Create]
bcb2dfaeSJed Brown```
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownA {ref}`CeedQFunction` performs independent operations at each quadrature point and
bcb2dfaeSJed Brownthe interface is intended to facilitate vectorization.  The second argument is
bcb2dfaeSJed Brownan expected vector length. If greater than 1, the caller must ensure that the
bcb2dfaeSJed Brownnumber of quadrature points `Q` is divisible by the vector length. This is
bcb2dfaeSJed Brownoften satisfied automatically due to the element size or by batching elements
bcb2dfaeSJed Browntogether to facilitate vectorization in other stages, and can always be ensured
bcb2dfaeSJed Brownby padding.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownIn addition to the function pointers (`setup` and `mass`), {ref}`CeedQFunction`
bcb2dfaeSJed Brownconstructors take a string representation specifying where the source for the
bcb2dfaeSJed Brownimplementation is found. This is used by backends that support Just-In-Time
bcb2dfaeSJed Brown(JIT) compilation (i.e., CUDA and OCCA) to compile for coprocessors.
bcb2dfaeSJed BrownFor full support across all backends, these {ref}`CeedQFunction` source files must only contain constructs mutually supported by C99, C++11, and CUDA.
bcb2dfaeSJed BrownFor example, explicit type casting of void pointers and explicit use of compatible arguments for {code}`math` library functions is required, and variable-length array (VLA) syntax for array reshaping is only available via libCEED's {code}`CEED_Q_VLA` macro.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownDifferent input and output fields are added individually, specifying the field
bcb2dfaeSJed Brownname, size of the field, and evaluation mode.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownThe size of the field is provided by a combination of the number of components
bcb2dfaeSJed Brownthe effect of any basis evaluations.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownThe evaluation mode (see {ref}`CeedBasis-Typedefs and Enumerations`) `CEED_EVAL_INTERP`
bcb2dfaeSJed Brownfor both input and output fields indicates that the mass operator only contains terms of
bcb2dfaeSJed Brownthe form
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown$$
bcb2dfaeSJed Brown\int_\Omega v \cdot f_0 (u, \nabla u)
bcb2dfaeSJed Brown$$
bcb2dfaeSJed Brown
bcb2dfaeSJed Brownwhere $v$ are test functions (see the {ref}`theoretical-framework`).
bcb2dfaeSJed BrownMore general operators, such as those of the form
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown$$
bcb2dfaeSJed Brown\int_\Omega v \cdot f_0 (u, \nabla u) + \nabla v : f_1 (u, \nabla u)
bcb2dfaeSJed Brown$$
bcb2dfaeSJed Brown
bcb2dfaeSJed Browncan be expressed.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownFor fields with derivatives, such as with the basis evaluation mode
bcb2dfaeSJed Brown(see {ref}`CeedBasis-Typedefs and Enumerations`) `CEED_EVAL_GRAD`, the size of the
bcb2dfaeSJed Brownfield needs to reflect both the number of components and the geometric dimension.
bcb2dfaeSJed BrownA 3-dimensional gradient on four components would therefore mean the field has a size of
bcb2dfaeSJed Brown12\.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownThe $\bm{B}$ operators for the mesh nodes, `basis_x`, and the unknown field,
bcb2dfaeSJed Brown`basis_u`, are defined in the calls to the function {c:func}`CeedBasisCreateTensorH1Lagrange()`.
bcb2dfaeSJed BrownIn this example, both the mesh and the unknown field use $H^1$ Lagrange finite
bcb2dfaeSJed Brownelements of order 1 and 4 respectively (the `P` argument represents the number of 1D
bcb2dfaeSJed Browndegrees of freedom on each element). Both basis operators use the same integration rule,
bcb2dfaeSJed Brownwhich is Gauss-Legendre with 8 points (the `Q` argument).
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
bcb2dfaeSJed Brown:end-before: //! [Basis Create]
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:start-after: //! [Basis Create]
bcb2dfaeSJed Brown```
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownOther elements with this structure can be specified in terms of the `Q×P`
bcb2dfaeSJed Brownmatrices that evaluate values and gradients at quadrature points in one
bcb2dfaeSJed Browndimension using {c:func}`CeedBasisCreateTensorH1()`. Elements that do not have tensor
bcb2dfaeSJed Brownproduct structure, such as symmetric elements on simplices, will be created
bcb2dfaeSJed Brownusing different constructors.
bcb2dfaeSJed Brown
0fe925dfSnbeamsThe $\bm{\mathcal{E}}$ operators for the mesh nodes, `elem_restr_x`, and the unknown field,
bcb2dfaeSJed Brown`elem_restr_u`, are specified in the {c:func}`CeedElemRestrictionCreate()`. Both of these
bcb2dfaeSJed Brownspecify directly the dof indices for each element in the `ind_x` and `ind_u`
bcb2dfaeSJed Brownarrays:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
bcb2dfaeSJed Brown:end-before: //! [ElemRestr Create]
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:start-after: //! [ElemRestr Create]
bcb2dfaeSJed Brown```
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
bcb2dfaeSJed Brown:end-before: //! [ElemRestrU Create]
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:start-after: //! [ElemRestrU Create]
bcb2dfaeSJed Brown```
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownIf the user has arrays available on a device, they can be provided using
bcb2dfaeSJed Brown`CEED_MEM_DEVICE`. This technique is used to provide no-copy interfaces in all
bcb2dfaeSJed Browncontexts that involve problem-sized data.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownFor discontinuous Galerkin and for applications such as Nek5000 that only
bcb2dfaeSJed Brownexplicitly store **E-vectors** (inter-element continuity has been subsumed by
0fe925dfSnbeamsthe parallel restriction $\bm{P}$), the element restriction $\bm{\mathcal{E}}$
bcb2dfaeSJed Brownis the identity and {c:func}`CeedElemRestrictionCreateStrided()` is used instead.
0fe925dfSnbeamsWe plan to support other structured representations of $\bm{\mathcal{E}}$ which will
bcb2dfaeSJed Brownbe added according to demand.
0fe925dfSnbeamsThere are two common approaches for supporting non-conforming elements: applying the node constraints via $\bm P$ so that the **L-vector** can be processed uniformly and applying the constraints via $\bm{\mathcal{E}}$ so that the **E-vector** is uniform.
bcb2dfaeSJed BrownThe former can be done with the existing interface while the latter will require a generalization to element restriction that would define field values at constrained nodes as linear combinations of the values at primary nodes.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownThese operations, $\bm{P}$, $\bm{B}$, and $\bm{D}$,
bcb2dfaeSJed Brownare combined with a {ref}`CeedOperator`. As with {ref}`CeedQFunction`s, operator fields are added
bcb2dfaeSJed Brownseparately with a matching field name, basis ($\bm{B}$), element restriction
0fe925dfSnbeams($\bm{\mathcal{E}}$), and **L-vector**. The flag
bcb2dfaeSJed Brown`CEED_VECTOR_ACTIVE` indicates that the vector corresponding to that field will
bcb2dfaeSJed Brownbe provided to the operator when {c:func}`CeedOperatorApply()` is called. Otherwise the
bcb2dfaeSJed Browninput/output will be read from/written to the specified **L-vector**.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownWith partial assembly, we first perform a setup stage where $\bm{D}$ is evaluated
bcb2dfaeSJed Brownand stored. This is accomplished by the operator `op_setup` and its application
bcb2dfaeSJed Brownto `X`, the nodes of the mesh (these are needed to compute Jacobians at
bcb2dfaeSJed Brownquadrature points). Note that the corresponding {c:func}`CeedOperatorApply()` has no basis
bcb2dfaeSJed Brownevaluation on the output, as the quadrature data is not needed at the dofs:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
bcb2dfaeSJed Brown:end-before: //! [Setup Create]
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:start-after: //! [Setup Create]
bcb2dfaeSJed Brown```
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
bcb2dfaeSJed Brown:end-before: //! [Setup Set]
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:start-after: //! [Setup Set]
bcb2dfaeSJed Brown```
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
bcb2dfaeSJed Brown:end-before: //! [Setup Apply]
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:start-after: //! [Setup Apply]
bcb2dfaeSJed Brown```
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownThe action of the operator is then represented by operator `op_mass` and its
bcb2dfaeSJed Brown{c:func}`CeedOperatorApply()` to the input **L-vector** `U` with output in `V`:
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
bcb2dfaeSJed Brown:end-before: //! [Operator Create]
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:start-after: //! [Operator Create]
bcb2dfaeSJed Brown```
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
bcb2dfaeSJed Brown:end-before: //! [Operator Set]
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:start-after: //! [Operator Set]
bcb2dfaeSJed Brown```
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
bcb2dfaeSJed Brown:end-before: //! [Operator Apply]
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:start-after: //! [Operator Apply]
bcb2dfaeSJed Brown```
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownA number of function calls in the interface, such as {c:func}`CeedOperatorApply()`, are
bcb2dfaeSJed Brownintended to support asynchronous execution via their last argument,
bcb2dfaeSJed Brown`CeedRequest*`. The specific (pointer) value used in the above example,
bcb2dfaeSJed Brown`CEED_REQUEST_IMMEDIATE`, is used to express the request (from the user) for the
bcb2dfaeSJed Brownoperation to complete before returning from the function call, i.e. to make sure
bcb2dfaeSJed Brownthat the result of the operation is available in the output parameters
bcb2dfaeSJed Brownimmediately after the call. For a true asynchronous call, one needs to provide
bcb2dfaeSJed Brownthe address of a user defined variable. Such a variable can be used later to
bcb2dfaeSJed Brownexplicitly wait for the completion of the operation.
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown## Gallery of QFunctions
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownLibCEED provides a gallery of built-in {ref}`CeedQFunction`s in the {file}`gallery/` directory.
bcb2dfaeSJed BrownThe available QFunctions are the ones associated with the mass, the Laplacian, and
bcb2dfaeSJed Brownthe identity operators. To illustrate how the user can declare a {ref}`CeedQFunction`
bcb2dfaeSJed Brownvia the gallery of available QFunctions, consider the selection of the
bcb2dfaeSJed Brown{ref}`CeedQFunction` associated with a simple 1D mass matrix
bcb2dfaeSJed Brown(cf. [tests/t410-qfunction.c](https://github.com/CEED/libCEED/blob/main/tests/t410-qfunction.c)).
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t410-qfunction.c
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:linenos: true
bcb2dfaeSJed Brown```
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown## Interface Principles and Evolution
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownLibCEED is intended to be extensible via backends that are packaged with the
bcb2dfaeSJed Brownlibrary and packaged separately (possibly as a binary containing proprietary
bcb2dfaeSJed Browncode). Backends are registered by calling
bcb2dfaeSJed Brown
bcb2dfaeSJed Brown```{literalinclude} ../../../backends/ref/ceed-ref.c
bcb2dfaeSJed Brown:end-before: //! [Register]
bcb2dfaeSJed Brown:language: c
bcb2dfaeSJed Brown:start-after: //! [Register]
bcb2dfaeSJed Brown```
bcb2dfaeSJed Brown
bcb2dfaeSJed Browntypically in a library initializer or "constructor" that runs automatically.
bcb2dfaeSJed Brown`CeedInit` uses this prefix to find an appropriate backend for the resource.
bcb2dfaeSJed Brown
bcb2dfaeSJed BrownSource (API) and binary (ABI) stability are important to libCEED. Prior to
bcb2dfaeSJed Brownreaching version 1.0, libCEED does not implement strict [semantic versioning](https://semver.org) across the entire interface. However, user code,
bcb2dfaeSJed Brownincluding libraries of {ref}`CeedQFunction`s, should be source and binary
bcb2dfaeSJed Browncompatible moving from 0.x.y to any later release 0.x.z. We have less experience
bcb2dfaeSJed Brownwith external packaging of backends and do not presently guarantee source or
bcb2dfaeSJed Brownbinary stability, but we intend to define stability guarantees for libCEED 1.0.
bcb2dfaeSJed BrownWe'd love to talk with you if you're interested in packaging backends
bcb2dfaeSJed Brownexternally, and will work with you on a practical stability policy.