sphinx/source/libCEEDapi.md

*bcb2dfaeSJed Brown# Interface Concepts
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownThis page provides a brief description of the theoretical foundations and the
*bcb2dfaeSJed Brownpractical implementation of the libCEED library.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown(theoretical-framework)=
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown## Theoretical Framework
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownIn finite element formulations, the weak form of a Partial Differential Equation
*bcb2dfaeSJed Brown(PDE) is evaluated on a subdomain $\Omega_e$ (element) and the local results
*bcb2dfaeSJed Brownare composed into a larger system of equations that models the entire problem on
*bcb2dfaeSJed Brownthe global domain $\Omega$. In particular, when high-order finite elements or
*bcb2dfaeSJed Brownspectral elements are used, the resulting sparse matrix representation of the global
*bcb2dfaeSJed Brownoperator is computationally expensive, with respect to both the memory transfer and
*bcb2dfaeSJed Brownfloating point operations needed for its evaluation. libCEED provides an interface
*bcb2dfaeSJed Brownfor matrix-free operator description that enables efficient evaluation on a variety
*bcb2dfaeSJed Brownof computational device types (selectable at run time). We present here the notation
*bcb2dfaeSJed Brownand the mathematical formulation adopted in libCEED.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownWe start by considering the discrete residual $F(u)=0$ formulation
*bcb2dfaeSJed Brownin weak form. We first define the $L^2$ inner product between real-valued functions
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown$$
*bcb2dfaeSJed Brown\langle v, u \rangle = \int_\Omega v u d \bm{x},
*bcb2dfaeSJed Brown$$
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brownwhere $\bm{x} \in \mathbb{R}^d \supset \Omega$.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownWe want to find $u$ in a suitable space $V_D$,
*bcb2dfaeSJed Brownsuch that
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown$$
*bcb2dfaeSJed Brown\langle  \bm v,  \bm f(u) \rangle = \int_\Omega  \bm v \cdot  \bm f_0 (u, \nabla u) + \nabla \bm v :  \bm f_1 (u, \nabla u) = 0
*bcb2dfaeSJed Brown$$ (residual)
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brownfor all $\bm v$ in the corresponding homogeneous space $V_0$, where $\bm f_0$
*bcb2dfaeSJed Brownand $\bm f_1$ contain all possible sources in the problem. We notice here that
*bcb2dfaeSJed Brown$\bm f_0$ represents all terms in {math:numref}`residual` which multiply the (possibly vector-valued) test
*bcb2dfaeSJed Brownfunction $\bm v$ and $\bm f_1$ all terms which multiply its gradient $\nabla \bm v$.
*bcb2dfaeSJed BrownFor an n-component problems in $d$ dimensions, $\bm f_0 \in \mathbb{R}^n$ and
*bcb2dfaeSJed Brown$\bm f_1 \in \mathbb{R}^{nd}$.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown:::{note}
*bcb2dfaeSJed BrownThe notation $\nabla \bm v \!:\! \bm f_1$ represents contraction over both
*bcb2dfaeSJed Brownfields and spatial dimensions while a single dot represents contraction in just one,
*bcb2dfaeSJed Brownwhich should be clear from context, e.g., $\bm v \cdot \bm f_0$ contracts only over
*bcb2dfaeSJed Brownfields.
*bcb2dfaeSJed Brown:::
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown:::{note}
*bcb2dfaeSJed BrownIn the code, the function that represents the weak form at quadrature
*bcb2dfaeSJed Brownpoints is called the {ref}`CeedQFunction`. In the {ref}`Examples` provided with the
*bcb2dfaeSJed Brownlibrary (in the {file}`examples/` directory), we store the term $\bm f_0$ directly
*bcb2dfaeSJed Browninto `v`, and the term $\bm f_1$ directly into `dv` (which stands for
*bcb2dfaeSJed Brown$\nabla \bm v$). If equation {math:numref}`residual` only presents a term of the
*bcb2dfaeSJed Browntype $\bm f_0$, the {ref}`CeedQFunction` will only have one output argument,
*bcb2dfaeSJed Brownnamely `v`. If equation {math:numref}`residual` also presents a term of the type
*bcb2dfaeSJed Brown$\bm f_1$, then the {ref}`CeedQFunction` will have two output arguments, namely,
*bcb2dfaeSJed Brown`v` and `dv`.
*bcb2dfaeSJed Brown:::
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown## Finite Element Operator Decomposition
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownFinite element operators are typically defined through weak formulations of
*bcb2dfaeSJed Brownpartial differential equations that involve integration over a computational
*bcb2dfaeSJed Brownmesh. The required integrals are computed by splitting them as a sum over the
*bcb2dfaeSJed Brownmesh elements, mapping each element to a simple *reference* element (e.g. the
*bcb2dfaeSJed Brownunit square) and applying a quadrature rule in reference space.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownThis sequence of operations highlights an inherent hierarchical structure
*bcb2dfaeSJed Brownpresent in all finite element operators where the evaluation starts on *global
*bcb2dfaeSJed Brown(trial) degrees of freedom (dofs) or nodes on the whole mesh*, restricts to
*bcb2dfaeSJed Brown*dofs on subdomains* (groups of elements), then moves to independent
*bcb2dfaeSJed Brown*dofs on each element*, transitions to independent *quadrature points* in
*bcb2dfaeSJed Brownreference space, performs the integration, and then goes back in reverse order
*bcb2dfaeSJed Brownto global (test) degrees of freedom on the whole mesh.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownThis is illustrated below for the simple case of symmetric linear operator on
*bcb2dfaeSJed Brownthird order ($Q_3$) scalar continuous ($H^1$) elements, where we use
*bcb2dfaeSJed Brownthe notions **T-vector**, **L-vector**, **E-vector** and **Q-vector** to represent
*bcb2dfaeSJed Brownthe sets corresponding to the (true) degrees of freedom on the global mesh, the split
*bcb2dfaeSJed Brownlocal degrees of freedom on the subdomains, the split degrees of freedom on the
*bcb2dfaeSJed Brownmesh elements, and the values at quadrature points, respectively.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownWe refer to the operators that connect the different types of vectors as:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- Subdomain restriction $\bm{P}$
*bcb2dfaeSJed Brown- Element restriction $\bm{G}$
*bcb2dfaeSJed Brown- Basis (Dofs-to-Qpts) evaluator $\bm{B}$
*bcb2dfaeSJed Brown- Operator at quadrature points $\bm{D}$
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownMore generally, when the test and trial space differ, they get their own
*bcb2dfaeSJed Brownversions of $\bm{P}$, $\bm{G}$ and $\bm{B}$.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown(fig-operator-decomp)=
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown:::{figure} ../../img/libCEED.png
*bcb2dfaeSJed BrownOperator Decomposition
*bcb2dfaeSJed Brown:::
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownNote that in the case of adaptive mesh refinement (AMR), the restrictions
*bcb2dfaeSJed Brown$\bm{P}$ and $\bm{G}$ will involve not just extracting sub-vectors,
*bcb2dfaeSJed Brownbut evaluating values at constrained degrees of freedom through the AMR interpolation.
*bcb2dfaeSJed BrownThere can also be several levels of subdomains ($\bm P_1$, $\bm P_2$,
*bcb2dfaeSJed Brownetc.), and it may be convenient to split $\bm{D}$ as the product of several
*bcb2dfaeSJed Brownoperators ($\bm D_1$, $\bm D_2$, etc.).
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown### Terminology and Notation
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownVector representation/storage categories:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- True degrees of freedom/unknowns, **T-vector**:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown  > - each unknown $i$ has exactly one copy, on exactly one processor, $rank(i)$
*bcb2dfaeSJed Brown  > - this is a non-overlapping vector decomposition
*bcb2dfaeSJed Brown  > - usually includes any essential (fixed) dofs.
*bcb2dfaeSJed Brown  >
*bcb2dfaeSJed Brown  > ```{image} ../../img/T-vector.svg
*bcb2dfaeSJed Brown  > ```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- Local (w.r.t. processors) degrees of freedom/unknowns, **L-vector**:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown  > - each unknown $i$ has exactly one copy on each processor that owns an
*bcb2dfaeSJed Brown  >   element containing $i$
*bcb2dfaeSJed Brown  > - this is an overlapping vector decomposition with overlaps only across
*bcb2dfaeSJed Brown  >   different processors---there is no duplication of unknowns on a single
*bcb2dfaeSJed Brown  >   processor
*bcb2dfaeSJed Brown  > - the shared dofs/unknowns are the overlapping dofs, i.e. the ones that have
*bcb2dfaeSJed Brown  >   more than one copy, on different processors.
*bcb2dfaeSJed Brown  >
*bcb2dfaeSJed Brown  > ```{image} ../../img/L-vector.svg
*bcb2dfaeSJed Brown  > ```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- Per element decomposition, **E-vector**:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown  > - each unknown $i$ has as many copies as the number of elements that contain
*bcb2dfaeSJed Brown  >   $i$
*bcb2dfaeSJed Brown  > - usually, the copies of the unknowns are grouped by the element they belong
*bcb2dfaeSJed Brown  >   to.
*bcb2dfaeSJed Brown  >
*bcb2dfaeSJed Brown  > ```{image} ../../img/E-vector.svg
*bcb2dfaeSJed Brown  > ```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- In the case of AMR with hanging nodes (giving rise to hanging dofs):
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown  > - the **L-vector** is enhanced with the hanging/dependent dofs
*bcb2dfaeSJed Brown  > - the additional hanging/dependent dofs are duplicated when they are shared
*bcb2dfaeSJed Brown  >   by multiple processors
*bcb2dfaeSJed Brown  > - this way, an **E-vector** can be derived from an **L-vector** without any
*bcb2dfaeSJed Brown  >   communications and without additional computations to derive the dependent
*bcb2dfaeSJed Brown  >   dofs
*bcb2dfaeSJed Brown  > - in other words, an entry in an **E-vector** is obtained by copying an entry
*bcb2dfaeSJed Brown  >   from the corresponding **L-vector**, optionally switching the sign of the
*bcb2dfaeSJed Brown  >   entry (for $H(\mathrm{div})$---and $H(\mathrm{curl})$-conforming spaces).
*bcb2dfaeSJed Brown  >
*bcb2dfaeSJed Brown  > ```{image} ../../img/L-vector-AMR.svg
*bcb2dfaeSJed Brown  > ```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- In the case of variable order spaces:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown  > - the dependent dofs (usually on the higher-order side of a face/edge) can
*bcb2dfaeSJed Brown  >   be treated just like the hanging/dependent dofs case.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- Quadrature point vector, **Q-vector**:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown  > - this is similar to **E-vector** where instead of dofs, the vector represents
*bcb2dfaeSJed Brown  >   values at quadrature points, grouped by element.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- In many cases it is useful to distinguish two types of vectors:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown  > - **X-vector**, or **primal X-vector**, and **X'-vector**, or **dual X-vector**
*bcb2dfaeSJed Brown  > - here X can be any of the T, L, E, or Q categories
*bcb2dfaeSJed Brown  > - for example, the mass matrix operator maps a **T-vector** to a **T'-vector**
*bcb2dfaeSJed Brown  > - the solutions vector is a **T-vector**, and the RHS vector is a **T'-vector**
*bcb2dfaeSJed Brown  > - using the parallel prolongation operator, one can map the solution
*bcb2dfaeSJed Brown  >   **T-vector** to a solution **L-vector**, etc.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownOperator representation/storage/action categories:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- Full true-dof parallel assembly, **TA**, or **A**:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown  > - ParCSR or similar format
*bcb2dfaeSJed Brown  > - the T in TA indicates that the data format represents an operator from a
*bcb2dfaeSJed Brown  >   **T-vector** to a **T'-vector**.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- Full local assembly, **LA**:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown  > - CSR matrix on each rank
*bcb2dfaeSJed Brown  > - the parallel prolongation operator, $\bm{P}$, (and its transpose) should use
*bcb2dfaeSJed Brown  >   optimized matrix-free action
*bcb2dfaeSJed Brown  > - note that $\bm{P}$ is the operator mapping T-vectors to L-vectors.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- Element matrix assembly, **EA**:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown  > - each element matrix is stored as a dense matrix
*bcb2dfaeSJed Brown  > - optimized element and parallel prolongation operators
*bcb2dfaeSJed Brown  > - note that the element prolongation operator is the mapping from an
*bcb2dfaeSJed Brown  >   **L-vector** to an **E-vector**.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- Quadrature-point/partial assembly, **QA** or **PA**:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown  > - precompute and store $w\det(J)$ at all quadrature points in all mesh elements
*bcb2dfaeSJed Brown  > - the stored data can be viewed as a **Q-vector**.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- Unassembled option,  **UA** or **U**:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown  > - no assembly step
*bcb2dfaeSJed Brown  > - the action uses directly the mesh node coordinates, and assumes specific
*bcb2dfaeSJed Brown  >   form of the coefficient, e.g. constant, piecewise-constant, or given as a
*bcb2dfaeSJed Brown  >   **Q-vector** (Q-coefficient).
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown### Partial Assembly
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownSince the global operator $\bm{A}$ is just a series of variational restrictions
*bcb2dfaeSJed Brownwith $\bm{B}$, $\bm{G}$ and $\bm{P}$, starting from its
*bcb2dfaeSJed Brownpoint-wise kernel $\bm{D}$, a "matvec" with $\bm{A}$ can be
*bcb2dfaeSJed Brownperformed by evaluating and storing some of the innermost variational restriction
*bcb2dfaeSJed Brownmatrices, and applying the rest of the operators "on-the-fly". For example, one can
*bcb2dfaeSJed Browncompute and store a global matrix on **T-vector** level. Alternatively, one can compute
*bcb2dfaeSJed Brownand store only the subdomain (**L-vector**) or element (**E-vector**) matrices and
*bcb2dfaeSJed Brownperform the action of $\bm{A}$ using matvecs with $\bm{P}$ or
*bcb2dfaeSJed Brown$\bm{P}$ and $\bm{G}$. While these options are natural for
*bcb2dfaeSJed Brownlow-order discretizations, they are not a good fit for high-order methods due to
*bcb2dfaeSJed Brownthe amount of FLOPs needed for their evaluation, as well as the memory transfer
*bcb2dfaeSJed Brownneeded for a matvec.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownOur focus in libCEED, instead, is on **partial assembly**, where we compute and
*bcb2dfaeSJed Brownstore only $\bm{D}$ (or portions of it) and evaluate the actions of
*bcb2dfaeSJed Brown$\bm{P}$, $\bm{G}$ and $\bm{B}$ on-the-fly.
*bcb2dfaeSJed BrownCritically for performance, we take advantage of the tensor-product structure of the
*bcb2dfaeSJed Browndegrees of freedom and quadrature points on *quad* and *hex* elements to perform the
*bcb2dfaeSJed Brownaction of $\bm{B}$ without storing it as a matrix.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownImplemented properly, the partial assembly algorithm requires optimal amount of
*bcb2dfaeSJed Brownmemory transfers (with respect to the polynomial order) and near-optimal FLOPs
*bcb2dfaeSJed Brownfor operator evaluation. It consists of an operator *setup* phase, that
*bcb2dfaeSJed Brownevaluates and stores $\bm{D}$ and an operator *apply* (evaluation) phase that
*bcb2dfaeSJed Browncomputes the action of $\bm{A}$ on an input vector. When desired, the setup
*bcb2dfaeSJed Brownphase may be done as a side-effect of evaluating a different operator, such as a
*bcb2dfaeSJed Brownnonlinear residual. The relative costs of the setup and apply phases are
*bcb2dfaeSJed Browndifferent depending on the physics being expressed and the representation of
*bcb2dfaeSJed Brown$\bm{D}$.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown### Parallel Decomposition
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownAfter the application of each of the first three transition operators,
*bcb2dfaeSJed Brown$\bm{P}$, $\bm{G}$ and $\bm{B}$, the operator evaluation
*bcb2dfaeSJed Brownis decoupled  on their ranges, so $\bm{P}$, $\bm{G}$ and
*bcb2dfaeSJed Brown$\bm{B}$ allow us to "zoom-in" to subdomain, element and quadrature point
*bcb2dfaeSJed Brownlevel, ignoring the coupling at higher levels.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownThus, a natural mapping of $\bm{A}$ on a parallel computer is to split the
*bcb2dfaeSJed Brown**T-vector** over MPI ranks (a non-overlapping decomposition, as is typically
*bcb2dfaeSJed Brownused for sparse matrices), and then split the rest of the vector types over
*bcb2dfaeSJed Browncomputational devices (CPUs, GPUs, etc.) as indicated by the shaded regions in
*bcb2dfaeSJed Brownthe diagram above.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownOne of the advantages of the decomposition perspective in these settings is that
*bcb2dfaeSJed Brownthe operators $\bm{P}$, $\bm{G}$, $\bm{B}$ and
*bcb2dfaeSJed Brown$\bm{D}$ clearly separate the MPI parallelism
*bcb2dfaeSJed Brownin the operator ($\bm{P}$) from the unstructured mesh topology
*bcb2dfaeSJed Brown($\bm{G}$), the choice of the finite element space/basis ($\bm{B}$)
*bcb2dfaeSJed Brownand the geometry and point-wise physics $\bm{D}$. These components also
*bcb2dfaeSJed Brownnaturally fall in different classes of numerical algorithms -- parallel (multi-device)
*bcb2dfaeSJed Brownlinear algebra for $\bm{P}$, sparse (on-device) linear algebra for
*bcb2dfaeSJed Brown$\bm{G}$, dense/structured linear algebra (tensor contractions) for
*bcb2dfaeSJed Brown$\bm{B}$ and parallel point-wise evaluations for $\bm{D}$.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownCurrently in libCEED, it is assumed that the host application manages the global
*bcb2dfaeSJed Brown**T-vectors** and the required communications among devices (which are generally
*bcb2dfaeSJed Brownon different compute nodes) with **P**. Our API is thus focused on the
*bcb2dfaeSJed Brown**L-vector** level, where the logical devices, which in the library are
*bcb2dfaeSJed Brownrepresented by the {ref}`Ceed` object, are independent. Each MPI rank can use one or
*bcb2dfaeSJed Brownmore {ref}`Ceed`s, and each {ref}`Ceed`, in turn, can represent one or more physical
*bcb2dfaeSJed Browndevices, as long as libCEED backends support such configurations. The idea is
*bcb2dfaeSJed Brownthat every MPI rank can use any logical device it is assigned at runtime. For
*bcb2dfaeSJed Brownexample, on a node with 2 CPU sockets and 4 GPUs, one may decide to use 6 MPI
*bcb2dfaeSJed Brownranks (each using a single {ref}`Ceed` object): 2 ranks using 1 CPU socket each, and
*bcb2dfaeSJed Brown4 using 1 GPU each. Another choice could be to run 1 MPI rank on the whole node
*bcb2dfaeSJed Brownand use 5 {ref}`Ceed` objects: 1 managing all CPU cores on the 2 sockets and 4
*bcb2dfaeSJed Brownmanaging 1 GPU each. The communications among the devices, e.g. required for
*bcb2dfaeSJed Brownapplying the action of $\bm{P}$, are currently out of scope of libCEED. The
*bcb2dfaeSJed Browninterface is non-blocking for all operations involving more than O(1) data,
*bcb2dfaeSJed Brownallowing operations performed on a coprocessor or worker threads to overlap with
*bcb2dfaeSJed Brownoperations on the host.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown## API Description
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownThe libCEED API takes an algebraic approach, where the user essentially
*bcb2dfaeSJed Browndescribes in the *frontend* the operators **G**, **B** and **D** and the library
*bcb2dfaeSJed Brownprovides *backend* implementations and coordinates their action to the original
*bcb2dfaeSJed Brownoperator on **L-vector** level (i.e. independently on each device / MPI task).
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownOne of the advantages of this purely algebraic description is that it already
*bcb2dfaeSJed Brownincludes all the finite element information, so the backends can operate on
*bcb2dfaeSJed Brownlinear algebra level without explicit finite element code. The frontend
*bcb2dfaeSJed Browndescription is general enough to support a wide variety of finite element
*bcb2dfaeSJed Brownalgorithms, as well as some other types algorithms such as spectral finite
*bcb2dfaeSJed Browndifferences. The separation of the front- and backends enables applications to
*bcb2dfaeSJed Browneasily switch/try different backends. It also enables backend developers to
*bcb2dfaeSJed Brownimpact many applications from a single implementation.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownOur long-term vision is to include a variety of backend implementations in
*bcb2dfaeSJed BrownlibCEED, ranging from reference kernels to highly optimized kernels targeting
*bcb2dfaeSJed Brownspecific devices (e.g. GPUs) or specific polynomial orders. A simple reference
*bcb2dfaeSJed Brownbackend implementation is provided in the file
*bcb2dfaeSJed Brown[ceed-ref.c](https://github.com/CEED/libCEED/blob/main/backends/ref/ceed-ref.c).
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownOn the frontend, the mapping between the decomposition concepts and the code
*bcb2dfaeSJed Brownimplementation is as follows:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown- **L-**, **E-** and **Q-vector** are represented as variables of type {ref}`CeedVector`.
*bcb2dfaeSJed Brown  (A backend may choose to operate incrementally without forming explicit **E-** or
*bcb2dfaeSJed Brown  **Q-vectors**.)
*bcb2dfaeSJed Brown- $\bm{G}$ is represented as variable of type {ref}`CeedElemRestriction`.
*bcb2dfaeSJed Brown- $\bm{B}$ is represented as variable of type {ref}`CeedBasis`.
*bcb2dfaeSJed Brown- the action of $\bm{D}$ is represented as variable of type {ref}`CeedQFunction`.
*bcb2dfaeSJed Brown- the overall operator $\bm{G}^T \bm{B}^T \bm{D} \bm{B} \bm{G}$
*bcb2dfaeSJed Brown  is represented as variable of type
*bcb2dfaeSJed Brown  {ref}`CeedOperator` and its action is accessible through {c:func}`CeedOperatorApply()`.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownTo clarify these concepts and illustrate how they are combined in the API,
*bcb2dfaeSJed Brownconsider the implementation of the action of a simple 1D mass matrix
*bcb2dfaeSJed Brown(cf. [tests/t500-operator.c](https://github.com/CEED/libCEED/blob/main/tests/t500-operator.c)).
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:linenos: true
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownThe constructor
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
*bcb2dfaeSJed Brown:end-at: CeedInit
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:start-at: CeedInit
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Browncreates a logical device `ceed` on the specified *resource*, which could also be
*bcb2dfaeSJed Browna coprocessor such as `"/nvidia/0"`. There can be any number of such devices,
*bcb2dfaeSJed Brownincluding multiple logical devices driving the same resource (though performance
*bcb2dfaeSJed Brownmay suffer in case of oversubscription). The resource is used to locate a
*bcb2dfaeSJed Brownsuitable backend which will have discretion over the implementations of all
*bcb2dfaeSJed Brownobjects created with this logical device.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownThe `setup` routine above computes and stores $\bm{D}$, in this case a
*bcb2dfaeSJed Brownscalar value in each quadrature point, while `mass` uses these saved values to perform
*bcb2dfaeSJed Brownthe action of $\bm{D}$. These functions are turned into the {ref}`CeedQFunction`
*bcb2dfaeSJed Brownvariables `qf_setup` and `qf_mass` in the {c:func}`CeedQFunctionCreateInterior()` calls:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
*bcb2dfaeSJed Brown:end-before: //! [QFunction Create]
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:start-after: //! [QFunction Create]
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownA {ref}`CeedQFunction` performs independent operations at each quadrature point and
*bcb2dfaeSJed Brownthe interface is intended to facilitate vectorization.  The second argument is
*bcb2dfaeSJed Brownan expected vector length. If greater than 1, the caller must ensure that the
*bcb2dfaeSJed Brownnumber of quadrature points `Q` is divisible by the vector length. This is
*bcb2dfaeSJed Brownoften satisfied automatically due to the element size or by batching elements
*bcb2dfaeSJed Browntogether to facilitate vectorization in other stages, and can always be ensured
*bcb2dfaeSJed Brownby padding.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownIn addition to the function pointers (`setup` and `mass`), {ref}`CeedQFunction`
*bcb2dfaeSJed Brownconstructors take a string representation specifying where the source for the
*bcb2dfaeSJed Brownimplementation is found. This is used by backends that support Just-In-Time
*bcb2dfaeSJed Brown(JIT) compilation (i.e., CUDA and OCCA) to compile for coprocessors.
*bcb2dfaeSJed BrownFor full support across all backends, these {ref}`CeedQFunction` source files must only contain constructs mutually supported by C99, C++11, and CUDA.
*bcb2dfaeSJed BrownFor example, explicit type casting of void pointers and explicit use of compatible arguments for {code}`math` library functions is required, and variable-length array (VLA) syntax for array reshaping is only available via libCEED's {code}`CEED_Q_VLA` macro.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownDifferent input and output fields are added individually, specifying the field
*bcb2dfaeSJed Brownname, size of the field, and evaluation mode.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownThe size of the field is provided by a combination of the number of components
*bcb2dfaeSJed Brownthe effect of any basis evaluations.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownThe evaluation mode (see {ref}`CeedBasis-Typedefs and Enumerations`) `CEED_EVAL_INTERP`
*bcb2dfaeSJed Brownfor both input and output fields indicates that the mass operator only contains terms of
*bcb2dfaeSJed Brownthe form
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown$$
*bcb2dfaeSJed Brown\int_\Omega v \cdot f_0 (u, \nabla u)
*bcb2dfaeSJed Brown$$
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brownwhere $v$ are test functions (see the {ref}`theoretical-framework`).
*bcb2dfaeSJed BrownMore general operators, such as those of the form
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown$$
*bcb2dfaeSJed Brown\int_\Omega v \cdot f_0 (u, \nabla u) + \nabla v : f_1 (u, \nabla u)
*bcb2dfaeSJed Brown$$
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Browncan be expressed.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownFor fields with derivatives, such as with the basis evaluation mode
*bcb2dfaeSJed Brown(see {ref}`CeedBasis-Typedefs and Enumerations`) `CEED_EVAL_GRAD`, the size of the
*bcb2dfaeSJed Brownfield needs to reflect both the number of components and the geometric dimension.
*bcb2dfaeSJed BrownA 3-dimensional gradient on four components would therefore mean the field has a size of
*bcb2dfaeSJed Brown12\.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownThe $\bm{B}$ operators for the mesh nodes, `basis_x`, and the unknown field,
*bcb2dfaeSJed Brown`basis_u`, are defined in the calls to the function {c:func}`CeedBasisCreateTensorH1Lagrange()`.
*bcb2dfaeSJed BrownIn this example, both the mesh and the unknown field use $H^1$ Lagrange finite
*bcb2dfaeSJed Brownelements of order 1 and 4 respectively (the `P` argument represents the number of 1D
*bcb2dfaeSJed Browndegrees of freedom on each element). Both basis operators use the same integration rule,
*bcb2dfaeSJed Brownwhich is Gauss-Legendre with 8 points (the `Q` argument).
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
*bcb2dfaeSJed Brown:end-before: //! [Basis Create]
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:start-after: //! [Basis Create]
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownOther elements with this structure can be specified in terms of the `Q×P`
*bcb2dfaeSJed Brownmatrices that evaluate values and gradients at quadrature points in one
*bcb2dfaeSJed Browndimension using {c:func}`CeedBasisCreateTensorH1()`. Elements that do not have tensor
*bcb2dfaeSJed Brownproduct structure, such as symmetric elements on simplices, will be created
*bcb2dfaeSJed Brownusing different constructors.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownThe $\bm{G}$ operators for the mesh nodes, `elem_restr_x`, and the unknown field,
*bcb2dfaeSJed Brown`elem_restr_u`, are specified in the {c:func}`CeedElemRestrictionCreate()`. Both of these
*bcb2dfaeSJed Brownspecify directly the dof indices for each element in the `ind_x` and `ind_u`
*bcb2dfaeSJed Brownarrays:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
*bcb2dfaeSJed Brown:end-before: //! [ElemRestr Create]
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:start-after: //! [ElemRestr Create]
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
*bcb2dfaeSJed Brown:end-before: //! [ElemRestrU Create]
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:start-after: //! [ElemRestrU Create]
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownIf the user has arrays available on a device, they can be provided using
*bcb2dfaeSJed Brown`CEED_MEM_DEVICE`. This technique is used to provide no-copy interfaces in all
*bcb2dfaeSJed Browncontexts that involve problem-sized data.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownFor discontinuous Galerkin and for applications such as Nek5000 that only
*bcb2dfaeSJed Brownexplicitly store **E-vectors** (inter-element continuity has been subsumed by
*bcb2dfaeSJed Brownthe parallel restriction $\bm{P}$), the element restriction $\bm{G}$
*bcb2dfaeSJed Brownis the identity and {c:func}`CeedElemRestrictionCreateStrided()` is used instead.
*bcb2dfaeSJed BrownWe plan to support other structured representations of $\bm{G}$ which will
*bcb2dfaeSJed Brownbe added according to demand.
*bcb2dfaeSJed BrownThere are two common approaches for supporting non-conforming elements: applying the node constraints via $\bm P$ so that the **L-vector** can be processed uniformly and applying the constraints via $\bm G$ so that the **E-vector** is uniform.
*bcb2dfaeSJed BrownThe former can be done with the existing interface while the latter will require a generalization to element restriction that would define field values at constrained nodes as linear combinations of the values at primary nodes.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownThese operations, $\bm{P}$, $\bm{B}$, and $\bm{D}$,
*bcb2dfaeSJed Brownare combined with a {ref}`CeedOperator`. As with {ref}`CeedQFunction`s, operator fields are added
*bcb2dfaeSJed Brownseparately with a matching field name, basis ($\bm{B}$), element restriction
*bcb2dfaeSJed Brown($\bm{G}$), and **L-vector**. The flag
*bcb2dfaeSJed Brown`CEED_VECTOR_ACTIVE` indicates that the vector corresponding to that field will
*bcb2dfaeSJed Brownbe provided to the operator when {c:func}`CeedOperatorApply()` is called. Otherwise the
*bcb2dfaeSJed Browninput/output will be read from/written to the specified **L-vector**.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownWith partial assembly, we first perform a setup stage where $\bm{D}$ is evaluated
*bcb2dfaeSJed Brownand stored. This is accomplished by the operator `op_setup` and its application
*bcb2dfaeSJed Brownto `X`, the nodes of the mesh (these are needed to compute Jacobians at
*bcb2dfaeSJed Brownquadrature points). Note that the corresponding {c:func}`CeedOperatorApply()` has no basis
*bcb2dfaeSJed Brownevaluation on the output, as the quadrature data is not needed at the dofs:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
*bcb2dfaeSJed Brown:end-before: //! [Setup Create]
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:start-after: //! [Setup Create]
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
*bcb2dfaeSJed Brown:end-before: //! [Setup Set]
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:start-after: //! [Setup Set]
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
*bcb2dfaeSJed Brown:end-before: //! [Setup Apply]
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:start-after: //! [Setup Apply]
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownThe action of the operator is then represented by operator `op_mass` and its
*bcb2dfaeSJed Brown{c:func}`CeedOperatorApply()` to the input **L-vector** `U` with output in `V`:
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
*bcb2dfaeSJed Brown:end-before: //! [Operator Create]
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:start-after: //! [Operator Create]
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
*bcb2dfaeSJed Brown:end-before: //! [Operator Set]
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:start-after: //! [Operator Set]
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
*bcb2dfaeSJed Brown:end-before: //! [Operator Apply]
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:start-after: //! [Operator Apply]
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownA number of function calls in the interface, such as {c:func}`CeedOperatorApply()`, are
*bcb2dfaeSJed Brownintended to support asynchronous execution via their last argument,
*bcb2dfaeSJed Brown`CeedRequest*`. The specific (pointer) value used in the above example,
*bcb2dfaeSJed Brown`CEED_REQUEST_IMMEDIATE`, is used to express the request (from the user) for the
*bcb2dfaeSJed Brownoperation to complete before returning from the function call, i.e. to make sure
*bcb2dfaeSJed Brownthat the result of the operation is available in the output parameters
*bcb2dfaeSJed Brownimmediately after the call. For a true asynchronous call, one needs to provide
*bcb2dfaeSJed Brownthe address of a user defined variable. Such a variable can be used later to
*bcb2dfaeSJed Brownexplicitly wait for the completion of the operation.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown## Gallery of QFunctions
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownLibCEED provides a gallery of built-in {ref}`CeedQFunction`s in the {file}`gallery/` directory.
*bcb2dfaeSJed BrownThe available QFunctions are the ones associated with the mass, the Laplacian, and
*bcb2dfaeSJed Brownthe identity operators. To illustrate how the user can declare a {ref}`CeedQFunction`
*bcb2dfaeSJed Brownvia the gallery of available QFunctions, consider the selection of the
*bcb2dfaeSJed Brown{ref}`CeedQFunction` associated with a simple 1D mass matrix
*bcb2dfaeSJed Brown(cf. [tests/t410-qfunction.c](https://github.com/CEED/libCEED/blob/main/tests/t410-qfunction.c)).
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t410-qfunction.c
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:linenos: true
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown## Interface Principles and Evolution
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownLibCEED is intended to be extensible via backends that are packaged with the
*bcb2dfaeSJed Brownlibrary and packaged separately (possibly as a binary containing proprietary
*bcb2dfaeSJed Browncode). Backends are registered by calling
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown```{literalinclude} ../../../backends/ref/ceed-ref.c
*bcb2dfaeSJed Brown:end-before: //! [Register]
*bcb2dfaeSJed Brown:language: c
*bcb2dfaeSJed Brown:start-after: //! [Register]
*bcb2dfaeSJed Brown```
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Browntypically in a library initializer or "constructor" that runs automatically.
*bcb2dfaeSJed Brown`CeedInit` uses this prefix to find an appropriate backend for the resource.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownSource (API) and binary (ABI) stability are important to libCEED. Prior to
*bcb2dfaeSJed Brownreaching version 1.0, libCEED does not implement strict [semantic versioning](https://semver.org) across the entire interface. However, user code,
*bcb2dfaeSJed Brownincluding libraries of {ref}`CeedQFunction`s, should be source and binary
*bcb2dfaeSJed Browncompatible moving from 0.x.y to any later release 0.x.z. We have less experience
*bcb2dfaeSJed Brownwith external packaging of backends and do not presently guarantee source or
*bcb2dfaeSJed Brownbinary stability, but we intend to define stability guarantees for libCEED 1.0.
*bcb2dfaeSJed BrownWe'd love to talk with you if you're interested in packaging backends
*bcb2dfaeSJed Brownexternally, and will work with you on a practical stability policy.