1*bcb2dfaeSJed Brown# Interface Concepts 2*bcb2dfaeSJed Brown 3*bcb2dfaeSJed BrownThis page provides a brief description of the theoretical foundations and the 4*bcb2dfaeSJed Brownpractical implementation of the libCEED library. 5*bcb2dfaeSJed Brown 6*bcb2dfaeSJed Brown(theoretical-framework)= 7*bcb2dfaeSJed Brown 8*bcb2dfaeSJed Brown## Theoretical Framework 9*bcb2dfaeSJed Brown 10*bcb2dfaeSJed BrownIn finite element formulations, the weak form of a Partial Differential Equation 11*bcb2dfaeSJed Brown(PDE) is evaluated on a subdomain $\Omega_e$ (element) and the local results 12*bcb2dfaeSJed Brownare composed into a larger system of equations that models the entire problem on 13*bcb2dfaeSJed Brownthe global domain $\Omega$. In particular, when high-order finite elements or 14*bcb2dfaeSJed Brownspectral elements are used, the resulting sparse matrix representation of the global 15*bcb2dfaeSJed Brownoperator is computationally expensive, with respect to both the memory transfer and 16*bcb2dfaeSJed Brownfloating point operations needed for its evaluation. libCEED provides an interface 17*bcb2dfaeSJed Brownfor matrix-free operator description that enables efficient evaluation on a variety 18*bcb2dfaeSJed Brownof computational device types (selectable at run time). We present here the notation 19*bcb2dfaeSJed Brownand the mathematical formulation adopted in libCEED. 20*bcb2dfaeSJed Brown 21*bcb2dfaeSJed BrownWe start by considering the discrete residual $F(u)=0$ formulation 22*bcb2dfaeSJed Brownin weak form. We first define the $L^2$ inner product between real-valued functions 23*bcb2dfaeSJed Brown 24*bcb2dfaeSJed Brown$$ 25*bcb2dfaeSJed Brown\langle v, u \rangle = \int_\Omega v u d \bm{x}, 26*bcb2dfaeSJed Brown$$ 27*bcb2dfaeSJed Brown 28*bcb2dfaeSJed Brownwhere $\bm{x} \in \mathbb{R}^d \supset \Omega$. 29*bcb2dfaeSJed Brown 30*bcb2dfaeSJed BrownWe want to find $u$ in a suitable space $V_D$, 31*bcb2dfaeSJed Brownsuch that 32*bcb2dfaeSJed Brown 33*bcb2dfaeSJed Brown$$ 34*bcb2dfaeSJed Brown\langle \bm v, \bm f(u) \rangle = \int_\Omega \bm v \cdot \bm f_0 (u, \nabla u) + \nabla \bm v : \bm f_1 (u, \nabla u) = 0 35*bcb2dfaeSJed Brown$$ (residual) 36*bcb2dfaeSJed Brown 37*bcb2dfaeSJed Brownfor all $\bm v$ in the corresponding homogeneous space $V_0$, where $\bm f_0$ 38*bcb2dfaeSJed Brownand $\bm f_1$ contain all possible sources in the problem. We notice here that 39*bcb2dfaeSJed Brown$\bm f_0$ represents all terms in {math:numref}`residual` which multiply the (possibly vector-valued) test 40*bcb2dfaeSJed Brownfunction $\bm v$ and $\bm f_1$ all terms which multiply its gradient $\nabla \bm v$. 41*bcb2dfaeSJed BrownFor an n-component problems in $d$ dimensions, $\bm f_0 \in \mathbb{R}^n$ and 42*bcb2dfaeSJed Brown$\bm f_1 \in \mathbb{R}^{nd}$. 43*bcb2dfaeSJed Brown 44*bcb2dfaeSJed Brown:::{note} 45*bcb2dfaeSJed BrownThe notation $\nabla \bm v \!:\! \bm f_1$ represents contraction over both 46*bcb2dfaeSJed Brownfields and spatial dimensions while a single dot represents contraction in just one, 47*bcb2dfaeSJed Brownwhich should be clear from context, e.g., $\bm v \cdot \bm f_0$ contracts only over 48*bcb2dfaeSJed Brownfields. 49*bcb2dfaeSJed Brown::: 50*bcb2dfaeSJed Brown 51*bcb2dfaeSJed Brown:::{note} 52*bcb2dfaeSJed BrownIn the code, the function that represents the weak form at quadrature 53*bcb2dfaeSJed Brownpoints is called the {ref}`CeedQFunction`. In the {ref}`Examples` provided with the 54*bcb2dfaeSJed Brownlibrary (in the {file}`examples/` directory), we store the term $\bm f_0$ directly 55*bcb2dfaeSJed Browninto `v`, and the term $\bm f_1$ directly into `dv` (which stands for 56*bcb2dfaeSJed Brown$\nabla \bm v$). If equation {math:numref}`residual` only presents a term of the 57*bcb2dfaeSJed Browntype $\bm f_0$, the {ref}`CeedQFunction` will only have one output argument, 58*bcb2dfaeSJed Brownnamely `v`. If equation {math:numref}`residual` also presents a term of the type 59*bcb2dfaeSJed Brown$\bm f_1$, then the {ref}`CeedQFunction` will have two output arguments, namely, 60*bcb2dfaeSJed Brown`v` and `dv`. 61*bcb2dfaeSJed Brown::: 62*bcb2dfaeSJed Brown 63*bcb2dfaeSJed Brown## Finite Element Operator Decomposition 64*bcb2dfaeSJed Brown 65*bcb2dfaeSJed BrownFinite element operators are typically defined through weak formulations of 66*bcb2dfaeSJed Brownpartial differential equations that involve integration over a computational 67*bcb2dfaeSJed Brownmesh. The required integrals are computed by splitting them as a sum over the 68*bcb2dfaeSJed Brownmesh elements, mapping each element to a simple *reference* element (e.g. the 69*bcb2dfaeSJed Brownunit square) and applying a quadrature rule in reference space. 70*bcb2dfaeSJed Brown 71*bcb2dfaeSJed BrownThis sequence of operations highlights an inherent hierarchical structure 72*bcb2dfaeSJed Brownpresent in all finite element operators where the evaluation starts on *global 73*bcb2dfaeSJed Brown(trial) degrees of freedom (dofs) or nodes on the whole mesh*, restricts to 74*bcb2dfaeSJed Brown*dofs on subdomains* (groups of elements), then moves to independent 75*bcb2dfaeSJed Brown*dofs on each element*, transitions to independent *quadrature points* in 76*bcb2dfaeSJed Brownreference space, performs the integration, and then goes back in reverse order 77*bcb2dfaeSJed Brownto global (test) degrees of freedom on the whole mesh. 78*bcb2dfaeSJed Brown 79*bcb2dfaeSJed BrownThis is illustrated below for the simple case of symmetric linear operator on 80*bcb2dfaeSJed Brownthird order ($Q_3$) scalar continuous ($H^1$) elements, where we use 81*bcb2dfaeSJed Brownthe notions **T-vector**, **L-vector**, **E-vector** and **Q-vector** to represent 82*bcb2dfaeSJed Brownthe sets corresponding to the (true) degrees of freedom on the global mesh, the split 83*bcb2dfaeSJed Brownlocal degrees of freedom on the subdomains, the split degrees of freedom on the 84*bcb2dfaeSJed Brownmesh elements, and the values at quadrature points, respectively. 85*bcb2dfaeSJed Brown 86*bcb2dfaeSJed BrownWe refer to the operators that connect the different types of vectors as: 87*bcb2dfaeSJed Brown 88*bcb2dfaeSJed Brown- Subdomain restriction $\bm{P}$ 89*bcb2dfaeSJed Brown- Element restriction $\bm{G}$ 90*bcb2dfaeSJed Brown- Basis (Dofs-to-Qpts) evaluator $\bm{B}$ 91*bcb2dfaeSJed Brown- Operator at quadrature points $\bm{D}$ 92*bcb2dfaeSJed Brown 93*bcb2dfaeSJed BrownMore generally, when the test and trial space differ, they get their own 94*bcb2dfaeSJed Brownversions of $\bm{P}$, $\bm{G}$ and $\bm{B}$. 95*bcb2dfaeSJed Brown 96*bcb2dfaeSJed Brown(fig-operator-decomp)= 97*bcb2dfaeSJed Brown 98*bcb2dfaeSJed Brown:::{figure} ../../img/libCEED.png 99*bcb2dfaeSJed BrownOperator Decomposition 100*bcb2dfaeSJed Brown::: 101*bcb2dfaeSJed Brown 102*bcb2dfaeSJed BrownNote that in the case of adaptive mesh refinement (AMR), the restrictions 103*bcb2dfaeSJed Brown$\bm{P}$ and $\bm{G}$ will involve not just extracting sub-vectors, 104*bcb2dfaeSJed Brownbut evaluating values at constrained degrees of freedom through the AMR interpolation. 105*bcb2dfaeSJed BrownThere can also be several levels of subdomains ($\bm P_1$, $\bm P_2$, 106*bcb2dfaeSJed Brownetc.), and it may be convenient to split $\bm{D}$ as the product of several 107*bcb2dfaeSJed Brownoperators ($\bm D_1$, $\bm D_2$, etc.). 108*bcb2dfaeSJed Brown 109*bcb2dfaeSJed Brown### Terminology and Notation 110*bcb2dfaeSJed Brown 111*bcb2dfaeSJed BrownVector representation/storage categories: 112*bcb2dfaeSJed Brown 113*bcb2dfaeSJed Brown- True degrees of freedom/unknowns, **T-vector**: 114*bcb2dfaeSJed Brown 115*bcb2dfaeSJed Brown > - each unknown $i$ has exactly one copy, on exactly one processor, $rank(i)$ 116*bcb2dfaeSJed Brown > - this is a non-overlapping vector decomposition 117*bcb2dfaeSJed Brown > - usually includes any essential (fixed) dofs. 118*bcb2dfaeSJed Brown > 119*bcb2dfaeSJed Brown > ```{image} ../../img/T-vector.svg 120*bcb2dfaeSJed Brown > ``` 121*bcb2dfaeSJed Brown 122*bcb2dfaeSJed Brown- Local (w.r.t. processors) degrees of freedom/unknowns, **L-vector**: 123*bcb2dfaeSJed Brown 124*bcb2dfaeSJed Brown > - each unknown $i$ has exactly one copy on each processor that owns an 125*bcb2dfaeSJed Brown > element containing $i$ 126*bcb2dfaeSJed Brown > - this is an overlapping vector decomposition with overlaps only across 127*bcb2dfaeSJed Brown > different processors---there is no duplication of unknowns on a single 128*bcb2dfaeSJed Brown > processor 129*bcb2dfaeSJed Brown > - the shared dofs/unknowns are the overlapping dofs, i.e. the ones that have 130*bcb2dfaeSJed Brown > more than one copy, on different processors. 131*bcb2dfaeSJed Brown > 132*bcb2dfaeSJed Brown > ```{image} ../../img/L-vector.svg 133*bcb2dfaeSJed Brown > ``` 134*bcb2dfaeSJed Brown 135*bcb2dfaeSJed Brown- Per element decomposition, **E-vector**: 136*bcb2dfaeSJed Brown 137*bcb2dfaeSJed Brown > - each unknown $i$ has as many copies as the number of elements that contain 138*bcb2dfaeSJed Brown > $i$ 139*bcb2dfaeSJed Brown > - usually, the copies of the unknowns are grouped by the element they belong 140*bcb2dfaeSJed Brown > to. 141*bcb2dfaeSJed Brown > 142*bcb2dfaeSJed Brown > ```{image} ../../img/E-vector.svg 143*bcb2dfaeSJed Brown > ``` 144*bcb2dfaeSJed Brown 145*bcb2dfaeSJed Brown- In the case of AMR with hanging nodes (giving rise to hanging dofs): 146*bcb2dfaeSJed Brown 147*bcb2dfaeSJed Brown > - the **L-vector** is enhanced with the hanging/dependent dofs 148*bcb2dfaeSJed Brown > - the additional hanging/dependent dofs are duplicated when they are shared 149*bcb2dfaeSJed Brown > by multiple processors 150*bcb2dfaeSJed Brown > - this way, an **E-vector** can be derived from an **L-vector** without any 151*bcb2dfaeSJed Brown > communications and without additional computations to derive the dependent 152*bcb2dfaeSJed Brown > dofs 153*bcb2dfaeSJed Brown > - in other words, an entry in an **E-vector** is obtained by copying an entry 154*bcb2dfaeSJed Brown > from the corresponding **L-vector**, optionally switching the sign of the 155*bcb2dfaeSJed Brown > entry (for $H(\mathrm{div})$---and $H(\mathrm{curl})$-conforming spaces). 156*bcb2dfaeSJed Brown > 157*bcb2dfaeSJed Brown > ```{image} ../../img/L-vector-AMR.svg 158*bcb2dfaeSJed Brown > ``` 159*bcb2dfaeSJed Brown 160*bcb2dfaeSJed Brown- In the case of variable order spaces: 161*bcb2dfaeSJed Brown 162*bcb2dfaeSJed Brown > - the dependent dofs (usually on the higher-order side of a face/edge) can 163*bcb2dfaeSJed Brown > be treated just like the hanging/dependent dofs case. 164*bcb2dfaeSJed Brown 165*bcb2dfaeSJed Brown- Quadrature point vector, **Q-vector**: 166*bcb2dfaeSJed Brown 167*bcb2dfaeSJed Brown > - this is similar to **E-vector** where instead of dofs, the vector represents 168*bcb2dfaeSJed Brown > values at quadrature points, grouped by element. 169*bcb2dfaeSJed Brown 170*bcb2dfaeSJed Brown- In many cases it is useful to distinguish two types of vectors: 171*bcb2dfaeSJed Brown 172*bcb2dfaeSJed Brown > - **X-vector**, or **primal X-vector**, and **X'-vector**, or **dual X-vector** 173*bcb2dfaeSJed Brown > - here X can be any of the T, L, E, or Q categories 174*bcb2dfaeSJed Brown > - for example, the mass matrix operator maps a **T-vector** to a **T'-vector** 175*bcb2dfaeSJed Brown > - the solutions vector is a **T-vector**, and the RHS vector is a **T'-vector** 176*bcb2dfaeSJed Brown > - using the parallel prolongation operator, one can map the solution 177*bcb2dfaeSJed Brown > **T-vector** to a solution **L-vector**, etc. 178*bcb2dfaeSJed Brown 179*bcb2dfaeSJed BrownOperator representation/storage/action categories: 180*bcb2dfaeSJed Brown 181*bcb2dfaeSJed Brown- Full true-dof parallel assembly, **TA**, or **A**: 182*bcb2dfaeSJed Brown 183*bcb2dfaeSJed Brown > - ParCSR or similar format 184*bcb2dfaeSJed Brown > - the T in TA indicates that the data format represents an operator from a 185*bcb2dfaeSJed Brown > **T-vector** to a **T'-vector**. 186*bcb2dfaeSJed Brown 187*bcb2dfaeSJed Brown- Full local assembly, **LA**: 188*bcb2dfaeSJed Brown 189*bcb2dfaeSJed Brown > - CSR matrix on each rank 190*bcb2dfaeSJed Brown > - the parallel prolongation operator, $\bm{P}$, (and its transpose) should use 191*bcb2dfaeSJed Brown > optimized matrix-free action 192*bcb2dfaeSJed Brown > - note that $\bm{P}$ is the operator mapping T-vectors to L-vectors. 193*bcb2dfaeSJed Brown 194*bcb2dfaeSJed Brown- Element matrix assembly, **EA**: 195*bcb2dfaeSJed Brown 196*bcb2dfaeSJed Brown > - each element matrix is stored as a dense matrix 197*bcb2dfaeSJed Brown > - optimized element and parallel prolongation operators 198*bcb2dfaeSJed Brown > - note that the element prolongation operator is the mapping from an 199*bcb2dfaeSJed Brown > **L-vector** to an **E-vector**. 200*bcb2dfaeSJed Brown 201*bcb2dfaeSJed Brown- Quadrature-point/partial assembly, **QA** or **PA**: 202*bcb2dfaeSJed Brown 203*bcb2dfaeSJed Brown > - precompute and store $w\det(J)$ at all quadrature points in all mesh elements 204*bcb2dfaeSJed Brown > - the stored data can be viewed as a **Q-vector**. 205*bcb2dfaeSJed Brown 206*bcb2dfaeSJed Brown- Unassembled option, **UA** or **U**: 207*bcb2dfaeSJed Brown 208*bcb2dfaeSJed Brown > - no assembly step 209*bcb2dfaeSJed Brown > - the action uses directly the mesh node coordinates, and assumes specific 210*bcb2dfaeSJed Brown > form of the coefficient, e.g. constant, piecewise-constant, or given as a 211*bcb2dfaeSJed Brown > **Q-vector** (Q-coefficient). 212*bcb2dfaeSJed Brown 213*bcb2dfaeSJed Brown### Partial Assembly 214*bcb2dfaeSJed Brown 215*bcb2dfaeSJed BrownSince the global operator $\bm{A}$ is just a series of variational restrictions 216*bcb2dfaeSJed Brownwith $\bm{B}$, $\bm{G}$ and $\bm{P}$, starting from its 217*bcb2dfaeSJed Brownpoint-wise kernel $\bm{D}$, a "matvec" with $\bm{A}$ can be 218*bcb2dfaeSJed Brownperformed by evaluating and storing some of the innermost variational restriction 219*bcb2dfaeSJed Brownmatrices, and applying the rest of the operators "on-the-fly". For example, one can 220*bcb2dfaeSJed Browncompute and store a global matrix on **T-vector** level. Alternatively, one can compute 221*bcb2dfaeSJed Brownand store only the subdomain (**L-vector**) or element (**E-vector**) matrices and 222*bcb2dfaeSJed Brownperform the action of $\bm{A}$ using matvecs with $\bm{P}$ or 223*bcb2dfaeSJed Brown$\bm{P}$ and $\bm{G}$. While these options are natural for 224*bcb2dfaeSJed Brownlow-order discretizations, they are not a good fit for high-order methods due to 225*bcb2dfaeSJed Brownthe amount of FLOPs needed for their evaluation, as well as the memory transfer 226*bcb2dfaeSJed Brownneeded for a matvec. 227*bcb2dfaeSJed Brown 228*bcb2dfaeSJed BrownOur focus in libCEED, instead, is on **partial assembly**, where we compute and 229*bcb2dfaeSJed Brownstore only $\bm{D}$ (or portions of it) and evaluate the actions of 230*bcb2dfaeSJed Brown$\bm{P}$, $\bm{G}$ and $\bm{B}$ on-the-fly. 231*bcb2dfaeSJed BrownCritically for performance, we take advantage of the tensor-product structure of the 232*bcb2dfaeSJed Browndegrees of freedom and quadrature points on *quad* and *hex* elements to perform the 233*bcb2dfaeSJed Brownaction of $\bm{B}$ without storing it as a matrix. 234*bcb2dfaeSJed Brown 235*bcb2dfaeSJed BrownImplemented properly, the partial assembly algorithm requires optimal amount of 236*bcb2dfaeSJed Brownmemory transfers (with respect to the polynomial order) and near-optimal FLOPs 237*bcb2dfaeSJed Brownfor operator evaluation. It consists of an operator *setup* phase, that 238*bcb2dfaeSJed Brownevaluates and stores $\bm{D}$ and an operator *apply* (evaluation) phase that 239*bcb2dfaeSJed Browncomputes the action of $\bm{A}$ on an input vector. When desired, the setup 240*bcb2dfaeSJed Brownphase may be done as a side-effect of evaluating a different operator, such as a 241*bcb2dfaeSJed Brownnonlinear residual. The relative costs of the setup and apply phases are 242*bcb2dfaeSJed Browndifferent depending on the physics being expressed and the representation of 243*bcb2dfaeSJed Brown$\bm{D}$. 244*bcb2dfaeSJed Brown 245*bcb2dfaeSJed Brown### Parallel Decomposition 246*bcb2dfaeSJed Brown 247*bcb2dfaeSJed BrownAfter the application of each of the first three transition operators, 248*bcb2dfaeSJed Brown$\bm{P}$, $\bm{G}$ and $\bm{B}$, the operator evaluation 249*bcb2dfaeSJed Brownis decoupled on their ranges, so $\bm{P}$, $\bm{G}$ and 250*bcb2dfaeSJed Brown$\bm{B}$ allow us to "zoom-in" to subdomain, element and quadrature point 251*bcb2dfaeSJed Brownlevel, ignoring the coupling at higher levels. 252*bcb2dfaeSJed Brown 253*bcb2dfaeSJed BrownThus, a natural mapping of $\bm{A}$ on a parallel computer is to split the 254*bcb2dfaeSJed Brown**T-vector** over MPI ranks (a non-overlapping decomposition, as is typically 255*bcb2dfaeSJed Brownused for sparse matrices), and then split the rest of the vector types over 256*bcb2dfaeSJed Browncomputational devices (CPUs, GPUs, etc.) as indicated by the shaded regions in 257*bcb2dfaeSJed Brownthe diagram above. 258*bcb2dfaeSJed Brown 259*bcb2dfaeSJed BrownOne of the advantages of the decomposition perspective in these settings is that 260*bcb2dfaeSJed Brownthe operators $\bm{P}$, $\bm{G}$, $\bm{B}$ and 261*bcb2dfaeSJed Brown$\bm{D}$ clearly separate the MPI parallelism 262*bcb2dfaeSJed Brownin the operator ($\bm{P}$) from the unstructured mesh topology 263*bcb2dfaeSJed Brown($\bm{G}$), the choice of the finite element space/basis ($\bm{B}$) 264*bcb2dfaeSJed Brownand the geometry and point-wise physics $\bm{D}$. These components also 265*bcb2dfaeSJed Brownnaturally fall in different classes of numerical algorithms -- parallel (multi-device) 266*bcb2dfaeSJed Brownlinear algebra for $\bm{P}$, sparse (on-device) linear algebra for 267*bcb2dfaeSJed Brown$\bm{G}$, dense/structured linear algebra (tensor contractions) for 268*bcb2dfaeSJed Brown$\bm{B}$ and parallel point-wise evaluations for $\bm{D}$. 269*bcb2dfaeSJed Brown 270*bcb2dfaeSJed BrownCurrently in libCEED, it is assumed that the host application manages the global 271*bcb2dfaeSJed Brown**T-vectors** and the required communications among devices (which are generally 272*bcb2dfaeSJed Brownon different compute nodes) with **P**. Our API is thus focused on the 273*bcb2dfaeSJed Brown**L-vector** level, where the logical devices, which in the library are 274*bcb2dfaeSJed Brownrepresented by the {ref}`Ceed` object, are independent. Each MPI rank can use one or 275*bcb2dfaeSJed Brownmore {ref}`Ceed`s, and each {ref}`Ceed`, in turn, can represent one or more physical 276*bcb2dfaeSJed Browndevices, as long as libCEED backends support such configurations. The idea is 277*bcb2dfaeSJed Brownthat every MPI rank can use any logical device it is assigned at runtime. For 278*bcb2dfaeSJed Brownexample, on a node with 2 CPU sockets and 4 GPUs, one may decide to use 6 MPI 279*bcb2dfaeSJed Brownranks (each using a single {ref}`Ceed` object): 2 ranks using 1 CPU socket each, and 280*bcb2dfaeSJed Brown4 using 1 GPU each. Another choice could be to run 1 MPI rank on the whole node 281*bcb2dfaeSJed Brownand use 5 {ref}`Ceed` objects: 1 managing all CPU cores on the 2 sockets and 4 282*bcb2dfaeSJed Brownmanaging 1 GPU each. The communications among the devices, e.g. required for 283*bcb2dfaeSJed Brownapplying the action of $\bm{P}$, are currently out of scope of libCEED. The 284*bcb2dfaeSJed Browninterface is non-blocking for all operations involving more than O(1) data, 285*bcb2dfaeSJed Brownallowing operations performed on a coprocessor or worker threads to overlap with 286*bcb2dfaeSJed Brownoperations on the host. 287*bcb2dfaeSJed Brown 288*bcb2dfaeSJed Brown## API Description 289*bcb2dfaeSJed Brown 290*bcb2dfaeSJed BrownThe libCEED API takes an algebraic approach, where the user essentially 291*bcb2dfaeSJed Browndescribes in the *frontend* the operators **G**, **B** and **D** and the library 292*bcb2dfaeSJed Brownprovides *backend* implementations and coordinates their action to the original 293*bcb2dfaeSJed Brownoperator on **L-vector** level (i.e. independently on each device / MPI task). 294*bcb2dfaeSJed Brown 295*bcb2dfaeSJed BrownOne of the advantages of this purely algebraic description is that it already 296*bcb2dfaeSJed Brownincludes all the finite element information, so the backends can operate on 297*bcb2dfaeSJed Brownlinear algebra level without explicit finite element code. The frontend 298*bcb2dfaeSJed Browndescription is general enough to support a wide variety of finite element 299*bcb2dfaeSJed Brownalgorithms, as well as some other types algorithms such as spectral finite 300*bcb2dfaeSJed Browndifferences. The separation of the front- and backends enables applications to 301*bcb2dfaeSJed Browneasily switch/try different backends. It also enables backend developers to 302*bcb2dfaeSJed Brownimpact many applications from a single implementation. 303*bcb2dfaeSJed Brown 304*bcb2dfaeSJed BrownOur long-term vision is to include a variety of backend implementations in 305*bcb2dfaeSJed BrownlibCEED, ranging from reference kernels to highly optimized kernels targeting 306*bcb2dfaeSJed Brownspecific devices (e.g. GPUs) or specific polynomial orders. A simple reference 307*bcb2dfaeSJed Brownbackend implementation is provided in the file 308*bcb2dfaeSJed Brown[ceed-ref.c](https://github.com/CEED/libCEED/blob/main/backends/ref/ceed-ref.c). 309*bcb2dfaeSJed Brown 310*bcb2dfaeSJed BrownOn the frontend, the mapping between the decomposition concepts and the code 311*bcb2dfaeSJed Brownimplementation is as follows: 312*bcb2dfaeSJed Brown 313*bcb2dfaeSJed Brown- **L-**, **E-** and **Q-vector** are represented as variables of type {ref}`CeedVector`. 314*bcb2dfaeSJed Brown (A backend may choose to operate incrementally without forming explicit **E-** or 315*bcb2dfaeSJed Brown **Q-vectors**.) 316*bcb2dfaeSJed Brown- $\bm{G}$ is represented as variable of type {ref}`CeedElemRestriction`. 317*bcb2dfaeSJed Brown- $\bm{B}$ is represented as variable of type {ref}`CeedBasis`. 318*bcb2dfaeSJed Brown- the action of $\bm{D}$ is represented as variable of type {ref}`CeedQFunction`. 319*bcb2dfaeSJed Brown- the overall operator $\bm{G}^T \bm{B}^T \bm{D} \bm{B} \bm{G}$ 320*bcb2dfaeSJed Brown is represented as variable of type 321*bcb2dfaeSJed Brown {ref}`CeedOperator` and its action is accessible through {c:func}`CeedOperatorApply()`. 322*bcb2dfaeSJed Brown 323*bcb2dfaeSJed BrownTo clarify these concepts and illustrate how they are combined in the API, 324*bcb2dfaeSJed Brownconsider the implementation of the action of a simple 1D mass matrix 325*bcb2dfaeSJed Brown(cf. [tests/t500-operator.c](https://github.com/CEED/libCEED/blob/main/tests/t500-operator.c)). 326*bcb2dfaeSJed Brown 327*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c 328*bcb2dfaeSJed Brown:language: c 329*bcb2dfaeSJed Brown:linenos: true 330*bcb2dfaeSJed Brown``` 331*bcb2dfaeSJed Brown 332*bcb2dfaeSJed BrownThe constructor 333*bcb2dfaeSJed Brown 334*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c 335*bcb2dfaeSJed Brown:end-at: CeedInit 336*bcb2dfaeSJed Brown:language: c 337*bcb2dfaeSJed Brown:start-at: CeedInit 338*bcb2dfaeSJed Brown``` 339*bcb2dfaeSJed Brown 340*bcb2dfaeSJed Browncreates a logical device `ceed` on the specified *resource*, which could also be 341*bcb2dfaeSJed Browna coprocessor such as `"/nvidia/0"`. There can be any number of such devices, 342*bcb2dfaeSJed Brownincluding multiple logical devices driving the same resource (though performance 343*bcb2dfaeSJed Brownmay suffer in case of oversubscription). The resource is used to locate a 344*bcb2dfaeSJed Brownsuitable backend which will have discretion over the implementations of all 345*bcb2dfaeSJed Brownobjects created with this logical device. 346*bcb2dfaeSJed Brown 347*bcb2dfaeSJed BrownThe `setup` routine above computes and stores $\bm{D}$, in this case a 348*bcb2dfaeSJed Brownscalar value in each quadrature point, while `mass` uses these saved values to perform 349*bcb2dfaeSJed Brownthe action of $\bm{D}$. These functions are turned into the {ref}`CeedQFunction` 350*bcb2dfaeSJed Brownvariables `qf_setup` and `qf_mass` in the {c:func}`CeedQFunctionCreateInterior()` calls: 351*bcb2dfaeSJed Brown 352*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c 353*bcb2dfaeSJed Brown:end-before: //! [QFunction Create] 354*bcb2dfaeSJed Brown:language: c 355*bcb2dfaeSJed Brown:start-after: //! [QFunction Create] 356*bcb2dfaeSJed Brown``` 357*bcb2dfaeSJed Brown 358*bcb2dfaeSJed BrownA {ref}`CeedQFunction` performs independent operations at each quadrature point and 359*bcb2dfaeSJed Brownthe interface is intended to facilitate vectorization. The second argument is 360*bcb2dfaeSJed Brownan expected vector length. If greater than 1, the caller must ensure that the 361*bcb2dfaeSJed Brownnumber of quadrature points `Q` is divisible by the vector length. This is 362*bcb2dfaeSJed Brownoften satisfied automatically due to the element size or by batching elements 363*bcb2dfaeSJed Browntogether to facilitate vectorization in other stages, and can always be ensured 364*bcb2dfaeSJed Brownby padding. 365*bcb2dfaeSJed Brown 366*bcb2dfaeSJed BrownIn addition to the function pointers (`setup` and `mass`), {ref}`CeedQFunction` 367*bcb2dfaeSJed Brownconstructors take a string representation specifying where the source for the 368*bcb2dfaeSJed Brownimplementation is found. This is used by backends that support Just-In-Time 369*bcb2dfaeSJed Brown(JIT) compilation (i.e., CUDA and OCCA) to compile for coprocessors. 370*bcb2dfaeSJed BrownFor full support across all backends, these {ref}`CeedQFunction` source files must only contain constructs mutually supported by C99, C++11, and CUDA. 371*bcb2dfaeSJed BrownFor example, explicit type casting of void pointers and explicit use of compatible arguments for {code}`math` library functions is required, and variable-length array (VLA) syntax for array reshaping is only available via libCEED's {code}`CEED_Q_VLA` macro. 372*bcb2dfaeSJed Brown 373*bcb2dfaeSJed BrownDifferent input and output fields are added individually, specifying the field 374*bcb2dfaeSJed Brownname, size of the field, and evaluation mode. 375*bcb2dfaeSJed Brown 376*bcb2dfaeSJed BrownThe size of the field is provided by a combination of the number of components 377*bcb2dfaeSJed Brownthe effect of any basis evaluations. 378*bcb2dfaeSJed Brown 379*bcb2dfaeSJed BrownThe evaluation mode (see {ref}`CeedBasis-Typedefs and Enumerations`) `CEED_EVAL_INTERP` 380*bcb2dfaeSJed Brownfor both input and output fields indicates that the mass operator only contains terms of 381*bcb2dfaeSJed Brownthe form 382*bcb2dfaeSJed Brown 383*bcb2dfaeSJed Brown$$ 384*bcb2dfaeSJed Brown\int_\Omega v \cdot f_0 (u, \nabla u) 385*bcb2dfaeSJed Brown$$ 386*bcb2dfaeSJed Brown 387*bcb2dfaeSJed Brownwhere $v$ are test functions (see the {ref}`theoretical-framework`). 388*bcb2dfaeSJed BrownMore general operators, such as those of the form 389*bcb2dfaeSJed Brown 390*bcb2dfaeSJed Brown$$ 391*bcb2dfaeSJed Brown\int_\Omega v \cdot f_0 (u, \nabla u) + \nabla v : f_1 (u, \nabla u) 392*bcb2dfaeSJed Brown$$ 393*bcb2dfaeSJed Brown 394*bcb2dfaeSJed Browncan be expressed. 395*bcb2dfaeSJed Brown 396*bcb2dfaeSJed BrownFor fields with derivatives, such as with the basis evaluation mode 397*bcb2dfaeSJed Brown(see {ref}`CeedBasis-Typedefs and Enumerations`) `CEED_EVAL_GRAD`, the size of the 398*bcb2dfaeSJed Brownfield needs to reflect both the number of components and the geometric dimension. 399*bcb2dfaeSJed BrownA 3-dimensional gradient on four components would therefore mean the field has a size of 400*bcb2dfaeSJed Brown12\. 401*bcb2dfaeSJed Brown 402*bcb2dfaeSJed BrownThe $\bm{B}$ operators for the mesh nodes, `basis_x`, and the unknown field, 403*bcb2dfaeSJed Brown`basis_u`, are defined in the calls to the function {c:func}`CeedBasisCreateTensorH1Lagrange()`. 404*bcb2dfaeSJed BrownIn this example, both the mesh and the unknown field use $H^1$ Lagrange finite 405*bcb2dfaeSJed Brownelements of order 1 and 4 respectively (the `P` argument represents the number of 1D 406*bcb2dfaeSJed Browndegrees of freedom on each element). Both basis operators use the same integration rule, 407*bcb2dfaeSJed Brownwhich is Gauss-Legendre with 8 points (the `Q` argument). 408*bcb2dfaeSJed Brown 409*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c 410*bcb2dfaeSJed Brown:end-before: //! [Basis Create] 411*bcb2dfaeSJed Brown:language: c 412*bcb2dfaeSJed Brown:start-after: //! [Basis Create] 413*bcb2dfaeSJed Brown``` 414*bcb2dfaeSJed Brown 415*bcb2dfaeSJed BrownOther elements with this structure can be specified in terms of the `Q×P` 416*bcb2dfaeSJed Brownmatrices that evaluate values and gradients at quadrature points in one 417*bcb2dfaeSJed Browndimension using {c:func}`CeedBasisCreateTensorH1()`. Elements that do not have tensor 418*bcb2dfaeSJed Brownproduct structure, such as symmetric elements on simplices, will be created 419*bcb2dfaeSJed Brownusing different constructors. 420*bcb2dfaeSJed Brown 421*bcb2dfaeSJed BrownThe $\bm{G}$ operators for the mesh nodes, `elem_restr_x`, and the unknown field, 422*bcb2dfaeSJed Brown`elem_restr_u`, are specified in the {c:func}`CeedElemRestrictionCreate()`. Both of these 423*bcb2dfaeSJed Brownspecify directly the dof indices for each element in the `ind_x` and `ind_u` 424*bcb2dfaeSJed Brownarrays: 425*bcb2dfaeSJed Brown 426*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c 427*bcb2dfaeSJed Brown:end-before: //! [ElemRestr Create] 428*bcb2dfaeSJed Brown:language: c 429*bcb2dfaeSJed Brown:start-after: //! [ElemRestr Create] 430*bcb2dfaeSJed Brown``` 431*bcb2dfaeSJed Brown 432*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c 433*bcb2dfaeSJed Brown:end-before: //! [ElemRestrU Create] 434*bcb2dfaeSJed Brown:language: c 435*bcb2dfaeSJed Brown:start-after: //! [ElemRestrU Create] 436*bcb2dfaeSJed Brown``` 437*bcb2dfaeSJed Brown 438*bcb2dfaeSJed BrownIf the user has arrays available on a device, they can be provided using 439*bcb2dfaeSJed Brown`CEED_MEM_DEVICE`. This technique is used to provide no-copy interfaces in all 440*bcb2dfaeSJed Browncontexts that involve problem-sized data. 441*bcb2dfaeSJed Brown 442*bcb2dfaeSJed BrownFor discontinuous Galerkin and for applications such as Nek5000 that only 443*bcb2dfaeSJed Brownexplicitly store **E-vectors** (inter-element continuity has been subsumed by 444*bcb2dfaeSJed Brownthe parallel restriction $\bm{P}$), the element restriction $\bm{G}$ 445*bcb2dfaeSJed Brownis the identity and {c:func}`CeedElemRestrictionCreateStrided()` is used instead. 446*bcb2dfaeSJed BrownWe plan to support other structured representations of $\bm{G}$ which will 447*bcb2dfaeSJed Brownbe added according to demand. 448*bcb2dfaeSJed BrownThere are two common approaches for supporting non-conforming elements: applying the node constraints via $\bm P$ so that the **L-vector** can be processed uniformly and applying the constraints via $\bm G$ so that the **E-vector** is uniform. 449*bcb2dfaeSJed BrownThe former can be done with the existing interface while the latter will require a generalization to element restriction that would define field values at constrained nodes as linear combinations of the values at primary nodes. 450*bcb2dfaeSJed Brown 451*bcb2dfaeSJed BrownThese operations, $\bm{P}$, $\bm{B}$, and $\bm{D}$, 452*bcb2dfaeSJed Brownare combined with a {ref}`CeedOperator`. As with {ref}`CeedQFunction`s, operator fields are added 453*bcb2dfaeSJed Brownseparately with a matching field name, basis ($\bm{B}$), element restriction 454*bcb2dfaeSJed Brown($\bm{G}$), and **L-vector**. The flag 455*bcb2dfaeSJed Brown`CEED_VECTOR_ACTIVE` indicates that the vector corresponding to that field will 456*bcb2dfaeSJed Brownbe provided to the operator when {c:func}`CeedOperatorApply()` is called. Otherwise the 457*bcb2dfaeSJed Browninput/output will be read from/written to the specified **L-vector**. 458*bcb2dfaeSJed Brown 459*bcb2dfaeSJed BrownWith partial assembly, we first perform a setup stage where $\bm{D}$ is evaluated 460*bcb2dfaeSJed Brownand stored. This is accomplished by the operator `op_setup` and its application 461*bcb2dfaeSJed Brownto `X`, the nodes of the mesh (these are needed to compute Jacobians at 462*bcb2dfaeSJed Brownquadrature points). Note that the corresponding {c:func}`CeedOperatorApply()` has no basis 463*bcb2dfaeSJed Brownevaluation on the output, as the quadrature data is not needed at the dofs: 464*bcb2dfaeSJed Brown 465*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c 466*bcb2dfaeSJed Brown:end-before: //! [Setup Create] 467*bcb2dfaeSJed Brown:language: c 468*bcb2dfaeSJed Brown:start-after: //! [Setup Create] 469*bcb2dfaeSJed Brown``` 470*bcb2dfaeSJed Brown 471*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c 472*bcb2dfaeSJed Brown:end-before: //! [Setup Set] 473*bcb2dfaeSJed Brown:language: c 474*bcb2dfaeSJed Brown:start-after: //! [Setup Set] 475*bcb2dfaeSJed Brown``` 476*bcb2dfaeSJed Brown 477*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c 478*bcb2dfaeSJed Brown:end-before: //! [Setup Apply] 479*bcb2dfaeSJed Brown:language: c 480*bcb2dfaeSJed Brown:start-after: //! [Setup Apply] 481*bcb2dfaeSJed Brown``` 482*bcb2dfaeSJed Brown 483*bcb2dfaeSJed BrownThe action of the operator is then represented by operator `op_mass` and its 484*bcb2dfaeSJed Brown{c:func}`CeedOperatorApply()` to the input **L-vector** `U` with output in `V`: 485*bcb2dfaeSJed Brown 486*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c 487*bcb2dfaeSJed Brown:end-before: //! [Operator Create] 488*bcb2dfaeSJed Brown:language: c 489*bcb2dfaeSJed Brown:start-after: //! [Operator Create] 490*bcb2dfaeSJed Brown``` 491*bcb2dfaeSJed Brown 492*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c 493*bcb2dfaeSJed Brown:end-before: //! [Operator Set] 494*bcb2dfaeSJed Brown:language: c 495*bcb2dfaeSJed Brown:start-after: //! [Operator Set] 496*bcb2dfaeSJed Brown``` 497*bcb2dfaeSJed Brown 498*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c 499*bcb2dfaeSJed Brown:end-before: //! [Operator Apply] 500*bcb2dfaeSJed Brown:language: c 501*bcb2dfaeSJed Brown:start-after: //! [Operator Apply] 502*bcb2dfaeSJed Brown``` 503*bcb2dfaeSJed Brown 504*bcb2dfaeSJed BrownA number of function calls in the interface, such as {c:func}`CeedOperatorApply()`, are 505*bcb2dfaeSJed Brownintended to support asynchronous execution via their last argument, 506*bcb2dfaeSJed Brown`CeedRequest*`. The specific (pointer) value used in the above example, 507*bcb2dfaeSJed Brown`CEED_REQUEST_IMMEDIATE`, is used to express the request (from the user) for the 508*bcb2dfaeSJed Brownoperation to complete before returning from the function call, i.e. to make sure 509*bcb2dfaeSJed Brownthat the result of the operation is available in the output parameters 510*bcb2dfaeSJed Brownimmediately after the call. For a true asynchronous call, one needs to provide 511*bcb2dfaeSJed Brownthe address of a user defined variable. Such a variable can be used later to 512*bcb2dfaeSJed Brownexplicitly wait for the completion of the operation. 513*bcb2dfaeSJed Brown 514*bcb2dfaeSJed Brown## Gallery of QFunctions 515*bcb2dfaeSJed Brown 516*bcb2dfaeSJed BrownLibCEED provides a gallery of built-in {ref}`CeedQFunction`s in the {file}`gallery/` directory. 517*bcb2dfaeSJed BrownThe available QFunctions are the ones associated with the mass, the Laplacian, and 518*bcb2dfaeSJed Brownthe identity operators. To illustrate how the user can declare a {ref}`CeedQFunction` 519*bcb2dfaeSJed Brownvia the gallery of available QFunctions, consider the selection of the 520*bcb2dfaeSJed Brown{ref}`CeedQFunction` associated with a simple 1D mass matrix 521*bcb2dfaeSJed Brown(cf. [tests/t410-qfunction.c](https://github.com/CEED/libCEED/blob/main/tests/t410-qfunction.c)). 522*bcb2dfaeSJed Brown 523*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t410-qfunction.c 524*bcb2dfaeSJed Brown:language: c 525*bcb2dfaeSJed Brown:linenos: true 526*bcb2dfaeSJed Brown``` 527*bcb2dfaeSJed Brown 528*bcb2dfaeSJed Brown## Interface Principles and Evolution 529*bcb2dfaeSJed Brown 530*bcb2dfaeSJed BrownLibCEED is intended to be extensible via backends that are packaged with the 531*bcb2dfaeSJed Brownlibrary and packaged separately (possibly as a binary containing proprietary 532*bcb2dfaeSJed Browncode). Backends are registered by calling 533*bcb2dfaeSJed Brown 534*bcb2dfaeSJed Brown```{literalinclude} ../../../backends/ref/ceed-ref.c 535*bcb2dfaeSJed Brown:end-before: //! [Register] 536*bcb2dfaeSJed Brown:language: c 537*bcb2dfaeSJed Brown:start-after: //! [Register] 538*bcb2dfaeSJed Brown``` 539*bcb2dfaeSJed Brown 540*bcb2dfaeSJed Browntypically in a library initializer or "constructor" that runs automatically. 541*bcb2dfaeSJed Brown`CeedInit` uses this prefix to find an appropriate backend for the resource. 542*bcb2dfaeSJed Brown 543*bcb2dfaeSJed BrownSource (API) and binary (ABI) stability are important to libCEED. Prior to 544*bcb2dfaeSJed Brownreaching version 1.0, libCEED does not implement strict [semantic versioning](https://semver.org) across the entire interface. However, user code, 545*bcb2dfaeSJed Brownincluding libraries of {ref}`CeedQFunction`s, should be source and binary 546*bcb2dfaeSJed Browncompatible moving from 0.x.y to any later release 0.x.z. We have less experience 547*bcb2dfaeSJed Brownwith external packaging of backends and do not presently guarantee source or 548*bcb2dfaeSJed Brownbinary stability, but we intend to define stability guarantees for libCEED 1.0. 549*bcb2dfaeSJed BrownWe'd love to talk with you if you're interested in packaging backends 550*bcb2dfaeSJed Brownexternally, and will work with you on a practical stability policy. 551