1# Interface Concepts 2 3This page provides a brief description of the theoretical foundations and the 4practical implementation of the libCEED library. 5 6(theoretical-framework)= 7 8## Theoretical Framework 9 10In finite element formulations, the weak form of a Partial Differential Equation 11(PDE) is evaluated on a subdomain $\Omega_e$ (element) and the local results 12are composed into a larger system of equations that models the entire problem on 13the global domain $\Omega$. In particular, when high-order finite elements or 14spectral elements are used, the resulting sparse matrix representation of the global 15operator is computationally expensive, with respect to both the memory transfer and 16floating point operations needed for its evaluation. libCEED provides an interface 17for matrix-free operator description that enables efficient evaluation on a variety 18of computational device types (selectable at run time). We present here the notation 19and the mathematical formulation adopted in libCEED. 20 21We start by considering the discrete residual $F(u)=0$ formulation 22in weak form. We first define the $L^2$ inner product between real-valued functions 23 24$$ 25\langle v, u \rangle = \int_\Omega v u d \bm{x}, 26$$ 27 28where $\bm{x} \in \mathbb{R}^d \supset \Omega$. 29 30We want to find $u$ in a suitable space $V_D$, 31such that 32 33$$ 34\langle \bm v, \bm f(u) \rangle = \int_\Omega \bm v \cdot \bm f_0 (u, \nabla u) + \nabla \bm v : \bm f_1 (u, \nabla u) = 0 35$$ (residual) 36 37for all $\bm v$ in the corresponding homogeneous space $V_0$, where $\bm f_0$ 38and $\bm f_1$ contain all possible sources in the problem. We notice here that 39$\bm f_0$ represents all terms in {eq}`residual` which multiply the (possibly vector-valued) test 40function $\bm v$ and $\bm f_1$ all terms which multiply its gradient $\nabla \bm v$. 41For an n-component problems in $d$ dimensions, $\bm f_0 \in \mathbb{R}^n$ and 42$\bm f_1 \in \mathbb{R}^{nd}$. 43 44:::{note} 45The notation $\nabla \bm v \!:\! \bm f_1$ represents contraction over both 46fields and spatial dimensions while a single dot represents contraction in just one, 47which should be clear from context, e.g., $\bm v \cdot \bm f_0$ contracts only over 48fields. 49::: 50 51:::{note} 52In the code, the function that represents the weak form at quadrature 53points is called the {ref}`CeedQFunction`. In the {ref}`Examples` provided with the 54library (in the {file}`examples/` directory), we store the term $\bm f_0$ directly 55into `v`, and the term $\bm f_1$ directly into `dv` (which stands for 56$\nabla \bm v$). If equation {eq}`residual` only presents a term of the 57type $\bm f_0$, the {ref}`CeedQFunction` will only have one output argument, 58namely `v`. If equation {eq}`residual` also presents a term of the type 59$\bm f_1$, then the {ref}`CeedQFunction` will have two output arguments, namely, 60`v` and `dv`. 61::: 62 63## Finite Element Operator Decomposition 64 65Finite element operators are typically defined through weak formulations of 66partial differential equations that involve integration over a computational 67mesh. The required integrals are computed by splitting them as a sum over the 68mesh elements, mapping each element to a simple *reference* element (e.g. the 69unit square) and applying a quadrature rule in reference space. 70 71This sequence of operations highlights an inherent hierarchical structure 72present in all finite element operators where the evaluation starts on *global 73(trial) degrees of freedom (dofs) or nodes on the whole mesh*, restricts to 74*dofs on subdomains* (groups of elements), then moves to independent 75*dofs on each element*, transitions to independent *quadrature points* in 76reference space, performs the integration, and then goes back in reverse order 77to global (test) degrees of freedom on the whole mesh. 78 79This is illustrated below for the simple case of symmetric linear operator on 80third order ($Q_3$) scalar continuous ($H^1$) elements, where we use 81the notions **T-vector**, **L-vector**, **E-vector** and **Q-vector** to represent 82the sets corresponding to the (true) degrees of freedom on the global mesh, the split 83local degrees of freedom on the subdomains, the split degrees of freedom on the 84mesh elements, and the values at quadrature points, respectively. 85 86We refer to the operators that connect the different types of vectors as: 87 88- Subdomain restriction $\bm{P}$ 89- Element restriction $\bm{\mathcal{E}}$ 90- Basis (Dofs-to-Qpts) evaluator $\bm{B}$ 91- Operator at quadrature points $\bm{D}$ 92 93More generally, when the test and trial space differ, they get their own 94versions of $\bm{P}$, $\bm{\mathcal{E}}$ and $\bm{B}$. 95 96(fig-operator-decomp)= 97 98:::{figure} ../../img/libCEED.svg 99Operator Decomposition 100::: 101 102Note that in the case of adaptive mesh refinement (AMR), the restrictions 103$\bm{P}$ and $\bm{\mathcal{E}}$ will involve not just extracting sub-vectors, 104but evaluating values at constrained degrees of freedom through the AMR interpolation. 105There can also be several levels of subdomains ($\bm P_1$, $\bm P_2$, 106etc.), and it may be convenient to split $\bm{D}$ as the product of several 107operators ($\bm D_1$, $\bm D_2$, etc.). 108 109### Terminology and Notation 110 111Vector representation/storage categories: 112 113- True degrees of freedom/unknowns, **T-vector**: 114 115 > - each unknown $i$ has exactly one copy, on exactly one processor, $rank(i)$ 116 > - this is a non-overlapping vector decomposition 117 > - usually includes any essential (fixed) dofs. 118 > 119 > ```{image} ../../img/T-vector.svg 120 > ``` 121 122- Local (w.r.t. processors) degrees of freedom/unknowns, **L-vector**: 123 124 > - each unknown $i$ has exactly one copy on each processor that owns an 125 > element containing $i$ 126 > - this is an overlapping vector decomposition with overlaps only across 127 > different processors---there is no duplication of unknowns on a single 128 > processor 129 > - the shared dofs/unknowns are the overlapping dofs, i.e. the ones that have 130 > more than one copy, on different processors. 131 > 132 > ```{image} ../../img/L-vector.svg 133 > ``` 134 135- Per element decomposition, **E-vector**: 136 137 > - each unknown $i$ has as many copies as the number of elements that contain 138 > $i$ 139 > - usually, the copies of the unknowns are grouped by the element they belong 140 > to. 141 > 142 > ```{image} ../../img/E-vector.svg 143 > ``` 144 145- In the case of AMR with hanging nodes (giving rise to hanging dofs): 146 147 > - the **L-vector** is enhanced with the hanging/dependent dofs 148 > - the additional hanging/dependent dofs are duplicated when they are shared 149 > by multiple processors 150 > - this way, an **E-vector** can be derived from an **L-vector** without any 151 > communications and without additional computations to derive the dependent 152 > dofs 153 > - in other words, an entry in an **E-vector** is obtained by copying an entry 154 > from the corresponding **L-vector**, optionally switching the sign of the 155 > entry (for $H(\mathrm{div})$---and $H(\mathrm{curl})$-conforming spaces). 156 > 157 > ```{image} ../../img/L-vector-AMR.svg 158 > ``` 159 160- In the case of variable order spaces: 161 162 > - the dependent dofs (usually on the higher-order side of a face/edge) can 163 > be treated just like the hanging/dependent dofs case. 164 165- Quadrature point vector, **Q-vector**: 166 167 > - this is similar to **E-vector** where instead of dofs, the vector represents 168 > values at quadrature points, grouped by element. 169 170- In many cases it is useful to distinguish two types of vectors: 171 172 > - **X-vector**, or **primal X-vector**, and **X'-vector**, or **dual X-vector** 173 > - here X can be any of the T, L, E, or Q categories 174 > - for example, the mass matrix operator maps a **T-vector** to a **T'-vector** 175 > - the solutions vector is a **T-vector**, and the RHS vector is a **T'-vector** 176 > - using the parallel prolongation operator, one can map the solution 177 > **T-vector** to a solution **L-vector**, etc. 178 179Operator representation/storage/action categories: 180 181- Full true-dof parallel assembly, **TA**, or **A**: 182 183 > - ParCSR or similar format 184 > - the T in TA indicates that the data format represents an operator from a 185 > **T-vector** to a **T'-vector**. 186 187- Full local assembly, **LA**: 188 189 > - CSR matrix on each rank 190 > - the parallel prolongation operator, $\bm{P}$, (and its transpose) should use 191 > optimized matrix-free action 192 > - note that $\bm{P}$ is the operator mapping T-vectors to L-vectors. 193 194- Element matrix assembly, **EA**: 195 196 > - each element matrix is stored as a dense matrix 197 > - optimized element and parallel prolongation operators 198 > - note that the element prolongation operator is the mapping from an 199 > **L-vector** to an **E-vector**. 200 201- Quadrature-point/partial assembly, **QA** or **PA**: 202 203 > - precompute and store $w\det(J)$ at all quadrature points in all mesh elements 204 > - the stored data can be viewed as a **Q-vector**. 205 206- Unassembled option, **UA** or **U**: 207 208 > - no assembly step 209 > - the action uses directly the mesh node coordinates, and assumes specific 210 > form of the coefficient, e.g. constant, piecewise-constant, or given as a 211 > **Q-vector** (Q-coefficient). 212 213### Partial Assembly 214 215Since the global operator $\bm{A}$ is just a series of variational restrictions 216with $\bm{B}$, $\bm{\mathcal{E}}$ and $\bm{P}$, starting from its 217point-wise kernel $\bm{D}$, a "matvec" with $\bm{A}$ can be 218performed by evaluating and storing some of the innermost variational restriction 219matrices, and applying the rest of the operators "on-the-fly". For example, one can 220compute and store a global matrix on **T-vector** level. Alternatively, one can compute 221and store only the subdomain (**L-vector**) or element (**E-vector**) matrices and 222perform the action of $\bm{A}$ using matvecs with $\bm{P}$ or 223$\bm{P}$ and $\bm{\mathcal{E}}$. While these options are natural for 224low-order discretizations, they are not a good fit for high-order methods due to 225the amount of FLOPs needed for their evaluation, as well as the memory transfer 226needed for a matvec. 227 228Our focus in libCEED, instead, is on **partial assembly**, where we compute and 229store only $\bm{D}$ (or portions of it) and evaluate the actions of 230$\bm{P}$, $\bm{\mathcal{E}}$ and $\bm{B}$ on-the-fly. 231Critically for performance, we take advantage of the tensor-product structure of the 232degrees of freedom and quadrature points on *quad* and *hex* elements to perform the 233action of $\bm{B}$ without storing it as a matrix. 234 235Implemented properly, the partial assembly algorithm requires optimal amount of 236memory transfers (with respect to the polynomial order) and near-optimal FLOPs 237for operator evaluation. It consists of an operator *setup* phase, that 238evaluates and stores $\bm{D}$ and an operator *apply* (evaluation) phase that 239computes the action of $\bm{A}$ on an input vector. When desired, the setup 240phase may be done as a side-effect of evaluating a different operator, such as a 241nonlinear residual. The relative costs of the setup and apply phases are 242different depending on the physics being expressed and the representation of 243$\bm{D}$. 244 245### Parallel Decomposition 246 247After the application of each of the first three transition operators, 248$\bm{P}$, $\bm{\mathcal{E}}$ and $\bm{B}$, the operator evaluation 249is decoupled on their ranges, so $\bm{P}$, $\bm{\mathcal{E}}$ and 250$\bm{B}$ allow us to "zoom-in" to subdomain, element and quadrature point 251level, ignoring the coupling at higher levels. 252 253Thus, a natural mapping of $\bm{A}$ on a parallel computer is to split the 254**T-vector** over MPI ranks (a non-overlapping decomposition, as is typically 255used for sparse matrices), and then split the rest of the vector types over 256computational devices (CPUs, GPUs, etc.) as indicated by the shaded regions in 257the diagram above. 258 259One of the advantages of the decomposition perspective in these settings is that 260the operators $\bm{P}$, $\bm{\mathcal{E}}$, $\bm{B}$ and 261$\bm{D}$ clearly separate the MPI parallelism 262in the operator ($\bm{P}$) from the unstructured mesh topology 263($\bm{\mathcal{E}}$), the choice of the finite element space/basis ($\bm{B}$) 264and the geometry and point-wise physics $\bm{D}$. These components also 265naturally fall in different classes of numerical algorithms -- parallel (multi-device) 266linear algebra for $\bm{P}$, sparse (on-device) linear algebra for 267$\bm{\mathcal{E}}$, dense/structured linear algebra (tensor contractions) for 268$\bm{B}$ and parallel point-wise evaluations for $\bm{D}$. 269 270Currently in libCEED, it is assumed that the host application manages the global 271**T-vectors** and the required communications among devices (which are generally 272on different compute nodes) with **P**. Our API is thus focused on the 273**L-vector** level, where the logical devices, which in the library are 274represented by the {ref}`Ceed` object, are independent. Each MPI rank can use one or 275more {ref}`Ceed`s, and each {ref}`Ceed`, in turn, can represent one or more physical 276devices, as long as libCEED backends support such configurations. The idea is 277that every MPI rank can use any logical device it is assigned at runtime. For 278example, on a node with 2 CPU sockets and 4 GPUs, one may decide to use 6 MPI 279ranks (each using a single {ref}`Ceed` object): 2 ranks using 1 CPU socket each, and 2804 using 1 GPU each. Another choice could be to run 1 MPI rank on the whole node 281and use 5 {ref}`Ceed` objects: 1 managing all CPU cores on the 2 sockets and 4 282managing 1 GPU each. The communications among the devices, e.g. required for 283applying the action of $\bm{P}$, are currently out of scope of libCEED. The 284interface is non-blocking for all operations involving more than O(1) data, 285allowing operations performed on a coprocessor or worker threads to overlap with 286operations on the host. 287 288## API Description 289 290The libCEED API takes an algebraic approach, where the user essentially 291describes in the *frontend* the operators $\bm{\bm{\mathcal{E}}}$, $\bm{B}$, and $\bm{D}$ and the library 292provides *backend* implementations and coordinates their action to the original 293operator on **L-vector** level (i.e. independently on each device / MPI task). 294This is visualized in the schematic below; "active" and "passive" inputs/outputs 295will be discussed in more detail later. 296 297(fig-operator-schematic)= 298 299:::{figure} ../../img/libceed_schematic.svg 300Flow of data through vector types inside libCEED Operators, through backend implementations 301of $\bm{\bm{\mathcal{E}}}$, $\bm{B}$, and $\bm{D}$ 302::: 303 304One of the advantages of this purely algebraic description is that it already 305includes all the finite element information, so the backends can operate on 306linear algebra level without explicit finite element code. The frontend 307description is general enough to support a wide variety of finite element 308algorithms, as well as some other types algorithms such as spectral finite 309differences. The separation of the front- and backends enables applications to 310easily switch/try different backends. It also enables backend developers to 311impact many applications from a single implementation. 312 313Our long-term vision is to include a variety of backend implementations in 314libCEED, ranging from reference kernels to highly optimized kernels targeting 315specific devices (e.g. GPUs) or specific polynomial orders. A simple reference 316backend implementation is provided in the file 317[ceed-ref.c](https://github.com/CEED/libCEED/blob/main/backends/ref/ceed-ref.c). 318 319 320On the frontend, the mapping between the decomposition concepts and the code 321implementation is as follows: 322 323- **L-**, **E-** and **Q-vector** are represented as variables of type {ref}`CeedVector`. 324 (A backend may choose to operate incrementally without forming explicit **E-** or 325 **Q-vectors**.) 326- $\bm{\mathcal{E}}$ is represented as variable of type {ref}`CeedElemRestriction`. 327- $\bm{B}$ is represented as variable of type {ref}`CeedBasis`. 328- the action of $\bm{D}$ is represented as variable of type {ref}`CeedQFunction`. 329- the overall operator $\bm{\mathcal{E}}^T \bm{B}^T \bm{D} \bm{B} \bm{\mathcal{E}}$ 330 is represented as variable of type 331 {ref}`CeedOperator` and its action is accessible through {c:func}`CeedOperatorApply()`. 332 333To clarify these concepts and illustrate how they are combined in the API, 334consider the implementation of the action of a simple 1D mass matrix 335(cf. [tests/t500-operator.c](https://github.com/CEED/libCEED/blob/main/tests/t500-operator.c)). 336 337```{literalinclude} ../../../tests/t500-operator.c 338:language: c 339:linenos: true 340``` 341In the following figure, we specialize the schematic used above for general operators so that 342it corresponds to the mass matrix as implemented in the sample code. Notations marked as "L[#]" 343denote the line number in the code where that object or evaluation mode is set for the operator. 344 345(fig-operator-schematic-mass)= 346 347:::{figure} ../../img/libceed_schematic_op_mass.svg 348Specific combination of $\bm{\bm{\mathcal{E}}}$, $\bm{B}$, $\bm{D}$, and input/output vectors 349corresponding to t500-operator 350::: 351 352The constructor 353 354```{literalinclude} ../../../tests/t500-operator.c 355:end-at: CeedInit 356:language: c 357:start-at: CeedInit 358``` 359 360creates a logical device `ceed` on the specified *resource*, which could also be 361a coprocessor such as `"/nvidia/0"`. There can be any number of such devices, 362including multiple logical devices driving the same resource (though performance 363may suffer in case of oversubscription). The resource is used to locate a 364suitable backend which will have discretion over the implementations of all 365objects created with this logical device. 366 367The `setup` routine above computes and stores $\bm{D}$, in this case a 368scalar value in each quadrature point, while `mass` uses these saved values to perform 369the action of $\bm{D}$. These functions are turned into the {ref}`CeedQFunction` 370variables `qf_setup` and `qf_mass` in the {c:func}`CeedQFunctionCreateInterior()` calls: 371 372```{literalinclude} ../../../tests/t500-operator.c 373:end-before: //! [QFunction Create] 374:language: c 375:start-after: //! [QFunction Create] 376``` 377 378A {ref}`CeedQFunction` performs independent operations at each quadrature point and 379the interface is intended to facilitate vectorization. The second argument is 380an expected vector length. If greater than 1, the caller must ensure that the 381number of quadrature points `Q` is divisible by the vector length. This is 382often satisfied automatically due to the element size or by batching elements 383together to facilitate vectorization in other stages, and can always be ensured 384by padding. 385 386In addition to the function pointers (`setup` and `mass`), {ref}`CeedQFunction` 387constructors take a string representation specifying where the source for the 388implementation is found. This is used by backends that support Just-In-Time 389(JIT) compilation (i.e., CUDA and OCCA) to compile for coprocessors. 390For full support across all backends, these {ref}`CeedQFunction` source files must only contain constructs mutually supported by C99, C++11, and CUDA. 391For example, explicit type casting of void pointers and explicit use of compatible arguments for {code}`math` library functions is required, and variable-length array (VLA) syntax for array reshaping is only available via libCEED's {code}`CEED_Q_VLA` macro. 392 393Different input and output fields are added individually, specifying the field 394name, size of the field, and evaluation mode. 395 396The size of the field is provided by a combination of the number of components 397the effect of any basis evaluations. 398 399The evaluation mode (see {ref}`CeedBasis-Typedefs and Enumerations`) `CEED_EVAL_INTERP` 400for both input and output fields indicates that the mass operator only contains terms of 401the form 402 403$$ 404\int_\Omega v \cdot f_0 (u, \nabla u) 405$$ 406 407where $v$ are test functions (see the {ref}`theoretical-framework`). 408More general operators, such as those of the form 409 410$$ 411\int_\Omega v \cdot f_0 (u, \nabla u) + \nabla v : f_1 (u, \nabla u) 412$$ 413 414can be expressed. 415 416For fields with derivatives, such as with the basis evaluation mode 417(see {ref}`CeedBasis-Typedefs and Enumerations`) `CEED_EVAL_GRAD`, the size of the 418field needs to reflect both the number of components and the geometric dimension. 419A 3-dimensional gradient on four components would therefore mean the field has a size of 42012\. 421 422The $\bm{B}$ operators for the mesh nodes, `basis_x`, and the unknown field, 423`basis_u`, are defined in the calls to the function {c:func}`CeedBasisCreateTensorH1Lagrange()`. 424In this example, both the mesh and the unknown field use $H^1$ Lagrange finite 425elements of order 1 and 4 respectively (the `P` argument represents the number of 1D 426degrees of freedom on each element). Both basis operators use the same integration rule, 427which is Gauss-Legendre with 8 points (the `Q` argument). 428 429```{literalinclude} ../../../tests/t500-operator.c 430:end-before: //! [Basis Create] 431:language: c 432:start-after: //! [Basis Create] 433``` 434 435Other elements with this structure can be specified in terms of the `Q×P` 436matrices that evaluate values and gradients at quadrature points in one 437dimension using {c:func}`CeedBasisCreateTensorH1()`. Elements that do not have tensor 438product structure, such as symmetric elements on simplices, will be created 439using different constructors. 440 441The $\bm{\mathcal{E}}$ operators for the mesh nodes, `elem_restr_x`, and the unknown field, 442`elem_restr_u`, are specified in the {c:func}`CeedElemRestrictionCreate()`. Both of these 443specify directly the dof indices for each element in the `ind_x` and `ind_u` 444arrays: 445 446```{literalinclude} ../../../tests/t500-operator.c 447:end-before: //! [ElemRestr Create] 448:language: c 449:start-after: //! [ElemRestr Create] 450``` 451 452```{literalinclude} ../../../tests/t500-operator.c 453:end-before: //! [ElemRestrU Create] 454:language: c 455:start-after: //! [ElemRestrU Create] 456``` 457 458If the user has arrays available on a device, they can be provided using 459`CEED_MEM_DEVICE`. This technique is used to provide no-copy interfaces in all 460contexts that involve problem-sized data. 461 462For discontinuous Galerkin and for applications such as Nek5000 that only 463explicitly store **E-vectors** (inter-element continuity has been subsumed by 464the parallel restriction $\bm{P}$), the element restriction $\bm{\mathcal{E}}$ 465is the identity and {c:func}`CeedElemRestrictionCreateStrided()` is used instead. 466We plan to support other structured representations of $\bm{\mathcal{E}}$ which will 467be added according to demand. 468There are two common approaches for supporting non-conforming elements: applying the node constraints via $\bm P$ so that the **L-vector** can be processed uniformly and applying the constraints via $\bm{\mathcal{E}}$ so that the **E-vector** is uniform. 469The former can be done with the existing interface while the latter will require a generalization to element restriction that would define field values at constrained nodes as linear combinations of the values at primary nodes. 470 471These operations, $\bm{P}$, $\bm{B}$, and $\bm{D}$, 472are combined with a {ref}`CeedOperator`. As with {ref}`CeedQFunction`s, operator fields are added 473separately with a matching field name, basis ($\bm{B}$), element restriction 474($\bm{\mathcal{E}}$), and **L-vector**. The flag 475`CEED_VECTOR_ACTIVE` indicates that the vector corresponding to that field will 476be provided to the operator when {c:func}`CeedOperatorApply()` is called. Otherwise the 477input/output will be read from/written to the specified **L-vector**. 478 479With partial assembly, we first perform a setup stage where $\bm{D}$ is evaluated 480and stored. This is accomplished by the operator `op_setup` and its application 481to `X`, the nodes of the mesh (these are needed to compute Jacobians at 482quadrature points). Note that the corresponding {c:func}`CeedOperatorApply()` has no basis 483evaluation on the output, as the quadrature data is not needed at the dofs: 484 485```{literalinclude} ../../../tests/t500-operator.c 486:end-before: //! [Setup Create] 487:language: c 488:start-after: //! [Setup Create] 489``` 490 491```{literalinclude} ../../../tests/t500-operator.c 492:end-before: //! [Setup Set] 493:language: c 494:start-after: //! [Setup Set] 495``` 496 497```{literalinclude} ../../../tests/t500-operator.c 498:end-before: //! [Setup Apply] 499:language: c 500:start-after: //! [Setup Apply] 501``` 502 503The action of the operator is then represented by operator `op_mass` and its 504{c:func}`CeedOperatorApply()` to the input **L-vector** `U` with output in `V`: 505 506```{literalinclude} ../../../tests/t500-operator.c 507:end-before: //! [Operator Create] 508:language: c 509:start-after: //! [Operator Create] 510``` 511 512```{literalinclude} ../../../tests/t500-operator.c 513:end-before: //! [Operator Set] 514:language: c 515:start-after: //! [Operator Set] 516``` 517 518```{literalinclude} ../../../tests/t500-operator.c 519:end-before: //! [Operator Apply] 520:language: c 521:start-after: //! [Operator Apply] 522``` 523 524A number of function calls in the interface, such as {c:func}`CeedOperatorApply()`, are 525intended to support asynchronous execution via their last argument, 526`CeedRequest*`. The specific (pointer) value used in the above example, 527`CEED_REQUEST_IMMEDIATE`, is used to express the request (from the user) for the 528operation to complete before returning from the function call, i.e. to make sure 529that the result of the operation is available in the output parameters 530immediately after the call. For a true asynchronous call, one needs to provide 531the address of a user defined variable. Such a variable can be used later to 532explicitly wait for the completion of the operation. 533 534## Gallery of QFunctions 535 536LibCEED provides a gallery of built-in {ref}`CeedQFunction`s in the {file}`gallery/` directory. 537The available QFunctions are the ones associated with the mass, the Laplacian, and 538the identity operators. To illustrate how the user can declare a {ref}`CeedQFunction` 539via the gallery of available QFunctions, consider the selection of the 540{ref}`CeedQFunction` associated with a simple 1D mass matrix 541(cf. [tests/t410-qfunction.c](https://github.com/CEED/libCEED/blob/main/tests/t410-qfunction.c)). 542 543```{literalinclude} ../../../tests/t410-qfunction.c 544:language: c 545:linenos: true 546``` 547 548## Interface Principles and Evolution 549 550LibCEED is intended to be extensible via backends that are packaged with the 551library and packaged separately (possibly as a binary containing proprietary 552code). Backends are registered by calling 553 554```{literalinclude} ../../../backends/ref/ceed-ref.c 555:end-before: //! [Register] 556:language: c 557:start-after: //! [Register] 558``` 559 560typically in a library initializer or "constructor" that runs automatically. 561`CeedInit` uses this prefix to find an appropriate backend for the resource. 562 563Source (API) and binary (ABI) stability are important to libCEED. Prior to 564reaching version 1.0, libCEED does not implement strict [semantic versioning](https://semver.org) across the entire interface. However, user code, 565including libraries of {ref}`CeedQFunction`s, should be source and binary 566compatible moving from 0.x.y to any later release 0.x.z. We have less experience 567with external packaging of backends and do not presently guarantee source or 568binary stability, but we intend to define stability guarantees for libCEED 1.0. 569We'd love to talk with you if you're interested in packaging backends 570externally, and will work with you on a practical stability policy. 571