xref: /libCEED/doc/sphinx/source/libCEEDapi.md (revision cdb4dd67cd8edd07e3207a1c021923acf216e178)
1# Interface Concepts
2
3This page provides a brief description of the theoretical foundations and the
4practical implementation of the libCEED library.
5
6(theoretical-framework)=
7
8## Theoretical Framework
9
10In finite element formulations, the weak form of a Partial Differential Equation
11(PDE) is evaluated on a subdomain $\Omega_e$ (element) and the local results
12are composed into a larger system of equations that models the entire problem on
13the global domain $\Omega$. In particular, when high-order finite elements or
14spectral elements are used, the resulting sparse matrix representation of the global
15operator is computationally expensive, with respect to both the memory transfer and
16floating point operations needed for its evaluation. libCEED provides an interface
17for matrix-free operator description that enables efficient evaluation on a variety
18of computational device types (selectable at run time). We present here the notation
19and the mathematical formulation adopted in libCEED.
20
21We start by considering the discrete residual $F(u)=0$ formulation
22in weak form. We first define the $L^2$ inner product between real-valued functions
23
24$$
25\langle v, u \rangle = \int_\Omega v u d \bm{x},
26$$
27
28where $\bm{x} \in \mathbb{R}^d \supset \Omega$.
29
30We want to find $u$ in a suitable space $V_D$,
31such that
32
33$$
34\langle  \bm v,  \bm f(u) \rangle = \int_\Omega  \bm v \cdot  \bm f_0 (u, \nabla u) + \nabla \bm v :  \bm f_1 (u, \nabla u) = 0
35$$ (residual)
36
37for all $\bm v$ in the corresponding homogeneous space $V_0$, where $\bm f_0$
38and $\bm f_1$ contain all possible sources in the problem. We notice here that
39$\bm f_0$ represents all terms in {eq}`residual` which multiply the (possibly vector-valued) test
40function $\bm v$ and $\bm f_1$ all terms which multiply its gradient $\nabla \bm v$.
41For an n-component problems in $d$ dimensions, $\bm f_0 \in \mathbb{R}^n$ and
42$\bm f_1 \in \mathbb{R}^{nd}$.
43
44:::{note}
45The notation $\nabla \bm v \!:\! \bm f_1$ represents contraction over both
46fields and spatial dimensions while a single dot represents contraction in just one,
47which should be clear from context, e.g., $\bm v \cdot \bm f_0$ contracts only over
48fields.
49:::
50
51:::{note}
52In the code, the function that represents the weak form at quadrature
53points is called the {ref}`CeedQFunction`. In the {ref}`Examples` provided with the
54library (in the {file}`examples/` directory), we store the term $\bm f_0$ directly
55into `v`, and the term $\bm f_1$ directly into `dv` (which stands for
56$\nabla \bm v$). If equation {eq}`residual` only presents a term of the
57type $\bm f_0$, the {ref}`CeedQFunction` will only have one output argument,
58namely `v`. If equation {eq}`residual` also presents a term of the type
59$\bm f_1$, then the {ref}`CeedQFunction` will have two output arguments, namely,
60`v` and `dv`.
61:::
62
63## Finite Element Operator Decomposition
64
65Finite element operators are typically defined through weak formulations of
66partial differential equations that involve integration over a computational
67mesh. The required integrals are computed by splitting them as a sum over the
68mesh elements, mapping each element to a simple *reference* element (e.g. the
69unit square) and applying a quadrature rule in reference space.
70
71This sequence of operations highlights an inherent hierarchical structure
72present in all finite element operators where the evaluation starts on *global
73(trial) degrees of freedom (dofs) or nodes on the whole mesh*, restricts to
74*dofs on subdomains* (groups of elements), then moves to independent
75*dofs on each element*, transitions to independent *quadrature points* in
76reference space, performs the integration, and then goes back in reverse order
77to global (test) degrees of freedom on the whole mesh.
78
79This is illustrated below for the simple case of symmetric linear operator on
80third order ($Q_3$) scalar continuous ($H^1$) elements, where we use
81the notions **T-vector**, **L-vector**, **E-vector** and **Q-vector** to represent
82the sets corresponding to the (true) degrees of freedom on the global mesh, the split
83local degrees of freedom on the subdomains, the split degrees of freedom on the
84mesh elements, and the values at quadrature points, respectively.
85
86We refer to the operators that connect the different types of vectors as:
87
88- Subdomain restriction $\bm{P}$
89- Element restriction $\bm{\mathcal{E}}$
90- Basis (Dofs-to-Qpts) evaluator $\bm{B}$
91- Operator at quadrature points $\bm{D}$
92
93More generally, when the test and trial space differ, they get their own
94versions of $\bm{P}$, $\bm{\mathcal{E}}$ and $\bm{B}$.
95
96(fig-operator-decomp)=
97
98:::{figure} ../../img/libCEED.svg
99Operator Decomposition
100:::
101
102Note that in the case of adaptive mesh refinement (AMR), the restrictions
103$\bm{P}$ and $\bm{\mathcal{E}}$ will involve not just extracting sub-vectors,
104but evaluating values at constrained degrees of freedom through the AMR interpolation.
105There can also be several levels of subdomains ($\bm P_1$, $\bm P_2$,
106etc.), and it may be convenient to split $\bm{D}$ as the product of several
107operators ($\bm D_1$, $\bm D_2$, etc.).
108
109### Terminology and Notation
110
111Vector representation/storage categories:
112
113- True degrees of freedom/unknowns, **T-vector**:
114
115  > - each unknown $i$ has exactly one copy, on exactly one processor, $rank(i)$
116  > - this is a non-overlapping vector decomposition
117  > - usually includes any essential (fixed) dofs.
118  >
119  > ```{image} ../../img/T-vector.svg
120  > ```
121
122- Local (w.r.t. processors) degrees of freedom/unknowns, **L-vector**:
123
124  > - each unknown $i$ has exactly one copy on each processor that owns an
125  >   element containing $i$
126  > - this is an overlapping vector decomposition with overlaps only across
127  >   different processors---there is no duplication of unknowns on a single
128  >   processor
129  > - the shared dofs/unknowns are the overlapping dofs, i.e. the ones that have
130  >   more than one copy, on different processors.
131  >
132  > ```{image} ../../img/L-vector.svg
133  > ```
134
135- Per element decomposition, **E-vector**:
136
137  > - each unknown $i$ has as many copies as the number of elements that contain
138  >   $i$
139  > - usually, the copies of the unknowns are grouped by the element they belong
140  >   to.
141  >
142  > ```{image} ../../img/E-vector.svg
143  > ```
144
145- In the case of AMR with hanging nodes (giving rise to hanging dofs):
146
147  > - the **L-vector** is enhanced with the hanging/dependent dofs
148  > - the additional hanging/dependent dofs are duplicated when they are shared
149  >   by multiple processors
150  > - this way, an **E-vector** can be derived from an **L-vector** without any
151  >   communications and without additional computations to derive the dependent
152  >   dofs
153  > - in other words, an entry in an **E-vector** is obtained by copying an entry
154  >   from the corresponding **L-vector**, optionally switching the sign of the
155  >   entry (for $H(\mathrm{div})$---and $H(\mathrm{curl})$-conforming spaces).
156  >
157  > ```{image} ../../img/L-vector-AMR.svg
158  > ```
159
160- In the case of variable order spaces:
161
162  > - the dependent dofs (usually on the higher-order side of a face/edge) can
163  >   be treated just like the hanging/dependent dofs case.
164
165- Quadrature point vector, **Q-vector**:
166
167  > - this is similar to **E-vector** where instead of dofs, the vector represents
168  >   values at quadrature points, grouped by element.
169
170- In many cases it is useful to distinguish two types of vectors:
171
172  > - **X-vector**, or **primal X-vector**, and **X'-vector**, or **dual X-vector**
173  > - here X can be any of the T, L, E, or Q categories
174  > - for example, the mass matrix operator maps a **T-vector** to a **T'-vector**
175  > - the solutions vector is a **T-vector**, and the RHS vector is a **T'-vector**
176  > - using the parallel prolongation operator, one can map the solution
177  >   **T-vector** to a solution **L-vector**, etc.
178
179Operator representation/storage/action categories:
180
181- Full true-dof parallel assembly, **TA**, or **A**:
182
183  > - ParCSR or similar format
184  > - the T in TA indicates that the data format represents an operator from a
185  >   **T-vector** to a **T'-vector**.
186
187- Full local assembly, **LA**:
188
189  > - CSR matrix on each rank
190  > - the parallel prolongation operator, $\bm{P}$, (and its transpose) should use
191  >   optimized matrix-free action
192  > - note that $\bm{P}$ is the operator mapping T-vectors to L-vectors.
193
194- Element matrix assembly, **EA**:
195
196  > - each element matrix is stored as a dense matrix
197  > - optimized element and parallel prolongation operators
198  > - note that the element prolongation operator is the mapping from an
199  >   **L-vector** to an **E-vector**.
200
201- Quadrature-point/partial assembly, **QA** or **PA**:
202
203  > - precompute and store $w\det(J)$ at all quadrature points in all mesh elements
204  > - the stored data can be viewed as a **Q-vector**.
205
206- Unassembled option,  **UA** or **U**:
207
208  > - no assembly step
209  > - the action uses directly the mesh node coordinates, and assumes specific
210  >   form of the coefficient, e.g. constant, piecewise-constant, or given as a
211  >   **Q-vector** (Q-coefficient).
212
213### Partial Assembly
214
215Since the global operator $\bm{A}$ is just a series of variational restrictions
216with $\bm{B}$, $\bm{\mathcal{E}}$ and $\bm{P}$, starting from its
217point-wise kernel $\bm{D}$, a "matvec" with $\bm{A}$ can be
218performed by evaluating and storing some of the innermost variational restriction
219matrices, and applying the rest of the operators "on-the-fly". For example, one can
220compute and store a global matrix on **T-vector** level. Alternatively, one can compute
221and store only the subdomain (**L-vector**) or element (**E-vector**) matrices and
222perform the action of $\bm{A}$ using matvecs with $\bm{P}$ or
223$\bm{P}$ and $\bm{\mathcal{E}}$. While these options are natural for
224low-order discretizations, they are not a good fit for high-order methods due to
225the amount of FLOPs needed for their evaluation, as well as the memory transfer
226needed for a matvec.
227
228Our focus in libCEED, instead, is on **partial assembly**, where we compute and
229store only $\bm{D}$ (or portions of it) and evaluate the actions of
230$\bm{P}$, $\bm{\mathcal{E}}$ and $\bm{B}$ on-the-fly.
231Critically for performance, we take advantage of the tensor-product structure of the
232degrees of freedom and quadrature points on *quad* and *hex* elements to perform the
233action of $\bm{B}$ without storing it as a matrix.
234
235Implemented properly, the partial assembly algorithm requires optimal amount of
236memory transfers (with respect to the polynomial order) and near-optimal FLOPs
237for operator evaluation. It consists of an operator *setup* phase, that
238evaluates and stores $\bm{D}$ and an operator *apply* (evaluation) phase that
239computes the action of $\bm{A}$ on an input vector. When desired, the setup
240phase may be done as a side-effect of evaluating a different operator, such as a
241nonlinear residual. The relative costs of the setup and apply phases are
242different depending on the physics being expressed and the representation of
243$\bm{D}$.
244
245### Parallel Decomposition
246
247After the application of each of the first three transition operators,
248$\bm{P}$, $\bm{\mathcal{E}}$ and $\bm{B}$, the operator evaluation
249is decoupled  on their ranges, so $\bm{P}$, $\bm{\mathcal{E}}$ and
250$\bm{B}$ allow us to "zoom-in" to subdomain, element and quadrature point
251level, ignoring the coupling at higher levels.
252
253Thus, a natural mapping of $\bm{A}$ on a parallel computer is to split the
254**T-vector** over MPI ranks (a non-overlapping decomposition, as is typically
255used for sparse matrices), and then split the rest of the vector types over
256computational devices (CPUs, GPUs, etc.) as indicated by the shaded regions in
257the diagram above.
258
259One of the advantages of the decomposition perspective in these settings is that
260the operators $\bm{P}$, $\bm{\mathcal{E}}$, $\bm{B}$ and
261$\bm{D}$ clearly separate the MPI parallelism
262in the operator ($\bm{P}$) from the unstructured mesh topology
263($\bm{\mathcal{E}}$), the choice of the finite element space/basis ($\bm{B}$)
264and the geometry and point-wise physics $\bm{D}$. These components also
265naturally fall in different classes of numerical algorithms -- parallel (multi-device)
266linear algebra for $\bm{P}$, sparse (on-device) linear algebra for
267$\bm{\mathcal{E}}$, dense/structured linear algebra (tensor contractions) for
268$\bm{B}$ and parallel point-wise evaluations for $\bm{D}$.
269
270Currently in libCEED, it is assumed that the host application manages the global
271**T-vectors** and the required communications among devices (which are generally
272on different compute nodes) with **P**. Our API is thus focused on the
273**L-vector** level, where the logical devices, which in the library are
274represented by the {ref}`Ceed` object, are independent. Each MPI rank can use one or
275more {ref}`Ceed`s, and each {ref}`Ceed`, in turn, can represent one or more physical
276devices, as long as libCEED backends support such configurations. The idea is
277that every MPI rank can use any logical device it is assigned at runtime. For
278example, on a node with 2 CPU sockets and 4 GPUs, one may decide to use 6 MPI
279ranks (each using a single {ref}`Ceed` object): 2 ranks using 1 CPU socket each, and
2804 using 1 GPU each. Another choice could be to run 1 MPI rank on the whole node
281and use 5 {ref}`Ceed` objects: 1 managing all CPU cores on the 2 sockets and 4
282managing 1 GPU each. The communications among the devices, e.g. required for
283applying the action of $\bm{P}$, are currently out of scope of libCEED. The
284interface is non-blocking for all operations involving more than O(1) data,
285allowing operations performed on a coprocessor or worker threads to overlap with
286operations on the host.
287
288## API Description
289
290The libCEED API takes an algebraic approach, where the user essentially
291describes in the *frontend* the operators $\bm{\bm{\mathcal{E}}}$, $\bm{B}$, and $\bm{D}$ and the library
292provides *backend* implementations and coordinates their action to the original
293operator on **L-vector** level (i.e. independently on each device / MPI task).
294This is visualized in the schematic below; "active" and "passive" inputs/outputs
295will be discussed in more detail later.
296
297(fig-operator-schematic)=
298
299:::{figure} ../../img/libceed_schematic.svg
300Flow of data through vector types inside libCEED Operators, through backend implementations
301of $\bm{\bm{\mathcal{E}}}$, $\bm{B}$, and $\bm{D}$
302:::
303
304One of the advantages of this purely algebraic description is that it already
305includes all the finite element information, so the backends can operate on
306linear algebra level without explicit finite element code. The frontend
307description is general enough to support a wide variety of finite element
308algorithms, as well as some other types algorithms such as spectral finite
309differences. The separation of the front- and backends enables applications to
310easily switch/try different backends. It also enables backend developers to
311impact many applications from a single implementation.
312
313Our long-term vision is to include a variety of backend implementations in
314libCEED, ranging from reference kernels to highly optimized kernels targeting
315specific devices (e.g. GPUs) or specific polynomial orders. A simple reference
316backend implementation is provided in the file
317[ceed-ref.c](https://github.com/CEED/libCEED/blob/main/backends/ref/ceed-ref.c).
318
319
320On the frontend, the mapping between the decomposition concepts and the code
321implementation is as follows:
322
323- **L-**, **E-** and **Q-vector** are represented as variables of type {ref}`CeedVector`.
324  (A backend may choose to operate incrementally without forming explicit **E-** or
325  **Q-vectors**.)
326- $\bm{\mathcal{E}}$ is represented as variable of type {ref}`CeedElemRestriction`.
327- $\bm{B}$ is represented as variable of type {ref}`CeedBasis`.
328- the action of $\bm{D}$ is represented as variable of type {ref}`CeedQFunction`.
329- the overall operator $\bm{\mathcal{E}}^T \bm{B}^T \bm{D} \bm{B} \bm{\mathcal{E}}$
330  is represented as variable of type
331  {ref}`CeedOperator` and its action is accessible through {c:func}`CeedOperatorApply()`.
332
333To clarify these concepts and illustrate how they are combined in the API,
334consider the implementation of the action of a simple 1D mass matrix
335(cf. [tests/t500-operator.c](https://github.com/CEED/libCEED/blob/main/tests/t500-operator.c)).
336
337```{literalinclude} ../../../tests/t500-operator.c
338:language: c
339:linenos: true
340```
341In the following figure, we specialize the schematic used above for general operators so that
342it corresponds to the specific setup and mass operators as implemented in the sample code. We show
343that the active output of the setup operator, combining the quadrature weights with the Jacobian
344information for the mesh transformation, becomes a passive input to the mass operator.  Notations
345denote the libCEED function used to set the properties of the input and output fields.
346
347(fig-operator-schematic-mass)=
348
349:::{figure} ../../img/libceed_schematic_op_setup_mass.svg
350Specific combination of $\bm{\bm{\mathcal{E}}}$, $\bm{B}$, $\bm{D}$, and input/output vectors
351corresponding to the libCEED operators in the t500-operator test
352:::
353
354The constructor
355
356```{literalinclude} ../../../tests/t500-operator.c
357:end-at: CeedInit
358:language: c
359:start-at: CeedInit
360```
361
362creates a logical device `ceed` on the specified *resource*, which could also be
363a coprocessor such as `"/nvidia/0"`. There can be any number of such devices,
364including multiple logical devices driving the same resource (though performance
365may suffer in case of oversubscription). The resource is used to locate a
366suitable backend which will have discretion over the implementations of all
367objects created with this logical device.
368
369The `setup` routine above computes and stores $\bm{D}$, in this case a
370scalar value in each quadrature point, while `mass` uses these saved values to perform
371the action of $\bm{D}$. These functions are turned into the {ref}`CeedQFunction`
372variables `qf_setup` and `qf_mass` in the {c:func}`CeedQFunctionCreateInterior()` calls:
373
374```{literalinclude} ../../../tests/t500-operator.c
375:end-before: //! [QFunction Create]
376:language: c
377:start-after: //! [QFunction Create]
378```
379
380A {ref}`CeedQFunction` performs independent operations at each quadrature point and
381the interface is intended to facilitate vectorization.  The second argument is
382an expected vector length. If greater than 1, the caller must ensure that the
383number of quadrature points `Q` is divisible by the vector length. This is
384often satisfied automatically due to the element size or by batching elements
385together to facilitate vectorization in other stages, and can always be ensured
386by padding.
387
388In addition to the function pointers (`setup` and `mass`), {ref}`CeedQFunction`
389constructors take a string representation specifying where the source for the
390implementation is found. This is used by backends that support Just-In-Time
391(JIT) compilation (i.e., CUDA and OCCA) to compile for coprocessors.
392For full support across all backends, these {ref}`CeedQFunction` source files must only contain constructs mutually supported by C99, C++11, and CUDA.
393For example, explicit type casting of void pointers and explicit use of compatible arguments for {code}`math` library functions is required, and variable-length array (VLA) syntax for array reshaping is only available via libCEED's {code}`CEED_Q_VLA` macro.
394
395Different input and output fields are added individually, specifying the field
396name, size of the field, and evaluation mode.
397
398The size of the field is provided by a combination of the number of components
399the effect of any basis evaluations.
400
401The evaluation mode (see {ref}`CeedBasis-Typedefs and Enumerations`) `CEED_EVAL_INTERP`
402for both input and output fields indicates that the mass operator only contains terms of
403the form
404
405$$
406\int_\Omega v \cdot f_0 (u, \nabla u)
407$$
408
409where $v$ are test functions (see the {ref}`theoretical-framework`).
410More general operators, such as those of the form
411
412$$
413\int_\Omega v \cdot f_0 (u, \nabla u) + \nabla v : f_1 (u, \nabla u)
414$$
415
416can be expressed.
417
418For fields with derivatives, such as with the basis evaluation mode
419(see {ref}`CeedBasis-Typedefs and Enumerations`) `CEED_EVAL_GRAD`, the size of the
420field needs to reflect both the number of components and the geometric dimension.
421A 3-dimensional gradient on four components would therefore mean the field has a size of
42212\.
423
424The $\bm{B}$ operators for the mesh nodes, `basis_x`, and the unknown field,
425`basis_u`, are defined in the calls to the function {c:func}`CeedBasisCreateTensorH1Lagrange()`.
426In this example, both the mesh and the unknown field use $H^1$ Lagrange finite
427elements of order 1 and 4 respectively (the `P` argument represents the number of 1D
428degrees of freedom on each element). Both basis operators use the same integration rule,
429which is Gauss-Legendre with 8 points (the `Q` argument).
430
431```{literalinclude} ../../../tests/t500-operator.c
432:end-before: //! [Basis Create]
433:language: c
434:start-after: //! [Basis Create]
435```
436
437Other elements with this structure can be specified in terms of the `Q×P`
438matrices that evaluate values and gradients at quadrature points in one
439dimension using {c:func}`CeedBasisCreateTensorH1()`. Elements that do not have tensor
440product structure, such as symmetric elements on simplices, will be created
441using different constructors.
442
443The $\bm{\mathcal{E}}$ operators for the mesh nodes, `elem_restr_x`, and the unknown field,
444`elem_restr_u`, are specified in the {c:func}`CeedElemRestrictionCreate()`. Both of these
445specify directly the dof indices for each element in the `ind_x` and `ind_u`
446arrays:
447
448```{literalinclude} ../../../tests/t500-operator.c
449:end-before: //! [ElemRestr Create]
450:language: c
451:start-after: //! [ElemRestr Create]
452```
453
454```{literalinclude} ../../../tests/t500-operator.c
455:end-before: //! [ElemRestrU Create]
456:language: c
457:start-after: //! [ElemRestrU Create]
458```
459
460If the user has arrays available on a device, they can be provided using
461`CEED_MEM_DEVICE`. This technique is used to provide no-copy interfaces in all
462contexts that involve problem-sized data.
463
464For discontinuous Galerkin and for applications such as Nek5000 that only
465explicitly store **E-vectors** (inter-element continuity has been subsumed by
466the parallel restriction $\bm{P}$), the element restriction $\bm{\mathcal{E}}$
467is the identity and {c:func}`CeedElemRestrictionCreateStrided()` is used instead.
468We plan to support other structured representations of $\bm{\mathcal{E}}$ which will
469be added according to demand.
470There are two common approaches for supporting non-conforming elements: applying the node constraints via $\bm P$ so that the **L-vector** can be processed uniformly and applying the constraints via $\bm{\mathcal{E}}$ so that the **E-vector** is uniform.
471The former can be done with the existing interface while the latter will require a generalization to element restriction that would define field values at constrained nodes as linear combinations of the values at primary nodes.
472
473These operations, $\bm{\mathcal{E}}$, $\bm{B}$, and $\bm{D}$,
474are combined with a {ref}`CeedOperator`. As with {ref}`CeedQFunction`s, operator fields are added
475separately with a matching field name, basis ($\bm{B}$), element restriction
476($\bm{\mathcal{E}}$), and **L-vector**. The flag
477`CEED_VECTOR_ACTIVE` indicates that the vector corresponding to that field will
478be provided to the operator when {c:func}`CeedOperatorApply()` is called. Otherwise the
479input/output will be read from/written to the specified **L-vector**.
480
481With partial assembly, we first perform a setup stage where $\bm{D}$ is evaluated
482and stored. This is accomplished by the operator `op_setup` and its application
483to `X`, the nodes of the mesh (these are needed to compute Jacobians at
484quadrature points). Note that the corresponding {c:func}`CeedOperatorApply()` has no basis
485evaluation on the output, as the quadrature data is not needed at the dofs:
486
487```{literalinclude} ../../../tests/t500-operator.c
488:end-before: //! [Setup Create]
489:language: c
490:start-after: //! [Setup Create]
491```
492
493```{literalinclude} ../../../tests/t500-operator.c
494:end-before: //! [Setup Set]
495:language: c
496:start-after: //! [Setup Set]
497```
498
499```{literalinclude} ../../../tests/t500-operator.c
500:end-before: //! [Setup Apply]
501:language: c
502:start-after: //! [Setup Apply]
503```
504
505The action of the operator is then represented by operator `op_mass` and its
506{c:func}`CeedOperatorApply()` to the input **L-vector** `U` with output in `V`:
507
508```{literalinclude} ../../../tests/t500-operator.c
509:end-before: //! [Operator Create]
510:language: c
511:start-after: //! [Operator Create]
512```
513
514```{literalinclude} ../../../tests/t500-operator.c
515:end-before: //! [Operator Set]
516:language: c
517:start-after: //! [Operator Set]
518```
519
520```{literalinclude} ../../../tests/t500-operator.c
521:end-before: //! [Operator Apply]
522:language: c
523:start-after: //! [Operator Apply]
524```
525
526A number of function calls in the interface, such as {c:func}`CeedOperatorApply()`, are
527intended to support asynchronous execution via their last argument,
528`CeedRequest*`. The specific (pointer) value used in the above example,
529`CEED_REQUEST_IMMEDIATE`, is used to express the request (from the user) for the
530operation to complete before returning from the function call, i.e. to make sure
531that the result of the operation is available in the output parameters
532immediately after the call. For a true asynchronous call, one needs to provide
533the address of a user defined variable. Such a variable can be used later to
534explicitly wait for the completion of the operation.
535
536## Gallery of QFunctions
537
538LibCEED provides a gallery of built-in {ref}`CeedQFunction`s in the {file}`gallery/` directory.
539The available QFunctions are the ones associated with the mass, the Laplacian, and
540the identity operators. To illustrate how the user can declare a {ref}`CeedQFunction`
541via the gallery of available QFunctions, consider the selection of the
542{ref}`CeedQFunction` associated with a simple 1D mass matrix
543(cf. [tests/t410-qfunction.c](https://github.com/CEED/libCEED/blob/main/tests/t410-qfunction.c)).
544
545```{literalinclude} ../../../tests/t410-qfunction.c
546:language: c
547:linenos: true
548```
549
550## Interface Principles and Evolution
551
552LibCEED is intended to be extensible via backends that are packaged with the
553library and packaged separately (possibly as a binary containing proprietary
554code). Backends are registered by calling
555
556```{literalinclude} ../../../backends/ref/ceed-ref.c
557:end-before: //! [Register]
558:language: c
559:start-after: //! [Register]
560```
561
562typically in a library initializer or "constructor" that runs automatically.
563`CeedInit` uses this prefix to find an appropriate backend for the resource.
564
565Source (API) and binary (ABI) stability are important to libCEED. Prior to
566reaching version 1.0, libCEED does not implement strict [semantic versioning](https://semver.org) across the entire interface. However, user code,
567including libraries of {ref}`CeedQFunction`s, should be source and binary
568compatible moving from 0.x.y to any later release 0.x.z. We have less experience
569with external packaging of backends and do not presently guarantee source or
570binary stability, but we intend to define stability guarantees for libCEED 1.0.
571We'd love to talk with you if you're interested in packaging backends
572externally, and will work with you on a practical stability policy.
573