xref: /libCEED/doc/sphinx/source/libCEEDapi.md (revision bcb2dfae4c301ddfdddf58806f08f6e7d17f4ea5)
1*bcb2dfaeSJed Brown# Interface Concepts
2*bcb2dfaeSJed Brown
3*bcb2dfaeSJed BrownThis page provides a brief description of the theoretical foundations and the
4*bcb2dfaeSJed Brownpractical implementation of the libCEED library.
5*bcb2dfaeSJed Brown
6*bcb2dfaeSJed Brown(theoretical-framework)=
7*bcb2dfaeSJed Brown
8*bcb2dfaeSJed Brown## Theoretical Framework
9*bcb2dfaeSJed Brown
10*bcb2dfaeSJed BrownIn finite element formulations, the weak form of a Partial Differential Equation
11*bcb2dfaeSJed Brown(PDE) is evaluated on a subdomain $\Omega_e$ (element) and the local results
12*bcb2dfaeSJed Brownare composed into a larger system of equations that models the entire problem on
13*bcb2dfaeSJed Brownthe global domain $\Omega$. In particular, when high-order finite elements or
14*bcb2dfaeSJed Brownspectral elements are used, the resulting sparse matrix representation of the global
15*bcb2dfaeSJed Brownoperator is computationally expensive, with respect to both the memory transfer and
16*bcb2dfaeSJed Brownfloating point operations needed for its evaluation. libCEED provides an interface
17*bcb2dfaeSJed Brownfor matrix-free operator description that enables efficient evaluation on a variety
18*bcb2dfaeSJed Brownof computational device types (selectable at run time). We present here the notation
19*bcb2dfaeSJed Brownand the mathematical formulation adopted in libCEED.
20*bcb2dfaeSJed Brown
21*bcb2dfaeSJed BrownWe start by considering the discrete residual $F(u)=0$ formulation
22*bcb2dfaeSJed Brownin weak form. We first define the $L^2$ inner product between real-valued functions
23*bcb2dfaeSJed Brown
24*bcb2dfaeSJed Brown$$
25*bcb2dfaeSJed Brown\langle v, u \rangle = \int_\Omega v u d \bm{x},
26*bcb2dfaeSJed Brown$$
27*bcb2dfaeSJed Brown
28*bcb2dfaeSJed Brownwhere $\bm{x} \in \mathbb{R}^d \supset \Omega$.
29*bcb2dfaeSJed Brown
30*bcb2dfaeSJed BrownWe want to find $u$ in a suitable space $V_D$,
31*bcb2dfaeSJed Brownsuch that
32*bcb2dfaeSJed Brown
33*bcb2dfaeSJed Brown$$
34*bcb2dfaeSJed Brown\langle  \bm v,  \bm f(u) \rangle = \int_\Omega  \bm v \cdot  \bm f_0 (u, \nabla u) + \nabla \bm v :  \bm f_1 (u, \nabla u) = 0
35*bcb2dfaeSJed Brown$$ (residual)
36*bcb2dfaeSJed Brown
37*bcb2dfaeSJed Brownfor all $\bm v$ in the corresponding homogeneous space $V_0$, where $\bm f_0$
38*bcb2dfaeSJed Brownand $\bm f_1$ contain all possible sources in the problem. We notice here that
39*bcb2dfaeSJed Brown$\bm f_0$ represents all terms in {math:numref}`residual` which multiply the (possibly vector-valued) test
40*bcb2dfaeSJed Brownfunction $\bm v$ and $\bm f_1$ all terms which multiply its gradient $\nabla \bm v$.
41*bcb2dfaeSJed BrownFor an n-component problems in $d$ dimensions, $\bm f_0 \in \mathbb{R}^n$ and
42*bcb2dfaeSJed Brown$\bm f_1 \in \mathbb{R}^{nd}$.
43*bcb2dfaeSJed Brown
44*bcb2dfaeSJed Brown:::{note}
45*bcb2dfaeSJed BrownThe notation $\nabla \bm v \!:\! \bm f_1$ represents contraction over both
46*bcb2dfaeSJed Brownfields and spatial dimensions while a single dot represents contraction in just one,
47*bcb2dfaeSJed Brownwhich should be clear from context, e.g., $\bm v \cdot \bm f_0$ contracts only over
48*bcb2dfaeSJed Brownfields.
49*bcb2dfaeSJed Brown:::
50*bcb2dfaeSJed Brown
51*bcb2dfaeSJed Brown:::{note}
52*bcb2dfaeSJed BrownIn the code, the function that represents the weak form at quadrature
53*bcb2dfaeSJed Brownpoints is called the {ref}`CeedQFunction`. In the {ref}`Examples` provided with the
54*bcb2dfaeSJed Brownlibrary (in the {file}`examples/` directory), we store the term $\bm f_0$ directly
55*bcb2dfaeSJed Browninto `v`, and the term $\bm f_1$ directly into `dv` (which stands for
56*bcb2dfaeSJed Brown$\nabla \bm v$). If equation {math:numref}`residual` only presents a term of the
57*bcb2dfaeSJed Browntype $\bm f_0$, the {ref}`CeedQFunction` will only have one output argument,
58*bcb2dfaeSJed Brownnamely `v`. If equation {math:numref}`residual` also presents a term of the type
59*bcb2dfaeSJed Brown$\bm f_1$, then the {ref}`CeedQFunction` will have two output arguments, namely,
60*bcb2dfaeSJed Brown`v` and `dv`.
61*bcb2dfaeSJed Brown:::
62*bcb2dfaeSJed Brown
63*bcb2dfaeSJed Brown## Finite Element Operator Decomposition
64*bcb2dfaeSJed Brown
65*bcb2dfaeSJed BrownFinite element operators are typically defined through weak formulations of
66*bcb2dfaeSJed Brownpartial differential equations that involve integration over a computational
67*bcb2dfaeSJed Brownmesh. The required integrals are computed by splitting them as a sum over the
68*bcb2dfaeSJed Brownmesh elements, mapping each element to a simple *reference* element (e.g. the
69*bcb2dfaeSJed Brownunit square) and applying a quadrature rule in reference space.
70*bcb2dfaeSJed Brown
71*bcb2dfaeSJed BrownThis sequence of operations highlights an inherent hierarchical structure
72*bcb2dfaeSJed Brownpresent in all finite element operators where the evaluation starts on *global
73*bcb2dfaeSJed Brown(trial) degrees of freedom (dofs) or nodes on the whole mesh*, restricts to
74*bcb2dfaeSJed Brown*dofs on subdomains* (groups of elements), then moves to independent
75*bcb2dfaeSJed Brown*dofs on each element*, transitions to independent *quadrature points* in
76*bcb2dfaeSJed Brownreference space, performs the integration, and then goes back in reverse order
77*bcb2dfaeSJed Brownto global (test) degrees of freedom on the whole mesh.
78*bcb2dfaeSJed Brown
79*bcb2dfaeSJed BrownThis is illustrated below for the simple case of symmetric linear operator on
80*bcb2dfaeSJed Brownthird order ($Q_3$) scalar continuous ($H^1$) elements, where we use
81*bcb2dfaeSJed Brownthe notions **T-vector**, **L-vector**, **E-vector** and **Q-vector** to represent
82*bcb2dfaeSJed Brownthe sets corresponding to the (true) degrees of freedom on the global mesh, the split
83*bcb2dfaeSJed Brownlocal degrees of freedom on the subdomains, the split degrees of freedom on the
84*bcb2dfaeSJed Brownmesh elements, and the values at quadrature points, respectively.
85*bcb2dfaeSJed Brown
86*bcb2dfaeSJed BrownWe refer to the operators that connect the different types of vectors as:
87*bcb2dfaeSJed Brown
88*bcb2dfaeSJed Brown- Subdomain restriction $\bm{P}$
89*bcb2dfaeSJed Brown- Element restriction $\bm{G}$
90*bcb2dfaeSJed Brown- Basis (Dofs-to-Qpts) evaluator $\bm{B}$
91*bcb2dfaeSJed Brown- Operator at quadrature points $\bm{D}$
92*bcb2dfaeSJed Brown
93*bcb2dfaeSJed BrownMore generally, when the test and trial space differ, they get their own
94*bcb2dfaeSJed Brownversions of $\bm{P}$, $\bm{G}$ and $\bm{B}$.
95*bcb2dfaeSJed Brown
96*bcb2dfaeSJed Brown(fig-operator-decomp)=
97*bcb2dfaeSJed Brown
98*bcb2dfaeSJed Brown:::{figure} ../../img/libCEED.png
99*bcb2dfaeSJed BrownOperator Decomposition
100*bcb2dfaeSJed Brown:::
101*bcb2dfaeSJed Brown
102*bcb2dfaeSJed BrownNote that in the case of adaptive mesh refinement (AMR), the restrictions
103*bcb2dfaeSJed Brown$\bm{P}$ and $\bm{G}$ will involve not just extracting sub-vectors,
104*bcb2dfaeSJed Brownbut evaluating values at constrained degrees of freedom through the AMR interpolation.
105*bcb2dfaeSJed BrownThere can also be several levels of subdomains ($\bm P_1$, $\bm P_2$,
106*bcb2dfaeSJed Brownetc.), and it may be convenient to split $\bm{D}$ as the product of several
107*bcb2dfaeSJed Brownoperators ($\bm D_1$, $\bm D_2$, etc.).
108*bcb2dfaeSJed Brown
109*bcb2dfaeSJed Brown### Terminology and Notation
110*bcb2dfaeSJed Brown
111*bcb2dfaeSJed BrownVector representation/storage categories:
112*bcb2dfaeSJed Brown
113*bcb2dfaeSJed Brown- True degrees of freedom/unknowns, **T-vector**:
114*bcb2dfaeSJed Brown
115*bcb2dfaeSJed Brown  > - each unknown $i$ has exactly one copy, on exactly one processor, $rank(i)$
116*bcb2dfaeSJed Brown  > - this is a non-overlapping vector decomposition
117*bcb2dfaeSJed Brown  > - usually includes any essential (fixed) dofs.
118*bcb2dfaeSJed Brown  >
119*bcb2dfaeSJed Brown  > ```{image} ../../img/T-vector.svg
120*bcb2dfaeSJed Brown  > ```
121*bcb2dfaeSJed Brown
122*bcb2dfaeSJed Brown- Local (w.r.t. processors) degrees of freedom/unknowns, **L-vector**:
123*bcb2dfaeSJed Brown
124*bcb2dfaeSJed Brown  > - each unknown $i$ has exactly one copy on each processor that owns an
125*bcb2dfaeSJed Brown  >   element containing $i$
126*bcb2dfaeSJed Brown  > - this is an overlapping vector decomposition with overlaps only across
127*bcb2dfaeSJed Brown  >   different processors---there is no duplication of unknowns on a single
128*bcb2dfaeSJed Brown  >   processor
129*bcb2dfaeSJed Brown  > - the shared dofs/unknowns are the overlapping dofs, i.e. the ones that have
130*bcb2dfaeSJed Brown  >   more than one copy, on different processors.
131*bcb2dfaeSJed Brown  >
132*bcb2dfaeSJed Brown  > ```{image} ../../img/L-vector.svg
133*bcb2dfaeSJed Brown  > ```
134*bcb2dfaeSJed Brown
135*bcb2dfaeSJed Brown- Per element decomposition, **E-vector**:
136*bcb2dfaeSJed Brown
137*bcb2dfaeSJed Brown  > - each unknown $i$ has as many copies as the number of elements that contain
138*bcb2dfaeSJed Brown  >   $i$
139*bcb2dfaeSJed Brown  > - usually, the copies of the unknowns are grouped by the element they belong
140*bcb2dfaeSJed Brown  >   to.
141*bcb2dfaeSJed Brown  >
142*bcb2dfaeSJed Brown  > ```{image} ../../img/E-vector.svg
143*bcb2dfaeSJed Brown  > ```
144*bcb2dfaeSJed Brown
145*bcb2dfaeSJed Brown- In the case of AMR with hanging nodes (giving rise to hanging dofs):
146*bcb2dfaeSJed Brown
147*bcb2dfaeSJed Brown  > - the **L-vector** is enhanced with the hanging/dependent dofs
148*bcb2dfaeSJed Brown  > - the additional hanging/dependent dofs are duplicated when they are shared
149*bcb2dfaeSJed Brown  >   by multiple processors
150*bcb2dfaeSJed Brown  > - this way, an **E-vector** can be derived from an **L-vector** without any
151*bcb2dfaeSJed Brown  >   communications and without additional computations to derive the dependent
152*bcb2dfaeSJed Brown  >   dofs
153*bcb2dfaeSJed Brown  > - in other words, an entry in an **E-vector** is obtained by copying an entry
154*bcb2dfaeSJed Brown  >   from the corresponding **L-vector**, optionally switching the sign of the
155*bcb2dfaeSJed Brown  >   entry (for $H(\mathrm{div})$---and $H(\mathrm{curl})$-conforming spaces).
156*bcb2dfaeSJed Brown  >
157*bcb2dfaeSJed Brown  > ```{image} ../../img/L-vector-AMR.svg
158*bcb2dfaeSJed Brown  > ```
159*bcb2dfaeSJed Brown
160*bcb2dfaeSJed Brown- In the case of variable order spaces:
161*bcb2dfaeSJed Brown
162*bcb2dfaeSJed Brown  > - the dependent dofs (usually on the higher-order side of a face/edge) can
163*bcb2dfaeSJed Brown  >   be treated just like the hanging/dependent dofs case.
164*bcb2dfaeSJed Brown
165*bcb2dfaeSJed Brown- Quadrature point vector, **Q-vector**:
166*bcb2dfaeSJed Brown
167*bcb2dfaeSJed Brown  > - this is similar to **E-vector** where instead of dofs, the vector represents
168*bcb2dfaeSJed Brown  >   values at quadrature points, grouped by element.
169*bcb2dfaeSJed Brown
170*bcb2dfaeSJed Brown- In many cases it is useful to distinguish two types of vectors:
171*bcb2dfaeSJed Brown
172*bcb2dfaeSJed Brown  > - **X-vector**, or **primal X-vector**, and **X'-vector**, or **dual X-vector**
173*bcb2dfaeSJed Brown  > - here X can be any of the T, L, E, or Q categories
174*bcb2dfaeSJed Brown  > - for example, the mass matrix operator maps a **T-vector** to a **T'-vector**
175*bcb2dfaeSJed Brown  > - the solutions vector is a **T-vector**, and the RHS vector is a **T'-vector**
176*bcb2dfaeSJed Brown  > - using the parallel prolongation operator, one can map the solution
177*bcb2dfaeSJed Brown  >   **T-vector** to a solution **L-vector**, etc.
178*bcb2dfaeSJed Brown
179*bcb2dfaeSJed BrownOperator representation/storage/action categories:
180*bcb2dfaeSJed Brown
181*bcb2dfaeSJed Brown- Full true-dof parallel assembly, **TA**, or **A**:
182*bcb2dfaeSJed Brown
183*bcb2dfaeSJed Brown  > - ParCSR or similar format
184*bcb2dfaeSJed Brown  > - the T in TA indicates that the data format represents an operator from a
185*bcb2dfaeSJed Brown  >   **T-vector** to a **T'-vector**.
186*bcb2dfaeSJed Brown
187*bcb2dfaeSJed Brown- Full local assembly, **LA**:
188*bcb2dfaeSJed Brown
189*bcb2dfaeSJed Brown  > - CSR matrix on each rank
190*bcb2dfaeSJed Brown  > - the parallel prolongation operator, $\bm{P}$, (and its transpose) should use
191*bcb2dfaeSJed Brown  >   optimized matrix-free action
192*bcb2dfaeSJed Brown  > - note that $\bm{P}$ is the operator mapping T-vectors to L-vectors.
193*bcb2dfaeSJed Brown
194*bcb2dfaeSJed Brown- Element matrix assembly, **EA**:
195*bcb2dfaeSJed Brown
196*bcb2dfaeSJed Brown  > - each element matrix is stored as a dense matrix
197*bcb2dfaeSJed Brown  > - optimized element and parallel prolongation operators
198*bcb2dfaeSJed Brown  > - note that the element prolongation operator is the mapping from an
199*bcb2dfaeSJed Brown  >   **L-vector** to an **E-vector**.
200*bcb2dfaeSJed Brown
201*bcb2dfaeSJed Brown- Quadrature-point/partial assembly, **QA** or **PA**:
202*bcb2dfaeSJed Brown
203*bcb2dfaeSJed Brown  > - precompute and store $w\det(J)$ at all quadrature points in all mesh elements
204*bcb2dfaeSJed Brown  > - the stored data can be viewed as a **Q-vector**.
205*bcb2dfaeSJed Brown
206*bcb2dfaeSJed Brown- Unassembled option,  **UA** or **U**:
207*bcb2dfaeSJed Brown
208*bcb2dfaeSJed Brown  > - no assembly step
209*bcb2dfaeSJed Brown  > - the action uses directly the mesh node coordinates, and assumes specific
210*bcb2dfaeSJed Brown  >   form of the coefficient, e.g. constant, piecewise-constant, or given as a
211*bcb2dfaeSJed Brown  >   **Q-vector** (Q-coefficient).
212*bcb2dfaeSJed Brown
213*bcb2dfaeSJed Brown### Partial Assembly
214*bcb2dfaeSJed Brown
215*bcb2dfaeSJed BrownSince the global operator $\bm{A}$ is just a series of variational restrictions
216*bcb2dfaeSJed Brownwith $\bm{B}$, $\bm{G}$ and $\bm{P}$, starting from its
217*bcb2dfaeSJed Brownpoint-wise kernel $\bm{D}$, a "matvec" with $\bm{A}$ can be
218*bcb2dfaeSJed Brownperformed by evaluating and storing some of the innermost variational restriction
219*bcb2dfaeSJed Brownmatrices, and applying the rest of the operators "on-the-fly". For example, one can
220*bcb2dfaeSJed Browncompute and store a global matrix on **T-vector** level. Alternatively, one can compute
221*bcb2dfaeSJed Brownand store only the subdomain (**L-vector**) or element (**E-vector**) matrices and
222*bcb2dfaeSJed Brownperform the action of $\bm{A}$ using matvecs with $\bm{P}$ or
223*bcb2dfaeSJed Brown$\bm{P}$ and $\bm{G}$. While these options are natural for
224*bcb2dfaeSJed Brownlow-order discretizations, they are not a good fit for high-order methods due to
225*bcb2dfaeSJed Brownthe amount of FLOPs needed for their evaluation, as well as the memory transfer
226*bcb2dfaeSJed Brownneeded for a matvec.
227*bcb2dfaeSJed Brown
228*bcb2dfaeSJed BrownOur focus in libCEED, instead, is on **partial assembly**, where we compute and
229*bcb2dfaeSJed Brownstore only $\bm{D}$ (or portions of it) and evaluate the actions of
230*bcb2dfaeSJed Brown$\bm{P}$, $\bm{G}$ and $\bm{B}$ on-the-fly.
231*bcb2dfaeSJed BrownCritically for performance, we take advantage of the tensor-product structure of the
232*bcb2dfaeSJed Browndegrees of freedom and quadrature points on *quad* and *hex* elements to perform the
233*bcb2dfaeSJed Brownaction of $\bm{B}$ without storing it as a matrix.
234*bcb2dfaeSJed Brown
235*bcb2dfaeSJed BrownImplemented properly, the partial assembly algorithm requires optimal amount of
236*bcb2dfaeSJed Brownmemory transfers (with respect to the polynomial order) and near-optimal FLOPs
237*bcb2dfaeSJed Brownfor operator evaluation. It consists of an operator *setup* phase, that
238*bcb2dfaeSJed Brownevaluates and stores $\bm{D}$ and an operator *apply* (evaluation) phase that
239*bcb2dfaeSJed Browncomputes the action of $\bm{A}$ on an input vector. When desired, the setup
240*bcb2dfaeSJed Brownphase may be done as a side-effect of evaluating a different operator, such as a
241*bcb2dfaeSJed Brownnonlinear residual. The relative costs of the setup and apply phases are
242*bcb2dfaeSJed Browndifferent depending on the physics being expressed and the representation of
243*bcb2dfaeSJed Brown$\bm{D}$.
244*bcb2dfaeSJed Brown
245*bcb2dfaeSJed Brown### Parallel Decomposition
246*bcb2dfaeSJed Brown
247*bcb2dfaeSJed BrownAfter the application of each of the first three transition operators,
248*bcb2dfaeSJed Brown$\bm{P}$, $\bm{G}$ and $\bm{B}$, the operator evaluation
249*bcb2dfaeSJed Brownis decoupled  on their ranges, so $\bm{P}$, $\bm{G}$ and
250*bcb2dfaeSJed Brown$\bm{B}$ allow us to "zoom-in" to subdomain, element and quadrature point
251*bcb2dfaeSJed Brownlevel, ignoring the coupling at higher levels.
252*bcb2dfaeSJed Brown
253*bcb2dfaeSJed BrownThus, a natural mapping of $\bm{A}$ on a parallel computer is to split the
254*bcb2dfaeSJed Brown**T-vector** over MPI ranks (a non-overlapping decomposition, as is typically
255*bcb2dfaeSJed Brownused for sparse matrices), and then split the rest of the vector types over
256*bcb2dfaeSJed Browncomputational devices (CPUs, GPUs, etc.) as indicated by the shaded regions in
257*bcb2dfaeSJed Brownthe diagram above.
258*bcb2dfaeSJed Brown
259*bcb2dfaeSJed BrownOne of the advantages of the decomposition perspective in these settings is that
260*bcb2dfaeSJed Brownthe operators $\bm{P}$, $\bm{G}$, $\bm{B}$ and
261*bcb2dfaeSJed Brown$\bm{D}$ clearly separate the MPI parallelism
262*bcb2dfaeSJed Brownin the operator ($\bm{P}$) from the unstructured mesh topology
263*bcb2dfaeSJed Brown($\bm{G}$), the choice of the finite element space/basis ($\bm{B}$)
264*bcb2dfaeSJed Brownand the geometry and point-wise physics $\bm{D}$. These components also
265*bcb2dfaeSJed Brownnaturally fall in different classes of numerical algorithms -- parallel (multi-device)
266*bcb2dfaeSJed Brownlinear algebra for $\bm{P}$, sparse (on-device) linear algebra for
267*bcb2dfaeSJed Brown$\bm{G}$, dense/structured linear algebra (tensor contractions) for
268*bcb2dfaeSJed Brown$\bm{B}$ and parallel point-wise evaluations for $\bm{D}$.
269*bcb2dfaeSJed Brown
270*bcb2dfaeSJed BrownCurrently in libCEED, it is assumed that the host application manages the global
271*bcb2dfaeSJed Brown**T-vectors** and the required communications among devices (which are generally
272*bcb2dfaeSJed Brownon different compute nodes) with **P**. Our API is thus focused on the
273*bcb2dfaeSJed Brown**L-vector** level, where the logical devices, which in the library are
274*bcb2dfaeSJed Brownrepresented by the {ref}`Ceed` object, are independent. Each MPI rank can use one or
275*bcb2dfaeSJed Brownmore {ref}`Ceed`s, and each {ref}`Ceed`, in turn, can represent one or more physical
276*bcb2dfaeSJed Browndevices, as long as libCEED backends support such configurations. The idea is
277*bcb2dfaeSJed Brownthat every MPI rank can use any logical device it is assigned at runtime. For
278*bcb2dfaeSJed Brownexample, on a node with 2 CPU sockets and 4 GPUs, one may decide to use 6 MPI
279*bcb2dfaeSJed Brownranks (each using a single {ref}`Ceed` object): 2 ranks using 1 CPU socket each, and
280*bcb2dfaeSJed Brown4 using 1 GPU each. Another choice could be to run 1 MPI rank on the whole node
281*bcb2dfaeSJed Brownand use 5 {ref}`Ceed` objects: 1 managing all CPU cores on the 2 sockets and 4
282*bcb2dfaeSJed Brownmanaging 1 GPU each. The communications among the devices, e.g. required for
283*bcb2dfaeSJed Brownapplying the action of $\bm{P}$, are currently out of scope of libCEED. The
284*bcb2dfaeSJed Browninterface is non-blocking for all operations involving more than O(1) data,
285*bcb2dfaeSJed Brownallowing operations performed on a coprocessor or worker threads to overlap with
286*bcb2dfaeSJed Brownoperations on the host.
287*bcb2dfaeSJed Brown
288*bcb2dfaeSJed Brown## API Description
289*bcb2dfaeSJed Brown
290*bcb2dfaeSJed BrownThe libCEED API takes an algebraic approach, where the user essentially
291*bcb2dfaeSJed Browndescribes in the *frontend* the operators **G**, **B** and **D** and the library
292*bcb2dfaeSJed Brownprovides *backend* implementations and coordinates their action to the original
293*bcb2dfaeSJed Brownoperator on **L-vector** level (i.e. independently on each device / MPI task).
294*bcb2dfaeSJed Brown
295*bcb2dfaeSJed BrownOne of the advantages of this purely algebraic description is that it already
296*bcb2dfaeSJed Brownincludes all the finite element information, so the backends can operate on
297*bcb2dfaeSJed Brownlinear algebra level without explicit finite element code. The frontend
298*bcb2dfaeSJed Browndescription is general enough to support a wide variety of finite element
299*bcb2dfaeSJed Brownalgorithms, as well as some other types algorithms such as spectral finite
300*bcb2dfaeSJed Browndifferences. The separation of the front- and backends enables applications to
301*bcb2dfaeSJed Browneasily switch/try different backends. It also enables backend developers to
302*bcb2dfaeSJed Brownimpact many applications from a single implementation.
303*bcb2dfaeSJed Brown
304*bcb2dfaeSJed BrownOur long-term vision is to include a variety of backend implementations in
305*bcb2dfaeSJed BrownlibCEED, ranging from reference kernels to highly optimized kernels targeting
306*bcb2dfaeSJed Brownspecific devices (e.g. GPUs) or specific polynomial orders. A simple reference
307*bcb2dfaeSJed Brownbackend implementation is provided in the file
308*bcb2dfaeSJed Brown[ceed-ref.c](https://github.com/CEED/libCEED/blob/main/backends/ref/ceed-ref.c).
309*bcb2dfaeSJed Brown
310*bcb2dfaeSJed BrownOn the frontend, the mapping between the decomposition concepts and the code
311*bcb2dfaeSJed Brownimplementation is as follows:
312*bcb2dfaeSJed Brown
313*bcb2dfaeSJed Brown- **L-**, **E-** and **Q-vector** are represented as variables of type {ref}`CeedVector`.
314*bcb2dfaeSJed Brown  (A backend may choose to operate incrementally without forming explicit **E-** or
315*bcb2dfaeSJed Brown  **Q-vectors**.)
316*bcb2dfaeSJed Brown- $\bm{G}$ is represented as variable of type {ref}`CeedElemRestriction`.
317*bcb2dfaeSJed Brown- $\bm{B}$ is represented as variable of type {ref}`CeedBasis`.
318*bcb2dfaeSJed Brown- the action of $\bm{D}$ is represented as variable of type {ref}`CeedQFunction`.
319*bcb2dfaeSJed Brown- the overall operator $\bm{G}^T \bm{B}^T \bm{D} \bm{B} \bm{G}$
320*bcb2dfaeSJed Brown  is represented as variable of type
321*bcb2dfaeSJed Brown  {ref}`CeedOperator` and its action is accessible through {c:func}`CeedOperatorApply()`.
322*bcb2dfaeSJed Brown
323*bcb2dfaeSJed BrownTo clarify these concepts and illustrate how they are combined in the API,
324*bcb2dfaeSJed Brownconsider the implementation of the action of a simple 1D mass matrix
325*bcb2dfaeSJed Brown(cf. [tests/t500-operator.c](https://github.com/CEED/libCEED/blob/main/tests/t500-operator.c)).
326*bcb2dfaeSJed Brown
327*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
328*bcb2dfaeSJed Brown:language: c
329*bcb2dfaeSJed Brown:linenos: true
330*bcb2dfaeSJed Brown```
331*bcb2dfaeSJed Brown
332*bcb2dfaeSJed BrownThe constructor
333*bcb2dfaeSJed Brown
334*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
335*bcb2dfaeSJed Brown:end-at: CeedInit
336*bcb2dfaeSJed Brown:language: c
337*bcb2dfaeSJed Brown:start-at: CeedInit
338*bcb2dfaeSJed Brown```
339*bcb2dfaeSJed Brown
340*bcb2dfaeSJed Browncreates a logical device `ceed` on the specified *resource*, which could also be
341*bcb2dfaeSJed Browna coprocessor such as `"/nvidia/0"`. There can be any number of such devices,
342*bcb2dfaeSJed Brownincluding multiple logical devices driving the same resource (though performance
343*bcb2dfaeSJed Brownmay suffer in case of oversubscription). The resource is used to locate a
344*bcb2dfaeSJed Brownsuitable backend which will have discretion over the implementations of all
345*bcb2dfaeSJed Brownobjects created with this logical device.
346*bcb2dfaeSJed Brown
347*bcb2dfaeSJed BrownThe `setup` routine above computes and stores $\bm{D}$, in this case a
348*bcb2dfaeSJed Brownscalar value in each quadrature point, while `mass` uses these saved values to perform
349*bcb2dfaeSJed Brownthe action of $\bm{D}$. These functions are turned into the {ref}`CeedQFunction`
350*bcb2dfaeSJed Brownvariables `qf_setup` and `qf_mass` in the {c:func}`CeedQFunctionCreateInterior()` calls:
351*bcb2dfaeSJed Brown
352*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
353*bcb2dfaeSJed Brown:end-before: //! [QFunction Create]
354*bcb2dfaeSJed Brown:language: c
355*bcb2dfaeSJed Brown:start-after: //! [QFunction Create]
356*bcb2dfaeSJed Brown```
357*bcb2dfaeSJed Brown
358*bcb2dfaeSJed BrownA {ref}`CeedQFunction` performs independent operations at each quadrature point and
359*bcb2dfaeSJed Brownthe interface is intended to facilitate vectorization.  The second argument is
360*bcb2dfaeSJed Brownan expected vector length. If greater than 1, the caller must ensure that the
361*bcb2dfaeSJed Brownnumber of quadrature points `Q` is divisible by the vector length. This is
362*bcb2dfaeSJed Brownoften satisfied automatically due to the element size or by batching elements
363*bcb2dfaeSJed Browntogether to facilitate vectorization in other stages, and can always be ensured
364*bcb2dfaeSJed Brownby padding.
365*bcb2dfaeSJed Brown
366*bcb2dfaeSJed BrownIn addition to the function pointers (`setup` and `mass`), {ref}`CeedQFunction`
367*bcb2dfaeSJed Brownconstructors take a string representation specifying where the source for the
368*bcb2dfaeSJed Brownimplementation is found. This is used by backends that support Just-In-Time
369*bcb2dfaeSJed Brown(JIT) compilation (i.e., CUDA and OCCA) to compile for coprocessors.
370*bcb2dfaeSJed BrownFor full support across all backends, these {ref}`CeedQFunction` source files must only contain constructs mutually supported by C99, C++11, and CUDA.
371*bcb2dfaeSJed BrownFor example, explicit type casting of void pointers and explicit use of compatible arguments for {code}`math` library functions is required, and variable-length array (VLA) syntax for array reshaping is only available via libCEED's {code}`CEED_Q_VLA` macro.
372*bcb2dfaeSJed Brown
373*bcb2dfaeSJed BrownDifferent input and output fields are added individually, specifying the field
374*bcb2dfaeSJed Brownname, size of the field, and evaluation mode.
375*bcb2dfaeSJed Brown
376*bcb2dfaeSJed BrownThe size of the field is provided by a combination of the number of components
377*bcb2dfaeSJed Brownthe effect of any basis evaluations.
378*bcb2dfaeSJed Brown
379*bcb2dfaeSJed BrownThe evaluation mode (see {ref}`CeedBasis-Typedefs and Enumerations`) `CEED_EVAL_INTERP`
380*bcb2dfaeSJed Brownfor both input and output fields indicates that the mass operator only contains terms of
381*bcb2dfaeSJed Brownthe form
382*bcb2dfaeSJed Brown
383*bcb2dfaeSJed Brown$$
384*bcb2dfaeSJed Brown\int_\Omega v \cdot f_0 (u, \nabla u)
385*bcb2dfaeSJed Brown$$
386*bcb2dfaeSJed Brown
387*bcb2dfaeSJed Brownwhere $v$ are test functions (see the {ref}`theoretical-framework`).
388*bcb2dfaeSJed BrownMore general operators, such as those of the form
389*bcb2dfaeSJed Brown
390*bcb2dfaeSJed Brown$$
391*bcb2dfaeSJed Brown\int_\Omega v \cdot f_0 (u, \nabla u) + \nabla v : f_1 (u, \nabla u)
392*bcb2dfaeSJed Brown$$
393*bcb2dfaeSJed Brown
394*bcb2dfaeSJed Browncan be expressed.
395*bcb2dfaeSJed Brown
396*bcb2dfaeSJed BrownFor fields with derivatives, such as with the basis evaluation mode
397*bcb2dfaeSJed Brown(see {ref}`CeedBasis-Typedefs and Enumerations`) `CEED_EVAL_GRAD`, the size of the
398*bcb2dfaeSJed Brownfield needs to reflect both the number of components and the geometric dimension.
399*bcb2dfaeSJed BrownA 3-dimensional gradient on four components would therefore mean the field has a size of
400*bcb2dfaeSJed Brown12\.
401*bcb2dfaeSJed Brown
402*bcb2dfaeSJed BrownThe $\bm{B}$ operators for the mesh nodes, `basis_x`, and the unknown field,
403*bcb2dfaeSJed Brown`basis_u`, are defined in the calls to the function {c:func}`CeedBasisCreateTensorH1Lagrange()`.
404*bcb2dfaeSJed BrownIn this example, both the mesh and the unknown field use $H^1$ Lagrange finite
405*bcb2dfaeSJed Brownelements of order 1 and 4 respectively (the `P` argument represents the number of 1D
406*bcb2dfaeSJed Browndegrees of freedom on each element). Both basis operators use the same integration rule,
407*bcb2dfaeSJed Brownwhich is Gauss-Legendre with 8 points (the `Q` argument).
408*bcb2dfaeSJed Brown
409*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
410*bcb2dfaeSJed Brown:end-before: //! [Basis Create]
411*bcb2dfaeSJed Brown:language: c
412*bcb2dfaeSJed Brown:start-after: //! [Basis Create]
413*bcb2dfaeSJed Brown```
414*bcb2dfaeSJed Brown
415*bcb2dfaeSJed BrownOther elements with this structure can be specified in terms of the `Q×P`
416*bcb2dfaeSJed Brownmatrices that evaluate values and gradients at quadrature points in one
417*bcb2dfaeSJed Browndimension using {c:func}`CeedBasisCreateTensorH1()`. Elements that do not have tensor
418*bcb2dfaeSJed Brownproduct structure, such as symmetric elements on simplices, will be created
419*bcb2dfaeSJed Brownusing different constructors.
420*bcb2dfaeSJed Brown
421*bcb2dfaeSJed BrownThe $\bm{G}$ operators for the mesh nodes, `elem_restr_x`, and the unknown field,
422*bcb2dfaeSJed Brown`elem_restr_u`, are specified in the {c:func}`CeedElemRestrictionCreate()`. Both of these
423*bcb2dfaeSJed Brownspecify directly the dof indices for each element in the `ind_x` and `ind_u`
424*bcb2dfaeSJed Brownarrays:
425*bcb2dfaeSJed Brown
426*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
427*bcb2dfaeSJed Brown:end-before: //! [ElemRestr Create]
428*bcb2dfaeSJed Brown:language: c
429*bcb2dfaeSJed Brown:start-after: //! [ElemRestr Create]
430*bcb2dfaeSJed Brown```
431*bcb2dfaeSJed Brown
432*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
433*bcb2dfaeSJed Brown:end-before: //! [ElemRestrU Create]
434*bcb2dfaeSJed Brown:language: c
435*bcb2dfaeSJed Brown:start-after: //! [ElemRestrU Create]
436*bcb2dfaeSJed Brown```
437*bcb2dfaeSJed Brown
438*bcb2dfaeSJed BrownIf the user has arrays available on a device, they can be provided using
439*bcb2dfaeSJed Brown`CEED_MEM_DEVICE`. This technique is used to provide no-copy interfaces in all
440*bcb2dfaeSJed Browncontexts that involve problem-sized data.
441*bcb2dfaeSJed Brown
442*bcb2dfaeSJed BrownFor discontinuous Galerkin and for applications such as Nek5000 that only
443*bcb2dfaeSJed Brownexplicitly store **E-vectors** (inter-element continuity has been subsumed by
444*bcb2dfaeSJed Brownthe parallel restriction $\bm{P}$), the element restriction $\bm{G}$
445*bcb2dfaeSJed Brownis the identity and {c:func}`CeedElemRestrictionCreateStrided()` is used instead.
446*bcb2dfaeSJed BrownWe plan to support other structured representations of $\bm{G}$ which will
447*bcb2dfaeSJed Brownbe added according to demand.
448*bcb2dfaeSJed BrownThere are two common approaches for supporting non-conforming elements: applying the node constraints via $\bm P$ so that the **L-vector** can be processed uniformly and applying the constraints via $\bm G$ so that the **E-vector** is uniform.
449*bcb2dfaeSJed BrownThe former can be done with the existing interface while the latter will require a generalization to element restriction that would define field values at constrained nodes as linear combinations of the values at primary nodes.
450*bcb2dfaeSJed Brown
451*bcb2dfaeSJed BrownThese operations, $\bm{P}$, $\bm{B}$, and $\bm{D}$,
452*bcb2dfaeSJed Brownare combined with a {ref}`CeedOperator`. As with {ref}`CeedQFunction`s, operator fields are added
453*bcb2dfaeSJed Brownseparately with a matching field name, basis ($\bm{B}$), element restriction
454*bcb2dfaeSJed Brown($\bm{G}$), and **L-vector**. The flag
455*bcb2dfaeSJed Brown`CEED_VECTOR_ACTIVE` indicates that the vector corresponding to that field will
456*bcb2dfaeSJed Brownbe provided to the operator when {c:func}`CeedOperatorApply()` is called. Otherwise the
457*bcb2dfaeSJed Browninput/output will be read from/written to the specified **L-vector**.
458*bcb2dfaeSJed Brown
459*bcb2dfaeSJed BrownWith partial assembly, we first perform a setup stage where $\bm{D}$ is evaluated
460*bcb2dfaeSJed Brownand stored. This is accomplished by the operator `op_setup` and its application
461*bcb2dfaeSJed Brownto `X`, the nodes of the mesh (these are needed to compute Jacobians at
462*bcb2dfaeSJed Brownquadrature points). Note that the corresponding {c:func}`CeedOperatorApply()` has no basis
463*bcb2dfaeSJed Brownevaluation on the output, as the quadrature data is not needed at the dofs:
464*bcb2dfaeSJed Brown
465*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
466*bcb2dfaeSJed Brown:end-before: //! [Setup Create]
467*bcb2dfaeSJed Brown:language: c
468*bcb2dfaeSJed Brown:start-after: //! [Setup Create]
469*bcb2dfaeSJed Brown```
470*bcb2dfaeSJed Brown
471*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
472*bcb2dfaeSJed Brown:end-before: //! [Setup Set]
473*bcb2dfaeSJed Brown:language: c
474*bcb2dfaeSJed Brown:start-after: //! [Setup Set]
475*bcb2dfaeSJed Brown```
476*bcb2dfaeSJed Brown
477*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
478*bcb2dfaeSJed Brown:end-before: //! [Setup Apply]
479*bcb2dfaeSJed Brown:language: c
480*bcb2dfaeSJed Brown:start-after: //! [Setup Apply]
481*bcb2dfaeSJed Brown```
482*bcb2dfaeSJed Brown
483*bcb2dfaeSJed BrownThe action of the operator is then represented by operator `op_mass` and its
484*bcb2dfaeSJed Brown{c:func}`CeedOperatorApply()` to the input **L-vector** `U` with output in `V`:
485*bcb2dfaeSJed Brown
486*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
487*bcb2dfaeSJed Brown:end-before: //! [Operator Create]
488*bcb2dfaeSJed Brown:language: c
489*bcb2dfaeSJed Brown:start-after: //! [Operator Create]
490*bcb2dfaeSJed Brown```
491*bcb2dfaeSJed Brown
492*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
493*bcb2dfaeSJed Brown:end-before: //! [Operator Set]
494*bcb2dfaeSJed Brown:language: c
495*bcb2dfaeSJed Brown:start-after: //! [Operator Set]
496*bcb2dfaeSJed Brown```
497*bcb2dfaeSJed Brown
498*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t500-operator.c
499*bcb2dfaeSJed Brown:end-before: //! [Operator Apply]
500*bcb2dfaeSJed Brown:language: c
501*bcb2dfaeSJed Brown:start-after: //! [Operator Apply]
502*bcb2dfaeSJed Brown```
503*bcb2dfaeSJed Brown
504*bcb2dfaeSJed BrownA number of function calls in the interface, such as {c:func}`CeedOperatorApply()`, are
505*bcb2dfaeSJed Brownintended to support asynchronous execution via their last argument,
506*bcb2dfaeSJed Brown`CeedRequest*`. The specific (pointer) value used in the above example,
507*bcb2dfaeSJed Brown`CEED_REQUEST_IMMEDIATE`, is used to express the request (from the user) for the
508*bcb2dfaeSJed Brownoperation to complete before returning from the function call, i.e. to make sure
509*bcb2dfaeSJed Brownthat the result of the operation is available in the output parameters
510*bcb2dfaeSJed Brownimmediately after the call. For a true asynchronous call, one needs to provide
511*bcb2dfaeSJed Brownthe address of a user defined variable. Such a variable can be used later to
512*bcb2dfaeSJed Brownexplicitly wait for the completion of the operation.
513*bcb2dfaeSJed Brown
514*bcb2dfaeSJed Brown## Gallery of QFunctions
515*bcb2dfaeSJed Brown
516*bcb2dfaeSJed BrownLibCEED provides a gallery of built-in {ref}`CeedQFunction`s in the {file}`gallery/` directory.
517*bcb2dfaeSJed BrownThe available QFunctions are the ones associated with the mass, the Laplacian, and
518*bcb2dfaeSJed Brownthe identity operators. To illustrate how the user can declare a {ref}`CeedQFunction`
519*bcb2dfaeSJed Brownvia the gallery of available QFunctions, consider the selection of the
520*bcb2dfaeSJed Brown{ref}`CeedQFunction` associated with a simple 1D mass matrix
521*bcb2dfaeSJed Brown(cf. [tests/t410-qfunction.c](https://github.com/CEED/libCEED/blob/main/tests/t410-qfunction.c)).
522*bcb2dfaeSJed Brown
523*bcb2dfaeSJed Brown```{literalinclude} ../../../tests/t410-qfunction.c
524*bcb2dfaeSJed Brown:language: c
525*bcb2dfaeSJed Brown:linenos: true
526*bcb2dfaeSJed Brown```
527*bcb2dfaeSJed Brown
528*bcb2dfaeSJed Brown## Interface Principles and Evolution
529*bcb2dfaeSJed Brown
530*bcb2dfaeSJed BrownLibCEED is intended to be extensible via backends that are packaged with the
531*bcb2dfaeSJed Brownlibrary and packaged separately (possibly as a binary containing proprietary
532*bcb2dfaeSJed Browncode). Backends are registered by calling
533*bcb2dfaeSJed Brown
534*bcb2dfaeSJed Brown```{literalinclude} ../../../backends/ref/ceed-ref.c
535*bcb2dfaeSJed Brown:end-before: //! [Register]
536*bcb2dfaeSJed Brown:language: c
537*bcb2dfaeSJed Brown:start-after: //! [Register]
538*bcb2dfaeSJed Brown```
539*bcb2dfaeSJed Brown
540*bcb2dfaeSJed Browntypically in a library initializer or "constructor" that runs automatically.
541*bcb2dfaeSJed Brown`CeedInit` uses this prefix to find an appropriate backend for the resource.
542*bcb2dfaeSJed Brown
543*bcb2dfaeSJed BrownSource (API) and binary (ABI) stability are important to libCEED. Prior to
544*bcb2dfaeSJed Brownreaching version 1.0, libCEED does not implement strict [semantic versioning](https://semver.org) across the entire interface. However, user code,
545*bcb2dfaeSJed Brownincluding libraries of {ref}`CeedQFunction`s, should be source and binary
546*bcb2dfaeSJed Browncompatible moving from 0.x.y to any later release 0.x.z. We have less experience
547*bcb2dfaeSJed Brownwith external packaging of backends and do not presently guarantee source or
548*bcb2dfaeSJed Brownbinary stability, but we intend to define stability guarantees for libCEED 1.0.
549*bcb2dfaeSJed BrownWe'd love to talk with you if you're interested in packaging backends
550*bcb2dfaeSJed Brownexternally, and will work with you on a practical stability policy.
551