xref: /libCEED/doc/sphinx/source/libCEEDdev.md (revision 8a3c90c815d7c47d4c85d45b709b827586c3b9a0)
1bcb2dfaeSJed Brown# Developer Notes
2bcb2dfaeSJed Brown
3d538d163SJeremy L Thompson## Library Design
4d538d163SJeremy L Thompson
5d538d163SJeremy L ThompsonLibCEED has a single user facing API for creating and using the libCEED objects ({ref}`CeedVector`, {ref}`CeedBasis`, etc).
6d538d163SJeremy L ThompsonDifferent Ceed backends are selected by instantiating a different {ref}`Ceed` object to create the other libCEED objects, in a [bridge pattern](https://en.wikipedia.org/wiki/Bridge_pattern).
7d538d163SJeremy L ThompsonAt runtime, the user can select the different backend implementations to target different hardware, such as CPUs or GPUs.
8d538d163SJeremy L Thompson
9d538d163SJeremy L ThompsonWhen designing new features, developers should place the function definitions for the user facing API in the header `/include/ceed/ceed.h`.
10d538d163SJeremy L ThompsonThe basic implementation of these functions should typically be placed in `/interface/*.c` files.
11d538d163SJeremy L ThompsonThe interface should pass any computationally expensive or hardware specific operations to a backend implementation.
12d538d163SJeremy L ThompsonA new method for the associated libCEED object can be added in `/include/ceed-impl.h`, with a corresponding `CEED_FTABLE_ENTRY` in `/interface/ceed.c` to allow backends to set their own implementations of this method.
13d538d163SJeremy L ThompsonThen in the creation of the backend specific implementation of the object, typically found in `/backends/[impl]/ceed-[impl]-[object].c`, the developer creates the backend implementation of the specific method and calls {c:func}`CeedSetBackendFunction` to set this implementation of the method for the backend.
14d538d163SJeremy L ThompsonAny supplemental functions intended to be used in the interface or by the backends may be added to the backend API in the header `/include/ceed/backend.h`.
15d538d163SJeremy L ThompsonThe basic implementation of these functions should also be placed in `/interface/*.c` files.
16d538d163SJeremy L Thompson
17d538d163SJeremy L ThompsonLibCEED generally follows a "CPU first" implementation strategy when adding new functionality to the user facing API.
18d538d163SJeremy L ThompsonIf there are no performance specific considerations, it is generally recommended to include a basic CPU default implementation in `/interface/*.c`.
19d538d163SJeremy L ThompsonAny new functions must be well documented and tested.
20d538d163SJeremy L ThompsonOnce the user facing API and the default implementation are in place and verified correct via tests, then the developer can focus on hardware specific implementations (AVX, CUDA, HIP, etc.) as necessary.
21d538d163SJeremy L Thompson
22d538d163SJeremy L Thompson## Backend Inheritance
23d538d163SJeremy L Thompson
24d538d163SJeremy L ThompsonA Ceed backend is not required to implement all libCeed objects or {ref}`CeedOperator` methods.
25d538d163SJeremy L ThompsonThere are three mechanisms by which a Ceed backend can inherit implementations from another Ceed backend.
26d538d163SJeremy L Thompson
27d538d163SJeremy L Thompson1. Delegation - Developers may use {c:func}`CeedSetDelegate` to set a general delegate {ref}`Ceed` object.
28d538d163SJeremy L Thompson   This delegate {ref}`Ceed` will provide the implementation of any libCeed objects that parent backend does not implement.
29d538d163SJeremy L Thompson   For example, the `/cpu/self/xsmm/serial` backend implements the `CeedTensorContract` object itself but delegates all other functionality to the `/cpu/self/opt/serial` backend.
30d538d163SJeremy L Thompson
31d538d163SJeremy L Thompson2. Object delegation  - Developers may use {c:func}`CeedSetObjectDelegate` to set a delegate {ref}`Ceed` object for a specific libCEED object.
32d538d163SJeremy L Thompson   This delegate {ref}`Ceed` will only provide the implementation of that specific libCeed object for the parent backend.
33d538d163SJeremy L Thompson   Object delegation has higher precedence than delegation.
34d538d163SJeremy L Thompson
35*8a3c90c8SZach Atkins3. Operator fallback - Developers may use {c:func}`CeedSetOperatorFallbackCeed` to set a {ref}`Ceed` object to provide any unimplemented {ref}`CeedOperator` methods that support preconditioning, such as {c:func}`CeedOperatorLinearAssemble`.
36d538d163SJeremy L Thompson   The parent backend must implement the basic {ref}`CeedOperator` functionality.
37*8a3c90c8SZach Atkins   Like the delegates above, this fallback {ref}`Ceed` object should be created and set in the backend `CeedInit` function.
38d538d163SJeremy L Thompson   In order to use operator fallback, the parent backend and fallback backend must use compatible E-vector and Q-vector layouts.
39d538d163SJeremy L Thompson   For example, `/gpu/cuda/gen` falls back to `/gpu/cuda/ref` for missing {ref}`CeedOperator` preconditioning support methods.
40*8a3c90c8SZach Atkins   If an unimplemented method is called, then the parent `/gpu/cuda/gen` {ref}`Ceed` object uses its fallback `/gpu/cuda/ref` {ref}`Ceed` object to create a clone of the {ref}`CeedOperator`.
41d538d163SJeremy L Thompson   This clone {ref}`CeedOperator` is then used for the unimplemented preconditioning support methods.
42d538d163SJeremy L Thompson
43d538d163SJeremy L Thompson## Backend Families
44d538d163SJeremy L Thompson
45d538d163SJeremy L ThompsonThere are 4 general 'families' of backend implementations.
46d538d163SJeremy L ThompsonAs internal data layouts are specific to backend families, it is generally not possible to delegate between backend families.
47d538d163SJeremy L Thompson
48d538d163SJeremy L Thompson### CPU Backends
49d538d163SJeremy L Thompson
50d538d163SJeremy L ThompsonThe basic CPU with the simplest implementation is `/cpu/self/ref/serial`.
51d538d163SJeremy L ThompsonThis backend contains the basic implementations of most objects that other backends rely upon.
52d538d163SJeremy L ThompsonMost of the other CPU backends only update the {ref}`CeedOperator` and `CeedTensorContract` objects.
53d538d163SJeremy L Thompson
54d538d163SJeremy L ThompsonThe `/cpu/self/ref/blockend` and `/cpu/self/opt/*` backends delegate to the `/cpu/self/ref/serial` backend.
55d538d163SJeremy L ThompsonThe `/cpu/self/ref/blocked` backend updates the {ref}`CeedOperator` to use an E-vector and Q-vector ordering when data for 8 elements are interlaced to provide better vectorization.
56d538d163SJeremy L ThompsonThe `/cpu/self/opt/*` backends update the {ref}`CeedOperator` to apply the action of the operator in 1 or 8 element batches, depending upon if the blocking strategy is used.
57d538d163SJeremy L ThompsonThis reduced the memory required to utilize this backend significantly.
58d538d163SJeremy L Thompson
59d538d163SJeremy L ThompsonThe `/cpu/self/avx/*` and `/cpu/self/xsmm/*` backends delegate to the corresponding `/cpu/self/opt/*` backends.
60d538d163SJeremy L ThompsonThese backends update the `CeedTensorContract` objects using AVX intrinsics and libXSMM functions, respectively.
61d538d163SJeremy L Thompson
62d538d163SJeremy L ThompsonThe `/cpu/self/memcheck/*` backends delegate to the `/cpu/self/ref/*` backends.
63d538d163SJeremy L ThompsonThese backends replace many of the implementations with methods that include more verification checks and a memory management model that more closely matches the memory management for GPU backends.
64d538d163SJeremy L ThompsonThese backends rely upon the [Valgrind](https://valgrind.org/) Memcheck tool and Valgrind headers.
65d538d163SJeremy L Thompson
66d538d163SJeremy L Thompson### GPU Backends
67d538d163SJeremy L Thompson
68d538d163SJeremy L ThompsonThe CUDA, HIP, and SYCL backend families all follow similar designs.
69d538d163SJeremy L ThompsonThe CUDA and HIP backends are very similar, with minor differences.
70d538d163SJeremy L ThompsonWhile the SYCL backend was based upon the CUDA and HIP backends, there are more internal differences to accommodate OpenCL and Intel hardware.
71d538d163SJeremy L Thompson
72d538d163SJeremy L ThompsonThe `/gpu/*/ref` backends provide basic functionality.
73d538d163SJeremy L ThompsonIn these backends, the operator is applied in multiple separate kernel launches, following the libCEED operator decomposition, where first {ref}`CeedElemRestriction` kernels map from the L-vectors to E-vectors, then {ref}`CeedBasis` kernels map from the E-vectors to Q-vectors, then the {ref}`CeedQFunction` kernel provides the action of the user quadrature point function, and the transpose {ref}`CeedBasis` and {ref}`CeedElemRestriction` kernels are applied to go back to the E-vectors and finally the L-vectors.
74d538d163SJeremy L ThompsonThese kernels apply to all points across all elements in order to maximize the amount of work each kernel launch has.
75d538d163SJeremy L ThompsonSome of these kernels are compiled at runtime via NVRTC, HIPRTC, or OpenCL RTC.
76d538d163SJeremy L Thompson
77d538d163SJeremy L ThompsonThe `/gpu/*/shared` backends delegate to the corresponding `/gpu/*/ref` backends.
78d538d163SJeremy L ThompsonThese backends use shared memory to improve performance for the {ref}`CeedBasis` kernels.
79d538d163SJeremy L ThompsonAll other libCEED objects are delegated to `/gpu/*/ref`.
80d538d163SJeremy L ThompsonThese kernels are compiled at runtime via NVRTC, HIPRTC, or OpenCL RTC.
81d538d163SJeremy L Thompson
82d538d163SJeremy L ThompsonThe `/gpu/*/gen` backends delegate to the corresponding `/gpu/*/shared` backends.
83d538d163SJeremy L ThompsonThese backends write a single comprehensive kernel to apply the action of the {ref}`CeedOperator`, significantly improving performance by eliminating intermediate data structures and reducing the total number of kernel launches required.
84d538d163SJeremy L ThompsonThis kernel is compiled at runtime via NVRTC, HIPRTC, or OpenCL RTC.
85d538d163SJeremy L Thompson
86d538d163SJeremy L ThompsonThe `/gpu/*/magma` backends delegate to the corresponding `/gpu/cuda/ref` and `/gpu/hip/ref` backends.
87d538d163SJeremy L ThompsonThese backends provide better performance for {ref}`CeedBasis` kernels but do not have the improvements from the `/gpu/*/gen` backends for {ref}`CeedOperator`.
88d538d163SJeremy L Thompson
89d538d163SJeremy L ThompsonThe `/*/*/occa` backends are an experimental feature and not part of any family.
90d538d163SJeremy L Thompson
91d538d163SJeremy L Thompson## Internal Layouts
92d538d163SJeremy L Thompson
93d538d163SJeremy L ThompsonCeed backends are free to use any E-vector and Q-vector data layout (including never fully forming these vectors) so long as the backend passes the `t5**` series tests and all examples.
94d538d163SJeremy L ThompsonThere are several common layouts for L-vectors, E-vectors, and Q-vectors, detailed below:
95d538d163SJeremy L Thompson
96d538d163SJeremy L Thompson- **L-vector** layouts
97d538d163SJeremy L Thompson
98d538d163SJeremy L Thompson  - L-vectors described by a standard {ref}`CeedElemRestriction` have a layout described by the `offsets` array and `comp_stride` parameter.
99d538d163SJeremy L Thompson    Data for node `i`, component `j`, element `k` can be found in the L-vector at index `offsets[i + k*elem_size] + j*comp_stride`.
100d538d163SJeremy L Thompson  - L-vectors described by a strided {ref}`CeedElemRestriction` have a layout described by the `strides` array.
101d538d163SJeremy L Thompson    Data for node `i`, component `j`, element `k` can be found in the L-vector at index `i*strides[0] + j*strides[1] + k*strides[2]`.
102d538d163SJeremy L Thompson
103d538d163SJeremy L Thompson- **E-vector** layouts
104d538d163SJeremy L Thompson
105d538d163SJeremy L Thompson  - If possible, backends should use {c:func}`CeedElemRestrictionSetELayout()` to use the `t2**` tests.
106d538d163SJeremy L Thompson    If the backend uses a strided E-vector layout, then the data for node `i`, component `j`, element `k` in the E-vector is given by `i*layout[0] + j*layout[1] + k*layout[2]`.
107d538d163SJeremy L Thompson  - Backends may choose to use a non-strided E-vector layout; however, the `t2**` tests will not function correctly in this case and these tests will need to be marked as allowable failures for this backend in the test suite.
108d538d163SJeremy L Thompson
109d538d163SJeremy L Thompson- **Q-vector** layouts
110d538d163SJeremy L Thompson
111d538d163SJeremy L Thompson  - When the size of a {ref}`CeedQFunction` field is greater than `1`, data for quadrature point `i` component `j` can be found in the Q-vector at index `i + Q*j`, where `Q` is the total number of quadrature points in the Q-vector.
112d538d163SJeremy L Thompson    Backends are free to provide the quadrature points in any order.
113d538d163SJeremy L Thompson  - When the {ref}`CeedQFunction` field has `emode` `CEED_EVAL_GRAD`, data for quadrature point `i`, component `j`, derivative `k` can be found in the Q-vector at index `i + Q*j + Q*num_comp*k`.
114d538d163SJeremy L Thompson  - Backend developers must take special care to ensure that the data in the Q-vectors for a field with `emode` `CEED_EVAL_NONE` is properly ordered when the backend uses different layouts for E-vectors and Q-vectors.
115d538d163SJeremy L Thompson
116d538d163SJeremy L Thompson## CeedVector Array Access
117d538d163SJeremy L Thompson
118d538d163SJeremy L ThompsonBackend implementations are expected to separately track 'owned' and 'borrowed' memory locations.
119d538d163SJeremy L ThompsonBackends are responsible for freeing 'owned' memory; 'borrowed' memory is set by the user and backends only have read/write access to 'borrowed' memory.
120d538d163SJeremy L ThompsonFor any given precision and memory type, a backend should only have 'owned' or 'borrowed' memory, not both.
121d538d163SJeremy L Thompson
122d538d163SJeremy L ThompsonBackends are responsible for tracking which memory locations contain valid data.
123d538d163SJeremy L ThompsonIf the user calls {c:func}`CeedVectorTakeArray` on the only memory location that contains valid data, then the {ref}`CeedVector` is left in an *invalid state*.
124d538d163SJeremy L ThompsonTo repair an *invalid state*, the user must set valid data by calling {c:func}`CeedVectorSetValue`, {c:func}`CeedVectorSetArray`, or {c:func}`CeedVectorGetArrayWrite`.
125d538d163SJeremy L Thompson
126d538d163SJeremy L ThompsonSome checks for consistency and data validity with {ref}`CeedVector` array access are performed at the interface level.
127d538d163SJeremy L ThompsonAll backends may assume that array access will conform to these guidelines:
128d538d163SJeremy L Thompson
129d538d163SJeremy L Thompson- Borrowed memory
130d538d163SJeremy L Thompson
131d538d163SJeremy L Thompson  - {ref}`CeedVector` access to borrowed memory is set with {c:func}`CeedVectorSetArray` with `copy_mode = CEED_USE_POINTER` and revoked with {c:func}`CeedVectorTakeArray`.
132d538d163SJeremy L Thompson    The user must first call {c:func}`CeedVectorSetArray` with `copy_mode = CEED_USE_POINTER` for the appropriate precision and memory type before calling {c:func}`CeedVectorTakeArray`.
133d538d163SJeremy L Thompson  - {c:func}`CeedVectorTakeArray` cannot be called on a vector in a *invalid state*.
134d538d163SJeremy L Thompson
135d538d163SJeremy L Thompson- Owned memory
136d538d163SJeremy L Thompson
137d538d163SJeremy L Thompson  - Owned memory can be allocated by calling {c:func}`CeedVectorSetValue` or by calling {c:func}`CeedVectorSetArray` with `copy_mode = CEED_COPY_VALUES`.
138d538d163SJeremy L Thompson  - Owned memory can be set by calling {c:func}`CeedVectorSetArray` with `copy_mode = CEED_OWN_POINTER`.
139d538d163SJeremy L Thompson  - Owned memory can also be allocated by calling {c:func}`CeedVectorGetArrayWrite`.
140d538d163SJeremy L Thompson    The user is responsible for manually setting the contents of the array in this case.
141d538d163SJeremy L Thompson
142d538d163SJeremy L Thompson- Data validity
143d538d163SJeremy L Thompson
144d538d163SJeremy L Thompson  - Internal synchronization and user calls to {c:func}`CeedVectorSync` cannot be made on a vector in an *invalid state*.
145d538d163SJeremy L Thompson  - Calls to {c:func}`CeedVectorGetArray` and {c:func}`CeedVectorGetArrayRead` cannot be made on a vector in an *invalid state*.
146d538d163SJeremy L Thompson  - Calls to {c:func}`CeedVectorSetArray` and {c:func}`CeedVectorSetValue` can be made on a vector in an *invalid state*.
147d538d163SJeremy L Thompson  - Calls to {c:func}`CeedVectorGetArrayWrite` can be made on a vector in an *invalid* state.
148d538d163SJeremy L Thompson    Data synchronization is not required for the memory location returned by {c:func}`CeedVectorGetArrayWrite`.
149d538d163SJeremy L Thompson    The caller should assume that all data at the memory location returned by {c:func}`CeedVectorGetArrayWrite` is *invalid*.
150d538d163SJeremy L Thompson
151d538d163SJeremy L Thompson## Shape
152d538d163SJeremy L Thompson
153d538d163SJeremy L ThompsonBackends often manipulate tensors of dimension greater than 2.
154d538d163SJeremy L ThompsonIt is awkward to pass fully-specified multi-dimensional arrays using C99 and certain operations will flatten/reshape the tensors for computational convenience.
155d538d163SJeremy L ThompsonWe frequently use comments to document shapes using a lexicographic ordering.
156d538d163SJeremy L ThompsonFor example, the comment
157d538d163SJeremy L Thompson
158d538d163SJeremy L Thompson```c
159d538d163SJeremy L Thompson// u has shape [dim, num_comp, Q, num_elem]
160d538d163SJeremy L Thompson```
161d538d163SJeremy L Thompson
162d538d163SJeremy L Thompsonmeans that it can be traversed as
163d538d163SJeremy L Thompson
164d538d163SJeremy L Thompson```c
165d538d163SJeremy L Thompsonfor (d = 0; d < dim; d++) {
166d538d163SJeremy L Thompson  for (c = 0; c < num_comp; c++) {
167d538d163SJeremy L Thompson    for (q = 0; q < Q; q++) {
168d538d163SJeremy L Thompson      for (e = 0; e < num_elem; e++) {
169d538d163SJeremy L Thompson        u[((d*num_comp + c)*Q + q)*num_elem + e] = ...
170d538d163SJeremy L Thompson```
171d538d163SJeremy L Thompson
172d538d163SJeremy L ThompsonThis ordering is sometimes referred to as row-major or C-style.
173d538d163SJeremy L ThompsonNote that flattening such as
174d538d163SJeremy L Thompson
175d538d163SJeremy L Thompson```c
176d538d163SJeremy L Thompson// u has shape [dim, num_comp, Q*num_elem]
177d538d163SJeremy L Thompson```
178d538d163SJeremy L Thompson
179d538d163SJeremy L Thompsonand
180d538d163SJeremy L Thompson
181d538d163SJeremy L Thompson```c
182d538d163SJeremy L Thompson// u has shape [dim*num_comp, Q, num_elem]
183d538d163SJeremy L Thompson```
184d538d163SJeremy L Thompson
185d538d163SJeremy L Thompsonare purely implicit -- one just indexes the same array using the appropriate convention.
186d538d163SJeremy L Thompson
187d538d163SJeremy L Thompson## `restrict` Semantics
188d538d163SJeremy L Thompson
189d538d163SJeremy L ThompsonQFunction arguments can be assumed to have `restrict` semantics.
190d538d163SJeremy L ThompsonThat is, each input and output array must reside in distinct memory without overlap.
191d538d163SJeremy L Thompson
192bcb2dfaeSJed Brown## Style Guide
193bcb2dfaeSJed Brown
194bcb2dfaeSJed BrownPlease check your code for style issues by running
195bcb2dfaeSJed Brown
1962b730f8bSJeremy L Thompson`make format`
197bcb2dfaeSJed Brown
198bcb2dfaeSJed BrownIn addition to those automatically enforced style rules, libCEED tends to follow the following code style conventions:
199bcb2dfaeSJed Brown
200bcb2dfaeSJed Brown- Variable names: `snake_case`
201bcb2dfaeSJed Brown- Strut members: `snake_case`
202bcb2dfaeSJed Brown- Function and method names: `PascalCase` or language specific style
203bcb2dfaeSJed Brown- Type names: `PascalCase` or language specific style
204bcb2dfaeSJed Brown- Constant names: `CAPS_SNAKE_CASE` or language specific style
205bcb2dfaeSJed Brown
206bcb2dfaeSJed BrownAlso, documentation files should have one sentence per line to help make git diffs clearer and less disruptive.
207bcb2dfaeSJed Brown
208bcb2dfaeSJed Brown## Clang-tidy
209bcb2dfaeSJed Brown
210bcb2dfaeSJed BrownPlease check your code for common issues by running
211bcb2dfaeSJed Brown
212bcb2dfaeSJed Brown`make tidy`
213bcb2dfaeSJed Brown
21417be3a41SJeremy L Thompsonwhich uses the `clang-tidy` utility included in recent releases of Clang.
21517be3a41SJeremy L ThompsonThis tool is much slower than actual compilation (`make -j8` parallelism helps).
21617be3a41SJeremy L ThompsonTo run on a single file, use
217bcb2dfaeSJed Brown
218bcb2dfaeSJed Brown`make interface/ceed.c.tidy`
219bcb2dfaeSJed Brown
22017be3a41SJeremy L Thompsonfor example.
22117be3a41SJeremy L ThompsonAll issues reported by `make tidy` should be fixed.
222bcb2dfaeSJed Brown
223db52d626SJeremy L Thompson## Include-What-You-Use
224bcb2dfaeSJed Brown
225bcb2dfaeSJed BrownHeader inclusion for source files should follow the principal of 'include what you use' rather than relying upon transitive `#include` to define all symbols.
226bcb2dfaeSJed Brown
22730eee506SJeremy L ThompsonEvery symbol that is used in the source file `foo.c` should be defined in `foo.c`, `foo.h`, or in a header file `#include`d in one of these two locations.
228db52d626SJeremy L ThompsonPlease check your code by running the tool [`include-what-you-use`](https://include-what-you-use.org/) to see recommendations for changes to your source.
229bcb2dfaeSJed BrownMost issues reported by `include-what-you-use` should be fixed; however this rule is flexible to account for differences in header file organization in external libraries.
2309c06f60aSJeremy L ThompsonIf you have `include-what-you-use` installed in a sibling directory to libCEED or set the environment variable `IWYU_CC`, then you can use the makefile target `make iwyu`.
231bcb2dfaeSJed Brown
232bcb2dfaeSJed BrownHeader files should be listed in alphabetical order, with installed headers preceding local headers and `ceed` headers being listed first.
233db52d626SJeremy L ThompsonThe `ceed-f64.h` and `ceed-f32.h` headers should only be included in `ceed.h`.
234bcb2dfaeSJed Brown
235bcb2dfaeSJed Brown```c
236bcb2dfaeSJed Brown#include <ceed.h>
237bcb2dfaeSJed Brown#include <ceed/backend.h>
238bcb2dfaeSJed Brown#include <stdbool.h>
239bcb2dfaeSJed Brown#include <string.h>
240bcb2dfaeSJed Brown#include "ceed-avx.h"
241bcb2dfaeSJed Brown```
242