xref: /libCEED/doc/sphinx/source/intro.md (revision bcb2dfae4c301ddfdddf58806f08f6e7d17f4ea5)
1*bcb2dfaeSJed Brown# Introduction
2*bcb2dfaeSJed Brown
3*bcb2dfaeSJed BrownHistorically, conventional high-order finite element methods were rarely used for
4*bcb2dfaeSJed Brownindustrial problems because the Jacobian rapidly loses sparsity as the order is
5*bcb2dfaeSJed Brownincreased, leading to unaffordable solve times and memory requirements
6*bcb2dfaeSJed Brown{cite}`brown2010`. This effect typically limited the order of accuracy to at most
7*bcb2dfaeSJed Brownquadratic, especially because quadratic finite element formulations are computationally advantageous in terms of
8*bcb2dfaeSJed Brownfloating point operations (FLOPS) per degree of freedom (DOF)---see
9*bcb2dfaeSJed Brown{numref}`fig-assembledVsmatrix-free`---, despite the fast convergence and favorable
10*bcb2dfaeSJed Brownstability properties offered by higher order discretizations. Nowadays, high-order
11*bcb2dfaeSJed Brownnumerical methods, such as the spectral element method (SEM)---a special case of
12*bcb2dfaeSJed Brownnodal p-Finite Element Method (FEM) which can reuse the interpolation nodes for
13*bcb2dfaeSJed Brownquadrature---are employed, especially with (nearly) affine elements, because
14*bcb2dfaeSJed Brownlinear constant coefficient problems can be very efficiently solved using the
15*bcb2dfaeSJed Brownfast diagonalization method combined with a multilevel coarse solve. In
16*bcb2dfaeSJed Brown{numref}`fig-assembledVsmatrix-free` we analyze and compare the theoretical costs,
17*bcb2dfaeSJed Brownof different configurations: assembling the sparse matrix representing the action
18*bcb2dfaeSJed Brownof the operator (labeled as *assembled*), non assembling the matrix and storing
19*bcb2dfaeSJed Brownonly the metric terms needed as an operator setup-phase (labeled as *tensor-qstore*)
20*bcb2dfaeSJed Brownand non assembling  the matrix and computing the metric terms on the fly and storing
21*bcb2dfaeSJed Browna compact representation of the linearization at quadrature points (labeled as
22*bcb2dfaeSJed Brown*tensor*). In the right panel, we show the cost in terms of FLOPS/DOF. This metric for
23*bcb2dfaeSJed Browncomputational efficiency made sense historically, when the performance was mostly
24*bcb2dfaeSJed Brownlimited by processors' clockspeed. A more relevant performance plot for current
25*bcb2dfaeSJed Brownstate-of-the-art high-performance machines (for which the bottleneck of performance is
26*bcb2dfaeSJed Brownmostly in the memory bandwith) is shown in the left panel of
27*bcb2dfaeSJed Brown{numref}`fig-assembledVsmatrix-free`, where the memory bandwith is measured in terms of
28*bcb2dfaeSJed Brownbytes/DOF. We can see that high-order methods, implemented properly with only partial
29*bcb2dfaeSJed Brownassembly, require optimal amount of memory transfers (with respect to the
30*bcb2dfaeSJed Brownpolynomial order) and near-optimal FLOPs for operator evaluation. Thus, high-order
31*bcb2dfaeSJed Brownmethods in matrix-free representation not only possess favorable properties, such as
32*bcb2dfaeSJed Brownhigher accuracy and faster convergence to solution, but also manifest an efficiency gain
33*bcb2dfaeSJed Browncompared to their corresponding assembled representations.
34*bcb2dfaeSJed Brown
35*bcb2dfaeSJed Brown(fig-assembledvsmatrix-free)=
36*bcb2dfaeSJed Brown
37*bcb2dfaeSJed Brown:::{figure} ../../img/TensorVsAssembly.png
38*bcb2dfaeSJed BrownComparison of memory transfer and floating point operations per
39*bcb2dfaeSJed Browndegree of freedom for different representations of a linear operator for a PDE in
40*bcb2dfaeSJed Brown3D with $b$ components and variable coefficients arising due to Newton
41*bcb2dfaeSJed Brownlinearization of a material nonlinearity. The representation labeled as *tensor*
42*bcb2dfaeSJed Browncomputes metric terms on the fly and stores a compact representation of the
43*bcb2dfaeSJed Brownlinearization at quadrature points. The representation labeled as *tensor-qstore*
44*bcb2dfaeSJed Brownpulls the metric terms into the stored representation. The *assembled* representation
45*bcb2dfaeSJed Brownuses a (block) CSR format.
46*bcb2dfaeSJed Brown:::
47*bcb2dfaeSJed Brown
48*bcb2dfaeSJed BrownFurthermore, software packages that provide high-performance implementations have often
49*bcb2dfaeSJed Brownbeen special-purpose and intrusive. libCEED {cite}`libceed-joss-paper` is a new library that offers a purely
50*bcb2dfaeSJed Brownalgebraic interface for matrix-free operator representation and supports run-time
51*bcb2dfaeSJed Brownselection of implementations tuned for a variety of computational device types,
52*bcb2dfaeSJed Brownincluding CPUs and GPUs. libCEED's purely algebraic interface can unobtrusively be
53*bcb2dfaeSJed Brownintegrated in new and legacy software to provide performance portable interfaces.
54*bcb2dfaeSJed BrownWhile libCEED's focus is on high-order finite elements, the approach is algebraic
55*bcb2dfaeSJed Brownand thus applicable to other discretizations in factored form. libCEED's role, as
56*bcb2dfaeSJed Browna lightweight portable library that allows a wide variety of applications to share
57*bcb2dfaeSJed Brownhighly optimized discretization kernels, is illustrated in
58*bcb2dfaeSJed Brown{numref}`fig-libCEED-backends`, where a non-exhaustive list of specialized
59*bcb2dfaeSJed Brownimplementations (backends) is provided. libCEED provides a low-level Application
60*bcb2dfaeSJed BrownProgramming Interface (API) for user codes so that applications with their own
61*bcb2dfaeSJed Browndiscretization infrastructure (e.g., those in [PETSc](https://www.mcs.anl.gov/petsc/),
62*bcb2dfaeSJed Brown[MFEM](https://mfem.org/) and [Nek5000](https://nek5000.mcs.anl.gov/)) can evaluate
63*bcb2dfaeSJed Brownand use the core operations provided by libCEED. GPU implementations are available via
64*bcb2dfaeSJed Brownpure [CUDA](https://developer.nvidia.com/about-cuda) as well as the
65*bcb2dfaeSJed Brown[OCCA](http://github.com/libocca/occa) and [MAGMA](https://bitbucket.org/icl/magma)
66*bcb2dfaeSJed Brownlibraries. CPU implementations are available via pure C and AVX intrinsics as well as
67*bcb2dfaeSJed Brownthe [LIBXSMM](http://github.com/hfp/libxsmm) library. libCEED provides a unified
68*bcb2dfaeSJed Browninterface, so that users only need to write a single source code and can select the
69*bcb2dfaeSJed Browndesired specialized implementation at run time. Moreover, each process or thread can
70*bcb2dfaeSJed Browninstantiate an arbitrary number of backends.
71*bcb2dfaeSJed Brown
72*bcb2dfaeSJed Brown(fig-libceed-backends)=
73*bcb2dfaeSJed Brown
74*bcb2dfaeSJed Brown:::{figure} ../../img/libCEEDBackends.png
75*bcb2dfaeSJed BrownThe role of libCEED as a lightweight, portable library which provides a low-level
76*bcb2dfaeSJed BrownAPI for efficient, specialized implementations. libCEED allows different applications
77*bcb2dfaeSJed Brownto share highly optimized discretization kernels.
78*bcb2dfaeSJed Brown:::
79