sphinx/source/intro.md

*bcb2dfaeSJed Brown# Introduction
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownHistorically, conventional high-order finite element methods were rarely used for
*bcb2dfaeSJed Brownindustrial problems because the Jacobian rapidly loses sparsity as the order is
*bcb2dfaeSJed Brownincreased, leading to unaffordable solve times and memory requirements
*bcb2dfaeSJed Brown{cite}`brown2010`. This effect typically limited the order of accuracy to at most
*bcb2dfaeSJed Brownquadratic, especially because quadratic finite element formulations are computationally advantageous in terms of
*bcb2dfaeSJed Brownfloating point operations (FLOPS) per degree of freedom (DOF)---see
*bcb2dfaeSJed Brown{numref}`fig-assembledVsmatrix-free`---, despite the fast convergence and favorable
*bcb2dfaeSJed Brownstability properties offered by higher order discretizations. Nowadays, high-order
*bcb2dfaeSJed Brownnumerical methods, such as the spectral element method (SEM)---a special case of
*bcb2dfaeSJed Brownnodal p-Finite Element Method (FEM) which can reuse the interpolation nodes for
*bcb2dfaeSJed Brownquadrature---are employed, especially with (nearly) affine elements, because
*bcb2dfaeSJed Brownlinear constant coefficient problems can be very efficiently solved using the
*bcb2dfaeSJed Brownfast diagonalization method combined with a multilevel coarse solve. In
*bcb2dfaeSJed Brown{numref}`fig-assembledVsmatrix-free` we analyze and compare the theoretical costs,
*bcb2dfaeSJed Brownof different configurations: assembling the sparse matrix representing the action
*bcb2dfaeSJed Brownof the operator (labeled as *assembled*), non assembling the matrix and storing
*bcb2dfaeSJed Brownonly the metric terms needed as an operator setup-phase (labeled as *tensor-qstore*)
*bcb2dfaeSJed Brownand non assembling  the matrix and computing the metric terms on the fly and storing
*bcb2dfaeSJed Browna compact representation of the linearization at quadrature points (labeled as
*bcb2dfaeSJed Brown*tensor*). In the right panel, we show the cost in terms of FLOPS/DOF. This metric for
*bcb2dfaeSJed Browncomputational efficiency made sense historically, when the performance was mostly
*bcb2dfaeSJed Brownlimited by processors' clockspeed. A more relevant performance plot for current
*bcb2dfaeSJed Brownstate-of-the-art high-performance machines (for which the bottleneck of performance is
*bcb2dfaeSJed Brownmostly in the memory bandwith) is shown in the left panel of
*bcb2dfaeSJed Brown{numref}`fig-assembledVsmatrix-free`, where the memory bandwith is measured in terms of
*bcb2dfaeSJed Brownbytes/DOF. We can see that high-order methods, implemented properly with only partial
*bcb2dfaeSJed Brownassembly, require optimal amount of memory transfers (with respect to the
*bcb2dfaeSJed Brownpolynomial order) and near-optimal FLOPs for operator evaluation. Thus, high-order
*bcb2dfaeSJed Brownmethods in matrix-free representation not only possess favorable properties, such as
*bcb2dfaeSJed Brownhigher accuracy and faster convergence to solution, but also manifest an efficiency gain
*bcb2dfaeSJed Browncompared to their corresponding assembled representations.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown(fig-assembledvsmatrix-free)=
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown:::{figure} ../../img/TensorVsAssembly.png
*bcb2dfaeSJed BrownComparison of memory transfer and floating point operations per
*bcb2dfaeSJed Browndegree of freedom for different representations of a linear operator for a PDE in
*bcb2dfaeSJed Brown3D with $b$ components and variable coefficients arising due to Newton
*bcb2dfaeSJed Brownlinearization of a material nonlinearity. The representation labeled as *tensor*
*bcb2dfaeSJed Browncomputes metric terms on the fly and stores a compact representation of the
*bcb2dfaeSJed Brownlinearization at quadrature points. The representation labeled as *tensor-qstore*
*bcb2dfaeSJed Brownpulls the metric terms into the stored representation. The *assembled* representation
*bcb2dfaeSJed Brownuses a (block) CSR format.
*bcb2dfaeSJed Brown:::
*bcb2dfaeSJed Brown
*bcb2dfaeSJed BrownFurthermore, software packages that provide high-performance implementations have often
*bcb2dfaeSJed Brownbeen special-purpose and intrusive. libCEED {cite}`libceed-joss-paper` is a new library that offers a purely
*bcb2dfaeSJed Brownalgebraic interface for matrix-free operator representation and supports run-time
*bcb2dfaeSJed Brownselection of implementations tuned for a variety of computational device types,
*bcb2dfaeSJed Brownincluding CPUs and GPUs. libCEED's purely algebraic interface can unobtrusively be
*bcb2dfaeSJed Brownintegrated in new and legacy software to provide performance portable interfaces.
*bcb2dfaeSJed BrownWhile libCEED's focus is on high-order finite elements, the approach is algebraic
*bcb2dfaeSJed Brownand thus applicable to other discretizations in factored form. libCEED's role, as
*bcb2dfaeSJed Browna lightweight portable library that allows a wide variety of applications to share
*bcb2dfaeSJed Brownhighly optimized discretization kernels, is illustrated in
*bcb2dfaeSJed Brown{numref}`fig-libCEED-backends`, where a non-exhaustive list of specialized
*bcb2dfaeSJed Brownimplementations (backends) is provided. libCEED provides a low-level Application
*bcb2dfaeSJed BrownProgramming Interface (API) for user codes so that applications with their own
*bcb2dfaeSJed Browndiscretization infrastructure (e.g., those in [PETSc](https://www.mcs.anl.gov/petsc/),
*bcb2dfaeSJed Brown[MFEM](https://mfem.org/) and [Nek5000](https://nek5000.mcs.anl.gov/)) can evaluate
*bcb2dfaeSJed Brownand use the core operations provided by libCEED. GPU implementations are available via
*bcb2dfaeSJed Brownpure [CUDA](https://developer.nvidia.com/about-cuda) as well as the
*bcb2dfaeSJed Brown[OCCA](http://github.com/libocca/occa) and [MAGMA](https://bitbucket.org/icl/magma)
*bcb2dfaeSJed Brownlibraries. CPU implementations are available via pure C and AVX intrinsics as well as
*bcb2dfaeSJed Brownthe [LIBXSMM](http://github.com/hfp/libxsmm) library. libCEED provides a unified
*bcb2dfaeSJed Browninterface, so that users only need to write a single source code and can select the
*bcb2dfaeSJed Browndesired specialized implementation at run time. Moreover, each process or thread can
*bcb2dfaeSJed Browninstantiate an arbitrary number of backends.
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown(fig-libceed-backends)=
*bcb2dfaeSJed Brown
*bcb2dfaeSJed Brown:::{figure} ../../img/libCEEDBackends.png
*bcb2dfaeSJed BrownThe role of libCEED as a lightweight, portable library which provides a low-level
*bcb2dfaeSJed BrownAPI for efficient, specialized implementations. libCEED allows different applications
*bcb2dfaeSJed Brownto share highly optimized discretization kernels.
*bcb2dfaeSJed Brown:::