1*bcb2dfaeSJed Brown# Introduction 2*bcb2dfaeSJed Brown 3*bcb2dfaeSJed BrownHistorically, conventional high-order finite element methods were rarely used for 4*bcb2dfaeSJed Brownindustrial problems because the Jacobian rapidly loses sparsity as the order is 5*bcb2dfaeSJed Brownincreased, leading to unaffordable solve times and memory requirements 6*bcb2dfaeSJed Brown{cite}`brown2010`. This effect typically limited the order of accuracy to at most 7*bcb2dfaeSJed Brownquadratic, especially because quadratic finite element formulations are computationally advantageous in terms of 8*bcb2dfaeSJed Brownfloating point operations (FLOPS) per degree of freedom (DOF)---see 9*bcb2dfaeSJed Brown{numref}`fig-assembledVsmatrix-free`---, despite the fast convergence and favorable 10*bcb2dfaeSJed Brownstability properties offered by higher order discretizations. Nowadays, high-order 11*bcb2dfaeSJed Brownnumerical methods, such as the spectral element method (SEM)---a special case of 12*bcb2dfaeSJed Brownnodal p-Finite Element Method (FEM) which can reuse the interpolation nodes for 13*bcb2dfaeSJed Brownquadrature---are employed, especially with (nearly) affine elements, because 14*bcb2dfaeSJed Brownlinear constant coefficient problems can be very efficiently solved using the 15*bcb2dfaeSJed Brownfast diagonalization method combined with a multilevel coarse solve. In 16*bcb2dfaeSJed Brown{numref}`fig-assembledVsmatrix-free` we analyze and compare the theoretical costs, 17*bcb2dfaeSJed Brownof different configurations: assembling the sparse matrix representing the action 18*bcb2dfaeSJed Brownof the operator (labeled as *assembled*), non assembling the matrix and storing 19*bcb2dfaeSJed Brownonly the metric terms needed as an operator setup-phase (labeled as *tensor-qstore*) 20*bcb2dfaeSJed Brownand non assembling the matrix and computing the metric terms on the fly and storing 21*bcb2dfaeSJed Browna compact representation of the linearization at quadrature points (labeled as 22*bcb2dfaeSJed Brown*tensor*). In the right panel, we show the cost in terms of FLOPS/DOF. This metric for 23*bcb2dfaeSJed Browncomputational efficiency made sense historically, when the performance was mostly 24*bcb2dfaeSJed Brownlimited by processors' clockspeed. A more relevant performance plot for current 25*bcb2dfaeSJed Brownstate-of-the-art high-performance machines (for which the bottleneck of performance is 26*bcb2dfaeSJed Brownmostly in the memory bandwith) is shown in the left panel of 27*bcb2dfaeSJed Brown{numref}`fig-assembledVsmatrix-free`, where the memory bandwith is measured in terms of 28*bcb2dfaeSJed Brownbytes/DOF. We can see that high-order methods, implemented properly with only partial 29*bcb2dfaeSJed Brownassembly, require optimal amount of memory transfers (with respect to the 30*bcb2dfaeSJed Brownpolynomial order) and near-optimal FLOPs for operator evaluation. Thus, high-order 31*bcb2dfaeSJed Brownmethods in matrix-free representation not only possess favorable properties, such as 32*bcb2dfaeSJed Brownhigher accuracy and faster convergence to solution, but also manifest an efficiency gain 33*bcb2dfaeSJed Browncompared to their corresponding assembled representations. 34*bcb2dfaeSJed Brown 35*bcb2dfaeSJed Brown(fig-assembledvsmatrix-free)= 36*bcb2dfaeSJed Brown 37*bcb2dfaeSJed Brown:::{figure} ../../img/TensorVsAssembly.png 38*bcb2dfaeSJed BrownComparison of memory transfer and floating point operations per 39*bcb2dfaeSJed Browndegree of freedom for different representations of a linear operator for a PDE in 40*bcb2dfaeSJed Brown3D with $b$ components and variable coefficients arising due to Newton 41*bcb2dfaeSJed Brownlinearization of a material nonlinearity. The representation labeled as *tensor* 42*bcb2dfaeSJed Browncomputes metric terms on the fly and stores a compact representation of the 43*bcb2dfaeSJed Brownlinearization at quadrature points. The representation labeled as *tensor-qstore* 44*bcb2dfaeSJed Brownpulls the metric terms into the stored representation. The *assembled* representation 45*bcb2dfaeSJed Brownuses a (block) CSR format. 46*bcb2dfaeSJed Brown::: 47*bcb2dfaeSJed Brown 48*bcb2dfaeSJed BrownFurthermore, software packages that provide high-performance implementations have often 49*bcb2dfaeSJed Brownbeen special-purpose and intrusive. libCEED {cite}`libceed-joss-paper` is a new library that offers a purely 50*bcb2dfaeSJed Brownalgebraic interface for matrix-free operator representation and supports run-time 51*bcb2dfaeSJed Brownselection of implementations tuned for a variety of computational device types, 52*bcb2dfaeSJed Brownincluding CPUs and GPUs. libCEED's purely algebraic interface can unobtrusively be 53*bcb2dfaeSJed Brownintegrated in new and legacy software to provide performance portable interfaces. 54*bcb2dfaeSJed BrownWhile libCEED's focus is on high-order finite elements, the approach is algebraic 55*bcb2dfaeSJed Brownand thus applicable to other discretizations in factored form. libCEED's role, as 56*bcb2dfaeSJed Browna lightweight portable library that allows a wide variety of applications to share 57*bcb2dfaeSJed Brownhighly optimized discretization kernels, is illustrated in 58*bcb2dfaeSJed Brown{numref}`fig-libCEED-backends`, where a non-exhaustive list of specialized 59*bcb2dfaeSJed Brownimplementations (backends) is provided. libCEED provides a low-level Application 60*bcb2dfaeSJed BrownProgramming Interface (API) for user codes so that applications with their own 61*bcb2dfaeSJed Browndiscretization infrastructure (e.g., those in [PETSc](https://www.mcs.anl.gov/petsc/), 62*bcb2dfaeSJed Brown[MFEM](https://mfem.org/) and [Nek5000](https://nek5000.mcs.anl.gov/)) can evaluate 63*bcb2dfaeSJed Brownand use the core operations provided by libCEED. GPU implementations are available via 64*bcb2dfaeSJed Brownpure [CUDA](https://developer.nvidia.com/about-cuda) as well as the 65*bcb2dfaeSJed Brown[OCCA](http://github.com/libocca/occa) and [MAGMA](https://bitbucket.org/icl/magma) 66*bcb2dfaeSJed Brownlibraries. CPU implementations are available via pure C and AVX intrinsics as well as 67*bcb2dfaeSJed Brownthe [LIBXSMM](http://github.com/hfp/libxsmm) library. libCEED provides a unified 68*bcb2dfaeSJed Browninterface, so that users only need to write a single source code and can select the 69*bcb2dfaeSJed Browndesired specialized implementation at run time. Moreover, each process or thread can 70*bcb2dfaeSJed Browninstantiate an arbitrary number of backends. 71*bcb2dfaeSJed Brown 72*bcb2dfaeSJed Brown(fig-libceed-backends)= 73*bcb2dfaeSJed Brown 74*bcb2dfaeSJed Brown:::{figure} ../../img/libCEEDBackends.png 75*bcb2dfaeSJed BrownThe role of libCEED as a lightweight, portable library which provides a low-level 76*bcb2dfaeSJed BrownAPI for efficient, specialized implementations. libCEED allows different applications 77*bcb2dfaeSJed Brownto share highly optimized discretization kernels. 78*bcb2dfaeSJed Brown::: 79