Reference documentation for deal.II version Git 0943bc0020 20211022 11:23:14 0400

This tutorial depends on step33, step48, step59.
Table of contents  

This program was contributed by Martin Kronbichler. Many ideas presented here are the result of common code development with Niklas Fehn, Katharina Kormann, Peter Munch, and Svenja Schoeder.
This work was partly supported by the German Research Foundation (DFG) through the project "Highorder discontinuous Galerkin for the exascale" (ExaDG) within the priority program "Software for Exascale Computing" (SPPEXA).
This tutorial program solves the Euler equations of fluid dynamics using an explicit time integrator with the matrixfree framework applied to a highorder discontinuous Galerkin discretization in space. For details about the Euler system and an alternative implicit approach, we also refer to the step33 tutorial program. You might also want to look at step69 for an alternative approach to solving these equations.
The Euler equations are a conservation law, describing the motion of a compressible inviscid gas,
\[ \frac{\partial \mathbf{w}}{\partial t} + \nabla \cdot \mathbf{F}(\mathbf{w}) = \mathbf{G}(\mathbf w), \]
where the \(d+2\) components of the solution vector are \(\mathbf{w}=(\rho, \rho u_1,\ldots,\rho u_d,E)^{\mathrm T}\). Here, \(\rho\) denotes the fluid density, \({\mathbf u}=(u_1,\ldots, u_d)^\mathrm T\) the fluid velocity, and \(E\) the energy density of the gas. The velocity is not directly solved for, but rather the variable \(\rho \mathbf{u}\), the linear momentum (since this is the conserved quantity).
The Euler flux function, a \((d+2)\times d\) matrix, is defined as
\[ \mathbf F(\mathbf w) = \begin{pmatrix} \rho \mathbf{u}\\ \rho \mathbf{u} \otimes \mathbf{u} + \mathbb{I}p\\ (E+p)\mathbf{u} \end{pmatrix} \]
with \(\mathbb{I}\) the \(d\times d\) identity matrix and \(\otimes\) the outer product; its components denote the mass, momentum, and energy fluxes, respectively. The right hand side forcing is given by
\[ \mathbf G(\mathbf w) = \begin{pmatrix} 0\\ \rho\mathbf{g}\\ \rho \mathbf{u} \cdot \mathbf{g} \end{pmatrix}, \]
where the vector \(\mathbf g\) denotes the direction and magnitude of gravity. It could, however, also denote any other external force per unit mass that is acting on the fluid. (Think, for example, of the electrostatic forces exerted by an external electric field on charged particles.)
The three blocks of equations, the second involving \(d\) components, describe the conservation of mass, momentum, and energy. The pressure is not a solution variable but needs to be expressed through a "closure relationship" by the other variables; we here choose the relationship appropriate for a gas with molecules composed of two atoms, which at moderate temperatures is given by \(p=(\gamma  1) \left(E\frac 12 \rho \mathbf{u}\cdot \mathbf{u}\right)\) with the constant \(\gamma = 1.4\).
For spatial discretization, we use a highorder discontinuous Galerkin (DG) discretization, using a solution expansion of the form
\[ \mathbf{w}_h(\mathbf{x}, t) = \sum_{j=1}^{n_\mathbf{dofs}} \boldsymbol{\varphi}_j(\mathbf{x}) {w}_j(t). \]
Here, \(\boldsymbol{\varphi}_j\) denotes the \(j\)th basis function, written in vector form with separate shape functions for the different components and letting \(w_j(t)\) go through the density, momentum, and energy variables, respectively. In this form, the space dependence is contained in the shape functions and the time dependence in the unknown coefficients \(w_j\). As opposed to the continuous finite element method where some shape functions span across element boundaries, the shape functions are local to a single element in DG methods, with a discontinuity from one element to the next. The connection of the solution from one cell to its neighbors is instead imposed by the numerical fluxes specified below. This allows for some additional flexibility, for example to introduce directionality in the numerical method by, e.g., upwinding.
DG methods are popular methods for solving problems of transport character because they combine low dispersion errors with controllable dissipation on barely resolved scales. This makes them particularly attractive for simulation in the field of fluid dynamics where a wide range of active scales needs to be represented and inadequately resolved features are prone to disturb the important wellresolved features. Furthermore, highorder DG methods are wellsuited for modern hardware with the right implementation. At the same time, DG methods are no silver bullet. In particular when the solution develops discontinuities (shocks), as is typical for the Euler equations in some flow regimes, highorder DG methods tend to oscillatory solutions, like all highorder methods when not using flux or slopelimiters. This is a consequence of Godunov's theorem that states that any total variation limited (TVD) scheme that is linear (like a basic DG discretization) can at most be firstorder accurate. Put differently, since DG methods aim for higher order accuracy, they cannot be TVD on solutions that develop shocks. Even though some communities claim that the numerical flux in DG methods can control dissipation, this is of limited value unless all shocks in a problem align with cell boundaries. Any shock that passes through the interior of cells will again produce oscillatory components due to the highorder polynomials. In the finite element and DG communities, there exist a number of different approaches to deal with shocks, for example the introduction of artificial diffusion on troubled cells (using a troubledcell indicator based e.g. on a modal decomposition of the solution), a switch to dissipative loworder finite volume methods on a subgrid, or the addition of some limiting procedures. Given the ample possibilities in this context, combined with the considerable implementation effort, we here refrain from the regime of the Euler equations with pronounced shocks, and rather concentrate on the regime of subsonic flows with wavelike phenomena. For a method that works well with shocks (but is more expensive per unknown), we refer to the step69 tutorial program.
For the derivation of the DG formulation, we multiply the Euler equations with test functions \(\mathbf{v}\) and integrate over an individual cell \(K\), which gives
\[ \left(\mathbf{v}, \frac{\partial \mathbf{w}}{\partial t}\right)_{K} + \left(\mathbf{v}, \nabla \cdot \mathbf{F}(\mathbf{w})\right)_{K} = \left(\mathbf{v},\mathbf{G}(\mathbf w)\right)_{K}. \]
We then integrate the second term by parts, moving the divergence from the solution slot to the test function slot, and producing an integral over the element boundary:
\[ \left(\mathbf{v}, \frac{\partial \mathbf{w}}{\partial t}\right)_{K}  \left(\nabla \mathbf{v}, \mathbf{F}(\mathbf{w})\right)_{K} + \left<\mathbf{v}, \mathbf{n} \cdot \widehat{\mathbf{F}}(\mathbf{w}) \right>_{\partial K} = \left(\mathbf{v},\mathbf{G}(\mathbf w)\right)_{K}. \]
In the surface integral, we have replaced the term \(\mathbf{F}(\mathbf w)\) by the term \(\widehat{\mathbf{F}}(\mathbf w)\), the numerical flux. The role of the numerical flux is to connect the solution on neighboring elements and weakly impose continuity of the solution. This ensures that the global coupling of the PDE is reflected in the discretization, despite independent basis functions on the cells. The connectivity to the neighbor is included by defining the numerical flux as a function \(\widehat{\mathbf{F}}(\mathbf w^, \mathbf w^+)\) of the solution from both sides of an interior face, \(\mathbf w^\) and \(\mathbf w^+\). A basic property we require is that the numerical flux needs to be conservative. That is, we want all information (i.e., mass, momentum, and energy) that leaves a cell over a face to enter the neighboring cell in its entirety and vice versa. This can be expressed as \(\widehat{\mathbf{F}}(\mathbf w^, \mathbf w^+) = \widehat{\mathbf{F}}(\mathbf w^+, \mathbf w^)\), meaning that the numerical flux evaluates to the same result from either side. Combined with the fact that the numerical flux is multiplied by the unit outer normal vector on the face under consideration, which points in opposite direction from the two sides, we see that the conservation is fulfilled. An alternative point of view of the numerical flux is as a singlevalued intermediate state that links the solution weakly from both sides.
There is a large number of numerical flux functions available, also called Riemann solvers. For the Euler equations, there exist socalled exact Riemann solvers – meaning that the states from both sides are combined in a way that is consistent with the Euler equations along a discontinuity – and approximate Riemann solvers, which violate some physical properties and rely on other mechanisms to render the scheme accurate overall. Approximate Riemann solvers have the advantage of beging cheaper to compute. Most flux functions have their origin in the finite volume community, which are similar to DG methods with polynomial degree 0 within the cells (called volumes). As the volume integral of the Euler operator \(\mathbf{F}\) would disappear for constant solution and test functions, the numerical flux must fully represent the physical operator, explaining why there has been a large body of research in that community. For DG methods, consistency is guaranteed by higher order polynomials within the cells, making the numerical flux less of an issue and usually affecting only the convergence rate, e.g., whether the solution converges as \(\mathcal O(h^p)\), \(\mathcal O(h^{p+1/2})\) or \(\mathcal O(h^{p+1})\) in the \(L_2\) norm for polynomials of degree \(p\). The numerical flux can thus be seen as a mechanism to select more advantageous dissipation/dispersion properties or regarding the extremal eigenvalue of the discretized and linearized operator, which affect the maximal admissible time step size in explicit time integrators.
In this tutorial program, we implement two variants of fluxes that can be controlled via a switch in the program (of course, it would be easy to make them a run time parameter controlled via an input file). The first flux is the local Lax–Friedrichs flux
\[ \hat{\mathbf{F}}(\mathbf{w}^,\mathbf{w}^+) = \frac{\mathbf{F}(\mathbf{w}^)+\mathbf{F}(\mathbf{w}^+)}{2} + \frac{\lambda}{2}\left[\mathbf{w}^\mathbf{w}^+\right]\otimes \mathbf{n^}. \]
In the original definition of the Lax–Friedrichs flux, a factor \(\lambda = \max\left(\\mathbf{u}^\+c^, \\mathbf{u}^+\+c^+\right)\) is used (corresponding to the maximal speed at which information is moving on the two sides of the interface), stating that the difference between the two states, \([\![\mathbf{w}]\!]\) is penalized by the largest eigenvalue in the Euler flux, which is \(\\mathbf{u}\+c\), where \(c=\sqrt{\gamma p / \rho}\) is the speed of sound. In the implementation below, we modify the penalty term somewhat, given that the penalty is of approximate nature anyway. We use
\begin{align*} \lambda &= \frac{1}{2}\max\left(\sqrt{\\mathbf{u^}\^2+(c^)^2}, \sqrt{\\mathbf{u}^+\^2+(c^+)^2}\right) \\ &= \frac{1}{2}\sqrt{\max\left(\\mathbf{u^}\^2+(c^)^2, \\mathbf{u}^+\^2+(c^+)^2\right)}. \end{align*}
The additional factor \(\frac 12\) reduces the penalty strength (which results in a reduced negative real part of the eigenvalues, and thus increases the admissible time step size). Using the squares within the sums allows us to reduce the number of expensive square root operations, which is 4 for the original Lax–Friedrichs definition, to a single one. This simplification leads to at most a factor of 2 in the reduction of the parameter \(\lambda\), since \(\\mathbf{u}\^2+c^2 \leq \\mathbf{u}\^2+2 c \mathbf{u}\ + c^2 = \left(\\mathbf{u}\+c\right)^2 \leq 2 \left(\\mathbf{u}\^2+c^2\right)\), with the last inequality following from Young's inequality.
The second numerical flux is one proposed by Harten, Lax and van Leer, called the HLL flux. It takes the different directions of propagation of the Euler equations into account, depending on the speed of sound. It utilizes some intermediate states \(\bar{\mathbf{u}}\) and \(\bar{c}\) to define the two branches \(s^\mathrm{p} = \max\left(0, \bar{\mathbf{u}}\cdot \mathbf{n} + \bar{c}\right)\) and \(s^\mathrm{n} = \min\left(0, \bar{\mathbf{u}}\cdot \mathbf{n}  \bar{c}\right)\). From these branches, one then defines the flux
\[ \hat{\mathbf{F}}(\mathbf{w}^,\mathbf{w}^+) = \frac{s^\mathrm{p} \mathbf{F}(\mathbf{w}^)s^\mathrm{n} \mathbf{F}(\mathbf{w}^+)} {s^\mathrm p  s^\mathrm{n} } + \frac{s^\mathrm{p} s^\mathrm{n}}{s^\mathrm{p}s^\mathrm{n}} \left[\mathbf{w}^\mathbf{w}^+\right]\otimes \mathbf{n^}. \]
Regarding the definition of the intermediate state \(\bar{\mathbf{u}}\) and \(\bar{c}\), several variants have been proposed. The variant originally proposed uses a densityaveraged definition of the velocity, \(\bar{\mathbf{u}} = \frac{\sqrt{\rho^} \mathbf{u}^ + \sqrt{\rho^+}\mathbf{u}^+}{\sqrt{\rho^} + \sqrt{\rho^+}}\). Since we consider the Euler equations without shocks, we simply use arithmetic means, \(\bar{\mathbf{u}} = \frac{\mathbf{u}^ + \mathbf{u}^+}{2}\) and \(\bar{c} = \frac{c^ + c^+}{2}\), with \(c^{\pm} = \sqrt{\gamma p^{\pm} / \rho^{\pm}}\), in this tutorial program, and leave other variants to a possible extension. We also note that the HLL flux has been extended in the literature to the socalled HLLC flux, where C stands for the ability to represent contact discontinuities.
At the boundaries with no neighboring state \(\mathbf{w}^+\) available, it is common practice to deduce suitable exterior values from the boundary conditions (see the general literature on DG methods for details). In this tutorial program, we consider three types of boundary conditions, namely inflow boundary conditions where all components are prescribed,
\[ \mathbf{w}^+ = \begin{pmatrix} \rho_\mathrm{D}(t)\\ (\rho \mathbf u)_{\mathrm D}(t) \\ E_\mathrm{D}(t)\end{pmatrix} \quad \text{(Dirichlet)}, \]
subsonic outflow boundaries, where we do not prescribe exterior solutions as the flow field is leaving the domain and use the interior values instead; we still need to prescribe the energy as there is one incoming characteristic left in the Euler flux,
\[ \mathbf{w}^+ = \begin{pmatrix} \rho^\\ (\rho \mathbf u)^ \\ E_\mathrm{D}(t)\end{pmatrix} \quad \text{(mixed Neumann/Dirichlet)}, \]
and wall boundary condition which describe a nopenetration configuration:
\[ \mathbf{w}^+ = \begin{pmatrix} \rho^\\ (\rho \mathbf u)^  2 [(\rho \mathbf u)^\cdot \mathbf n] \mathbf{n} \\ E^\end{pmatrix}. \]
The polynomial expansion of the solution is finally inserted to the weak form and test functions are replaced by the basis functions. This gives a discrete in space, continuous in time nonlinear system with a finite number of unknown coefficient values \(w_j\), \(j=1,\ldots,n_\text{dofs}\). Regarding the choice of the polynomial degree in the DG method, there is no consensus in literature as of 2019 as to what polynomial degrees are most efficient and the decision is problemdependent. Higher order polynomials ensure better convergence rates and are thus superior for moderate to high accuracy requirements for smooth solutions. At the same time, the volumetosurface ratio of where degrees of freedom are located, increases with higher degrees, and this makes the effect of the numerical flux weaker, typically reducing dissipation. However, in most of the cases the solution is not smooth, at least not compared to the resolution that can be afforded. This is true for example in incompressible fluid dynamics, compressible fluid dynamics, and the related topic of wave propagation. In this preasymptotic regime, the error is approximately proportional to the numerical resolution, and other factors such as dispersion errors or the dissipative behavior become more important. Very high order methods are often ruled out because they come with more restrictive CFL conditions measured against the number of unknowns, and they are also not as flexible when it comes to representing complex geometries. Therefore, polynomial degrees between two and six are most popular in practice, see e.g. the efficiency evaluation in [50] and references cited therein.
To discretize in time, we slightly rearrange the weak form and sum over all cells:
\[ \sum_{K \in \mathcal T_h} \left(\boldsymbol{\varphi}_i, \frac{\partial \mathbf{w}}{\partial t}\right)_{K} = \sum_{K\in \mathcal T_h} \left[ \left(\nabla \boldsymbol{\varphi}_i, \mathbf{F}(\mathbf{w})\right)_{K} \left<\boldsymbol{\varphi}_i, \mathbf{n} \cdot \widehat{\mathbf{F}}(\mathbf{w})\right>_{\partial K} + \left(\boldsymbol{\varphi}_i,\mathbf{G}(\mathbf w)\right)_{K} \right], \]
where \(\boldsymbol{\varphi}_i\) runs through all basis functions with from 1 to \(n_\text{dofs}\).
We now denote by \(\mathcal M\) the mass matrix with entries \(\mathcal M_{ij} = \sum_{K} \left(\boldsymbol{\varphi}_i, \boldsymbol{\varphi}_j\right)_K\), and by
\[ \mathcal L_h(t,\mathbf{w}_h) = \left[\sum_{K\in \mathcal T_h} \left[ \left(\nabla \boldsymbol{\varphi}_i, \mathbf{F}(\mathbf{w}_h)\right)_{K}  \left<\boldsymbol{\varphi}_i, \mathbf{n} \cdot \widehat{\mathbf{F}}(\mathbf{w}_h)\right>_{\partial K} + \left(\boldsymbol{\varphi}_i,\mathbf{G}(\mathbf w_h)\right)_{K} \right]\right]_{i=1,\ldots,n_\text{dofs}}. \]
the operator evaluating the righthand side of the Euler operator, given a function \(\mathbf{w}_h\) associated with a global vector of unknowns and the finite element in use. This function \(\mathcal L_h\) is explicitly timedependent as the numerical flux evaluated at the boundary will involve timedependent data \(\rho_\mathrm{D}\), \((\rho \mathbf{u})_\mathrm{D}\), and \(E_\mathbf{D}\) on some parts of the boundary, depending on the assignment of boundary conditions. With this notation, we can write the discrete in space, continuous in time system compactly as
\[ \mathcal M \frac{\partial \mathbf{w}_h}{\partial t} = \mathcal L_h(t, \mathbf{w}_h), \]
where we have taken the liberty to also denote the global solution vector by \(\mathbf{w}_h\) (in addition to the the corresponding finite element function). Equivalently, the system above has the form
\[ \frac{\partial \mathbf{w}_h}{\partial t} = \mathcal M^{1} \mathcal L_h(t, \mathbf{w}_h). \]
For hyperbolic systems discretized by highorder discontinuous Galerkin methods, explicit time integration of this system is very popular. This is due to the fact that the mass matrix \(\mathcal M\) is blockdiagonal (with each block corresponding to only variables of the same kind defined on the same cell) and thus easily inverted. In each time step – or stage of a Runge–Kutta scheme – one only needs to evaluate the differential operator once using the given data and subsequently apply the inverse of the mass matrix. For implicit time stepping, on the other hand, one would first have to linearize the equations and then iteratively solve the linear system, which involves several residual evaluations and at least a dozen applications of the linearized operator, as has been demonstrated in the step33 tutorial program.
Of course, the simplicity of explicit time stepping comes with a price, namely conditional stability due to the socalled Courant–Friedrichs–Lewy (CFL) condition. It states that the time step cannot be larger than the fastest propagation of information by the discretized differential operator. In more modern terms, the speed of propagation corresponds to the largest eigenvalue in the discretized operator, and in turn depends on the mesh size, the polynomial degree \(p\) and the physics of the Euler operator, i.e., the eigenvalues of the linearization of \(\mathbf F(\mathbf w)\) with respect to \(\mathbf{w}\). In this program, we set the time step as follows:
\[ \Delta t = \frac{\mathrm{Cr}}{p^{1.5}}\left(\frac{1} {\max\left[\frac{\\mathbf{u}\}{h_u} + \frac{c}{h_c}\right]}\right), \]
with the maximum taken over all quadrature points and all cells. The dimensionless number \(\mathrm{Cr}\) denotes the Courant number and can be chosen up to a maximally stable number \(\mathrm{Cr}_\text{max}\), whose value depends on the selected time stepping method and its stability properties. The power \(p^{1.5}\) used for the polynomial scaling is heuristic and represents the closest fit for polynomial degrees between 1 and 8, see e.g. [117]. In the limit of higher degrees, \(p>10\), a scaling of \(p^2\) is more accurate, related to the inverse estimates typically used for interior penalty methods. Regarding the effective mesh sizes \(h_u\) and \(h_c\) used in the formula, we note that the convective transport is directional. Thus an appropriate scaling is to use the element length in the direction of the velocity \(\mathbf u\). The code below derives this scaling from the inverse of the Jacobian from the reference to real cell, i.e., we approximate \(\frac{\\mathbf{u}\}{h_u} \approx \J^{1} \mathbf u\_{\infty}\). The acoustic waves, instead, are isotropic in character, which is why we use the smallest feature size, represented by the smallest singular value of \(J\), for the acoustic scaling \(h_c\). Finally, we need to add the convective and acoustic limits, as the Euler equations can transport information with speed \(\\mathbf{u}\+c\).
In this tutorial program, we use a specific variant of explicit Runge–Kutta methods, which in general use the following update procedure from the state \(\mathbf{w}_h^{n}\) at time \(t^n\) to the new time \(t^{n+1}\) with \(\Delta t = t^{n+1}t^n\):
\[ \begin{aligned} \mathbf{k}_1 &= \mathcal M^{1} \mathcal L_h\left(t^n, \mathbf{w}_h^n\right), \\ \mathbf{k}_2 &= \mathcal M^{1} \mathcal L_h\left(t^n+c_2\Delta t, \mathbf{w}_h^n + a_{21} \Delta t \mathbf{k}_1\right), \\ &\vdots \\ \mathbf{k}_s &= \mathcal M^{1} \mathcal L_h\left(t^n+c_s\Delta t, \mathbf{w}_h^n + \sum_{j=1}^{s1} a_{sj} \Delta t \mathbf{k}_j\right), \\ \mathbf{w}_h^{n+1} &= \mathbf{w}_h^n + \Delta t\left(b_1 \mathbf{k}_1 + b_2 \mathbf{k}_2 + \ldots + b_s \mathbf{k}_s\right). \end{aligned} \]
The vectors \(\mathbf{k}_i\), \(i=1,\ldots,s\), in an \(s\)stage scheme are evaluations of the operator at some intermediate state and used to define the endofstep value \(\mathbf{w}_h^{n+1}\) via some linear combination. The scalar coefficients in this scheme, \(c_i\), \(a_{ij}\), and \(b_j\), are defined such that certain conditions are satisfied for higher order schemes, the most basic one being \(c_i = \sum_{j=1}^{i1}a_{ij}\). The parameters are typically collected in the form of a socalled Butcher tableau that collects all of the coefficients that define the scheme. For a fivestage scheme, it would look like this:
\[ \begin{array}{cccccc} 0 \\ c_2 & a_{21} \\ c_3 & a_{31} & a_{32} \\ c_4 & a_{41} & a_{42} & a_{43} \\ c_5 & a_{51} & a_{52} & a_{53} & a_{54} \\ \hline & b_1 & b_2 & b_3 & b_4 & b_5 \end{array} \]
In this tutorial program, we use a subset of explicit Runge–Kutta methods, socalled lowstorage Runge–Kutta methods (LSRK), which assume additional structure in the coefficients. In the variant used by reference [80], the assumption is to use Butcher tableaus of the form
\[ \begin{array}{cccccc} 0 \\ c_2 & a_1 \\ c_3 & b_1 & a_2 \\ c_4 & b_1 & b_2 & a_3 \\ c_5 & b_1 & b_2 & b_3 & a_4 \\ \hline & b_1 & b_2 & b_3 & b_4 & b_5 \end{array} \]
With such a definition, the update to \(\mathbf{w}_h^n\) shares the storage with the information for the intermediate values \(\mathbf{k}_i\). Starting with \(\mathbf{w}^{n+1}=\mathbf{w}^n\) and \(\mathbf{r}_1 = \mathbf{w}^n\), the update in each of the \(s\) stages simplifies to
\[ \begin{aligned} \mathbf{k}_i &= \mathcal M^{1} \mathcal L_h\left(t^n+c_i\Delta t, \mathbf{r}_{i} \right),\\ \mathbf{r}_{i+1} &= \mathbf{w}_h^{n+1} + \Delta t \, a_i \mathbf{k}_i,\\ \mathbf{w}_h^{n+1} &= \mathbf{w}_h^{n+1} + \Delta t \, b_i \mathbf{k}_i. \end{aligned} \]
Besides the vector \(\mathbf w_h^{n+1}\) that is successively updated, this scheme only needs two auxiliary vectors, namely the vector \(\mathbf{k}_i\) to hold the evaluation of the differential operator, and the vector \(\mathbf{r}_i\) that holds the righthand side for the differential operator application. In subsequent stages \(i\), the values \(\mathbf{k}_i\) and \(\mathbf{r}_i\) can use the same storage.
The main advantages of lowstorage variants are the reduced memory consumption on the one hand (if a very large number of unknowns must be fit in memory, holding all \(\mathbf{k}_i\) to compute subsequent updates can be a limit already for \(s\) in between five and eight – recall that we are using an explicit scheme, so we do not need to store any matrices that are typically much larger than a few vectors), and the reduced memory access on the other. In this program, we are particularly interested in the latter aspect. Since cost of operator evaluation is only a small multiple of the cost of simply streaming the input and output vector from memory with the optimized matrixfree methods of deal.II, we must consider the cost of vector updates, and lowstorage variants can deliver up to twice the throughput of conventional explicit Runge–Kutta methods for this reason, see e.g. the analysis in [117].
Besides three variants for third, fourth and fifth order accuracy from the reference [80], we also use a fourthorder accurate variant with seven stages that was optimized for acoustics setups from [126]. Acoustic problems are one of the interesting aspects of the subsonic regime of the Euler equations where compressibility leads to the transmission of sound waves; often, one uses further simplifications of the linearized Euler equations around a background state or the acoustic wave equation around a fixed frame.
The major ingredients used in this program are the fast matrixfree techniques we use to evaluate the operator \(\mathcal L_h\) and the inverse mass matrix \(\mathcal M\). Actually, the term matrixfree is a slight misnomer, because we are working with a nonlinear operator and do not linearize the operator that in turn could be represented by a matrix. However, fast evaluation of integrals has become popular as a replacement of sparse matrixvector products, as shown in step37 and step59, and we have coined this infrastructure matrixfree functionality in deal.II for this reason. Furthermore, the inverse mass matrix is indeed applied in a matrixfree way, as detailed below.
The matrixfree infrastructure allows us to quickly evaluate the integrals in the weak forms. The ingredients are the fast interpolation from solution coefficients into values and derivatives at quadrature points, pointwise operations at quadrature points (where we implement the differential operator as derived above), as well as multiplication by all test functions and summation over quadrature points. The first and third component make use of sum factorization and have been extensively discussed in the step37 tutorial program for the cell integrals and step59 for the face integrals. The only difference is that we now deal with a system of \(d+2\) components, rather than the scalar systems in previous tutorial programs. In the code, all that changes is a template argument of the FEEvaluation and FEFaceEvaluation classes, the one to set the number of components. The access to the vector is the same as before, all handled transparently by the evaluator. We also note that the variant with a single evaluator chosen in the code below is not the only choice – we could also have used separate evalators for the separate components \(\rho\), \(\rho \mathbf u\), and \(E\); given that we treat all components similarly (also reflected in the way we state the equation as a vector system), this would be more complicated here. As before, the FEEvaluation class provides explicit vectorization by combining the operations on several cells (and faces), involving data types called VectorizedArray. Since the arithmetic operations are overloaded for this type, we do not have to bother with it all that much, except for the evaluation of functions through the Function interface, where we need to provide particular vectorized evaluations for several quadrature point locations at once.
A more substantial change in this program is the operation at quadrature points: Here, the multicomponent evaluators provide us with return types not discussed before. Whereas FEEvaluation::get_value() would return a scalar (more precisely, a VectorizedArray type due to vectorization across cells) for the Laplacian of step37, it now returns a type that is Tensor<1,dim+2,VectorizedArray<Number>>
. Likewise, the gradient type is now Tensor<1,dim+2,Tensor<1,dim,VectorizedArray<Number>>>
, where the outer tensor collects the dim+2
components of the Euler system, and the inner tensor the partial derivatives in the various directions. For example, the flux \(\mathbf{F}(\mathbf{w})\) of the Euler system is of this type. In order to reduce the amount of code we have to write for spelling out these types, we use the C++ auto
keyword where possible.
From an implementation point of view, the nonlinearity is not a big difficulty: It is introduced naturally as we express the terms of the Euler weak form, for example in the form of the momentum term \(\rho \mathbf{u} \otimes \mathbf{u}\). To obtain this expression, we first deduce the velocity \(\mathbf{u}\) from the momentum variable \(\rho \mathbf{u}\). Given that \(\rho \mathbf{u}\) is represented as a \(p\)degree polynomial, as is \(\rho\), the velocity \(\mathbf{u}\) is a rational expression in terms of the reference coordinates \(\hat{\mathbf{x}}\). As we perform the multiplication \((\rho \mathbf{u})\otimes \mathbf{u}\), we obtain an expression that is the ratio of two polynomials, with polynomial degree \(2p\) in the numerator and polynomial degree \(p\) in the denominator. Combined with the gradient of the test function, the integrand is of degree \(3p\) in the numerator and \(p\) in the denominator already for affine cells, i.e., for parallelograms/ parallelepipeds. For curved cells, additional polynomial and rational expressions appear when multiplying the integrand by the determinant of the Jacobian of the mapping. At this point, one usually needs to give up on insisting on exact integration, and take whatever accuracy the Gaussian (more precisely, Gauss–Legrende) quadrature provides. The situation is then similar to the one for the Laplace equation, where the integrand contains rational expressions on nonaffince cells and is also only integrated approximately. As these formulas only integrate polynomials exactly, we have to live with the variational crime in the form of an integration error.
While inaccurate integration is usually tolerable for elliptic problems, for hyperbolic problems inexact integration causes some headache in the form of an effect called aliasing. The term comes from signal processing and expresses the situation of inappropriate, too coarse sampling. In terms of quadrature, the inappropriate sampling means that we use too few quadrature points compared to what would be required to accurately sample the variablecoefficient integrand. It has been shown in the DG literature that aliasing errors can introduce unphysical oscillations in the numerical solution for barely resolved simulations. The fact that aliasing mostly affects coarse resolutions – whereas finer meshes with the same scheme work fine – is not surprising because wellresolved simulations tend to be smooth on lengthscales of a cell (i.e., they have small coefficients in the higher polynomial degrees that are missed by too few quadrature points, whereas the main solution contribution in the lower polynomial degrees is still wellcaptured – this is simply a consequence of Taylor's theorem). To address this topic, various approaches have been proposed in the DG literature. One technique is filtering which damps the solution components pertaining to higher polynomial degrees. As the chosen nodal basis is not hierarchical, this would mean to transform from the nodal basis into a hierarchical one (e.g., a modal one based on Legendre polynomials) where the contributions within a cell are split by polynomial degrees. In that basis, one could then multiply the solution coefficients associated with higher degrees by a small number, keep the lower ones intact (to not destroy consistency), and then transform back to the nodal basis. However, filters reduce the accuracy of the method. Another, in some sense simpler, strategy is to use more quadrature points to capture nonlinear terms more accurately. Using more than \(p+1\) quadrature points per coordinate directions is sometimes called overintegration or consistent integration. The latter name is most common in the context of the incompressible NavierStokes equations, where the \(\mathbf{u}\otimes \mathbf{u}\) nonlinearity results in polynomial integrands of degree \(3p\) (when also considering the test function), which can be integrated exactly with \(\textrm{floor}\left(\frac{3p}{2}\right)+1\) quadrature points per direction as long as the element geometry is affine. In the context of the Euler equations with nonpolynomial integrands, the choice is less clear. Depending on the variation in the various variables both \(\textrm{floor}\left(\frac{3p}{2}\right)+1\) or \(2p+1\) points (integrating exactly polynomials of degree \(3p\) or \(4p\), respectively) are common.
To reflect this variability in the choice of quadrature in the program, we keep the number of quadrature points a variable to be specified just as the polynomial degree, and note that one would make different choices depending also on the flow configuration. The default choice is \(p+2\) points – a bit more than the minimum possible of \(p+1\) points. The FEEvaluation and FEFaceEvaluation classes allow to seamlessly change the number of points by a template parameter, such that the program does not get more complicated because of that.
The last ingredient is the evaluation of the inverse mass matrix \(\mathcal M^{1}\). In DG methods with explicit time integration, mass matrices are blockdiagonal and thus easily inverted – one only needs to invert the diagonal blocks. However, given the fact that matrixfree evaluation of integrals is closer in cost to the access of the vectors only, even the application of a blockdiagonal matrix (e.g. via an array of LU factors) would be several times more expensive than evaluation of \(\mathcal L_h\) simply because just storing and loading matrices of size dofs_per_cell
times dofs_per_cell
for higher order finite elements repeatedly is expensive. As this is clearly undesirable, part of the community has moved to bases where the mass matrix is diagonal, for example the L_{2}orthogonal Legendre basis using hierarchical polynomials or Lagrange polynomials on the points of the Gaussian quadrature (which is just another way of utilizing Legendre information). While the diagonal property breaks down for deformed elements, the error made by taking a diagonal mass matrix and ignoring the rest (a variant of mass lumping, though not the one with an additional integration error as utilized in step48) has been shown to not alter discretization accuracy. The Lagrange basis in the points of Gaussian quadrature is sometimes also referred to as a collocation setup, as the nodal points of the polynomials coincide (= are "colocated") with the points of quadrature, obviating some interpolation operations. Given the fact that we want to use more quadrature points for nonlinear terms in \(\mathcal L_h\), however, the collocation property is lost. (More precisely, it is still used in FEEvaluation and FEFaceEvaluation after a change of basis, see the matrixfree paper [84].)
In this tutorial program, we use the collocation idea for the application of the inverse mass matrix, but with a slight twist. Rather than using the collocation via Lagrange polynomials in the points of Gaussian quadrature, we prefer a conventional Lagrange basis in GaussLobatto points as those make the evaluation of face integrals cheap. This is because for GaussLobatto points, some of the node points are located on the faces of the cell and it is not difficult to show that on any given face, the only shape functions with nonzero values are exactly the ones whose node points are in fact located on that face. One could of course also use the GaussLobatto quadrature (with some additional integration error) as was done in step48, but we do not want to sacrifice accuracy as these quadrature formulas are generally of lower order than the general Gauss quadrature formulas. Instead, we use an idea described in the reference [86] where it was proposed to change the basis for the sake of applying the inverse mass matrix. Let us denote by \(S\) the matrix of shape functions evaluated at quadrature points, with shape functions in the row of the matrix and quadrature points in columns. Then, the mass matrix on a cell \(K\) is given by
\[ \mathcal M^K = S J^K S^\mathrm T. \]
Here, \(J^K\) is the diagonal matrix with the determinant of the Jacobian times the quadrature weight (JxW) as entries. The matrix \(S\) is constructed as the Kronecker product (tensor product) of onedimensional matrices, e.g. in 3D as
\[ S = S_{\text{1D}}\otimes S_{\text{1D}}\otimes S_{\text{1D}}, \]
which is the result of the basis functions being a tensor product of onedimensional shape functions and the quadrature formula being the tensor product of 1D quadrature formulas. For the case that the number of polynomials equals the number of quadrature points, all matrices in \(S J^K S^\mathrm T\) are square, and also the ingredients to \(S\) in the Kronecker product are square. Thus, one can invert each matrix to form the overall inverse,
\[ \left(\mathcal M^K\right)^{1} = S_{\text{1D}}^{\mathrm T}\otimes S_{\text{1D}}^{\mathrm T}\otimes S_{\text{1D}}^{\mathrm T} \left(J^K\right)^{1} S_{\text{1D}}^{1}\otimes S_{\text{1D}}^{1}\otimes S_{\text{1D}}^{1}. \]
This formula is of exactly the same structure as the steps in the forward evaluation of integrals with sum factorization techniques (i.e., the FEEvaluation and MatrixFree framework of deal.II). Hence, we can utilize the same code paths with a different interpolation matrix, \(S_{\mathrm{1D}}^{\mathrm{T}}\) rather than \(S_{\mathrm{1D}}\).
The class MatrixFreeOperators::CellwiseInverseMassMatrix implements this operation: It changes from the basis contained in the finite element (in this case, FE_DGQ) to the Lagrange basis in Gaussian quadrature points. Here, the inverse of a diagonal mass matrix can be evaluated, which is simply the inverse of the JxW
factors (i.e., the quadrature weight times the determinant of the Jacobian from reference to real coordinates). Once this is done, we can change back to the standard nodal GaussLobatto basis.
The advantage of this particular way of applying the inverse mass matrix is a cost similar to the forward application of a mass matrix, which is cheaper than the evaluation of the spatial operator \(\mathcal L_h\) with overintegration and face integrals. (We will demonstrate this with detailed timing information in the results section.) In fact, it is so cheap that it is limited by the bandwidth of reading the source vector, reading the diagonal, and writing into the destination vector on most modern architectures. The hardware used for the result section allows to do the computations at least twice as fast as the streaming of the vectors from memory.
In this tutorial program, we implement two test cases. The first case is a convergence test limited to two space dimensions. It runs a socalled isentropic vortex which is transported via a background flow field. The second case uses a more exciting setup: We start with a cylinder immersed in a channel, using the GridGenerator::channel_with_cylinder() function. Here, we impose a subsonic initial field at Mach number of \(\mathrm{Ma}=0.307\) with a constant velocity in \(x\) direction. At the top and bottom walls as well as at the cylinder, we impose a nopenetration (i.e., tangential flow) condition. This setup forces the flow to reorient as compared to the initial condition, which results in a big sound wave propagating away from the cylinder. In upstream direction, the wave travels more slowly (as it has to move against the oncoming gas), including a discontinuity in density and pressure. In downstream direction, the transport is faster as sound propagation and fluid flow go in the same direction, which smears out the discontinuity somewhat. Once the sound wave hits the upper and lower walls, the sound is reflected back, creating some nice shapes as illustrated in the results section below.
The include files are similar to the previous matrixfree tutorial programs step37, step48, and step59
The following file includes the CellwiseInverseMassMatrix data structure that we will use for the mass matrix inversion, the only new include file for this tutorial program:
Similarly to the other matrixfree tutorial programs, we collect all parameters that control the execution of the program at the top of the file. Besides the dimension and polynomial degree we want to run with, we also specify a number of points in the Gaussian quadrature formula we want to use for the nonlinear terms in the Euler equations. Furthermore, we specify the time interval for the timedependent problem, and implement two different test cases. The first one is an analytical solution in 2D, whereas the second is a channel flow around a cylinder as described in the introduction. Depending on the test case, we also change the final time up to which we run the simulation, and a variable output_tick
that specifies in which intervals we want to write output (assuming that the tick is larger than the time step size).
Next off are some details of the time integrator, namely a Courant number that scales the time step size in terms of the formula \(\Delta t = \text{Cr} n_\text{stages} \frac{h}{(p+1)^{1.5} (\\mathbf{u} + c)_\text{max}}\), as well as a selection of a few lowstorage Runge–Kutta methods. We specify the Courant number per stage of the Runge–Kutta scheme, as this gives a more realistic expression of the numerical cost for schemes of various numbers of stages.
Eventually, we select a detail of the spatial discretization, namely the numerical flux (Riemann solver) at the faces between cells. For this program, we have implemented a modified variant of the Lax–Friedrichs flux and the Harten–Lax–van Leer (HLL) flux.
We now define a class with the exact solution for the test case 0 and one with a background flow field for test case 1 of the channel. Given that the Euler equations are a problem with \(d+2\) equations in \(d\) dimensions, we need to tell the Function base class about the correct number of components.
As far as the actual function implemented is concerned, the analytical test case is an isentropic vortex case (see e.g. the book by Hesthaven and Warburton, Example 6.1 in Section 6.6 on page 209) which fulfills the Euler equations with zero force term on the right hand side. Given that definition, we return either the density, the momentum, or the energy depending on which component is requested. Note that the original definition of the density involves the \(\frac{1}{\gamma 1}\)th power of some expression. Since std::pow()
has pretty slow implementations on some systems, we replace it by logarithm followed by exponentiation (of base 2), which is mathematically equivalent but usually much better optimized. This formula might lose accuracy in the last digits for very small numbers compared to std::pow()
, but we are happy with it anyway, since small numbers map to data close to 1.
For the channel test case, we simply select a density of 1, a velocity of 0.4 in \(x\) direction and zero in the other directions, and an energy that corresponds to a speed of sound of 1.3 measured against the background velocity field, computed from the relation \(E = \frac{c^2}{\gamma (\gamma 1)} + \frac 12 \rho \u\^2\).
The next few lines implement a few lowstorage variants of Runge–Kutta methods. These methods have specific Butcher tableaux with coefficients \(b_i\) and \(a_i\) as shown in the introduction. As usual in Runge–Kutta method, we can deduce time steps, \(c_i = \sum_{j=1}^{i2} b_i + a_{i1}\) from those coefficients. The main advantage of this kind of scheme is the fact that only two vectors are needed per stage, namely the accumulated part of the solution \(\mathbf{w}\) (that will hold the solution \(\mathbf{w}^{n+1}\) at the new time \(t^{n+1}\) after the last stage), the update vector \(\mathbf{r}_i\) that gets evaluated during the stages, plus one vector \(\mathbf{k}_i\) to hold the evaluation of the operator. Such a Runge–Kutta setup reduces the memory storage and memory access. As the memory bandwidth is often the performancelimiting factor on modern hardware when the evaluation of the differential operator is welloptimized, performance can be improved over standard time integrators. This is true also when taking into account that a conventional Runge–Kutta scheme might allow for slightly larger time steps as more free parameters allow for better stability properties.
In this tutorial programs, we concentrate on a few variants of lowstorage schemes defined in the article by Kennedy, Carpenter, and Lewis (2000), as well as one variant described by Tselios and Simos (2007). There is a large series of other schemes available, which could be addressed by additional sets of coefficients or slightly different update formulas.
We define a single class for the four integrators, distinguished by the enum described above. To each scheme, we then fill the vectors for the \(b_i\) and \(a_i\) to the given variables in the class.
First comes the threestage scheme of order three by Kennedy et al. (2000). While its stability region is significantly smaller than for the other schemes, it only involves three stages, so it is very competitive in terms of the work per stage.
The next scheme is a fivestage scheme of order four, again defined in the paper by Kennedy et al. (2000).
The following scheme of seven stages and order four has been explicitly derived for acoustics problems. It is a balance of accuracy for imaginary eigenvalues among fourth order schemes, combined with a large stability region. Since DG schemes are dissipative among the highest frequencies, this does not necessarily translate to the highest possible time step per stage. In the context of the present tutorial program, the numerical flux plays a crucial role in the dissipation and thus also the maximal stable time step size. For the modified Lax–Friedrichs flux, this scheme is similar to the stage_5_order_4
scheme in terms of step size per stage if only stability is considered, but somewhat less efficient for the HLL flux.
The last scheme included here is the ninestage scheme of order five from Kennedy et al. (2000). It is the most accurate among the schemes used here, but the higher order of accuracy sacrifices some stability, so the step length normalized per stage is less than for the fourth order schemes.
The main function of the time integrator is to go through the stages, evaluate the operator, prepare the \(\mathbf{r}_i\) vector for the next evaluation, and update the solution vector \(\mathbf{w}\). We hand off the work to the pde_operator
involved in order to be able to merge the vector operations of the Runge–Kutta setup with the evaluation of the differential operator for better performance, so all we do here is to delegate the vectors and coefficients.
We separately call the operator for the first stage because we need slightly modified arguments there: We evaluate the solution from the old solution \(\mathbf{w}^n\) rather than a \(\mathbf r_i\) vector, so the first argument is solution
. We here let the stage vector \(\mathbf{r}_i\) also hold the temporary result of the evaluation, as it is not used otherwise. For all subsequent stages, we use the vector vec_ki
as the second vector argument to store the result of the operator evaluation. Finally, when we are at the last stage, we must skip the computation of the vector \(\mathbf{r}_{s+1}\) as there is no coefficient \(a_s\) available (nor will it be used).
In the following functions, we implement the various problemspecific operators pertaining to the Euler equations. Each function acts on the vector of conserved variables \([\rho, \rho\mathbf{u}, E]\) that we hold in the solution vectors, and computes various derived quantities.
First out is the computation of the velocity, that we derive from the momentum variable \(\rho \mathbf{u}\) by division by \(\rho\). One thing to note here is that we decorate all those functions with the keyword DEAL_II_ALWAYS_INLINE
. This is a special macro that maps to a compilerspecific keyword that tells the compiler to never create a function call for any of those functions, and instead move the implementation inline to where they are called. This is critical for performance because we call into some of those functions millions or billions of times: For example, we both use the velocity for the computation of the flux further down, but also for the computation of the pressure, and both of these places are evaluated at every quadrature point of every cell. Making sure these functions are inlined ensures not only that the processor does not have to execute a jump instruction into the function (and the corresponding return jump), but also that the compiler can reuse intermediate information from one function's context in code that comes after the place where the function was called. (We note that compilers are generally quite good at figuring out which functions to inline by themselves. Here is a place where compilers may or may not have figured it out by themselves but where we know for sure that inlining is a win.)
Another trick we apply is a separate variable for the inverse density \(\frac{1}{\rho}\). This enables the compiler to only perform a single division for the flux, despite the division being used at several places. As divisions are around ten to twenty times as expensive as multiplications or additions, avoiding redundant divisions is crucial for performance. We note that taking the inverse first and later multiplying with it is not equivalent to a division in floating point arithmetic due to roundoff effects, so the compiler is not allowed to exchange one way by the other with standard optimization flags. However, it is also not particularly difficult to write the code in the right way.
To summarize, the chosen strategy of always inlining and careful definition of expensive arithmetic operations allows us to write compact code without passing all intermediate results around, despite making sure that the code maps to excellent machine code.
The next function computes the pressure from the vector of conserved variables, using the formula \(p = (\gamma  1) \left(E  \frac 12 \rho \mathbf{u}\cdot \mathbf{u}\right)\). As explained above, we use the velocity from the euler_velocity()
function. Note that we need to specify the first template argument dim
here because the compiler is not able to deduce it from the arguments of the tensor, whereas the second argument (number type) can be automatically deduced.
Here is the definition of the Euler flux function, i.e., the definition of the actual equation. Given the velocity and pressure (that the compiler optimization will make sure are done only once), this is straightforward given the equation stated in the introduction.
This next function is a helper to simplify the implementation of the numerical flux, implementing the action of a tensor of tensors (with nonstandard outer dimension of size dim + 2
, so the standard overloads provided by deal.II's tensor classes do not apply here) with another tensor of the same inner dimension, i.e., a matrixvector product.
This function implements the numerical flux (Riemann solver). It gets the state from the two sides of an interface and the normal vector, oriented from the side of the solution \(\mathbf{w}^\) towards the solution \(\mathbf{w}^+\). In finite volume methods which rely on piecewise constant data, the numerical flux is the central ingredient as it is the only place where the physical information is entered. In DG methods, the numerical flux is less central due to the polynomials within the elements and the physical flux used there. As a result of higherdegree interpolation with consistent values from both sides in the limit of a continuous solution, the numerical flux can be seen as a control of the jump of the solution from both sides to weakly impose continuity. It is important to realize that a numerical flux alone cannot stabilize a highorder DG method in the presence of shocks, and thus any DG method must be combined with further shockcapturing techniques to handle those cases. In this tutorial, we focus on wavelike solutions of the Euler equations in the subsonic regime without strong discontinuities where our basic scheme is sufficient.
Nonetheless, the numerical flux is decisive in terms of the numerical dissipation of the overall scheme and influences the admissible time step size with explicit Runge–Kutta methods. We consider two choices, a modified Lax–Friedrichs scheme and the widely used Harten–Lax–van Leer (HLL) flux. For both variants, we first need to get the velocities and pressures from both sides of the interface and evaluate the physical Euler flux.
For the local Lax–Friedrichs flux, the definition is \(\hat{\mathbf{F}} =\frac{\mathbf{F}(\mathbf{w}^)+\mathbf{F}(\mathbf{w}^+)}{2} + \frac{\lambda}{2}\left[\mathbf{w}^\mathbf{w}^+\right]\otimes \mathbf{n^}\), where the factor \(\lambda = \max\left(\\mathbf{u}^\+c^, \\mathbf{u}^+\+c^+\right)\) gives the maximal wave speed and \(c = \sqrt{\gamma p / \rho}\) is the speed of sound. Here, we choose two modifications of that expression for reasons of computational efficiency, given the small impact of the flux on the solution. For the above definition of the factor \(\lambda\), we would need to take four square roots, two for the two velocity norms and two for the speed of sound on either side. The first modification is hence to rather use \(\sqrt{\\mathbf{u}\^2+c^2}\) as an estimate of the maximal speed (which is at most a factor of 2 away from the actual maximum, as shown in the introduction). This allows us to pull the square root out of the maximum and get away with a single square root computation. The second modification is to further relax on the parameter \(\lambda\)—the smaller it is, the smaller the dissipation factor (which is multiplied by the jump in \(\mathbf{w}\), which might result in a smaller or bigger dissipation in the end). This allows us to fit the spectrum into the stability region of the explicit Runge–Kutta integrator with bigger time steps. However, we cannot make dissipation too small because otherwise imaginary eigenvalues grow larger. Finally, the current conservative formulation is not energystable in the limit of \(\lambda\to 0\) as it is not skewsymmetric, and would need additional measures such as splitform DG schemes in that case.
For the HLL flux, we follow the formula from literature, introducing an additional weighting of the two states from Lax–Friedrichs by a parameter \(s\). It is derived from the physical transport directions of the Euler equations in terms of the current direction of velocity and sound speed. For the velocity, we here choose a simple arithmetic average which is sufficient for DG scenarios and moderate jumps in material parameters.
Since the numerical flux is multiplied by the normal vector in the weak form, we multiply by the result by the normal vector for all terms in the equation. In these multiplications, the operator*
defined above enables a compact notation similar to the mathematical definition.
In this and the following functions, we use variable suffixes _m
and _p
to indicate quantities derived from \(\mathbf{w}^\) and \(\mathbf{w}^+\), i.e., values "here" and "there" relative to the current cell when looking at a neighbor cell.
This and the next function are helper functions to provide compact evaluation calls as multiple points get batched together via a VectorizedArray argument (see the step37 tutorial for details). This function is used for the subsonic outflow boundary conditions where we need to set the energy component to a prescribed value. The next one requests the solution on all components and is used for inflow boundaries where all components of the solution are set.
This class implements the evaluators for the Euler problem, in analogy to the LaplaceOperator
class of step37 or step59. Since the present operator is nonlinear and does not require a matrix interface (to be handed over to preconditioners), we skip the various vmult
functions otherwise present in matrixfree operators and only implement an apply
function as well as the combination of apply
with the required vector updates for the lowstorage Runge–Kutta time integrator mentioned above (called perform_stage
). Furthermore, we have added three additional functions involving matrixfree routines, namely one to compute an estimate of the time step scaling (that is combined with the Courant number for the actual time step size) based on the velocity and speed of sound in the elements, one for the projection of solutions (specializing VectorTools::project() for the DG case), and one to compute the errors against a possible analytical solution or norms against some background state.
The rest of the class is similar to other matrixfree tutorials. As discussed in the introduction, we provide a few functions to allow a user to pass in various forms of boundary conditions on different parts of the domain boundary marked by types::boundary_id variables, as well as possible body forces.
For the initialization of the Euler operator, we set up the MatrixFree variable contained in the class. This can be done given a mapping to describe possible curved boundaries as well as a DoFHandler object describing the degrees of freedom. Since we use a discontinuous Galerkin discretization in this tutorial program where no constraints are imposed strongly on the solution field, we do not need to pass in an AffineConstraints object and rather use a dummy for the construction. With respect to quadrature, we want to select two different ways of computing the underlying integrals: The first is a flexible one, based on a template parameter n_points_1d
(that will be assigned the n_q_points_1d
value specified at the top of this file). More accurate integration is necessary to avoid the aliasing problem due to the variable coefficients in the Euler operator. The second less accurate quadrature formula is a tight one based on fe_degree+1
and needed for the inverse mass matrix. While that formula provides an exact inverse only on affine element shapes and not on deformed elements, it enables the fast inversion of the mass matrix by tensor product techniques, necessary to ensure optimal computational efficiency overall.
The subsequent four member functions are the ones that must be called from outside to specify the various types of boundaries. For an inflow boundary, we must specify all components in terms of density \(\rho\), momentum \(\rho \mathbf{u}\) and energy \(E\). Given this information, we then store the function alongside the respective boundary id in a map member variable of this class. Likewise, we proceed for the subsonic outflow boundaries (where we request a function as well, which we use to retrieve the energy) and for wall (nopenetration) boundaries where we impose zero normal velocity (no function necessary, so we only request the boundary id). For the present DG code where boundary conditions are solely applied as part of the weak form (during time integration), the call to set the boundary conditions can appear both before or after the reinit()
call to this class. This is different from continuous finite element codes where the boundary conditions determine the content of the AffineConstraints object that is sent into MatrixFree for initialization, thus requiring to be set before the initialization of the matrixfree data structures.
The checks added in each of the four function are used to ensure that boundary conditions are mutually exclusive on the various parts of the boundary, i.e., that a user does not accidentally designate a boundary as both an inflow and say a subsonic outflow boundary.
Now we proceed to the local evaluators for the Euler problem. The evaluators are relatively simple and follow what has been presented in step37, step48, or step59. The first notable difference is the fact that we use an FEEvaluation with a nonstandard number of quadrature points. Whereas we previously always set the number of quadrature points to equal the polynomial degree plus one (ensuring exact integration on affine element shapes), we now set the number quadrature points as a separate variable (e.g. the polynomial degree plus two or three halves of the polynomial degree) to more accurately handle nonlinear terms. Since the evaluator is fed with the appropriate loop lengths via the template argument and keeps the number of quadrature points in the whole cell in the variable FEEvaluation::n_q_points, we now automatically operate on the more accurate formula without further changes.
The second difference is due to the fact that we are now evaluating a multicomponent system, as opposed to the scalar systems considered previously. The matrixfree framework provides several ways to handle the multicomponent case. The variant shown here utilizes an FEEvaluation object with multiple components embedded into it, specified by the fourth template argument dim + 2
for the components in the Euler system. As a consequence, the return type of FEEvaluation::get_value() is not a scalar any more (that would return a VectorizedArray type, collecting data from several elements), but a Tensor of dim+2
components. The functionality is otherwise similar to the scalar case; it is handled by a template specialization of a base class, called FEEvaluationAccess. An alternative variant would have been to use several FEEvaluation objects, a scalar one for the density, a vectorvalued one with dim
components for the momentum, and another scalar evaluator for the energy. To ensure that those components point to the correct part of the solution, the constructor of FEEvaluation takes three optional integer arguments after the required MatrixFree field, namely the number of the DoFHandler for multiDoFHandler systems (taking the first by default), the number of the quadrature point in case there are multiple Quadrature objects (see more below), and as a third argument the component within a vector system. As we have a single vector for all components, we would go with the third argument, and set it to 0
for the density, 1
for the vectorvalued momentum, and dim+1
for the energy slot. FEEvaluation then picks the appropriate subrange of the solution vector during FEEvaluationBase::read_dof_values() and FEEvaluation::distributed_local_to_global() or the more compact FEEvaluation::gather_evaluate() and FEEvaluation::integrate_scatter() calls.
When it comes to the evaluation of the body force vector, we distinguish between two cases for efficiency reasons: In case we have a constant function (derived from Functions::ConstantFunction), we can precompute the value outside the loop over quadrature points and simply use the value everywhere. For a more general function, we instead need to call the evaluate_function()
method we provided above; this path is more expensive because we need to access the memory associated with the quadrature point data.
The rest follows the other tutorial programs. Since we have implemented all physics for the Euler equations in the separate euler_flux()
function, all we have to do here is to call this function given the current solution evaluated at quadrature points, returned by phi.get_value(q)
, and tell the FEEvaluation object to queue the flux for testing it by the gradients of the shape functions (which is a Tensor of outer dim+2
components, each holding a tensor of dim
components for the \(x,y,z\) component of the Euler flux). One final thing worth mentioning is the order in which we queue the data for testing by the value of the test function, phi.submit_value()
, in case we are given an external function: We must do this after calling phi.get_value(q)
, because get_value()
(reading the solution) and submit_value()
(queuing the value for multiplication by the test function and summation over quadrature points) access the same underlying data field. Here it would be easy to achieve also without temporary variable w_q
since there is no mixing between values and gradients. For more complicated setups, one has to first copy out e.g. both the value and gradient at a quadrature point and then queue results again by FEEvaluationBase::submit_value() and FEEvaluationBase::submit_gradient().
As a final note, we mention that we do not use the first MatrixFree argument of this function, which is a callback from MatrixFree::loop(). The interfaces imposes the present list of arguments, but since we are in a member function where the MatrixFree object is already available as the data
variable, we stick with that to avoid confusion.
The next function concerns the computation of integrals on interior faces, where we need evaluators from both cells adjacent to the face. We associate the variable phi_m
with the solution component \(\mathbf{w}^\) and the variable phi_p
with the solution component \(\mathbf{w}^+\). We distinguish the two sides in the constructor of FEFaceEvaluation by the second argument, with true
for the interior side and false
for the exterior side, with interior and exterior denoting the orientation with respect to the normal vector.
Note that the calls FEFaceEvaluation::gather_evaluate() and FEFaceEvaluation::integrate_scatter() combine the access to the vectors and the sum factorization parts. This combined operation not only saves a line of code, but also contains an important optimization: Given that we use a nodal basis in terms of the Lagrange polynomials in the points of the GaussLobatto quadrature formula, only \((p+1)^{d1}\) out of the \((p+1)^d\) basis functions evaluate to nonzero on each face. Thus, the evaluator only accesses the necessary data in the vector and skips the parts which are multiplied by zero. If we had first read the vector, we would have needed to load all data from the vector, as the call in isolation would not know what data is required in subsequent operations. If the subsequent FEFaceEvaluation::evaluate() call requests values and derivatives, indeed all \((p+1)^d\) vector entries for each component are needed, as the normal derivative is nonzero for all basis functions.
The arguments to the evaluators as well as the procedure is similar to the cell evaluation. We again use the more accurate (over)integration scheme due to the nonlinear terms, specified as the third template argument in the list. At the quadrature points, we then go to our freestanding function for the numerical flux. It receives the solution evaluated at quadrature points from both sides (i.e., \(\mathbf{w}^\) and \(\mathbf{w}^+\)), as well as the normal vector onto the minus side. As explained above, the numerical flux is already multiplied by the normal vector from the minus side. We need to switch the sign because the boundary term comes with a minus sign in the weak form derived in the introduction. The flux is then queued for testing both on the minus sign and on the plus sign, with switched sign as the normal vector from the plus side is exactly opposed to the one from the minus side.
For faces located at the boundary, we need to impose the appropriate boundary conditions. In this tutorial program, we implement four cases as mentioned above. (A fifth case, for supersonic outflow conditions is discussed in the "Results" section below.) The discontinuous Galerkin method imposes boundary conditions not as constraints, but only weakly. Thus, the various conditions are imposed by finding an appropriate exterior quantity \(\mathbf{w}^+\) that is then handed to the numerical flux function also used for the interior faces. In essence, we "pretend" a state on the outside of the domain in such a way that if that were reality, the solution of the PDE would satisfy the boundary conditions we want.
For wall boundaries, we need to impose a nonormalflux condition on the momentum variable, whereas we use a Neumann condition for the density and energy with \(\rho^+ = \rho^\) and \(E^+ = E^\). To achieve the nonormal flux condition, we set the exterior values to the interior values and subtract two times the velocity in wallnormal direction, i.e., in the direction of the normal vector.
For inflow boundaries, we simply set the given Dirichlet data \(\mathbf{w}_\mathrm{D}\) as a boundary value. An alternative would have been to use \(\mathbf{w}^+ = \mathbf{w}^ + 2 \mathbf{w}_\mathrm{D}\), the socalled mirror principle.
The imposition of outflow is essentially a Neumann condition, i.e., setting \(\mathbf{w}^+ = \mathbf{w}^\). For the case of subsonic outflow, we still need to impose a value for the energy, which we derive from the respective function. A special step is needed for the case of backflow, i.e., the case where there is a momentum flux into the domain on the Neumann portion. According to the literature (a fact that can be derived by appropriate energy arguments), we must switch to another variant of the flux on inflow parts, see Gravemeier, Comerford, Yoshihara, Ismail, Wall, "A novel formulation for Neumann inflow conditions in biomechanics", Int. J. Numer. Meth. Biomed. Eng., vol. 28 (2012). Here, the momentum term needs to be added once again, which corresponds to removing the flux contribution on the momentum variables. We do this in a postprocessing step, and only for the case when we both are at an outflow boundary and the dot product between the normal vector and the momentum (or, equivalently, velocity) is negative. As we work on data of several quadrature points at once for SIMD vectorizations, we here need to explicitly loop over the array entries of the SIMD array.
In the implementation below, we check for the various types of boundaries at the level of quadrature points. Of course, we could also have moved the decision out of the quadrature point loop and treat entire faces as of the same kind, which avoids some map/set lookups in the inner loop over quadrature points. However, the loss of efficiency is hardly noticeable, so we opt for the simpler code here. Also note that the final else
clause will catch the case when some part of the boundary was not assigned any boundary condition via EulerOperator::set_..._boundary(...)
.
The next function implements the inverse mass matrix operation. The algorithms and rationale have been discussed extensively in the introduction, so we here limit ourselves to the technicalities of the MatrixFreeOperators::CellwiseInverseMassMatrix class. It does similar operations as the forward evaluation of the mass matrix, except with a different interpolation matrix, representing the inverse \(S^{1}\) factors. These represent a change of basis from the specified basis (in this case, the Lagrange basis in the points of the Gauss–Lobatto quadrature formula) to the Lagrange basis in the points of the Gauss quadrature formula. In the latter basis, we can apply the inverse of the pointwise JxW
factor, i.e., the quadrature weight times the determinant of the Jacobian of the mapping from reference to real coordinates. Once this is done, the basis is changed back to the nodal GaussLobatto basis again. All of these operations are done by the apply()
function below. What we need to provide is the local fields to operate on (which we extract from the global vector by an FEEvaluation object) and write the results back to the destination vector of the mass matrix operation.
One thing to note is that we added two integer arguments (that are optional) to the constructor of FEEvaluation, the first being 0 (selecting among the DoFHandler in multiDoFHandler systems; here, we only have one) and the second being 1 to make the quadrature formula selection. As we use the quadrature formula 0 for the overintegration of nonlinear terms, we use the formula 1 with the default \(p+1\) (or fe_degree+1
in terms of the variable name) points for the mass matrix. This leads to square contributions to the mass matrix and ensures exact integration, as explained in the introduction.
We now come to the function which implements the evaluation of the Euler operator as a whole, i.e., \(\mathcal M^{1} \mathcal L(t, \mathbf{w})\), calling into the local evaluators presented above. The steps should be clear from the previous code. One thing to note is that we need to adjust the time in the functions we have associated with the various parts of the boundary, in order to be consistent with the equation in case the boundary data is timedependent. Then, we call MatrixFree::loop() to perform the cell and face integrals, including the necessary ghost data exchange in the src
vector. The seventh argument to the function, true
, specifies that we want to zero the dst
vector as part of the loop, before we start accumulating integrals into it. This variant is preferred over explicitly calling dst = 0.;
before the loop as the zeroing operation is done on a subrange of the vector in parts that are written by the integrals nearby. This enhances data locality and allows for caching, saving one roundtrip of vector data to main memory and enhancing performance. The last two arguments to the loop determine which data is exchanged: Since we only access the values of the shape functions one faces, typical of firstorder hyperbolic problems, and since we have a nodal basis with nodes at the reference element surface, we only need to exchange those parts. This again saves precious memory bandwidth.
Once the spatial operator \(\mathcal L\) is applied, we need to make a second round and apply the inverse mass matrix. Here, we call MatrixFree::cell_loop() since only cell integrals appear. The cell loop is cheaper than the full loop as access only goes to the degrees of freedom associated with the locally owned cells, which is simply the locally owned degrees of freedom for DG discretizations. Thus, no ghost exchange is needed here.
Around all these functions, we put timer scopes to record the computational time for statistics about the contributions of the various parts.
Let us move to the function that does an entire stage of a Runge–Kutta update. It calls EulerOperator::apply() followed by some updates to the vectors, namely next_ri = solution + factor_ai * k_i
and solution += factor_solution * k_i
. Rather than performing these steps through the vector interfaces, we here present an alternative strategy that is faster on cachebased architectures. As the memory consumed by the vectors is often much larger than what fits into caches, the data has to effectively come from the slow RAM memory. The situation can be improved by loop fusion, i.e., performing both the updates to next_ki
and solution
within a single sweep. In that case, we would read the two vectors rhs
and solution
and write into next_ki
and solution
, compared to at least 4 reads and two writes in the baseline case. Here, we go one step further and perform the loop immediately when the mass matrix inversion has finished on a part of the vector. MatrixFree::cell_loop() provides a mechanism to attach an std::function
both before the loop over cells first touches a vector entry (which we do not use here, but is e.g. used for zeroing the vector) and a second std::function
to be called after the loop last touches an entry. The callback is in form of a range over the given vector (in terms of the local index numbering in the MPI universe) that can be addressed by local_element()
functions.
For this second callback, we create a lambda that works on a range and write the respective update on this range. Ideally, we would add the DEAL_II_OPENMP_SIMD_PRAGMA
before the local loop to suggest to the compiler to SIMD parallelize this loop (which means in practice that we ensure that there is no overlap, also called aliasing, between the index ranges of the pointers we use inside the loops). It turns out that at the time of this writing, GCC 7.2 fails to compile an OpenMP pragma inside a lambda function, so we comment this pragma out below. If your compiler is newer, you should be able to uncomment these lines again.
Note that we select a different code path for the last Runge–Kutta stage when we do not need to update the next_ri
vector. This strategy gives a considerable speedup. Whereas the inverse mass matrix and vector updates take more than 60% of the computational time with default vector updates on a 40core machine, the percentage is around 35% with the more optimized variant. In other words, this is a speedup of around a third.
Having discussed the implementation of the functions that deal with advancing the solution by one time step, let us now move to functions that implement other, ancillary operations. Specifically, these are functions that compute projections, evaluate errors, and compute the speed of information transport on a cell.
The first of these functions is essentially equivalent to VectorTools::project(), just much faster because it is specialized for DG elements where there is no need to set up and solve a linear system, as each element has independent basis functions. The reason why we show the code here, besides a small speedup of this noncritical operation, is that it shows additional functionality provided by MatrixFreeOperators::CellwiseInverseMassMatrix.
The projection operation works as follows: If we denote the matrix of shape functions evaluated at quadrature points by \(S\), the projection on cell \(K\) is an operation of the form \(\underbrace{S J^K S^\mathrm T}_{\mathcal M^K} \mathbf{w}^K = S J^K \tilde{\mathbf{w}}(\mathbf{x}_q)_{q=1:n_q}\), where \(J^K\) is the diagonal matrix containing the determinant of the Jacobian times the quadrature weight (JxW), \(\mathcal M^K\) is the cellwise mass matrix, and \(\tilde{\mathbf{w}}(\mathbf{x}_q)_{q=1:n_q}\) is the evaluation of the field to be projected onto quadrature points. (In reality the matrix \(S\) has additional structure through the tensor product, as explained in the introduction.) This system can now equivalently be written as \(\mathbf{w}^K = \left(S J^K S^\mathrm T\right)^{1} S J^K \tilde{\mathbf{w}}(\mathbf{x}_q)_{q=1:n_q} = S^{\mathrm T} \left(J^K\right)^{1} S^{1} S J^K \tilde{\mathbf{w}}(\mathbf{x}_q)_{q=1:n_q}\). Now, the term \(S^{1} S\) and then \(\left(J^K\right)^{1} J^K\) cancel, resulting in the final expression \(\mathbf{w}^K = S^{\mathrm T} \tilde{\mathbf{w}}(\mathbf{x}_q)_{q=1:n_q}\). This operation is implemented by MatrixFreeOperators::CellwiseInverseMassMatrix::transform_from_q_points_to_basis(). The name is derived from the fact that this projection is simply the multiplication by \(S^{\mathrm T}\), a basis change from the nodal basis in the points of the Gaussian quadrature to the given finite element basis. Note that we call FEEvaluation::set_dof_values() to write the result into the vector, overwriting previous content, rather than accumulating the results as typical in integration tasks – we can do this because every vector entry has contributions from only a single cell for discontinuous Galerkin discretizations.
The next function again repeats functionality also provided by the deal.II library, namely VectorTools::integrate_difference(). We here show the explicit code to highlight how the vectorization across several cells works and how to accumulate results via that interface: Recall that each lane of the vectorized array holds data from a different cell. By the loop over all cell batches that are owned by the current MPI process, we could then fill a VectorizedArray of results; to obtain a global sum, we would need to further go on and sum across the entries in the SIMD array. However, such a procedure is not stable as the SIMD array could in fact not hold valid data for all its lanes. This happens when the number of locally owned cells is not a multiple of the SIMD width. To avoid invalid data, we must explicitly skip those invalid lanes when accessing the data. While one could imagine that we could make it work by simply setting the empty lanes to zero (and thus, not contribute to a sum), the situation is more complicated than that: What if we were to compute a velocity out of the momentum? Then, we would need to divide by the density, which is zero – the result would consequently be NaN and contaminate the result. This trap is avoided by accumulating the results from the valid SIMD range as we loop through the cell batches, using the function MatrixFree::n_active_entries_per_cell_batch() to give us the number of lanes with valid data. It equals VectorizedArray::size() on most cells, but can be less on the last cell batch if the number of cells has a remainder compared to the SIMD width.
This final function of the EulerOperator class is used to estimate the transport speed, scaled by the mesh size, that is relevant for setting the time step size in the explicit time integrator. In the Euler equations, there are two speeds of transport, namely the convective velocity \(\mathbf{u}\) and the propagation of sound waves with sound speed \(c = \sqrt{\gamma p/\rho}\) relative to the medium moving at velocity \(\mathbf u\).
In the formula for the time step size, we are interested not by these absolute speeds, but by the amount of time it takes for information to cross a single cell. For information transported along with the medium, \(\mathbf u\) is scaled by the mesh size, so an estimate of the maximal velocity can be obtained by computing \(\J^{\mathrm T} \mathbf{u}\_\infty\), where \(J\) is the Jacobian of the transformation from real to the reference domain. Note that FEEvaluationBase::inverse_jacobian() returns the inverse and transpose Jacobian, representing the metric term from real to reference coordinates, so we do not need to transpose it again. We store this limit in the variable convective_limit
in the code below.
The sound propagation is isotropic, so we need to take mesh sizes in any direction into account. The appropriate mesh size scaling is then given by the minimal singular value of \(J\) or, equivalently, the maximal singular value of \(J^{1}\). Note that one could approximate this quantity by the minimal distance between vertices of a cell when ignoring curved cells. To get the maximal singular value of the Jacobian, the general strategy would be some LAPACK function. Since all we need here is an estimate, we can avoid the hassle of decomposing a tensor of VectorizedArray numbers into several matrices and go into an (expensive) eigenvalue function without vectorization, and instead use a few iterations (five in the code below) of the power method applied to \(J^{1}J^{\mathrm T}\). The speed of convergence of this method depends on the ratio of the largest to the next largest eigenvalue and the initial guess, which is the vector of all ones. This might suggest that we get slow convergence on cells close to a cube shape where all lengths are almost the same. However, this slow convergence means that the result will sit between the two largest singular values, which both are close to the maximal value anyway. In all other cases, convergence will be quick. Thus, we can merely hardcode 5 iterations here and be confident that the result is good.
Similarly to the previous function, we must make sure to accumulate speed only on the valid cells of a cell batch.
This class combines the EulerOperator class with the time integrator and the usual global data structures such as FiniteElement and DoFHandler, to actually run the simulations of the Euler problem.
The member variables are a triangulation, a finite element, a mapping (to create highorder curved surfaces, see e.g. step10), and a DoFHandler to describe the degrees of freedom. In addition, we keep an instance of the EulerOperator described above around, which will do all heavy lifting in terms of integrals, and some parameters for time integration like the current time or the time step size.
Furthermore, we use a PostProcessor instance to write some additional information to the output file, in similarity to what was done in step33. The interface of the DataPostprocessor class is intuitive, requiring us to provide information about what needs to be evaluated (typically only the values of the solution, except for the Schlieren plot that we only enable in 2D where it makes sense), and the names of what gets evaluated. Note that it would also be possible to extract most information by calculator tools within visualization programs such as ParaView, but it is so much more convenient to do it already when writing the output.
For the main evaluation of the field variables, we first check that the lengths of the arrays equal the expected values (the lengths 2*dim+4
or 2*dim+5
are derived from the sizes of the names we specify in the get_names() function below). Then we loop over all evaluation points and fill the respective information: First we fill the primal solution variables of density \(\rho\), momentum \(\rho \mathbf{u}\) and energy \(E\), then we compute the derived velocity \(\mathbf u\), the pressure \(p\), the speed of sound \(c=\sqrt{\gamma p / \rho}\), as well as the Schlieren plot showing \(s = \nabla \rho^2\) in case it is enabled. (See step69 for another example where we create a Schlieren plot.)
For the interpretation of quantities, we have scalar density, energy, pressure, speed of sound, and the Schlieren plot, and vectors for the momentum and the velocity.
With respect to the necessary update flags, we only need the values for all quantities but the Schlieren plot, which is based on the density gradient.
The constructor for this class is unsurprising: We set up a parallel triangulation based on the MPI_COMM_WORLD
communicator, a vector finite element with dim+2
components for density, momentum, and energy, a highorder mapping of the same degree as the underlying finite element, and initialize the time and time step to zero.
As a mesh, this tutorial program implements two options, depending on the global variable testcase
: For the analytical variant (testcase==0
), the domain is \((0, 10) \times (5, 5)\), with Dirichlet boundary conditions (inflow) all around the domain. For testcase==1
, we set the domain to a cylinder in a rectangular box, derived from the flow past cylinder testcase for incompressible viscous flow by Schäfer and Turek (1996). Here, we have a larger variety of boundaries. The inflow part at the left of the channel is given the inflow type, for which we choose a constant inflow profile, whereas we set a subsonic outflow at the right. For the boundary around the cylinder (boundary id equal to 2) as well as the channel walls (boundary id equal to 3) we use the wall boundary type, which is nonormal flow. Furthermore, for the 3D cylinder we also add a gravity force in vertical direction. Having the base mesh in place (including the manifolds set by GridGenerator::channel_with_cylinder()), we can then perform the specified number of global refinements, create the unknown numbering from the DoFHandler, and hand the DoFHandler and Mapping objects to the initialization of the EulerOperator.
In the following, we output some statistics about the problem. Because we often end up with quite large numbers of cells or degrees of freedom, we would like to print them with a comma to separate each set of three digits. This can be done via "locales", although the way this works is not particularly intuitive. step32 explains this in slightly more detail.
For output, we first let the Euler operator compute the errors of the numerical results. More precisely, we compute the error against the analytical result for the analytical solution case, whereas we compute the deviation against the background field with constant density and energy and constant velocity in \(x\) direction for the second test case.
The next step is to create output. This is similar to what is done in step33: We let the postprocessor defined above control most of the output, except for the primal field that we write directly. For the analytical solution test case, we also perform another projection of the analytical solution and print the difference between that field and the numerical solution. Once we have defined all quantities to be written, we build the patches for output. Similarly to step65, we create a highorder VTK output by setting the appropriate flag, which enables us to visualize fields of high polynomial degrees. Finally, we call the DataOutInterface::write_vtu_in_parallel()
function to write the result to the given file name. This function uses special MPI parallel write facilities, which are typically more optimized for parallel file systems than the standard library's std::ofstream
variants used in most other tutorial programs. A particularly nice feature of the write_vtu_in_parallel()
function is the fact that it can combine output from all MPI ranks into a single file, making it unnecessary to have a central record of all such files (namely, the "pvtu" file).
For parallel programs, it is often instructive to look at the partitioning of cells among processors. To this end, one can pass a vector of numbers to DataOut::add_data_vector() that contains as many entries as the current processor has active cells; these numbers should then be the rank of the processor that owns each of these cells. Such a vector could, for example, be obtained from GridTools::get_subdomain_association(). On the other hand, on each MPI process, DataOut will only read those entries that correspond to locally owned cells, and these of course all have the same value: namely, the rank of the current process. What is in the remaining entries of the vector doesn't actually matter, and so we can just get away with a cheap trick: We just fill all values of the vector we give to DataOut::add_data_vector() with the rank of the current MPI process. The key is that on each process, only the entries corresponding to the locally owned cells will be read, ignoring the (wrong) values in other entries. The fact that every process submits a vector in which the correct subset of entries is correct is all that is necessary.
The EulerProblem::run() function puts all pieces together. It starts off by calling the function that creates the mesh and sets up data structures, and then initializing the time integrator and the two temporary vectors of the lowstorage integrator. We call these vectors rk_register_1
and rk_register_2
, and use the first vector to represent the quantity \(\mathbf{r}_i\) and the second one for \(\mathbf{k}_i\) in the formulas for the Runge–Kutta scheme outlined in the introduction. Before we start the time loop, we compute the time step size by the EulerOperator::compute_cell_transport_speed()
function. For reasons of comparison, we compare the result obtained there with the minimal mesh size and print them to screen. For velocities and speeds of sound close to unity as in this tutorial program, the predicted effective mesh size will be close, but they could vary if scaling were different.
Now we are ready to start the time loop, which we run until the time has reached the desired end time. Every 5 time steps, we compute a new estimate for the time step – since the solution is nonlinear, it is most effective to adapt the value during the course of the simulation. In case the Courant number was chosen too aggressively, the simulation will typically blow up with time step NaN, so that is easy to detect here. One thing to note is that roundoff errors might propagate to the leading digits due to an interaction of slightly different time step selections that in turn lead to slightly different solutions. To decrease this sensitivity, it is common practice to round or truncate the time step size to a few digits, e.g. 3 in this case. In case the current time is near the prescribed 'tick' value for output (e.g. 0.02), we also write the output. After the end of the time loop, we summarize the computation by printing some statistics, which is mostly done by the TimerOutput::print_wall_time_statistics() function.
The main() function is not surprising and follows what was done in all previous MPI programs: As we run an MPI program, we need to call MPI_Init()
and MPI_Finalize()
, which we do through the Utilities::MPI::MPI_InitFinalize data structure. Note that we run the program only with MPI, and set the thread count to 1.
Running the program with the default settings on a machine with 40 processes produces the following output:
The program output shows that all errors are small. This is due to the fact that we use a relatively fine mesh of \(32^2\) cells with polynomials of degree 5 for a solution that is smooth. An interesting pattern shows for the time step size: whereas it is 0.0069 up to time 5, it increases to 0.0096 for later times. The step size increases once the vortex with some motion on top of the speed of sound (and thus faster propagation) leaves the computational domain between times 5 and 6.5. After that point, the flow is simply uniform in the same direction, and the maximum velocity of the gas is reduced compared to the previous state where the uniform velocity was overlaid by the vortex. Our time step formula recognizes this effect.
The final block of output shows detailed information about the timing of individual parts of the programs; it breaks this down by showing the time taken by the fastest and the slowest processor, and the average time – this is often useful in very large computations to find whether there are processors that are consistently overheated (and consequently are throttling their clock speed) or consistently slow for other reasons. The summary shows that 1283 time steps have been performed in 1.02 seconds (looking at the average time among all MPI processes), while the output of 11 files has taken additional 0.96 seconds. Broken down per time step and into the five Runge–Kutta stages, the compute time per evaluation is 0.16 milliseconds. This high performance is typical of matrixfree evaluators and a reason why explicit time integration is very competitive against implicit solvers, especially for largescale simulations. The breakdown of computational times at the end of the program run shows that the evaluation of integrals in \(\mathcal L_h\) contributes with around 0.92 seconds and the application of the inverse mass matrix with 0.06 seconds. Furthermore, the estimation of the transport speed for the time step size computation contributes with another 0.05 seconds of compute time.
If we use three more levels of global refinement and 9.4 million DoFs in total, the final statistics are as follows (for the modified Lax–Friedrichs flux, \(p=5\), and the same system of 40 cores of dualsocket Intel Xeon Gold 6230):
Per time step, the solver now takes 0.02 seconds, about 25 times as long as for the small problem with 147k unknowns. Given that the problem involves 64 times as many unknowns, the increase in computing time is not surprising. Since we also do 8 times as many time steps, the compute time should in theory increase by a factor of 512. The actual increase is 205 s / 1.02 s = 202. This is because the small problem size cannot fully utilize the 40 cores due to communication overhead. This becomes clear if we look into the details of the operations done per time step. The evaluation of the differential operator \(\mathcal L_h\) with nearest neighbor communication goes from 0.92 seconds to 127 seconds, i.e., it increases with a factor of 138. On the other hand, the cost for application of the inverse mass matrix and the vector updates, which do not need to communicate between the MPI processes at all, has increased by a factor of 1195. The increase is more than the theoretical factor of 512 because the operation is limited by the bandwidth from RAM memory for the larger size while for the smaller size, all vectors fit into the caches of the CPU. The numbers show that the mass matrix evaluation and vector update part consume almost 40% of the time spent by the Runge–Kutta stages – despite using a lowstorage Runge–Kutta integrator and merging of vector operations! And despite using overintegration for the \(\mathcal L_h\) operator. For simpler differential operators and more expensive time integrators, the proportion spent in the mass matrix and vector update part can also reach 70%. If we compute a throughput number in terms of DoFs processed per second and Runge–Kutta stage, we obtain
\[ \text{throughput} = \frac{n_\mathrm{time steps} n_\mathrm{stages} n_\mathrm{dofs}}{t_\mathrm{compute}} = \frac{10258 \cdot 5 \cdot 9.4\,\text{MDoFs}}{205s} = 2360\, \text{MDoFs/s} \]
This throughput number is very high, given that simply copying one vector to another one runs at only around 10,000 MDoFs/s.
If we go to the nextlarger size with 37.7 million DoFs, the overall simulation time is 2196 seconds, with 1978 seconds spent in the time stepping. The increase in run time is a factor of 9.3 for the L_h operator (1179 versus 127 seconds) and a factor of 10.3 for the inverse mass matrix and vector updates (797 vs 77.5 seconds). The reason for this nonoptimal increase in run time can be traced back to cache effects on the given hardware (with 40 MB of L2 cache and 55 MB of L3 cache): While not all of the relevant data fits into caches for 9.4 million DoFs (one vector takes 75 MB and we have three vectors plus some additional data in MatrixFree), there is capacity for one and a half vector nonetheless. Given that modern caches are more sophisticated than the naive leastrecentlyused strategy (where we would have little reuse as the data is used in a streaminglike fashion), we can assume that a sizeable fraction of data can indeed be delivered from caches for the 9.4 million DoFs case. For the larger case, even with optimal caching less than 10 percent of data would fit into caches, with an associated loss in performance.
For the modified Lax–Friedrichs flux and measuring the error in the momentum variable, we obtain the following convergence table (the rates are very similar for the density and energy variables):
p=2  p=3  p=5  

n_cells  n_dofs  error mom  rate  n_dofs  error mom  rate  n_dofs  error mom  rate 
16  2,304  1.373e01  
64  4,096  9.130e02  9,216  8.899e03  3.94  
256  9,216  5.577e02  16,384  7.381e03  3.64  36,864  2.082e04  5.42  
1024  36,864  4.724e03  3.56  65,536  3.072e04  4.59  147,456  2.625e06  6.31 
4096  147,456  6.205e04  2.92  262,144  1.880e05  4.03  589,824  3.268e08  6.33 
16,384  589,824  8.279e05  2.91  1,048,576  1.224e06  3.94  2,359,296  9.252e10  5.14 
65,536  2,359,296  1.105e05  2.91  4,194,304  7.871e08  3.96  9,437,184  1.369e10  2.77 
262,144  9,437,184  1.615e06  2.77  16,777,216  4.961e09  3.99  37,748,736  7.091e11  0.95 
If we switch to the HartenLaxvan Leer flux, the results are as follows:
p=2  p=3  p=5  

n_cells  n_dofs  error mom  rate  n_dofs  error mom  rate  n_dofs  error mom  rate 
16  2,304  1.339e01  
64  4,096  9.037e02  9,216  8.849e03  3.92  
256  9,216  4.204e02  16,384  9.143e03  3.31  36,864  2.501e04  5.14  
1024  36,864  4.913e03  3.09  65,536  3.257e04  4.81  147,456  3.260e06  6.26 
4096  147,456  7.862e04  2.64  262,144  1.588e05  4.36  589,824  2.953e08  6.79 
16,384  589,824  1.137e04  2.79  1,048,576  9.400e07  4.08  2,359,296  4.286e10  6.11 
65,536  2,359,296  1.476e05  2.95  4,194,304  5.799e08  4.02  9,437,184  2.789e11  3.94 
262,144  9,437,184  2.038e06  2.86  16,777,216  3.609e09  4.01  37,748,736  5.730e11  1.04 
The tables show that we get optimal \(\mathcal O\left(h^{p+1}\right)\) convergence rates for both numerical fluxes. The errors are slightly smaller for the Lax–Friedrichs flux for \(p=2\), but the picture is reversed for \(p=3\); in any case, the differences on this testcase are relatively small.
For \(p=5\), we reach the roundoff accuracy of \(10^{11}\) with both fluxes on the finest grids. Also note that the errors are absolute with a domain length of \(10^2\), so relative errors are below \(10^{12}\). The HLL flux is somewhat better for the highest degree, which is due to a slight inaccuracy of the Lax–Friedrichs flux: The Lax–Friedrichs flux sets a Dirichlet condition on the solution that leaves the domain, which results in a small artificial reflection, which is accentuated for the Lax–Friedrichs flux. Apart from that, we see that the influence of the numerical flux is minor, as the polynomial part inside elements is the main driver of the accucary. The limited influence of the flux also has consequences when trying to approach more challenging setups with the higherorder DG setup: Taking for example the parameters and grid of step33, we get oscillations (which in turn make density negative and make the solution explode) with both fluxes once the highmass part comes near the boundary, as opposed to the loworder finite volume case ( \(p=0\)). Thus, any case that leads to shocks in the solution necessitates some form of limiting or artificial dissipation. For another alternative, see the step69 tutorial program.
For the test case of the flow around a cylinder in a channel, we need to change the first code line to
This test case starts with a background field of a constant velocity of Mach number 0.31 and a constant initial density; the flow will have to go around an obstacle in the form of a cylinder. Since we impose a nopenetration condition on the cylinder walls, the flow that initially impinges headon onto to cylinder has to rearrange, which creates a big sound wave. The following pictures show the pressure at times 0.1, 0.25, 0.5, and 1.0 (top left to bottom right) for the 2D case with 5 levels of global refinement, using 102,400 cells with polynomial degree of 5 and 14.7 million degrees of freedom over all 4 solution variables. We clearly see the discontinuity that propagates slowly in the upstream direction and more quickly in downstream direction in the first snapshot at time 0.1. At time 0.25, the sound wave has reached the top and bottom walls and reflected back to the interior. From the different distances of the reflected waves from lower and upper walls we can see the slight asymmetry of the SchäferTurek test case represented by GridGenerator::channel_with_cylinder() with somewhat more space above the cylinder compared to below. At later times, the picture is more chaotic with many sound waves all over the place.
The next picture shows an elevation plot of the pressure at time 1.0 looking from the channel inlet towards the outlet at the same resolution – here, we can see the large number of reflections. In the figure, two types of waves are visible. The largeramplitude waves correspond to various reflections that happened as the initial discontinuity hit the walls, whereas the smallamplitude waves of size similar to the elements correspond to numerical artifacts. They have their origin in the finite resolution of the scheme and appear as the discontinuity travels through elements with highorder polynomials. This effect can be cured by increasing resolution. Apart from this effect, the rich wave structure is the result of the transport accuracy of the highorder DG method.
With 2 levels of global refinement with 1,600 cells, the mesh and its partitioning on 40 MPI processes looks as follows:
When we run the code with 4 levels of global refinements on 40 cores, we get the following output:
The norms shown here for the various quantities are the deviations \(\rho'\), \((\rho u)'\), and \(E'\) against the background field (namely, the initial condition). The distribution of run time is overall similar as in the previous test case. The only slight difference is the larger proportion of time spent in \(\mathcal L_h\) as compared to the inverse mass matrix and vector updates. This is because the geometry is deformed and the matrixfree framework needs to load additional arrays for the geometry from memory that are compressed in the affine mesh case.
Increasing the number of global refinements to 5, the output becomes:
The effect on performance is similar to the analytical test case – in theory, computation times should increase by a factor of 8, but we actually see an increase by a factor of 11 for the time steps (219.5 seconds versus 2450 seconds). This can be traced back to caches, with the small case mostly fitting in caches. An interesting effect, typical of programs with a mix of local communication (integrals \(\mathcal L_h\)) and global communication (computation of transport speed) with some load imbalance, can be observed by looking at the MPI ranks that encounter the minimal and maximal time of different phases, respectively. Rank 0 reports the fastest throughput for the "rk time stepping total" part. At the same time, it appears to be slowest for the "compute transport speed" part, almost a factor of 2 slower than the average and almost a factor of 4 compared to the faster rank. Since the latter involves global communication, we can attribute the slowness in this part to the fact that the local Runge–Kutta stages have advanced more quickly on this rank and need to wait until the other processors catch up. At this point, one can wonder about the reason for this imbalance: The number of cells is almost the same on all MPI processes. However, the matrixfree framework is faster on affine and Cartesian cells located towards the outlet of the channel, to which the lower MPI ranks are assigned. On the other hand, rank 32, which reports the highest run time for the Runga–Kutta stages, owns the curved cells near the cylinder, for which no data compression is possible. To improve throughput, we could assign different weights to different cell types when partitioning the parallel::distributed::Triangulation object, or even measure the run time for a few time steps and try to rebalance then.
The throughput per Runge–Kutta stage can be computed to 2085 MDoFs/s for the 14.7 million DoFs test case over the 346,000 Runge–Kutta stages, slightly slower than the Cartesian mesh throughput of 2360 MDoFs/s reported above.
Finally, if we add one additional refinement, we record the following output:
The "rk time stepping total" part corresponds to a throughput of 2010 MDoFs/s. The overall run time to perform 139k time steps is 20k seconds (5.7 hours) or 7 time steps per second – not so bad for having nearly 60 million unknowns. More throughput can be achieved by adding more cores to the computation.
Switching the channel test case to 3D with 3 global refinements, the output is
The physics are similar to the 2D case, with a slight motion in the z direction due to the gravitational force. The throughput per Runge–Kutta stage in this case is
\[ \text{throughput} = \frac{n_\mathrm{time steps} n_\mathrm{stages} n_\mathrm{dofs}}{t_\mathrm{compute}} = \frac{17723 \cdot 5 \cdot 221.2\,\text{M}}{15580s} = 1258\, \text{MDoFs/s}. \]
The throughput is lower than in 2D because the computation of the \(\mathcal L_h\) term is more expensive. This is due to overintegration with degree+2
points and the larger fraction of face integrals (worse volumetosurface ratio) with more expensive flux computations. If we only consider the inverse mass matrix and vector update part, we record a throughput of 4857 MDoFs/s for the 2D case of the isentropic vortex with 37.7 million unknowns, whereas the 3D case runs with 4535 MDoFs/s. The performance is similar because both cases are in fact limited by the memory bandwidth.
If we go to four levels of global refinement, we need to increase the number of processes to fit everything in memory – the computation needs around 350 GB of RAM memory in this case. Also, the time it takes to complete 35k time steps becomes more tolerable by adding additional resources. We therefore use 6 nodes with 40 cores each, resulting in a computation with 240 MPI processes:
This simulation had nearly 2 billion unknowns – quite a large computation indeed, and still only needed around 1.5 seconds per time step.
The code presented here straightforwardly extends to adaptive meshes, given appropriate indicators for setting the refinement flags. Largescale adaptivity of a similar solver in the context of the acoustic wave equation has been achieved by the exwave project. However, in the present context, the benefits of adaptivity are often limited to early times and effects close to the origin of sound waves, as the waves eventually reflect and diffract. This leads to steep gradients all over the place, similar to turbulent flow, and a more or less globally refined mesh.
Another topic that we did not discuss in the results section is a comparison of different time integration schemes. The program provides four variants of lowstorage Runga–Kutta integrators that each have slightly different accuracy and stability behavior. Among the schemes implemented here, the higherorder ones provide additional accuracy but come with slightly lower efficiency in terms of step size per stage before they violate the CFL condition. An interesting extension would be to compare the lowstorage variants proposed here with standard Runge–Kutta integrators or to use vector operations that are run separate from the mass matrix operation and compare performance.
As mentioned in the introduction, the modified Lax–Friedrichs flux and the HLL flux employed in this program are only two variants of a large body of numerical fluxes available in the literature on the Euler equations. One example is the HLLC flux (HartenLaxvan LeerContact) flux which adds the effect of rarefaction waves missing in the HLL flux, or the Roe flux. As mentioned in the introduction, the effect of numerical fluxes on highorder DG schemes is debatable (unlike for the case of loworder discretizations).
A related improvement to increase the stability of the solver is to also consider the spatial integral terms. A shortcoming in the rather naive implementation used above is the fact that the energy conservation of the original Euler equations (in the absence of shocks) only holds up to a discretization error. If the solution is underresolved, the discretization error can give rise to an increase in the numerical energy and eventually render the discretization unstable. This is because of the inexact numerical integration of the terms in the Euler equations, which both contain rational nonlinearities and higherdegree content from curved cells. A way out of this dilemma are socalled skewsymmetric formulations, see [56] for a simple variant. Skew symmetry means that switching the role of the solution \(\mathbf{w}\) and test functions \(\mathbf{v}\) in the weak form produces the exact negative of the original quantity, apart from some boundary terms. In the discrete setting, the challenge is to keep this skew symmetry also when the integrals are only computed approximately (in the continuous case, skewsymmetry is a consequence of integration by parts). Skewsymmetric numerical schemes balance spatial derivatives in the conservative form \((\nabla \mathbf v, \mathbf{F}(\mathbf w))_{K}\) with contributions in the convective form \((\mathbf v, \tilde{\mathbf{F}}(\mathbf w)\nabla \mathbf{w})_{K}\) for some \(\tilde{\mathbf{F}}\). The precise terms depend on the equation and the integration formula, and can in some cases by understood by special skewsymmetric finite difference schemes.
To get started, interested readers could take a look at https://github.com/kronbichler/advection_miniapp, where a skewsymmetric DG formulation is implemented with deal.II for a simple advection equation.
As mentioned in the introduction, the solution to the Euler equations develops shocks as the Mach number increases, which require additional mechanisms to stabilize the scheme, e.g. in the form of limiters. The main challenge besides actually implementing the limiter or artificial viscosity approach would be to loadbalance the computations, as the additional computations involved for limiting the oscillations in troubled cells would make them more expensive than the plain DG cells without limiting. Furthermore, additional numerical fluxes that better cope with the discontinuities would also be an option.
One ingredient also necessary for supersonic flows are appropriate boundary conditions. As opposed to the subsonic outflow boundaries discussed in the introduction and implemented in the program, all characteristics are outgoing for supersonic outflow boundaries, so we do not want to prescribe any external data,
\[ \mathbf{w}^+ = \mathbf{w}^ = \begin{pmatrix} \rho^\\ (\rho \mathbf u)^ \\ E^\end{pmatrix} \quad \text{(Neumann)}. \]
In the code, we would simply add the additional statement
in the local_apply_boundary_face()
function.
When the interest with an Euler solution is mostly in the propagation of sound waves, it often makes sense to linearize the Euler equations around a background state, i.e., a given density, velocity and energy (or pressure) field, and only compute the change against these fields. This is the setting of the wide field of aeroacoustics. Even though the resolution requirements are sometimes considerably reduced, implementation gets somewhat more complicated as the linearization gives rise to additional terms. From a code perspective, in the operator evaluation we also need to equip the code with the state to linearize against. This information can be provided either by analytical functions (that are evaluated in terms of the position of the quadrature points) or by a vector similar to the solution. Based on that vector, we would create an additional FEEvaluation object to read from it and provide the values of the field at quadrature points. If the background velocity is zero and the density is constant, the linearized Euler equations further simplify and can equivalently be written in the form of the acoustic wave equation.
A challenge in the context of sound propagation is often the definition of boundary conditions, as the computational domain needs to be of finite size, whereas the actual simulation often spans an infinite (or at least much larger) physical domain. Conventional Dirichlet or Neumann boundary conditions give rise to reflections of the sound waves that eventually propagate back to the region of interest and spoil the solution. Therefore, various variants of nonreflecting boundary conditions or sponge layers, often in the form of perfectly matched layers – where the solution is damped without reflection – are common.
The solver presented in this tutorial program can also be extended to the compressible Navier–Stokes equations by adding viscous terms, as described in [50]. To keep as much of the performance obtained here despite the additional cost of elliptic terms, e.g. via an interior penalty method, one can switch the basis from FE_DGQ to FE_DGQHermite like in the step59 tutorial program.
In this tutorial, we used facecentric loops. Here, cell and face integrals are treated in separate loops, resulting in multiple writing accesses into the result vector, which is relatively expensive on modern hardware since writing operations generally result also in an implicit read operation. Elementcentric loops, on the other hand, are processing a cell and in direct succession processing all its 2d faces. Although this kind of loop implies that fluxes have to be computed twice (for each side of an interior face), the fact that the result vector has to accessed only once might  and the fact that the resulting algorithm is free of raceconditions and as such perfectly suitable for shared memory  already give a performance boost. If you are interested in these advanced topics, you can take a look at step76 where we take the present tutorial and modify it so that we can use these features.