Optimized sparse polynomial chaos expansion with entropy regularization

Sparse Polynomial Chaos Expansion (PCE) is widely used in various engineering fields to quantitatively analyse the influence of uncertainty, while alleviating the problem of dimensionality curse. However, current sparse PCE techniques focus on choosing features with the largest coefficients, which may ignore uncertainties propagated with high order features. Hence, this paper proposes the idea of selecting polynomial chaos basis based on information entropy, which aims to retain the advantages of existing sparse techniques while considering entropy change as output uncertainty. A novel entropy-based optimization method is proposed to update the state-of-the-art sparse PCE models. This work further develops an entropy-based synthetic sparse model, which has higher computational efficiency. Two benchmark functions and a computational fluid dynamics (CFD) experiment are used to compare the accuracy and efficiency between the proposed method and classical methods. The results show that entropy-based methods can better capture the features of uncertainty propagation, improving accuracy and reducing sparsity while avoiding over-fitting problems.


Introduction
Due to the variety of uncertainties frequently involved in engineering applications, which may cause fluctuations in the performance of a system, it is necessary to comprehensively consider the impact of uncertain factors [1,2].To deal with the problem of limited experiment resources, various surrogate models [3,4] have been proposed, aiming to construct a mathematical model that accurately mimics the behavior of the original problem with an affordable experimental design [5].Popular surrogate techniques include the Kriging method [6][7][8][9], artificial neural network [10][11][12], polynomial chaos expansion (PCE) method [13][14][15][16] etc.In this paper, we focus on the PCE model, which has been widely used in engineering to quantify uncertainties efficiently.
The basic idea of PCE is to expand an exact solution of a stochastic process space by polynomial expansion [17][18][19], and for the model generated by which can generally be solved by ordinary least squares (OLS).Many successful applications of uncertainty quantification based on the PCE method have been achieved [20][21][22].However, the cost of constructing a PCE model increases exponentially with the dimension of input parameters, i.e. the curse of dimensionality, thus severely restricts practical applications of the model at the industrial level [23][24][25].In order to solve this problem, lots of sparse algorithms have been developed in recent years.Some typical examples of well-established methods are the sparse regression method [26,27] and the compressive sampling method [28,29].
A problem with some OLS-solution methods is the possibility of over-fitting but can be avoided by regularization, which is another practical solution to seek a sparse representation [30].Regularization is a typical model selection method that introduces a regularizer to the empirical risk.From the perspective of Bayesian estimation, the regularizer corresponds to the prior probability of the model [31].
There are many regularization methods, but it is 0 regularization that is required to obtain a true sparse model.However, solving model with 0 regularization is an NP-Hard problem [32,33], so researchers usually adopt the 1 regularization when dealing with sparse problems [34], which is an optimal convex approximation of the 0 paradigm, and can be easily solved optimally to obtain a sparse model.
Although scholars have conducted a considerable amount of research on sparse polynomial chaos expansions, most of the existing sparse methods are derived from the perspective of regression [35].In these methods, researchers usually pay attention to one or two numerical characteristics in the distribution, which are important statistics to measure uncertainty.However, the distribution is very complicated in common situations, especially when there is a noise in the original problem.Furthermore, the aforementioned methods focus on statistics, such as the relative mean squared error (MSE) or the cross-validation error, to optimize parameters, which ignores the overall uncertainties propagated from the input variables.
To depict the full uncertainty of the output, we introduce information entropy as a measure of the evolution of the output distribution to further optimize the sparse representation of a PCE model.Information entropy was first introduced to measure microscopic uncertainty and multiplicity by Shannon [36].The idea was then extended to learning tasks to measure the information changes with different statistical models.Jaynes [37] proposed the maximum entropy principle (MEP) to provide an optimization criterion.B. Grechuk et al. [38] have studied the relationships among error, likelihood, and entropy in regression analysis, and found that a normal distribution can be recovered from the maximum entropy principle.Wang [39] introduced a maximum entropy penalty to model and incorporate the entropy-controlled framework with other conventional learning algorithms.Liang [40] proposed a sparse subspace clustering with entropy-norm by using information entropy as the regularizer of the objective function.Researchers further made variable selection through the evaluation of entropy, and extended the use of entropy as an important criterion for model selection [41,42], which can be embedded in existing methods but does not form a framework.
In this work, we aim to use entropy to propagate uncertainty of input-output while retaining the advantages of existing sparse PCE techniques.Hence, we propose a novel method where entropy is an auxiliary penalty term.Firstly, a general re-optimization model is proposed which chooses the features with the largest information entropy-based on the existing optimized model.Although such a strategy is easy to implement, it may ignore some key features in the first stage.A hybrid entropy-based synthetic method embedded in several commonly used classical sparse criteria is thus proposed.The novel regularization structure takes the value and volatility of each feature into account simultaneously.The entropy term can be regarded as a trade-off between the predictive mean value and uncertainty along with the selected features.The above two algorithms can be easily computed with existing sparse algorithms, which improves their usage scope and availability.Experiments are performed to compare the proposed two algorithms and classical sparse methods, i.e. the Orthogonal Matching Pursuit (OMP), the Least Angle Regression (LARS), the Subspace Pursuit (SP), and the Bayesian Compressive Sensing (BCS).Results show that the general re-optimization method is of simple operation and universality because it can optimize any sparse PCE model, while the hybrid entropy-based synthetic method has strong pertinence, and is superior to the former in high-dimensional complex applications.
The remainder of the paper is organized as follows.Section 2 introduces the fundamental theory of sparse PCE models.Section 3 proposes the theories and algorithms of the two optimization methods mentioned in the previous paragraph.Section 4 shows the performance of the proposed methods, illustrated by numerical examples and comparisons.Finally, the conclusions are summarized in Section 5.

Sparse polynomial chaos expansion
The basics of PCE are elaborated briefly as follows.Let M be the original model and Y = M(ξ ) be the output.The finite-order PCE model expanded in the full polynomial space can be expressed as: where ξ is d-dimensional random variables with a probability density function (PDF) of f (ξ ), and i (ξ ) is a set of polynomial basis functions truncated at the p-th order with P denoting the degree of the PCE.Generally, P varies with different truncation schemes.For example, when the total degree space is chosen, with p as the maximum degree for each dimension.Given a design of experiment(DoE) {ξ , Y }, where ξ = (ξ 1 , ξ 2 , . . ., ξ n ) ∈ R d×n is a specific sample and Y (ξ ) ∈ R n are corresponding responses.The main effort of constructing a PCE model is solving the following generalized linear equation system: which can be rewritten in matrix form: Although we are able to obtain a least squares estimate ĉ = T −1 T Y easily, the degree P increases dramatically with p and d, hence we have to solve an underdetermined system with limited DoE resources, and the least squares solution can be inaccurate and unstable.The sparse PCE method is proposed to solve this problem, and it restores the complete model response almost accurately by selecting a small number of basis that dominates the system output.0 -norm is the most widely used regularization criterion, i.e. limit the degree of non-zero coefficients by solving the following optimization problem: However, Eq. 5 is a non-convex optimization problem, which can be difficult to solve in practice due to its NP-hardness.A typical practical technique is to replace the 0 norm term with the 1 norm term [43], and the new objective function becomes

Error estimation and model evaluation
The optimization problem defined in Eq. 6 is usually solved by relaxing the constraint, i.e. ĉ = arg min where ε is recorded as the truncation error, determined by measurement noise, which is a more natural parameterization choice.If ε is too large, the reconstructed PC is not accurate enough; however, if ε is too small, the reconstructed PC may be over-fitting.The leave-one-out (LOO) error is used to measure the degree of over-fitting in practice, which can be expressed as follows: where h i denotes the i-th diagonal element of the matrix T −1 T .The E LOO can be calculated easily when the least squares solution w.r.t full design are available (see [27]).
We choose the PCE model with the smallest LOO error during the iteration of finding the best sparse solution.
When comparing the performance of different surrogate models, our interest lies in the precision and uncertainty.The relative MSE is a widely used measure to quantify the precision of a surrogate, which is defined by the following equation: where ȳ = 1 n n i=1 y i .As for the uncertainty, we pay special attention to the probability density function (PDF) or cumulative distribution function (CDF).

The concept of information entropy
The type of entropy addressed in this work is information entropy that was first proposed by Shannon in 1948 [36].The most general explanation of entropy is the measure of uncertainty, which refers to the information contained in the system [44].Specifically, the more information the system contains, the smaller the uncertainty and corresponding entropy.
For any random variable X = {x 1 , . . ., x n }, the definition of information entropy is expressed in terms of a discrete set of probabilities P(x i ), as follows: Similarly, for a continuous random variable, differential entropy can be obtained: Based on information entropy, when predicting a probability model of random variables, the main idea of the MEP is that among all candidate distributions, the distribution that maximizes entropy should be selected.In other words, the best probability distribution has maximal entropy, which comes up with an optimization problem: where p(y) is the probability density function of element y in Y .Hence, by combining MEP and the classical sparse PCE techniques, information entropy can be used as a sparse optimization criterion for the sparse PCE model for it can retain the largest amount of uncertainty while making it sparse.

Methodology
In this section, two adaptive methods are proposed.The first is the re-optimization of results achieved with the alternative model based on entropy, which can be thought of as a simple external plug-in.The second is hybrid optimization using entropy as a regularizer, which is the main method of this paper and it is more advantageous in high-dimensional problems.

Entropy-based re-optimization sparse PCE
As mentioned at the end of Section 2.3, the aim of this paper is to learn a sparse model and preserve the maximum uncertainty of the results.Thus, we can directly perform entropy-based secondary optimization from Eq. 12 based on existing sparse algorithms (i.e., entropy-based re-optimization sparse PCE).Typically, solving problems with the sparse PCE surrogate model can be converted to: where I is the sparse index set.
With the classical sparse PCE method, we get the model response Y S .Additional model responses are produced through the repeated sampling process, which can then be reoptimized based on entropy sorting.Our goal is to select W items with the largest entropy values.The objective function during re-optimization is defined as follows: where Y S is the model response obtained by the classical sparse method, and index set J satisfies: with H (Y S ) W j=1 H(y j ), where y j = j c j .Since calculating the differential entropy of Y S is more complicated, it can be approximated by the Shannon entropy of Monte Carlo.For example, if we repeat sampling m times as the verification set, the Shannon information entropy of y j can be obtained: The pseudo code of the algorithm for entropy-based re-optimization sparse PCE is as follows: The main advantage of this method is that it can be easily implemented and extended to various sparse methods without the need to modify their internal codes.In other words, it is universal, maneuverable, and can be considered as an external plug-in.However, it requires longer computation time when dealing with high-dimensional cases because of the OLS algorithm used in the re-optimization process.

Hybrid entropy-based comprehensive sparse PCE
In Section 3.1, a universal entropy-based re-optimization sparse PCE method is proposed, which is essentially a subsequent optimization of the results trained by classical sparse PCE methods.However, its re-optimization is performed with the features and coefficients retained by a sparse algorithm, hence we cannot retrieve features that have been discarded, which may be of great influence for the overall uncertainty.As the complexity of the model increases, the discarded features by the classical sparse method will also increase, so the optimization effectiveness of Ent-PCE will be limited.Therefore, in different sparse methods, we can consider the role of information entropy when selecting features.In fact, there are several commonly used sparse algorithms in the literature, some of which can be found in open-source toolboxes.In this section, for these specific sparse PCE algorithms, a method involving hybrid entropy-based comprehensive sparse PCE will be proposed.
In the PCE model, the unknown coefficients correspond to polynomial chaos basis.When we reduce the dimension of the coefficients, it can be seen as a feature selection for the polynomial chaos basis.From the perspective of regression, we usually select the feature basis that has a greater impact on the residual, such as the OMP method.However, the polynomial chaos basis stores the probabilistic information of the predicted response.From the MEP perspective, the feature basis with larger entropy contains more uncertainty information and is more important to the response.
To maximize the input uncertainty distribution in the response, we introduce the entropy of the basis function to regularize the classical model.Hence, the objective function is re-defined as follows: Find i ∈ which is closely related to the residual and has a larger entropy value by solving: Update the index set: Compute the coefficients c * using only the active indices by least-squares: In each iteration, OMP selects the regressor that is most closely correlated with the current residual, while HEnt-OMP introduces entropy as a penalty term on this basis.The regressor must not only be related to the residual but also ensure that its entropy value is large.From this, we can find that HEnt-PCE only needs to calculate the entropy of the basis function once at the beginning of feature selection, so the extra computational effort is almost negligible.
It is worth noting that in the methodology, the prefixes 'Ent-' and 'HEnt-' indicate optimization routine and the suffixes 'PCE' represent universal sparse PCE algorithms.While in the algorithm as well as in the experiments, the suffix is replaced by the specified classical sparse PCE algorithm.Compared with Algorithm 1, this method is more targeted and has faster computing speeds.
However, due to the different classical methods, the selection of regularization parameters in different methods is also worthy of attention.

Parameter optimization and selection
Before conducting experiments, we need to set some parameters for these algorithms.For example, in Algorithm 1, we are concerned with the effects of changing parameter W on the final sparse result.Similarly, when using hybrid entropy-based algorithms, γ is a parameter worthy of attention.For each of the various methods used in entropy-based algorithms, the choice of parameters will be different.
The selection of parameter W is natural because we can set it directly according to the desired size of the active basis.While regarding the selection of γ , we can use a small size of experiment design in preliminary experiments before conducting the formal experiment, so that γ can get an ideal value.The algorithm is designed as Algorithm 3, the idea of which comes from DIRECT algorithm [45,46].The algorithm equates the search interval into k subintervals, and calculates the value of function at the boundary point of the subinterval, then takes the point with the lowest value as the centroid, the length of the subinterval as the length of the interval, thus establishing a new space to repeat the previous process until the value of function is below the set value ε. for each i = 0, . . ., k do 3: Run the model S with γ i and calculate Relative MSE: e i 5: end for 6: Find the index i Find the index that minimizes the Relative MSE: i = arg min i e i 7: Update the boundary: γ 0 = γ i − 1 2 l, γ k = γ i + 1 2 l 8: end while

Experiment analysis
In the experiments, we analyze and compare the effectiveness of four classical sparse PCE algorithms (OMP [47], LARS [27], SP [48], and BCS [49]) and two entropy-based optimized algorithms on the basis of them.
To implement the benchmark, we consider an Ishigami function (d = 3), an OAK-LEY & O'HAGAN function (d = 15) and a high-dimensional function (d = 54) to show the performance of the methods in low-dimensional and high-dimensional cases respectively.Both benchmark functions have been inserted small random interference items to better fit the actual.The LHS sequence is used to generate sample points in the experiment.Each experiment was repeated 50 times, and the mean value is given to compare performance of the used methods.To further prove the effective performance of the proposed method, the approach is applied to a more realistic CFD application: RAE2822 airfoil (d = 5).For all experiments, we use the general-purpose uncertainty quantification software UQLab [35].

Toy model
Before starting the formal experiment, we first consider a toy model to explain the problem of this paper.The classical sparse method only considers the value of coefficients, rather than the impact of feature changes on uncertainty.So we constructed a function that is a combination of smooth function and high-order features.
f (x, y) = 6sin(y) + 4x(cos(x) + 1) + 5x 6 + 10 −10 9 x • x 22 (18) If classical methods such as OMP are used directly, these high-level features will be discarded, regardless of whether they contain important information or not.As shown in Table 1 and Fig. 1(a) and (b), using the two methods proposed in this paper to optimize can make the probability distribution closer to the original distribution, while keeping the relative MSR at a small value.Meanwhile, Fig. 1(c) shows the retained basis functions, and the meaning of the label of it is the method (number of retained terms): index among all basis functions.It can be found that Ent-PCE is sparse on the result of OMP, so the index set of selected features are subsets of the index set of the original result, while HEnt-PCE takes the role of information entropy into account at the time of feature selection, so some higher-order features that were originally discarded by OMP are retained.

Case 1: the Ishigami function
The Ishigami model is a non-linear, non-monotonic smooth three-dimensional function: where x 1 , x 2 , and x 3 are three independent input random variables uniformly distributed on [−π, π], and in typical a = 7, b = 0.1.We use the entropy-based re-optimization method and hybrid entropy-based comprehensive method on the basis of other sparse methods for optimization, and the results are shown in Figs. 2, 3, 4, 5.In (a) and (b), the x-axis is the size of DoE and the y-axis is relative MSE and level of sparsity (number of items retained/number of items in complete PCE) respectively.It can be noticed that both optimization methods can achieve a good degree of optimization in low dimensions.Specifically, relative MSE and level of sparsity are both reduced compared to the classical method when using the same size of DoE.In other words, entropy-based methods require fewer sample points when a certain level of precision or sparsity is required.Furthermore, (c) show the PDF using 200 DoE.It can be observed that the optimized PDF is closer to the sample point, which is consistent with the previous theory.

Case 2: the OAKLEY & O'HAGAN function
The OAKLEY & O'HAGAN function was proposed in 2010 [50].It is a high-dimensional function and commonly used in uncertainty quantification and sensitivity analysis, which is expressed as follows: The independent distributions of the input random variables are x i ∼ N (0, 1), i = 1, . . ., 15. a 1 , a 2 , a 3 are 1 × 15 vectors, and M is a square matrix, and the data are taken directly from [50].
From the results of Relative MSE shown in (a) of Figs. 6, 7, 8, 9, in the case of high dimensionality, both optimization methods can improve on the original classical methods.However, Ent-PCE can only be slightly optimized based on the existing results, while HEnt-PCE can achieve better optimization results.This is because Ent-PCE is a reoptimization method based on the classical sparse PCE.In the low-dimensional case, the low-dimensional features selected by the classical sparse algorithm may be sufficient to represent the overall features of the original model.And when the model dimension is high, the classical sparse may discard those high-dimensional features that affect the uncertainty, and we cannot retrieve them in the next re-optimization process.Therefore, HEnt-PCE is much better than Ent-PCE in the high-dimensional case for optimization.Compared to HEnt-PCE, Ent-PCE can control sparsity better, as can be seen in (b).However, in the higher-dimensional cases, Ent-PCE requires a large-scale OLS calculation, which can lead to an increase in computation time, and therefore HEnt-PCE is preferred for the higher dimensional cases where sparsity is not required.

Case 3: the high-dimensional function
To demonstrate the performance of HEnt-PCE in higher dimensional conditions, consider a high-dimensional function from UQLab, which is an analytical model of the form: where X i ∼ U ( [ 1,2] ), i = 20, and X 20 ∼ U ( [ 1,3] ).
As methods we proposed are further optimized based on existing sparse PCE model, the performance would be limited for high dimensional cases.For the Ent-PCE models, since the high order features are dumped by the sparse PCE model, there exists little room for further optimization.As a result, we only compare the results of HEnt-PCE and classical PCE, as can be seen in Figs. 10, 11, 12, 13.It can be seen that the HEnt-PCE still outperforms the classical method in low-order settings.Reynolds length 1.0

Case 4: RAE2822 airfoil
The final application involves the RAE2822 airfoil, a supercritical airfoil, which is a challenging case used by various researchers [24,51] to test for quantification of uncertainty.
The reference values of uncertainty factors in the test are shown in Table 2.We specify that all random variables are subject to a uniform distribution, with the upper and lower boundaries at a reference value of 1 ± 5%.  .Each position will form a standard deviation, because of the sensitivity of the RAE2822 airfoil model.The standard deviations of the classical sparse models are usually greater than the actual results, while optimization can reduce the standard deviation and achieve better results by adjusting the parameters.However, the least squares used in Ent-PCE will make the standard deviation too small.Moreover, although the dimensionality is not high, one REA2822 case needs to build hundreds of PCE models (each position of the airfoil needs a model, the number is related to the density of grid

Conclusion
In this paper, an effective framework of sparse PCE is realized and verified.The main contribution of this paper is the development of two adaptive regression methods for optimizing sparse polynomial chaos expansion, Ent-PCE and HEnt-PCE.The former is essentially a followed-up optimization of the results that have been trained by classical sparse methods, and can be easily implemented and extended to various sparse methods; the latter embeds a penalty term into any known sparse algorithms, and allows important higher-order features that may be discarded by classical sparse methods to be retained.
The advantages of the proposed methods are twofold.First, the regularization method in the classical sparse PCE model is followed, which improves accuracy and reduces sparsity while avoiding over-fitting problems.Secondly, considering the uncertainty propagation in the model, the distribution rules of model input are passed, so the output retains uncertainty to the greatest extent.
Furthermore, selection of regularization parameters is changed, because of inconsistencies in the standards for feature selection of different classical sparse PCE models.To achieve the best optimization effect, a set of parameter selection rules was formed.For verification, several applications were considered.Two benchmark analysis functions were initially studied.The results show that the two proposed algorithms can achieve a certain optimization effect.Among them, Ent-PCE is simpler and more universal in lowdimensional situations.In the case of high dimensions, as HEnt-PCE can retain important higher-order features and does not need secondary calculation, the results are more accurate and the calculation speed is faster.Additionally, the algorithm effect was analyzed and compared through the wind tunnel application.The experimental results show that Ent-PCE has greater limitations, but it can be used independently and can easily optimize small-scale models.HEnt-PCE is more efficient and can achieve higher accuracy, so it is an effective method that is worth being promoted at the industrial level to establish a high-quality sparse PCE model.

Algorithm 2
An example of a hybrid entropy-based comprehensive sparse PCE algorithm based on the classical OMP method is Algorithm 2. The suffix indicates the classical sparse PCE algorithm being optimized.Hybrid entropy-based orthogonal matching pursuit (HEnt-OMP) Input: : a set of candidate basis functions; y: observation vector.1: Initialize: feature index set: I = ∅; residual vector:r = y 2: for each i = 1, . . ., n do 3:

7 :c 8 :
* = arg min I∈R y − I c * 2 Update the residual: r = y − I c * 9: end for

Algorithm 3
Parameter selection Input: S: surrogate model with unknown γ ; γ 9 : the upper bound of γ 1: while The variance of Relative MSE is less than ε do 2:

Fig. 1 A
Fig. 1 A two-dimensional toy model consisting of a combination of simple functions followed by a high-dimensional function with small coefficients.(a) The relative error of each y; (b) The probability density function of the entire response; (c) The coefficients corresponding to the reserved basis functions with their indices

Fig. 2 Fig. 3 Fig. 4 Fig. 5 Fig. 6
Fig. 2 Performance comparison of OMP, Ent-OMP and HEnt-OMP w.r.t. the Ishigami model.(a) The relationship between the relative MSE and the size of DoE; (b) The relationship between the sparsity level of several methods and the size of DoE; (c) Enlarged details of the reconstructed PDF

Fig. 7 Fig. 8
Fig. 7 Performance comparison of LARS, Ent-LARS and HEnt-LARS w.r.t. the OAKLEY & O'HAGAN function.(a) The relationship between the relative MSE and the size of DoE; (b) The relationship between the sparsity level of several methods and the size of DoE; (c) Enlarged details of the reconstructed PDF

Fig. 9 Fig. 10 Fig. 11 Fig. 12
Fig. 9 Performance comparison of BCS, Ent-BCS and HEnt-BCS w.r.t. the OAKLEY & O'HAGAN function.(a) The relationship between the relative MSE and the size of DoE; (b) The relationship between the sparsity level of several methods and the size of DoE; (c) Enlarged details of the reconstructed PDF For deterministic CFD solutions, the Spalart-Allmaras one-equation turbulence model is used for turbulence modeling, and SU2 simulation software is used as a black-box solver.The model of 5th order, using different sparse PCE methods is constructed with 20 samples.The results (pressure coefficient) were obtained with four classical sparse PCE methods and two improved methods proposed above.The mean, standard deviation, the CDF figure of one random point in the RAE2822 airfoil and relative error are shown in Fig. 14 (OMP), Fig. 15 (LARS), Fig. 16 (SP), Fig. 17 Fig. 14 Performance comparison of OMP, Ent-OMP and HEnt-OMP w.r.t. the RAE2822 airfoil

Table 1
Comparison of relative MSE and KL (Kullback-Leibler) divergence for the toy model

Table 2
Reference value of uncertainty factors in experiments