Transfer learning for deep neural network-based partial differential equations solving

Deep neural networks (DNNs) have recently shown great potential in solving partial differential equations (PDEs). The success of neural network-based surrogate models is attributed to their ability to learn a rich set of solution-related features. However, learning DNNs usually involves tedious training iterations to converge and requires a very large number of training data, which hinders the application of these models to complex physical contexts. To address this problem, we propose to apply the transfer learning approach to DNN-based PDE solving tasks. In our work, we create pairs of transfer experiments on Helmholtz and Navier-Stokes equations by constructing subtasks with different source terms and Reynolds numbers. We also conduct a series of experiments to investigate the degree of generality of the features between different equations. Our results demonstrate that despite differences in underlying PDE systems, the transfer methodology can lead to a significant improvement in the accuracy of the predicted solutions and achieve a maximum performance boost of 97.3% on widely used surrogate models.


Introduction
Neural network (NN) has been extensively studied as a surrogate model in the field of physics simulations for many years [1,2]. Recent progress in deep learning offers a potential approach for the solution prediction of partial differential equations (PDEs) [3,4]. Based on the universal approximation properties of the deep neural networks, pioneering works began to explore the possibility of building end-to-end solvers by means of the composition of hidden layers in a variety of network structures and activation functions [5][6][7][8]. These solvers consider the numerical simulation process as an unsupervised learning problem where the network model takes as input spatial and temporal coordinates and then predicts quantities of interests.
Physics-informed neural network (PINN) is one of the most commonly used DNNbased surrogate models [9,10]. During the optimization phase, PINN embeds the governing equations, as well as the initial/boundary conditions in the loss function as penalizing terms to guide the gradient descent direction. After suitable training, the network model is able to provide a nonlinear approximant for the underlying PDE systems. However, existing PINN-based surrogate models for PDE solving suffer from some inherent drawbacks. On the one hand, these models may not guarantee the desired prediction solution on the limited amount of training data, and tend to yield inaccurate results, especially for complex nonlinear PDEs [11,12]. On the other hand, PINNs often require tedious training to converge, which are computationally cumbersome on modern CPUs or GPUs [13]. Moreover, due to the lack of generalizability, the trained network models are only applicable to the problem at hand and need to be re-trained for new problems or equation settings. The extensive re-training overhead limits their usefulness in engineering and optimal design context.
Transfer learning can be a powerful tool to enable efficient convergence during training. Generally, the transfer methodology is to train a base network and then transfer its learned features to the target network [14,15]. By repurposing the knowledge and skills learned in previous tasks (domains), the network is able to produce a boost to prediction performance after fine-tuning to a new task (domain). Many studies have taken advantage of this fact to obtain state-of-the-art results in fields of computer vision applications, such as image classification and objects recognition [16][17][18]. However, the effect of transfer learning for DNN-based surrogate models has not been studied adequately.
In this paper, we aim not to optimize network designs, but rather to study the transfer ability of widely used surrogate models. We conduct a series of experiments on two PDE benchmarks using different models. Specifically, we first evaluate the transfer performance on PDEs with different equation settings. We also investigate the generality versus specificity of neurons in each layer of DNN-based surrogate models. Finally, we investigate the dependency of transfer results on the similarity of base and target tasks. The experimental results prove that initializing with transferred features can improve prediction accuracy after fine-tuning on a new task, which could be a generally useful technique for improving the prediction performance of surrogate models.
The rest of the paper is organized as follows. In Section 2, we present a brief overview of DNN-based surrogate models and introduce the methodology of transfer learning. Then, we conduct the transfer learning experiments. The prediction results are discussed in Section 3. In Section 4, we conclude the paper with a summary.

Deep neural network-based surrogate model
With the advances in neural network theory and learning algorithms, there has been much interest and work toward integrating deep neural networks into the traditional physics contexts [2,19]. Recently, pioneering works began to solve nonlinear PDEs by utilizing the nonlinear approximate nature of deep neural networks. Raissi et al. [9,10] first introduced the physics-informed neural network (PINN) for inferring the latent solution u(x, t) to a PDE system of the general form: where the spatial domain ∈ R d , D is the differential operator. The PDE system is subject to the boundary condition, where ∂ is the domain boundary, B is the differential operator for boundary conditions, and h(x, t): R d+1 → R is the given function.
In PINN, the solving process is formulated as a nonconvex optimization problem where an appropriate loss function is designed to optimize the predicted solution. Specifically, the governing equations, as well as the initial and boundary conditions, are embedded in the loss function as penalizing terms to guide the gradient descent direction: where x r ∈ and x b ∈ ∂ . By minimizing this composite loss function, the trained network model is able to encode the underlying physical laws and work as a function approximator to provide the prediction u(x, t). However, despite its simplicity, the original formulation of PINN often incurs difficulties in satisfying all equation residuals, thus leading to slow convergence or inaccurate prediction for many physical problems governed by complex nonlinear PDEs.
In order to alleviate the unstable approximation of original PINN, Wang et al. [11] studied the stiffness in the gradient flow dynamics of PINN and investigated the imbalance of the back-propagated gradients during gradient descent optimization. They proposed an adaptive learning rate algorithm along with a novel fully-connected neural architecture to solve the above deficiencies, resulting in two PINN-based variants: PINN-anneal and GP-PINN. They tested their network models across a range of problems in computational physics. The results show that, their methods outperform the original PINN and improve the accuracy by a factor of 50-100x. Following this line of research, many other similar surrogate solvers have been developed, such as DGM [20], hp-VPINNs [12], and NSFnets [13]. These models provide a series of promising results in many scientific applications such as computational fluid dynamics, electromagnetism, solid-state physics, and quantum physics [4].

Transfer learning
In computer vision, modern deep neural networks tend to learn first-layer features that resemble either Gabor filters or color blobs, and last-layer features that depend greatly on the underlying task [21,22]. Based on this observation, transfer learning approaches are employed to transfer knowledge between the relevant base and target tasks to achieve efficient training convergence. As shown in Fig. 1, the usual transfer approach is to train a base network on a large dataset, and then transfer the learned features to a target network to be trained on a target dataset.
Yosinski et al. [14] experimentally studied the generality versus specificity of learned features in each network layer for different training tasks. Their results show that the appearance of the common (first-layer) features occurs not only to be specific to a particular image data or training set, but general in that they are applicable to different datasets. They also proved that the transfer process will tend to work well if the transferred features are not specific to the base task, but general to both base and target tasks. Many studies in Fig. 1 Overview of the transfer learning approach. Instead of initializing the entire network randomly, the usual transfer approach is to train a base network on a large dataset and then copy (transfer) its first n layers to the first n layers of a target network. The remaining layers are randomly initialized and trained toward the target task the image and natural language processing have taken advantage of this phenomenon to obtain state-of-the-art results. Examples of transfer learning include coping with image data with different lighting or background in similar tasks [16], overcoming the deficit of training samples for natural language domains [23,24], and improving the classification or regression performance on unsupervised pseudo-tasks [15,17].
Oquab et al. [21] studied the ability of a system to recognize and apply knowledge and skills learned in previous domains to novel domains. Their results collectively show that, by transferring image representations learned in previous domains, the underlying network pre-train the network produces a boost to prediction performance after fine-tuning to a new task, even with very different training objectives. Similar work was done by Sermanet et al. [25]. They proved that neural networks do indeed compute features that are fairly general, even across different data sources. The above results further emphasize the importance of studying the generality nature and extent of the learned features in different applications.
Overall, transfer learning can be a useful technique for improving the classification or regression performance in neural network training. However, there have not been any experimental results of DNN-based surrogate models for their transfer ability. Inspired by the tremendous success of transfer experiments in the areas of computer vision and image recognition, we apply the transfer learning approach to deep neural network-based surrogate models to investigate their transfer performance on different PDE systems.

Experimental results
In this study, we evaluate the performance of transfer learning on the neural networkbased PDE solving tasks. In Section 3.1, we train DNN-based surrogate models on the Helmholtz equation and investigate the transfer performance in cases with different source terms. In Section 3.2, we conduct the transfer learning experiments on the Navier-Stokes equation with different Reynolds numbers. Finally, the transfer ability between two different equations is evaluated in Section 3.3.
For all experiments, we employ three widely used network models for PDE solving, including PINN, PINN with annealing, and GP-PINN [9][10][11]. We follow the network designs in these works and employ the original three-layer architectures in our solving tasks. Figure 2 shows an example of the PINN architecture. Due to the relatively shallow layers of the underlying networks, we do not consider the frozen-layer cases in this paper. During each training, we train the networks for 40000 epochs using the Adam optimizer in TensorFlow 1.15.0 [26]. The initial learning rate is set to 1e-03 and decays 0.95 every 1000 epochs. We use 128 random points per epoch as the training set and use 10000 uniform points as the test set at the end of the training.

Helmholtz equation
We first employ a classical benchmark, the two-dimensional Helmholtz equation [27], to investigate the performance of the transfer methodology. The Helmholtz equation is one of the fundamental PDEs in many scientific fields such as acoustics, electromagnetism, and elastic mechanics. It takes the form: where is the Laplace operator, ∈[ 0, 1] ×[ 0, 1], and q(x, y) is a source term given by: When k = 1, h(x, y) is computed by the analytical solution given by: During the training, we construct seven cases with different source terms for Helmholtz benchmarks: (1) α 1 = 1, α 2 = 4; (2) α 1 = 1, α 2 = 6; (3) α 1 = 2, α 2 = 3; (4) α 1 = 2, α 2 = 4; (5) α 1 = 3, α 2 = 5; (6) α 1 = 4, α 2 = 5; (7) α 1 = 4, α 2 = 6. To create the base and target subtasks, we set the case 1 (α 1 = 1, α 2 = 4) as the base subtask and the other six cases as target subtasks. Specifically, we first train all the subtasks using random initialization and obtain the original prediction results of six target subtasks. Then, we re-trained the six subtasks by repurposing the learned features of the base subtask to investigate the feature transfer ability on the Helmholtz benchmarks. When generalizing (transferring) to the other  cases, we would expect that the DNN-based solvers trained on top of the previous (base) subtask would perform better than random initialization since the underlying solution spaces of various settings are similar. We summarize the prediction accuracy of different network models in Tables 1, 2 and 3. The prediction error is measured in the relative L 2 norm, which is defined as: where u ref and u pred represent the reference/analytical solution and the predicted solution obtained by surrogate models. The experimental results in Tables 1, 2 and 3 show that transferring base network variables and then fine-tuning them exhibit better performance than those trained directly on the target subtask. Compared with the original training, the transfer learning approaches achieve a maximum performance boost of 82.3% for PINN, 97.3% for PINN-anneal, and 55.6% for GP-PINN. Taking the PINN-anneal as an example, the network gives a relatively low prediction result (1.87e+00) in case 1. However, it is clear that the transfer model outperforms the original model by a large margin and achieves an L 2 error of 4.99e-02.
To better compare the performance boost in each target subtask, we also plot the visualization results of transfer learning experiments on the Helmholtz equation in Figs. 3 and 4. We can observe that repurposing the network variables of the base subtask is robust for physics-informed neural networks. Compared with the original training results depicted in the second column, we can clearly see the advantage of the transfer, that is, the prediction solution is more accurate in all cases. This is evidence that network models provide means to learn rich high-dimensional operator features transferrable to a variety of similar DNN-based PDE solving tasks. Moreover, we add an additional example to investigate whether transfer learning is likely to lead to worse results when the prediction accuracy of the base task is poor. In the new example, we take case 7 (Fig. 4c) as the base task and case 3 (Fig. 3b) as the target task. The experimental results demonstrate that transferring  features even from the base task with poor performance can be better than using random features. Taking PINN as an example, the original network achieves an L 2 error of 1.37e-01 in case 3, while yields 8.46e-02 using transfer learning.

Navier-Stokes equation
We now analyze the transfer performance of different models on the two-dimensional liddriven cavity benchmarks (see Fig. 5). The steady-state flow in this problem is governed by the Navier-Stokes equation [19], which is defined as: where ∈[ 0, 1] ×[ 0, 1], Re is the Reynolds number of the flow, u(x, y) is a velocity vector field, p(x, y) is a scalar pressure field, and v(x, y) denotes the velocity in the direction. We create six subtasks with Reynolds numbers Re = 10, 100, 200, 300, 400, 500 to comprehensively study the transfer performance of surrogate models for solving the Navier-Stokes equations. Specifically, we set the second subtask (Re = 100) as the base subtask and the other five subtasks as target subtasks. During the training, we use vorticity-velocity (VV) formulation [11,13] to construct the loss function and compare the prediction results with the reference solution obtained by OpenFOAM [28].
The original and transfer performance of PINN for Navier-Stokes equations is shown in Table 4. It is observed that, for different Reynolds numbers, the network benefits from transfer learning methods. In all subtasks, the transfer results outperform the original results and achieve an average performance boost of 22.4%. Similar results can be seen in Tables 5 and 6. Taking GP-PINN (Table 6) as an example, the original network achieves an L 2 error of 2.47e-01 for Re = 200, while yields 1.42e-01 using transfer learning. When Re = 400, GP-PINN obtains an L 2 error of 2.94e-01 in the transfer case, and the performance boost is 45.5%. In all cases, the maximum performance boost of PINN-anneal and GP-PINN is 48.1% and 50.0%, respectively.
In transfer learning tasks, the number of transfer layers has a significant impact on the prediction accuracy of surrogate models. Here, we conduct a series of experiments to evaluate the transferability of features across different subtasks. For this purpose, we test the performance boost when using transfer variables for the first n (n ∈ [ 1,4]) layers for various Reynolds numbers and models. The experimental results are depicted in Fig. 6.
From Fig. 6, we find that initializing the underlying network with transferred features from almost any number of layers can produce a boost to the prediction accuracy of surrogate models after fine-tuning to a new Reynolds number. In most cases, transferring all  the layers achieves the best performance. Although the maximum boost may occur when n = 1 or 3, copying all the layers of the base subtask to the target training can still lead to considerable gains.

Transfer learning between different equations
In the last two sections, we conduct a series of transfer experiments by varying the Reynolds number or source terms of a PDE system. However, it is worth noting that there is a strong correlation between the above subtasks since the underlying solution spaces are similar. In order to study the transfer ability across dissimilar equations, we conduct experiments by transferring the features learned from the Navier-Stokes equation (Re = 100) to eight Helmholtz benchmarks. The eight target subtasks are: (1) α 1 = 1, α 2 = 4; (2) α 1 = 1, α 2 = 6; (3) α 1 = 2, α 2 = 3; (4) α 1 = 2, α 2 = 4; (5) α 1 = 3, α 2 = 5; (6) α 1 = 4, α 2 = 5; (7) α 1 = 4, α 2 = 6; (8) α 1 = 5, α 2 = 6. Similarly, we test the prediction error using three different network models and the experimental results are shown in Tables 7, 8 and 9. We can see that transfer learning between different equations can still yield performance gains. Compared to random, untrained weights, PINN achieves an average performance boost of 50.6%. For PINNanneal, the transferring learning approach delivers relatively large performance gains in    To better analyze the transfer ability between different equations, we also plot the convergence of GP-PINN on cases 6 and 7 (see Fig. 7). From the variation curves of the loss value, we can observe that applying transferring learning is robust for PDE solution prediction. During the training phase, the value of the loss terms decreases as the learning rate decays with the increase in the epoch. However, the transferred model does a better job at fitting the governing equation and boundary conditions on Helmholtz benchmarks. Compared with training with random initialization, we can clearly see the advantage of transfer training, that is, it can help accelerate training convergence and yield more accurate results. Moreover, by comparing the prediction results obtained in Section 3.1, we find that the effectiveness of feature transfer declines as the base and target tasks become less similar. It is evident that the networks are able to achieve greater performance gains by using similar features (the same equation with different source terms) than by transferring variables obtained from another PDE system. However, transferring features even from different equations can be better than using random features.
Different choices for hyperparameters can significantly affect the transfer performance for target subtasks. To establish a training methodology for DNN-based surrogate solvers, we analyze the effects of hyperparameters on the prediction accuracy of GP-PINN. The initial learning rate and decay rate are two of the most crucial hyperparameters for transfer learning. These two hyperparameters determine the length of optimization algorithm movement in the direction of the gradient. We test the effect of different learning rates and decay rates on the prediction accuracy of the Navier-Stokes equation (Re = 200) using GP-PINN .
The results in Fig. 8(a) show that overly large step size might have trouble converging, while small step size may get stuck in a local minima and provide suboptimal solutions. In our cases, GP-PINN achieves the best predictive performance when the initial learning rate is 1e-03. As for the decay rate, it is suggested that a relatively small decay rate is required to achieve a better performance, and the highest accuracy is achieved when the decay size is 0.8.

Conclusion
Motivated by the tremendous success of transferred image representation in the areas of computer vision, we apply the transfer learning approach to neural network-based surrogate models to investigate the transfer ability on different PDE systems. Specifically, we analyze the transfer performance and show significant prediction improvements on the PDEs with different source terms and Reynolds numbers. We also quantify the degree of generality of features learned in each network layer. Moreover, we document that the transferability gap grows as the distance between the base and target tasks increases, but that transferring features even from distant tasks can be better than using random features. In future work, we will focus on applying the transfer methodology to more complex tasks in real-time analysis and optimization design applications. It is also interesting to study the feature transition from general to specific in more detail.