An implementation of paralleled Finite Element Method


Petr Gladkikh, January 2002, University of Avignon

Introduction

Parallelization of Finite Element Method (FEM) is important because there is existing demand for solution of large scale of problems that are out of reach of single processor computer capabilities. Many approaches are offered today to solve that problem. Each of these has it's own strengthens and weaknesses, thus having particular area of reasonable application. One of most important quality criteria of parallelization method is speedup, which is usually defined as follows:

(1)

where

S(n) - speedup for n-processor computer;

T(n) - execution time of certain program on n-processor computer.

The "dream speedup" is , is the highest performance boost that possible to obtain on multi processor system. Real is less than due to synchronization and communication expenses between program processes running on different processor nodes.

Instead of we can consider term "efficiency of parallelization":.


This characteristic determines how the method works with large scale computers, which have big number of processors.

Below is described method of parallelization and it's preliminary implementation, which likely have good speedup for non linear problems with many degrees of freedom in each node. Discretization of such kind of problem leads to necessity of building relatively large local matrices. For example, for Navier-Stockes equations system of compressible flow modeling we have 20 degrees of freedom per element and thus we need to calculate values of 20x20 matrix for each element (4 nodes per tethraedral volumic element and 5 degrees of freedom per node) 1. Moreover, in this case we have non-linear system, which in abstract form can be expressed as and we need implement some kind of iterative process for obtaining solution on next time step. These iterative processes utilize linearization procedure of solving non-linear equation and require to recalculate for each iteration number k. In this case we find that most of computation time is consumed by procedure of recalculating . More precise, in FEM implementation it means that we need to recalculate global matrices, and this procedure can consume up to 90% of processor (CPU) time. Finally, if we can to split calculation into set of relatively independent procedures we can obtain good speedup for non-linear, multiequation problem.

Let's give formal description of such kind parallelization.

Formal description

FEM formulation

Consider a non stationary problem and let's suppose we need to obtain approximate solution for following problem

(2)

where (initial conditions),

(Dirichlet boundary conditions),

(Neuman boundary conditions),

boundary of ,

-external unit vector normal to ,

- spatial coordinates,

in finite dimensions vector space H.

More precisely, using Galerkin's residue ponderation method we formulate this task as follows: find function which satisfies following equation

(3) ,

where - residue of (2);

- first variation of ;

is a closed space of all functions whose support is in and which have continuous partial derivatives where ,. This choice of space provides correctness (solution uniquity) of the problem formulation.

Doing so we will find closest approximation of exact solution u in space . It means that norm inducted by scalar product between approximate solution U and exact solution () will be minimal. Other interpretation of this fact is "U is orthogonal projection of u onto ".

Domain discretization. In order to be able to perform numerical calculation of problem above we need to substitute with finite dimensional space . Because we can express each function in as linear combination of it's basis. Formulation (3) in our case will become

(3')

where is a basis of ,

is support of where .

In order to reduce amount of necessary calculations it is preferable to be orthogonal system. In particular this is fulfilled if supports of differenthave intersections with zero measure. This will be finite element formulation and then we will consider only this case:


, .


Assume that has some certain value (for solution this problem along t values will be used semi-implicit method), and let us write concrete formulation of problem (3') for equation (2) in finite dimension space, considering that is orthogonal independent system:

(3'').

Analytical discretization. Now considering that all functions under integral are in same space , whose basis is , express all these functions thru the their coordinates of very basis. In order to be able to do souse Green theorem (it will allow us to reduce requirements imposed on functions under integral)

, and choosing

, thus and we can write

.

Now define basis of an element as is linear Lagrange's polynomial base where i-th base function has units at the i-th corner of element and zeroes at other corners.

Desired function can be expressed element wise as follows

where -nodal values of the function. At the same time are coordinates of U in the basis.

Derivative of this function can be expressed as. Making assumption that variation of U is in the same space as U, that is , we can explicitly express integral (3'').

,

then

.

Now consider parts of the latter integral.

,

where integral is global mass matrix.

where integralis global stiffness matrix.

where integralis global loading vector.

For numerical integration is used traditional scheme when all necessary integrals are calculated on reference (standard) element and then translation of coordinates is performed. For example, approximate calculation of stiffness matrix can be performed as follows

where is Jacobean of ;


- actual coordinates; - reference element coordinates.


Now using obtained discretization. we can formulate approximate solution scheme along time. After FEM discretization we obtain following differential system:

(4)

where M is global mass matrix;

F is global loading;

K is global stiffness matrix.

Assume . Using derivative approximation in form

we can obtain following approximation of (4):

(5).

This is explicit scheme, and here we assume .

For semi-implicit scheme formula (5) changes as following:

(4')

This is an iterative process which convergence's to exact solution.

For non linear systems in order to obtain acceptable precision we need implement more complex schemes (implicit, semi implicit and/or schemes of high orders). One of these is described below in "Semi-implicit scheme" section.

Mass matrix diagonalization

One of key features of the implemented method is using mass matrix in diagonal form. This allows to simplify procedure of SLE solving as much as it possible. Usually discretiszation of initial problem gives dense mass matrix. There are several ways do make it diagonal. In method is being described used following approach.

For element wise numerical integration is used following usual formula:

(*)

where - element of domain;

- linear combination coefficient;

-basis function;

- index set of corner nodes of element .

Solution is being searched in space of picewise linear functions. Basis of this space is chosen as a set of Lagrange polynoms with set of nodes same as set of element nodes. Therefore, and considering (*) we have

(**)

where coordinate of corner node of element .

Let us denote basis of as . Due to manner it was chosen, this basis has following property:(***).

Now consider how we can calculate mass matrix integral on local element. This integral has form

where is space - dependent coefficient (in complex cases it can be in matrix form).

We have now function of degree 2 (or higher, depending on degree of ) under integral. And because we have non orthogonal basis , matrix will be non diagonal, as well as it's integral. But if we try calculate this integral approximately using (**) and (***) we will need to calculate following function:

where square matrix where only non zero element is i-th diagonal element of calculated at . Formally is could be defined as .

Doing so we obtain diagonal mass matrix. Benefit of this matrix form is (as it mentioned above) is extremely simple procedure of solving SLE with this matrix. At the same time drawback is not very accurate integration result. In order to compensate this and obtain requested precision we need (eventually) more dense mesh, which in it's turn increases calculation expences.

Global matrix decomposition

In few words, procedure of parallelization is based on additivity of global matrices assembling procedure.

Now let's look how we can decompose building of global mass (M) and stiffness (K) matrices.

Let's assume that we searching approximate solution in vector space L. Split into set of subspaces as follows:

Then define orthogonal projection operator:

Where S is arbitrary linear space. This operator is linear so we can write following:

(2)

Formally we can write in space with finite number of dimensions at least in two ways: as matrix multiplier or as scalar product. Matrix representation has form:

where ;

;

- diagonal matrix.

In form of scalar product:

where.

For example, if we have following base of whole space , we can select linear envelope using this operator (in matrix form) . In numerical computations this operator is never build explicitly.


We consider special case when mass matrix diagonal. Considering (2) and (5'), we can build following decomposition of (4):

(6)

Calculation of each and can be performed independently, and calculation of sums above is easy procedure because they are in fact sums of vectors.

Semi-implicit scheme

For solving differential equation (5) is used following semi-implicit scheme

(6)

where is upward parameter,

is accuracy parameter.

Numerical tests of the method show that following formula (6) modification of above scheme usually required less number of iterations for calculating :

(6')

In some tests this change reduces computation time more than 20 times. Possibly there are other extrapolation approaches, which can give more precise predictions for next time step.

Speedup estimate

Now let's estimate possible speedup for the method. Execution time in our case can be expressed as

where - time of assembling procedure,

- time of communications for sending partial matrices and receiving current solution,

- time of solving (4),

n - number of processors.

consists of time necessary for global problem data structures assembling using sequential algorithm. Namely, here we consider time for assembling a global value of and global value of mass matrix. Here we assume that all processors sent their partial sums to single ("root") processor, which in turn calculate global matrices and calculate solution for next time step. Communication scheme in this case looks like this:






Assume that . We have . It is obvious that for large scale computers only communication and solution time give considerable contributions. Thus this method will have good scalability and speedup if these components are not too big.

Speedup and efficiency in this case are and . In extreme case , efficiency , and thus this method has limited scalability.

In order to reduce communication overhead we can implement matrices summation using binary tree schemes. This gives logarithmic depensity of communication time by number of processors, and in this case we have (approximately) following speedup .

Communication scheme in this case is as follows:






Note that there is no actual necessity to messages from i-th processor to i-th processor.

Here is a graph of efficiency and speedups in case , , :




Note: red line - efficiency for sequential summation scheme;

green line - speedup for sequential summation scheme;

blue line - speedup for binary tree summation scheme.

Different parameters ,,give different graphs but the general conclusion stays the same: this method sustains satisfactory speedup results only for computational systems of not more than few tens of processors.

Now show how we can improve scalability of the method. Let's divide global matrices into approximately equal parts and distribute them among processors. In this case if we need to contribute into local part of global matrix we do not need to spend time for messaging at all. In this case we have two section graph like this:






Let's estimate speedup and efficiency in this case. The best case -when we do not need to send messages between different processors at all. Thus we have:

,

In other words we have ! But note that we can obtain this only in case when we have non continuous modeling area, for example consisting of n independent parts which have no intersections.

The worst case is when each processor should send messages to all other processors. Number of communication stages in this case is . They can be performed in the same manner as in following example for 3 node system:

Thus

and .

It is not an exciting result. But note that number of messages we need to send is determined by topology of respective matrix elements. In particular this topology is determined by topology of the modeling space and if two elements has common node they contribute to respective global matrix element. In other words it is possible to establish unambiguous mapping between matrix is being builded and graph adjacency matrix. This mapping compares each non zero value in the first corresponds to true value in the latter, and vice versa.

Whereas modeling space topology determines information depensity topology we need to share data among processors respective to this fact. More precise, if we can limit maximal count of messages we need to send from particular processor we can improve worst estimate for this method.

Let's define "degree of node" as number of adjacent nodes according to given topology.

Consider, for example following partitioning of the area:

Graph of information depensities is as follows (nodes are processors, edges are depensities):

The maximal degree of node in this graph is 2. If we implement this technique for larger number of processors this fact retains valid. And we can expect that in this case maximal number of messages which need to send particular processor node is 2.

Let' s look at bit more complex regular sharing scheme:






Respective graph of information depensities is as follows:

In this case we can say that maximal node degree and maximal number of messages is not more than 8.

More over, in real modeling areas, after discretization we can estimate maximal number of information depensities. Thus we can estimate as follows:

,

where is maximal degree of nodes in information depensities graph.

Let and look at speedup in these three cases (here are the same values of parameters , , ):


where blue line - best case;

red line - worst case;

green line -most probable estimate.

Efficiency for most probable estimate depending on number of processors is:


Note that it decreases very slow and even for two hundred processors it is greater than 0.3

In general case, the less synchronization operations we have the better scalability of code we obtain. If we have any non zero part of sequential code in program, we can see . Therefore to fight any global synchronization is effective way to improve scalability of the code. Fortunately last approach to organization of the code permits to reduce global synchronization as much as it possible for this numerical method.

More precisely each processor should coordinate it's own work only with it's neighbors. On other hand this code organization is more complex than single root summation schemes.

Of course, it is possible to try improve this method by combining shared matrices approach with binary tree summation schemes in order to utilize benefits of both these methods. But it is difficult to design effective algorithm, that works well in general case. Moreover, benefits of binary tree like algorithms can be revealed with big number of sum members. At the same time this number in shared matrices approach is limited to which in real meshes not more than few tens (usually it is less than 10).

There is necessity in more comprehensive testing then it performed yet in order to support or negate these considerations. One possible source of errors is that we not consider the size of communication messages. If we have efficient implementation of this computational method, the more processors involved in computation the less part of whole vector is used to send at once.

Time step optimization

As it turns obvious, it is difficult for complex processes to guess or estimate appropriate time step sufficient for computation with desired accuracy. Moreover, for using constant time step along whole process of computation we need to set minimal estimated value. Whereas not whole time we need so little step (), we can optimize computation by choosing appropriate time steps for each modeling time point (). In some cases it is possible to a priori build function that will adjust time step. But, again, it is possible only in some cases. Better way is to design adaptation algorithm which will adjust depending on some criteria.

All tests below are performed solving parabolic equation with constant boundary conditions and constant coefficients in area. The process is being modeled has following feature: all changes in area decayed with time and we can see constant solution after some transitive process. Usually the less changes we have in area the less time step we need to calculate solution. It is so in this case too.

The selected parameters that are being monitored are the ratio where is obtained precision after last inner iteration at the current time step (see description of semi-implicit scheme in section "Formal description") and the number of inner iterations (NIT) at the last time step.

Each inner iteration is followed by procedure of final solution assembling which requires a lot of communications. Further, given total model time of computation number of iterations is defined from

. Thus for faster solution of the problem wee need to reduce overall number of inner iterations and to increase time step. But these demands are inconsistent. The more inner iterations we have the bigger we can choose. The compromise depends on different factors in particular on time necessary for communications. The tests below are performed using SMP system, and thus cover not the cases where communication time is significant.

Now let's consider different types of adaptation algorithm.

Additive scheme with constant increasing/decreasing step.

After obtaining solution at current time step calculate new according to following rule

(a)

Functionis given below:



Blue line - is chosen close to optimal value.

Read line - . In this case we can see decreasing period when the algorithm tries to find time step small enough for obtaining given precision.

This algorithm is simple but it's adaptation to current situation is too slow. As we can see in case above there is approximately 250 iterations when no correct result even first time step.

Now compare more two types of adjusting functions. First has polynomial form:

(p)

where , ,scalar parameters.

Note that this function not always gives and in common case need additional checks if this condition is fulfilled.

Second is exponential:

(e)

These functions have much more faster adaptation then case (a).

Here is results of the above functions (p) and (e) testing with (p=0.1 for exponential function and p=0.5 for polynomial):

Note that due to very inappropriate value of initial time step these functions change it to very small value yet at the first time iteration. That's why values in beginning of computation are very small. In spite of this these functions allow to faster find optimal time step values than while using first algorithm.

Let's look at these functions in more comfortable situation (, all other parameters are the same):

Note that total number of iterations in this case is less than in first test.

Functions with NIT penalty

In above considerations we made not distinguition between total number of time steps and total number if inner iterations. In order to consider communication expenses within optimization process we need involve number of inner iterations in adaptation criteria. Usual way of building complex criteria functions is to write linear combination of different criteria functions. Let's rewrite above criteria functions as follows:

, (p')

(e')

where criteria weights,

optimal value of iterations.

As it shown below this extension of criteria functions allows to smoothen functions and make them more predictable. This is of course no matter if consider only computation time.

Let's choose and for exponential function . And consider in this case:



As it mentioned above, these testing results representing only numerical aspects, not computation time or resources consuming. And these results will be incomplete without testing on relatively big system, where communication time is significant.

Limit for number of inner iterations.

In implemented computation algorithm limit of inner iterations is set. This limit prevents algorithm from inefficient calculations when process either divergences either convergences too slow. Experimental results shows that good choice for most problems is about 10. In tests that described here actual number of iterations was 3 ... 7.

For little number of inner iterations behavior of strongly depends on chosen parameter p in both cases (polynomial and exponential).

In case of polynomial function there is optimal value of parameter p when we have minimal number of iterations. All fluctuations around it increasing this number.

In case of exponential function, the greater p we chose the less number of iterations we have. But in case of small p we can see big oscillations of when necessary value of should have very little changes. In other words we have unstable behavior.

Polynomial functions:



Exponential functions:






In spite of this, behavior of adjusting functions can be stabilized using NIT penalty. Below is an result of testing polynomial function with big adaptation parameter p=0.20 and little limit for NIT 3. Here are two functions with and without NIT penalty member.








The final version of time step adjustment procedure.

Due to considerations above the following version of time step adjustment procedure was chosen:

1.(exponential form, same as (e') ). This form was chosen because it has fast adaptation capabilities. NIT penalty is used to prevent from big variations of time step when too little number of steps is set. However parameter was set to small value about 0.001.

2.. Where - scalar parameter. This step prevents from extra fast reaction of algorithm. In particular it eliminates fall offs of time step that can be observed in some of previous diagrams. Parameter D is to be chosen of order of few tens (in implemented program it is set to 50).

        3.

        4. . Minimal and maximal values for are to be chosen due to some common sense a priori considerations.

Implementation

Now consider structure of program code which implements first approach (with single root node and simple summation).

The program code is based on existing sequential version of solver. Most important changes are in iteration procedure of calculation . The structure of this part of code is shown below (this is C-like notation where all non worthy details are omitted and used mathematical notation where it is pertinent):

do

{

{

if (has_dirichlet_condition(i))

else

}

// assemble processor's part of matrices

assemble_matrices();


if (is_root())

{

// this branch is executed only in root processor.

// receive other partial sums:

for (int pe = 0; pe < mpi_size; pe++) // (*)

if (pe != mpi_root_rank)

{

// receive mass diagonal:

MPI_Recv(, ..., pe, ...);

// receive residue values:

MPI_Recv(, ..., pe, ...);

// add to sum

}

// calculate new ; note that is diagonal

}

else

{

// this branch is executed only in non root processors.

// send partial sums to root processor:

// send mass diagonal:

MPI_Send(, ..., mpi_root_rank, ...);

// send residue values:

MPI_Send(, ... , mpi_root_rank, ...);

}


// root processor - sends, other processors - receive

MPI_Bcast(, ..., mpi_root_rank, ...);


}

while ( (iter < n_iter_max) && () );

Note on communication scheme

In the pseudo listing above is cycle thru all non root processor numbers (*). In most rigorous implementation on i-th execution of this cycle's body only messages from i-th processor are received. But in big multiprocessor system latencies can be unpredictable and it is a way to improve this algorithm by setting more loose synchronization. More concrete, in order to provide correctness of assembling procedure we need to make sure that

Considering this in the implementation is used following communication scheme2. At the begin of cycle tag counter is set to same value in all program's processes. Then after each time step it increments. Doing so we in fact define a set of communication layers, and each of these meets with particular tag value. Because we need to assemble two matrices each time step is separated into two communication layers (one for mass matrix and one for residue vector). Thus tag value is being incremented twice per inner time step iteration.

Here is pseudocode of this procedure:

tag := 0

do

{

assemble_matrices();

if (is_root())

{

// this branch is executed only in root processor.

// receive other partial sums:

for (int pe = 0; pe < mpi_size; pe++) // (*)

if (pe != mpi_root_rank)

{

// Here we specify source process and source tag

MPI_Recv(, ..., pe, tag, ...);

MPI_Recv(, ..., pe, tag+1, ...);

// add to sum

}

tag := tag + 2

}

else

{

// this branch is executed only in non root processors.

tag := tag + 1

MPI_Send(, ..., mpi_root_rank, tag, ...);

tag := tag + 1

MPI_Send(, ... , mpi_root_rank, tag, ...);

}

MPI_Bcast(, ..., mpi_root_rank, ...);

}

while ( (iter < n_iter_max) && () );


This is most straitforward improvement. But there is possibility to have only one communication layer per inner iteration. This implementation is being implemented in the next versions of program.

Changes in the class hierarchy

In order to improve overall code quality class hierarchy of initial program was reorganized in order to reduce unnecessary depensities between different parts of code. Here is initial class hierarchy of program (below is used UML like notation):



Main functions of classes are commented at the right side of diagram. After transferring all variable declaration close to code of their using, was obtained following ER (entity-relation) diagram:

Where dashed arrows mean relation of using. In order to simplify these relations inheritance diagram was changed as follows:



This change have required only two new methods for class elementary_matrix. These methods are used for passing coordinates of element and solution values necessary for calculations.

Optimization of message size

The code presented above has an following inefficiency: each processor sends whole partial sum vector. At the same time we need to send only non zero elements. They are comprise only part of whole vector. And the more processor number we have the less this part. One of straitforward optimizations is to send (and store in local memory) only necessary elements of the vectors, which have non zero values.

For sending only non zero elements one can use any known methods of representation sparse sets. For presented realization following method is chosen. Message contains pairs instead of whole vector. The receiving process can determine where to add contributions using this information.

The very method has following feature total volume of data to be sent is larger then just amount of contribution values. But this volume is constant independently of number of processors. Moreover benefit of this approach is in universal applicability for any data sharing schemes. This method is the one implemented in the program.

It is possible to optimize method above for simpler sharing schemes. For example, if we have continuos ranges of non zero values in each processor we can send message in form .

Optimization of calculations

Profiling tests shows that calculation of coordinate conversion Jacobean is a significant part of total calculation expences. For example for filtration problem it takes approximately 38% and for Navier-Stokes system 5.3% of whole computation time. The latter result is explained by more complex procedure of local matrix calculation, which in turn takes greater part of total computation expences.

At the same time jacobian values are depend on domain discretization geometry, and in presented implementation of FEM it is considered constant during all computation. Thus we can calculate jacobian values once and save them for following integrations.

In fact we can save values where is integration weight (see description of numerical calculation) because values of Jacobeans are not used alone in calculations.

Results

In this section results of preliminary presented method testing are presented.

Simple test

These results are obtained while solving low sized 3 dimensional problem with one DOF per node (filtration problem) and with Navier-Stokes system (5 DOF per node, compressible flow modeling). For domain discretization tethraedral elements are used. Test was performed on 2 processor SMP system.

Speedup for these two cases is:


Degrees of freedom per element

Number of processors

4

20

1

1

1

2

1.6

1.96



The test was performed using following tools and environments:

MPI implementation

MPICH v.1.2.3

C++ compiler

gcc v.3.0.4 (with optimization option -O3)

Operating system

Linux RedHat 7.2, kernel 2.4.9-13

Computing system

2 x Pentium III 1GHz SMP, 1 Gb RAM

Note that for larger local matrices speedup is better.





Profiling of Navier-Stokes solver on same SMP machine shows that calculation of local matrix takes 84.5% of total computation time.



Big test

Below are speedup and efficiency for 3 dimensional filtration problem (4 DOF per element):

The nature of quick performance falloff at the 10 processors is unknown. Most probable reason is in interference of the node interconnecion properties and properties of communication protocol.


Here are same results in numerical form (note that tests for 9, 11-19, 21-24 processors are not performed, respective data is linear interpolation):






The test was performed using following tools and environments:

MPI implementation

MPICH

C++ compiler

gcc v.3.0.4 (with optimization option -O3)

Operating system

Linux RedHat 7.1 (distributed with AlinkA toolkit)

Computing system

Each node is 2 x Athlon 1.5GHz SMP, 1 GB RAM

Interconnection network: Gigabit Ethernet with switch.





Acknowlegements



Present work was performed in December 2001- March 2002 at the Univercity of Avignon (http://www.univ-avignon.fr/), under supervision of prof. André Chambarel.

Large tests are performed at LABORATOIRE DE MODELISATION ET SIMULATION NUMERIQUE EN MECANIQUE (http://www.l3m.univ-mrs.fr),Marseille.



Abbreviations

DOF -degree of freedom

UML -Uniform Modeling Language

ER(D) - Entity Relation (Diagram)



Appendix 1

Example of complicated non linear equation system which can be efficiently solved using presented method on multiprocessor system.

This is well known Navier-Stokes system which is an model of compressible fluid flow:

where ;

- density;

- pressure;

- energy;

- velocity;



1 You can find this equation system in Appendix 1.

2MPI library allows to identify each send-receive communication using a number called "tag". And particular receive procedure can be restricted to accept message with particular tag only.