1991 - [Farhat] - Feti Method

INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN VOL.
32, 1205-1227 (1991)

ENGINEERING,
A METHOD O F FINITE ELEMENT TEARING AND

INTERCONNECTING AND ITS PARALLEL SOLUTION
ALGORITHM
CHARBEL FARHAT
Department of Aerospace Engineering Sciences and Center for Space Structures and Controls, University of Colorado at
Boulder, Boulder, CO 80309-0429, U.S.A.
FRANCOIS-XAVIER ROUX
O.N.E.R.A.Groupe Calcul Parallele, 29 Au. de la Division Leclerc, BP72 92322 Chatillon Cedex, France
SUMMARY
A novel domain decomposition approach for the parallel finite element solution of equilibrium equations is
presented. The spatial domain is partitioned into a set of totally disconnected subdomains, each assigned to
an individual processor. Lagrange multipliers are introduced to enforce compatibility at the interface nodes.
In the static case, each floating subdomain induces a local singularity that is resolved in two phases. First,
the rigid body modes are eliminated in parallel from each local problem and a direct scheme is applied
concurrently to all subdomains in order to recover each partial local solution. Next, the contributions of
these modes are related to the Lagrange multipliers through an orthogonality condition. A parallel
conjugate projected gradient algorithm is developed for the solution of the coupled system of local rigid
modes components and Lagrange multipliers, which completes the solution of the problem. When
implemented on local memory multiprocessors, this proposed method of tearing and interconnecting
requires less interprocessor communications than the classical method of substructuring. It is also suitable
for parallel/vector computers with shared memory. Moreover, unlike parallel direct solvers, it exhibits a
degree of parallelism that is not limited by the bandwidth of the finite element system of equations.
1. INTRODUCTION
A number of methods based on domain decomposition procedures have been proposed in recent
years for the parallel solution of both static and dynamic finite element equations of equilibrium.
Most of these methods are derived from the popular substructuring technique. Typically, the
finite element domain is decomposed into a set of subdomains and each of these is assigned to an
individual processor. The solution of the local problems is trivially parallelized and usually a
direct method is preferred for this purpose. Parallel implementations of both a direct' and an
iterative' solution of the resulting interface problem have been reported in the literature. A
number of more original approaches have also been spurred by the advent of new parallel
processors. Ortiz and Nour-Omid3 have developed a family of unconditionally stable concurrent
procedures for transient finite element analysis and Farhat4 has designed a multigrid-like
algorithm for the massively parallel finite element solution of static problems. Both of these
developments relate to the divide and conquer paradigm but depart from classical substructuring.
In this paper, we present a parallel finite element computational method for the solution of
static problems that is also a departure from the classical method of substructures. The unique
feature about the proposed procedure is that it requires fewer interprocessor communications
0029-598 1/91/141205-23$11 S O Received 28 February 1990

0 1991 by John Wiley & Sons, Ltd. Revised 30 Ju(y 1990
1206 C.FARHAT AND F.-X. ROUX
than traditional domain dccornposition algorithms, while it still offers the same amount of
parallelism. R o u ~has ~ ,presented
~ an early version of this work that is limited to a very special
class of problems where a finite element domain can be partitioned into a set of disconnected but
non-floating subdomains. Here, we generalize the method for arbitrary finite element problems
and arbitrary mesh partitions. We denote the resulting computational strategy by 'finite element
tearing and interconnecting' because of its resemblance with the very early work of I(ron7 on
tearing methods for electric circuit models. Tn Section 2, we partition the finite element domain
into a set of totally disconnected s u b ~ ~ m a iand
n s derive a computational strategy from a hybrid
variational principle where the inter-subdomain continuity constraint is removed by the in-
troduction of a Lagrange multiplier. An arbitrary mesh partition typically contains a set of
floating sL~bdoi~ains which induce local singularities. The handling of these singularities is treated
in Section 3. First, the rigid body modes are eliminated in parallel from each local problem and a
direct scheme is applied concurrently to all subdomains in order to recover each partial local
solution. Next, the contributions of these modes are related to the Lagrange multipliers through
an orthogonality condition. A parallel conjugate projected gradient algorithm is developed in
Section 4 for the solution of the coupled system of local rigid modes components and Lagrange
multipliers, which completes the solution of the problem. Section 5 deals with the preconditioning
of the interface problem in order to speed up the recovery of the Lagrange multipliers. Section 6
emphasizes the parallel characteristics of the proposed method and Section 7 contrasts it with the
method of substructures. Section 8 discusses some important issues related to the partitioning of a
given finite element mesh. Finally, Section 9 illustrates the method with structural examples on
the distributed memory hypercube iPSC (32 processors) and the shared memory parallel/veclor
CRAY-2 system (4processors), and Section 10 concludes the paper.
2. FINITE ELEMENT TEARING AND INTERCONNECTING

Here we present a domain decomposition based algorithm associated with a hybrid formulation
for the parallel finite element solution of the linear elastostatic problem. However, the method is
equally applicable to the finite element solution of any self-adjoint elliptic partial differential
equation. For the sake of clarity, we consider first the case of two subdomains, then generalize the
method for an arbitrary number of subdomains.
The variational form of the three-dimensional boundary-value problem to be solved goes as
follows. Givenfand h, find the displacement function u which is a stationary point of the energy
functional
J(0) = 4a(v, v ) - ( ~ , f-
) (07 h)r
where
r
(0, h)r = jrh dr
+
Tn the above, the indices i, j, k take the values 1 to 3, qi,j ) = (vi, vj, i)/2 and vi, denotes the
partial derivative of the ith component of IJ with respect to the jth spatial variable, cijklare the
elastic coefficients, C8 denotes the volume of the elastostatic body, r its piecewise smooth
boundary and l?* the piece of where the tractions hi are prescribed.
FE TEARING AND INTERCONNECTING 1207
Figure 1. Decomposition in two subdomains
and which satisfy on the interface b o u ~ ~ ~ TI

a s the
y ~ o ~ t i n uconstraint
~ty
141 = u2 on r* (3)
Solving the two above variational problems (2) with the subsidiary continuity condition (3) is
equivalent to f i n d ~ ~the
g saddle point of the ~ ~ ~ r a ~ ~ i a n :
where
1208 C. FARHAT AND F.-X. ROUX
for any admissible v,, v 2 and p. Clearly, the left inequality in ( 5 ) implies that (ul - u,, P ) ~<
,
(u, - u,, A)rr,which imposes (u, - u,, p)r, = 0 for any admissible p and therefore u1 = u2 on TI.
+ +
The right inequality in (5) imposes J , ( u , ) J,(u,) < J,(v,) J,(u,) for any pair of admissible
functions (0, , u,). This implies that, among all admissible pairs (vl, 0,) which satisfy the
continuity condition (3),the pair (ul, 2.4,) minimizes the sum of the energy functionals J , and J ,
defined respectively on R, and R,. Therefore, u1 and u2 are the restriction of the solution u of the
non-partitioned problem (1) to respectively R, and R,. Indeed, equations (4)and (5) correspond
to a hybrid variational principle where the inter-subdomain continuity constraint (3) is removed
by the introduction of a Lagrange multiplier (see, for example, Pian*).
If now the displacement fields u, and u, are expressed by suitable shape functions as
u1 = N u , and u2 = N u ,
and the continuity equation is enforced for the discrete problem, a standard Galerkin procedure
transforms the hybrid variational principle (4)in the following algebraic system:
K,u, = f, + B:I
K,u, = f2 - BZI (6)
B,u, = B,u,
where K,, u, and f,, j = 1,2, are respectively the stiffness matrix, the displacement vector and the
prescribed force vector associated with the finite element discretization of Q,. The vector of
Lagrange multipliers I represents the interaction forces between the two subdomains R, and Q,
along their common boundary Ti. Within each subdomain Rj,we denote the number of interior
nodal unknowns by ng and the number of interface nodal unknowns by n:. The total number of
interface nodal unknowns is denoted by n,. Note that n, = n: = n: in the particular case of two
subdomains. If the interior degrees of freedom are numbered first and the interface ones are
numbered last, each of the two connectivity matrices B, and B, takes the form
B, = [O, I,] j = 1, 2
where O j is an ni x n; null matrix and I, is the n: x ni identity matrix. The vector of Lagrange
multipliers I is n, long.
If both K , and K, are non-singular, equations (6) can be written as
(BIKLIBT + B2KT'BZ)I = B2KT1f, - B,K;'f,
u1 = K;'(f, + BTk) (7)

U, = K, '(f, - BZI)
and the solution of (6) is obtained by solving the first of equations (7) for the Lagrange multipliers
I , then substituting these in the second of (7) and backsolving for u1 and u,.
For an arbitrary number of subdomains R,, the method goes as follows. First, the finite element
mesh is 'torn' into a set of totally disconnected meshes (Figure 2).
For each mesh, the stiffness matrix K, and the vector of prescribed forees fj are formed. Next,
for each a,, a set of Boolean symbolic matrices Bj"is set up to interconnect the mesh of Q, with
those of its neighbours R,. In general, Bj" is n, x (ng + n;) and has the following pattern:
Bj"= ~
O l ( j 3k)
C$
0 2 ( j 2k)
]
Figure 2. Finite element tearing
where 0,( j , k ) is an m,(j , k) x (nj + n:) zero matrix, O,( j , k ) is another m2( j , k ) x (n; + n:) zero
matrix and C! is an m,( j , k ) x ($ + n;) connectivity matrix, m,( j , k ) is the number of Lagrange
multipliers that interconnect Q j with its neighbour Q k , and m , ( j , k) and m,( j , k ) are two non-
negative integers which satisfy m,(j , k ) + m,( j , k ) + m2(j , k ) = n,. The connectivity matrix C:
can be written as
Cj”= C 0 3 ( j ,k ) 1: O A j , k)l
where 0 3 ( j ,k ) is an m,( j, k ) x m 3 ( j ,k ) zero matrix, 15 is the m,(j , k) x m,( j , k) identity matrix,
04( j, k ) is another m,( j , k ) x m4(j, k ) zero matrix, and m3( j , k ) and m4(j , k ) are two non-negative
+ + +
integers which verify m3(j , k ) m,( j , k ) m4(j , k ) = nj n:. If uj and N , denote respectively the
number of subdomains Qk that are adjacent to Inj and the total number of subdomains, the finite
element variational interpretation of the saddle-point problem (4) generates the following
algebraic system:
k=aj
Kjuj = fj + B;Th j = 1, N ,
k=1
BjkUj =: BjkUk j = 1, N , and Qk connected to Q j

If Kj is non-singular for a l l j = 1, N , , the solution procedure (7) can be extended to the case of
an arbitrary number of subdomains. However, the finite element tearing process described in this
section may produce some ‘floating’subdomains Q, which are characterized by a singular stiffness
matrix K,. When this happens, the above solution algorithm (7) breaks down and a special
computational strategy is required to handle the local singularities.
We refer to the computational procedure presented herein as the method of finite element
tearing and interconnecting because of its resemblance with Kron’s tearing method7 for electric
circuit models. We also note that the utility of Lagrange multipliers specifically for domain
decomposition has also been previously recognized by other investigator^.^, lo
3. HANDLING LOCAL SINGULARITIES

Again, we focus on the two-subdomain tearing. The extrapolation to N , > 2 is straightforward.
For example, suppose that Q corresponds to a cantilever beam and that 52, and Q, are the result
of a vertical partitioning (Figure 3).
In this case, K , is positive definite and K, is positive semi-definite since no boundary condition
is specified over I),. Therefore, the second of equations (6),
K ~ u ,=c: f 2 - BTh (9)
requires special attention. If the singular system (9) is consistent, a pseudo-inverse of K, can be
Figure 3. Decomposition resulting in a singular subdomain
found-that is a matrix K: which verifies K,K:K, = K,--and the general solution of (9) is given
ti2 = K:(f, - Bzh) + R,a

+
where R, is an (n; n',) x n; rectangular matrix whose columns form a basis of the null space of
K,, and a is a vector of length n;. Physically, ,
represents the rigid body modes of R, and Q
specifies a linear combination of these. Consequently, we have n> 6 6 for three-dimensional
problems, and n; d 3 for two-dimensional problems. Sub~tituting(10) into (7) leads to
(B,K;'BT + B2B: lK;'f, + B,(K:f, + R,a)
(11)
u2 = K;(f2 - Bi3,) -4- R,a
It should be noted that:
(i, because Bj is a Boolean operator, the result of its application to a matrix or vector quantity
should be interpreted as an extraction process rather than a matrix-matrix or
matrix-vector product. For example, B,R, is the restriction of the local rigid modes R, of
R, to the interface unknowns. In the sequel we adopt the notation
(ii) the pseudo-inverse KZ does not need to be explicitly computed. For a given input vector
v, the output vector K:v and the rigid modes R, can be obtained at almost the
same computational cost as the response vector K i l v , where K, is non-singular (see
Appendix I);
(iii) system (11) is under-determined. Both 3, and a need to be determined before u1 and u2 can
be found, but only three equations are available so far.
Since K, is symmetric. the singular equation (9) admits at least one solution if and only if the
right hand side (f, - BTL) has no component in the nu61 space of K,. This can be expressed as
(12)
The above orthogonality condition provides the missing equation for the complete solution of
(1 1). Combining (1 1) and (12) yields, after some algebraic manipulations,
1
where
F, = (B1K;'BT + B2K:BT)
Clearly, F, is s y ~ ~positive r ~ ~ and 1 has full column rank. Therefore, the system of
e ~ definite
equations in (a,.) is symmetric and non-si ular. It admits a unique solution (h,a) which
uniquely determrnes 1~~ and
n t that, since az; 6 6, systems (I 3) and (7) have almost the same size. For an
It is i ~ p o ~ ~toa note
arbitrary number ~ ~ s u b d o ~Na, ~ofnwhichs N , are floating, the additional number of equations
introduced by the ~ a ~ ~of~local i nsingularities
g is bounded by SN,. For large-scale problems and
relatively coarse mesh partitions, this number is a very small fraction of the size of the global
system. On the other hand, if a given tearing process does not result in any floating subdomain, u
is zero and the systems of equations (13) and (7) are identical.
Next, we present a numerical algorithm for the solution of (13).
We seek an efficient solution algorithm that does not require the explicit assembly of F,.
The solution to the above problem can be expressed as
where
As written in ( ! 5 > , this solution procedure is not recommended because it requires either the
~ ~ a of the ~ a of F,,
~ inverse ~ or~ the ~nested
n solutions of two linear systems involving F, and
I r F I- 1 R;. It is noted by Fletcher" that if two matrices S and Z are computed such that
an alternative r e p ~ e s e ~ i ~ a of
~ i the
o n solution to (14) is given by
h I= - W~BzK;~2- IB,P(;'fl) +T
1K;'fI .- BzKif2) - UR:f2
where
H = Z(ZTF,Z)-'ZT
T = S - HFIS
U = STFIHF,S- STF,S
which does not require the explicit assembly of F, if a suitable iterative scheme is chosen for
solving all the temporary systems involving the quantity (ZTF,Z)- Still, the above solution
procedure is not feasible because it requires the computation of S and Z-typically via a Q R
factorization of some matrix involving R',"-and the iterative solution of too many temporary
systems before h and a can be obtained.
Clearly, the nature of F, makes the solution of (14) inadequate by any technique which requires
this submatrix explicitly. This implies that a direct method or an iterative method of the SOR type
cannot be used. The only efficient method of solving (14) in the general sparse case is that of
conjugate gradients, because once K , and K, have been factorized, matrix-vector products of the
form F,v can be performed very efficiently using only forward and backward substitutions.
Unfortunately, the Lagrangian matrix
.=[ - RF1
F -"',I
0
is indefinite so that a straightforward conjugate gradient algorithm cannot be directly applied to
the solution of (14). However, the conjugate gradient iteration with the projected gradient (see, for
example, Gill and Murray',) can be used to obtain the sought-after solution. In order to
introduce the latter solution algorithm, we first note that solving (14) is equivalent to solving the
equality constraint problem:
minimize @(A) = $hTF,h + (B,K;'f, - B2K:f,)Th
(18)
subject to R r h = RSf,
Since F, is symmetric positive definite, a conjugate gradient algorithm is most suitable for
computing the unique solution to the unconstrained problem. Therefore, this algorithm will
converge to the solution to (18)if and only if it can be modified so that the constraint R F h = RTf2
is satisfied at each iteration. This can be achieved by projecting all the search directions onto the
null space of R',.
The result is a conjugate gradient algorithm with the projected gradient. It is of the form
Initialize
Pick I(') such that RTh(') = Rzf,
Iterate k = 1, 2, . . . until convergence

A fast scheme for finding a starting li.(O) which satisfies the constraint R?;1(*) = Rzf, is given in
Appendix 11. Clearly, RFdk) = 0 for all k > 1. Therefore, R?h(k) = R:'li.(O), which indicates that
the approximate solution satisfies the linear equality constraint of problem (14) at each
iteration k. It is also important to note that, within each iteration, only one projection is
performed. This projection is relatively inexpensive since the only implicit computations that are
involved are associated with the matrix R',TR', which is at most 6 x 6. This matrix is factored once,
before the first iteration begins. Except for this small overhead, algorithm (20) above has the same
computational cost as the regular conjugate gradient method.
After li. is found, the rigid body mode coefficients are computed as
a = (RrR\)-'(F,h - BZKi1fz + BIK;'fl)
RITA = i"].
RiT
= RTf= 14 !!i.
For an arbitrary number of subdomains N , of which N , are floating, the equality constraint is
.o.
0 RT
Only those columns of R? which operate on Lagrange multipliers that are associated with TI n Q j
are non-zero. The projection matrix is P .= [I - R1(RITR')-'R'T], where R"R' is generally
banded of dimension at most equal to 6 N , . The banded structure of P is determined by the
subdomains interconnectivity. If for practical reasons this banded structure is not exploited, the
number of three-dimensional floating subdomains should be kept as small as possible, say less
than thirty two, which implies that the proposed computational method would be suitable only
for coarse and medium grain multiprocessors.
5. PRECONDITIONING THE INTERFACE PROBLEM

As in the case of the conjugate gradient method, the conjugate projected gradient algorithm is
most effective when applied to the preconditioned system of equations. It should be noted that,
even in the presence of floating subdomains, only F, needs to be preconditioned and not the
global Lagrangian matrix L. In the case of two subdomains, F, can be written in matrix form as
where K J 1 , j = 1, 2, is replaced by KT if sZj is a floating subdomain. The objective is to find an

approximate inverse P; of F, that: (a) does not need to be explicitly assembled (especially since
F, is not explicitly assembled), and (b) that is amenable to parallel computations. The matrix P is
then the preconditioner. Equation (21) above suggests the following choice for P;':
At each iteration k, the preconditioned conjugate projected gradient algorithm involves the
solution of an auxiliary system of the form
plz(k) =: (23)
where rck)is the residual at the kth iteration. The particular choice of P; ' given in (22) offers the
advantage of solving (23) explicitly without the need for any intermediate factorization.
For computational efficiency, P; ' is implemented as
P;' = K! + K', (24)
\ are the traces of K , and K, on TI.Clearly, with this choice for the
preconditioner, the auxiliary system (23) is 'cheap', easy to solve and perfectly parallelizablie on
both local and shared memory parallel architectures.
Since we do not have a strong mathematical justification for this choice of the preconditioner,
we have conducted a set of numerical experiments to assess a p r i ~ rits
i performance. A fixed-fixed
cylindrical panel was discretized with an N by M regular mesh and was modelled with 4 node
shell elements (Figure 4). All test cases used N , = 2 and a vertical slicing.
Table I reports the condition numbers of the global stiffness matrix K, the subdomain stiffness
matrices K , and K,, and the original and preconditioned interface flexibility-like matrices F, and
P;'F,, for various values of N .
For this test problem, the condition number of the preconditioned interface is two orders of
magnitude lower than that of the global problem.
The extrapolation of (22) and (24) to N , > 2 is straightforward. In order to reduce further the
number of reconditioned conjugate pro-jeeted gradient iterations, the selective reorthogonaliz-
ation procedure developed by Roux and reported in Reference 13 is also utilized.
6. PARALLEL CHARACTERISTlCS O F THE ~ ~ O ~ O SMETHOD

E D
Like most domain decomposition based algorithms, the proposed method of finite element
tearing and interconnecting is perfectly suitable for parallel processing. If every subdomain sZj is
assigned to an individual processor p j , all local finite element computations can be performed in
parallel. These include forming and assembling the stiffness matrix K, and the forcing vector fj,
factoring K, and eventually computing the rigid modes Wj,as well as backsolving for uj after 1
and a have been determined. The conjugate projected gradient algorithm described in Section 4 is
Figure 4. Cylindrical panel- N, = 2

also amenable to parallel processing. For example, the matrix-vector product F,s@)can be
computed in parallel by assigning to each processor p j the task of evaluating yy) = BFK; B!Tsy),
and exchanging y?) with the processors assigned to neighbouring subdomains in order to
assemble the global result. Interprocessor communication is required only during the solution of
the interface problem (14) and takes place exclusively among neighbouring processors during the
assembly of the subdomain results.
iat this point, we stress that the parallel solution method developed herein requires inherently
iess interprocessor communication than other domain decomposition based parallel algorithms.
As mentioned earlier, interprocessor communication within the proposed method occurs only
during the solution of the interface problem (14).The reader should trace back this problem as
well as the presence of the Lagrange multipliers to the integral quantity
= lrI,,
(u, - u j , JJ~~,, A(oi - uj)dr (23
where rIZis the interface between subdomains R, and Rj. If has a zero measure, then
( v E- z ' ~A3 and no exchange of information is needed between Q and Q j . Therefore the
, rL, = 0
subdomains which interconnect along one edge in three-dimensional problems and those which
interconnect along one vertex in both two- and three-dimensional problems do not require any
interprocessor cornmunication. This is unlike the parallel method of substructures, whether the
interface problem is solved with a direct scheme' or with an iterative one.2 For a three-
dimensional regular mesh that is partitioned into subcubes, the proposed method of finite element
tearing and interconnecting requires that each subdomain communicates with at most six
neighbouring subdomains (since a cube has only six faces), while the parallel method of
substructures necessitates that each subdomain communicates with up to 26 neighbours (Figure
5). This communication characteristic makes the proposed parallel solution method very attract-
ive for a multiprocessor with a distributed memory such as a hypercube. Indeed, the advantages
of the method for this family of parallel processors are twofold: (a) the number of message-
pasings is dramatically reduced, which reduces the overhead due to communication start-up,
and (b) the complexity of the communication requirements is reduced so that an optimal mapping
of the processors onto the subdomains can be r e a ~ h e d ; ' ~therefore
.'~ the elapsed time for a given
message is improved. Both enhancements (a) and (b) reduce the communication overhead of the
parallel solution algorithm in a synergistic manner. This algorithmic feature of the proposed
method is still desirable for shared memory multiprocessors because it eases the assembly process
during the interface solution and makes the latter more manageable. It is not, however, as critical
for the performance as it is for local memory multiprocessors.
7. TEARING VS. SUBSTR UCTURING

Another difference between the subdomain based parallel solution method developed in this
paper and the parallel method of substructures lies in the formulation of the interface problem.
For the method of substructures, the interface problem corresponds to a stiflnness formulation.
For the two-subdomain decomposition it can be written as:
(KII- K T , K ~ ~ K-, , K T I K ~ ~ K Z I ) U=I f,, - KT,K;,'flC,, - KTlaC;;f2, (26)
where K,,, K , , and K,, are the stiffness matrices associated respectively with the interface nodes
and the interior nodes of subdomains R, and Q2, and K,, and K,, are the coupling stiffnesses
between respectively R, and r, and R, and TI(see, for example Referenee 1 for further details). A
4 RATHER THAN 8
t
*
Figure 5. Reduced interprocessor communication patterns for two- and three-dimensional regular mesh partitions
standard conjugate gradient algorithm may be used for solving (26). On the other hand, the
resulting interface problem for the method of finite element tearing and interconnecting corres-
ponds to a formulation that is homogeneous to a jiexibility, even if it is not exactly a flexibility
formulation. For the two-subdomain decomposition, it can be written as in (14) and necessitates
the use of a conjugate projected gradient algorithm for finding the solution 1.
If K, and K, are partitioned into internal and boundary (interface) components and then are
injected into the first of equations (7), it can be easily shown that
where KS' and Kit) denote respectively the contributions of the first and second subdomains to
K,, . Equations (27) above establish the relationship between both approaches to domain
decomposition.
The computational implications of the differences between the two solution methods are as
follows:
e within each iteration, the solution process of problem (14) requires an additional com-
putational step which corresponds to the projection of the search direction onto the null
space of It:;
a within each iteration, the solution process of problem (14) requires the evaluation of the
matrix-vector product BjKl: B;dk), while the solution process of problem (26) requires the
'
evaluation of the matrix-vector product KTIKjj Kj, dk).Given that Bj is a Boolean matrix
and that its application to a matrix or a vector defines a floating-point-free extraction
process, each conjugate gradient iteration applied to (14) is less computationally intensive
than its counterpart that is applied to (26);
0 since a conjugate gradient algorithm captures initially the high frequency mesh mode of a
problem, it can be expected to perform better on a flexibility-like matrix than on a stiffness
matrix because the high frequencies of the former are indeed the low frequencies of the
stiffness matrix which are closer to the solution of the static problem.
In the light of the above remarks, it is reasonable to expect that for a given mesh partition:
0 each conjugate projected gradient iteration that is applied to the solution of the interface
problem (14) which results from the method of finite element tearing and interconnecting will
not be slower-and may be even faster for large-scale problems and a small number of
interface nodes-than each conjugate gradient iteration applied to the solution of the
interface problem (26) which results from the method of substructures;
0 the iterative solution of the interface problem associated with the tearing method will exhibit
a faster rate of convergence than the iterative solution of the interface problem resulting from
the conventional method of substructures.
Finally, it should be noted that domain decomposition methods in general exhibit a larger
degree of parallelism than parallel direct solvers. The efficiency of the latter is governed by the
bandwidth of the given finite element system of equations. If the bandwidth is not large enough,
interprocessor communication and/or process synchronization can dominate the work done in
parallel by each processor. This is true not only for multiprocessors with a message-passing
system, but also for super-vector-multiprocessors with a shared memory such as the CRAY
systems, where synchronization primitives are rather expensive. Therefore, the computational
method described in this paper should be seriously considered for large-scale problems with a
relatively small or medium bandwidth. These problems are typically encountered in the finite
element analysis of large space structures which are often elongated and include only a few
elements along one or two directions.16The method is also recommended for problems where the
storage requirements of direct solvers cannot be met.
8. OPTIMAL MESH DECOMPOSITION

The computational method described in this paper requires that the given finite element mesh be
partitioned into as many submeshes as there are available processors. In this section, we establish
some guidelines for the design of an optimal mesh partition by. analysing the effect of its structure
on the performance of the global solution algorithm.
From the numerical point of view, the proposed solution method is hybrid in the sense it
combines a direct and an iterative schemes. The direct solver is applied to each subdomain
problem, the iterative one to the interface between these subdomains. If the mesh partition is such
that the bandwidth of each subdomain problem is of the same order as the bandwidth of the
global unpartitioned system of equations, the overall algorithm performs more operations than a
direct method applied to the global unpartitioned system, independently of how fast the interface
problem converges. The slicing of a parallelpiped along its largest dimension yields such a
partition (Figure 6). If, on the other hand, the same parallelepiped is partitioned such that the
bandwidth of each subdomain problem is much smaller than the bandwidth of the original finite
element system (Figure 7), and if the convergence of the interface problem is fast enough, the
method of finite element tearing and interconnecting may produce the solution with fewer
computations than a global direct solver.
1218 C. FARHAT A N D F.-X. ROUX
Figure 7. Boxwise partitioning of a parallelepiped
Besides conditioning, there are two other factors which affect the convergence of the interface
problem (14) and which are directly related to the mesh partition: (a) the number of interface
nodes, and (b) the interconnectivity of the subdomains along their interface. It can be easily
checked that, within one iteration of the conjugate projected gradient algorithm, new information
that is issued from a subdomain Q j reaches only those subdomains that interconnect with Q j
along an edge or a plane. Therefore, the interface problem converges faster for a mesh partition
that is characterized by a larger effective interconnectivity bandwidth.
The above observations suggest that an automatic finite element mesh decomposer that is
suitable for the computational method described herein should meet or strike a balanced
compromise between the seven following requirements:
(i) it should yield a set of subdomains where the bandwidths of each local problem is only a
fraction of the bandwidth of the global system of equations;
(ii) it should keep the amount of interface nodes as small as possible in order to reduce the size
of the interface problem;
(iii) it should yield a set of subdomains with a relatively high interconnectivity bandwidth so
that within each iteration a new correction reaches as many subdomains as possible;
(iv) it should avoid producing subdomains with ‘bad’ aspect ratio (for example, elongated and
flat subdomains) in order to keep the local problems as well conditioned as possible;
(v) it should deliver as few as possible floating subdomains in order to keep the cost
associated with the projected gradients as low as possible;
(vi) it should yield a set of balanced subdomains in order to ensure that the overall
computational load will be as evenly distributed as possible among the processors;

(vii) it should be able to handle irregular geometry and arbitrary discretization in order to be
general purpose.
For some mesh topologies, it becomes very difficult to meet simultaneously requirements (i), (ii)
and (iv). In that case, priority should be given to the first two requirements. However, we have
found that, for many problems, the above requirements can be met, using for example a slightly
modified version of the general purpose finite element decomposer presented by Farhat in
Reference 17. Several decomposition examples are described in Section 9. The most challenging
problem that is yet to be resolved is the rational relationship between the mesh decomposition
and the interface conditioning.
9. APPLICATIONS AND PERFORMANCE ASSESSMENT

We first illustrate the proposed parallel computational method with the static analysis on a 32
processor iPSC/2 hypercube of a three-dimensional mechanical joint subjected to internal
pressure loading. We report performance results which show that the parallel method of tearing
exhibits a better speed-up than the parallel method of conventional substructuring because it
consumes three times less interprocessor communication. Next, we apply our algorithm to the
large-scale finite element analysis on a 4 processor CRAY-2 of a three-dimensional cantilever
composite beam made of more than one hundred stiff carbon fibres bound by a nearly
incompressible elastomer matrix. We report and discuss in detail the measured performance
results for various mesh partitioning strategies. For that problem, the proposed solution method
outperforms the direct Choleski factorization by a factor greater than three, even for configura-
tions that yield very ill-conditioned systems. In the following, NP, NE, NDF, Tmsg,Tp and SP
denote respectively the number of processors, the number of elements, the number of degrees of
freedom, the time elapsed in message-passing, the total parallel time and the overall parallel
speed-up.
The finite element discretization of the mechanical joint using 8 node brick elements is shown
in Figure 8. Two meshes are considered. The first one contains 5002 elements, 14 932 degrees of
freedom and is intended for a 16 processor cluster of the iPSC/2. The second mesh has 9912
elements, 29 654 degrees of freedom and is constructed for a 32 processor configuration of the
same hypercube. The mesh decompositions into 16 and 32 subdomains are carefully designed to
be topologically equivalent as much as possible to a checkerboard partitioning. Consequently,
many of the resulting subdomains are Boating.
The interprocessor communication time per iteration, the total parallel execution time and the
overall parallel speed-up associated with the parallel method of tearing and the parallel method of
substructures are reported in Table I1 for both meshes. For all cases, a tolerance of on the
global relative residuals is selected as a convergence criterion.
For both cases, the parallel tearing and parallel substructuring algorithms achieve excellent
speed-up. This is generally true for all balanced algorithms that require message-passing only
Table 11. Performance results on iPSC/2. Mechanical joint-brick elements-16 and 32 subdomains
-
Tm,$itr. Tm,$itr. TP TP SP SP
NP NE NDF subs. tearing subs. tearing subs. tearing
16 5002 14932 16.3ms 5.2 ms 602s 546s 14.4 15.4

32 9912 29654 17.9ms 5.4 ms 1103s 917s 24.0 28.8
- ._
Figure 8. Finite element discretization of a mechanical joint
between neighbouring processors. However, for this problem, the tearing algorithm is faster and
exhibits a 20 per cent higher speed-up than the conventional substructuring algorithm for which
the time elapsed in interprocessor communication is 3.31 times higher. Again, because it avoids
interprocessor communication along the edges and corners of the subdomains, the tearing
algorithm requires fewer message-passing startups which, in the case of short messages, are
known to account for the largest portion of the time elapsed in interprocessor communication on
the iPSC/2 (see, for example, the benchmarks of Boman and Rose’ *). A performance comparison
with a parallel direct solver i s not provided because of the lack of memory space to store in-core
the triangular factors of K.
Now that the parallel properties of the presented algorithm have been illustrated, we focus next
on example problems that illustrate its intrinsic properties and performance. We consider the
large-scale finite element static analysis of the pure bending of a set of beams made of similar
jointed composite ‘pencils’ (Figure 9). Each composite pencil contains one carbon fibre with its
elastomer matrix and is discretized in 51 vertical layers containing each 25 mesh points. The cross
section of the finite element mesh corresponding to a 16 pencil beam is shown in Figure 10.
The numerical results obtained on a 4 processor CRAY-2 for a 16 pencil beam with 48000
degrees of freedom are reported in Figures 11 and 12. These correspond to two extreme mesh
decompositions, namely: a horizontal cross-slicing into 4 subdomains each containing 4 cantil-
ever parallel pencils (Dl), and (b) a vertical slicing into 4 subdomains, of which three are floating
(D2). Poisson’s ratio for the elastomer is 0.49.
For each decomposition case, three curves are reported which correspond to monitoring
convergence with three different measures: (A) the global force relative residual, (B) the displace-
Figure 10. Cross section of the finite element mesh for a 16 pencil composite beam
ment relative variation, and (C)the interface relative residual. Clearly, decomposition D1 induces
a faster convergence rate than decomposition D2. We have predicted this result since, within each
iteration of the iterative solution of the interface problem, information reaches all of the
subdomains in decomposition D1, while it reaches only half of these in decomposition D2.
PRECONDITIONED TEARING METHOD

4 t ~ l ~ l { l ~ l ~ l ~ t ~
A=Globdl F o r c e R e s i .
2 8-Global D i s . Vdrid.
C i l n t e r f d c e Residual
-2
(P
c
e
O -4
L
0
s
0
Q
-6
-8
-10
-12
0 20 40 60 80 100 120 140 160 180
No. o f iteration
Figure 11. Numerical results for decomposition D1
Another important result relates to the relative positioning of the three curves, independently
from the decomposition pattern. Note first that convergence with the global force relative
residual is harder to achieve than convergence with the displacement relative variation. This is
because the problem suiTers from a severe ill-conditioning owing to the incompressibility of the
elastomer (Poisson ratio -=10.49)and the elongated shape of the cantilever composite beam. Note
also that convergence with. the interface relative residual is closer to convergence with the
displacement relative variation than it is to convergence with the global force relative residual.
This is because the interface problem i s formulated in the functional space of the stresses, so that
its residuals correspond to a d~sp~acement increment.
Finally, the tearing method is compared for performance with a direct Cholesky algorithm. The
factorization routine is highly optimized for CRAY-like supercomputers: it is fully vectorized and
n t sreduced via an unrolling procedure. In order to have a
its memory traffic r e q ~ ~ r e ~ eare
meaningful cornparison, we use the same factorization routine for the Cholesky global algorithm
and for factoring the s u b ~ o m a ~stiffnesses
n in the tearing algorithm. The same bending problem
as previously is selected for that purpose. Several djfferent mesh configurations which correspond
to different numbers of pencils are considered. Performance results on a CRAY-2 single processor
A=Global F o r c e R e s i .
2 - B=Global D i s . V d r i d . -
-2
6)
c
&
0
-4
c
(u
0
Cl
-6
-8
-10
No. o f iteration
Figure 12. Numerical results for decomposition D2
are reported in Table 111. A tolerance of on the global relative residuals is selected as a
convergence criterion.
The above results demonstrate that, for sufficiently large problems, the tearing method can
outperform direct solvers. For the particular problem above, it runs up to 3.3 times faster than
Cholesky factorization and requires 5.2 times less memory space.
10. CLOSURE AND OVERVIEW OF SUBSEQUENT RESEARCH

A novel domain decomposition approach for the parallel finite element solution of equilibrium
equations is presented. The spatial domain is partitioned into a set of totally disconnected
subdomains, each assigned to an individual processor. Lagrange multipliers are introduced to
enforce compatibility at the interface nodes. In the static case, each floating subdomain induces a
local singularity that is resolved in two phases. First, the rigid body modes are eliminated in
parallel from each local problem and a direct scheme is applied concurrently to all subdomains in
order to recover each partial local solution. Next, the contributions of these modes are related to
the Lagrange m u l t ~ p ~ i through
~rs an orthogonality condition. A parallel conjugate projected
Table 111. Performance results on CRAY-2. Composite

beam-brick elements-direct vs. 4-subdomain tearing
Number of pencils 4 9 16
NDF 13000 28 000 48 000
4-subdomain tearing method

~
._____________
________._ _ _
NDF interface 2450 7350 14 700
# iterations 130 210 300
CPU time 20 s 73 s 193 s
Memory size 1.6 m.w. 4.5 m.w. 9.5 m.w
Global Cholesky factorization
CPU time 15 s 130 s 650 s

Memory size 3.6 m.w. 16 m.w. 50 m.w.
gradient algorithm is developed for the solution of the coupled system of local rigid modes
components and Lagrange multipliers, which completes the solution of the problem. When
implemented on local memory multiprocessors, this proposed method of tearing and inter-
connecting requires less interprocessor communications than the classical method of sub-
structuring. It is also suitable for parallel/vector computers with shared memory. Large-scale
example applications are reported on the iPSC/l and CRAY-2. Measured performance results
illustrate the advantages of the proposed method and demonstrate its potential to outperform the
classical method of substructures and parallel direct solvers.
It is our experience that domain decomposition methods are very sensitive to the mesh
partition. In this paper, we have outlined some guidelines for the practical decomposition of a
given finite element mesh. Subsequent research will focus on determining the relationship
between a pattern of decomposition and the resulting conditioning of each of the local problems
and the interface one. While several preconditioners for conventional domain decomposition
methods (Schur methods) are available in the literature, further research is needed to develop a
preconditioner for hybrid domain decomposition algorithms such as the tearing method de-
veloped herein.
ACKNOWLEDGEMENTS
The first author would like to thank Pr. M. Geradin at the University of Liege, Belgium, for his
valuable comments. He also wishes to acknowledge partial support by the National Science
Foundation under Grant ASC-8717773, and partial support by the Air Force Office of Scientific
Research under Grant AFOSR-89-0422.
APPENDIX I: SOLVING A CONSISTENT SINGULAR SYSTEM Kjuj = fj

For completeness, we include in this Appendix a derivation of the solution of a consistent singular
system of equations. In this work, such a system arises in every floating subdomain Inj and takes
the form
KJ. u3. = f3. (28)
+
where Kj is the (ny ni) x (n; + n;) stiffness matrix associated with SZ,, and uj and f, are the
corresponding displacement and forcing vectors. If SZ, has n: rigid body modes, K j is rank n:
deficient. Provided that fj is orthogonal to the null space of K,, the singular system (28) is
consistent and admits a general solution of the form
uj = Kff, + Rju (29)
where KT is a pseudo-inverse of Kj-that is, Kf verifies KjKf K j = K,, R, is a basis of the null
space of Kj-that is, Rj stores the nfi rigid body modes of R,, and u is a vector of length n;
containing arbitrary real coefficients.
1. Computing the rigid body modes

Let the superscripts p and r denote respectively a principal and a redundant quantity. The
singular stiffness matrix K j is partitioned as
where KY has full rank equal to ng

K.= cKYP~~]
KYT
+ ni - nfi. If Rj i s defined as
where I,; is the ng x nfi identity matrix, then Rj satisfies

KjRj = 0
Moreover, 1,; has full column rank and so does R,. Therefore, the n; columns of R, as defined in
(29) form a basis of the null space of Kj.
2. Computing Kf fj
The partitioning of the singular matrix Kj defined in (30) implies that
Kg’ = Ky*K!P- Kjp’
Using the above identity, it can be easily checked that the matrix Kf defined as
is a pseudo-inverse of Kj. Therefore, a solution of the form Kff, can be also written as
In practice, K, cannot be explicitly re-arranged as in (30). Rather, the following should be

implemented when K, is stored in skyline form. A zero pivot that is encountered during the
factorization process of K, corresponds to a redundant equation which needs to be labelled and
removed from the system. The zero pivot is set to one, the reduced column above it is copied into
an extra right hand side-this corresponds to a forward reduction with Kp’ufias right hand side-
and the coefficients in the skyline corresponding to that pivotal equation are set to zero. At the
end of the factorization process, the non-labelled equations define the full rank matrix KYP. The
backward substitution is modified to operate also on the n; extra right hand sides in order to
recoyer up = - K P P - l K P r ~ U f i .
The abote procedure for solving a consistent singular system of equations has almost the same
compu~a~i~ complexity
na~ as the solution of a non-singular one.
APPENDIX IT: STARTING LAGRANGE MULTIPLIER VECTOR

In this Appendix we present a fast scheme for generating a starting vector for the conjugate
projected gradient algorithm (19)--(20). We consider the general case of an arbitrary mesh
partition.
For each floating ~ u b ~ o ~a,,
a i the
n corresponding component of the starting vector has to
satisfy the equality constraint
is an (ti; + ni) x nj full column rank matrix which stores the rigid body modes of the
floating subdomain aj,€4; is the restriction of R, to the intersection of sZj and the interface rl,and
fj is the vector of prescribed forces in Q j . If li.?) is written as
(34)
then (33) becomes
which admits as solution

fly = ~ ~ ~ ~ i ) - l ~ ~ ~ j (36)
Therefore, a starting vector A?) which satisfies the constraint equation (33) is given by
The matrix product (Ry ) is only n; x ni7 where n; is the number of rigid body modes of the
floating subdomain Q j . Therefore, (R:TR:) is at most 3 x 3 in two-dimensional problems and at
most 6 x 6 in three-dime~isionalproblems, and the evaluation of hy) according to (37) requires
little computational effort.
REFERENCES
1. C. Farhat and E. Wilson, ‘A new finite element concurrent computer program architecture’, Int. j. numer. methods eng.,
24, 1771-1792 (1987).
2. B. Now-Oniid, A Raefsky and G. Lyzenga, ‘Solving finite element equations on concurrent computers’, in A. K. Noor
(ed.), Pmallel Computations and Their Impact on Mechanics, ASME, New York, 1987, pp. 209-228.
3. M. Qrtiz and B. Nour-Omid, ‘Unconditionally stable concurrent procedures for transient finite element analysis’,
Comp. Methods Appl. Mech. Eng., 58, 151-174 (1986).
4. C. Farhat, ’A ~ u l t i g r i d . ~ isenii-iterative
~e algorithm for the massively parallel solution of large scale finite elements
systems’, MuEtigrid Methods: Proc. Fourth Copper Mountain Con$ on Multigrid Methods, SIAM, Copper Mountain,
Colorado, 1989, pp. 171-180.
5. F. X. Roux, ‘Test on parallel machines of a domain decomposition method for a composite three dimensional
structural analysis problem’, Proc. Int. Con$ on ~upercompu~ing, Saint Malo, France, 1988, pp. 273-283.
6. F. X. Roux, ‘A parallel solver for the linear elasticity equations on a composite beam’, in T. F. Chan et al. (eds.), Proc.
Second Int. Con$ on Domain Decomposition Methods, SIAM, Los Angeles, California, 1989.
7. 6. Kron, ‘A set of principles to interconnect the solutions of physical systems’, J . Appl. Phys., 24, 965-980 (1953).
8. T. M. H. Pian, ‘Finite element formulation by variational principles with relaxed continuity requirements’, in A. K.
Aziz (cd.), The Mathematical Foundation of the Finite Element Method with Applications to Partial Differential
Equations, Part I I , Academic Press, London, 1972, pp. 671-687.
9. Q. V. Dihn, R. Glowinski and J. Periaux, ‘Solving elliptic problems by domain decomposition methods with
applications’, in A. Schoenstadt (ed.), Elliptic Problem Solvers ZZ, Academic Press, London, 1984.
10. M. R. Dorr, ‘Domain decomposition via Lagrange multipliers’, Report No. UCRL-98532, Lawrence Livermore
National Laboratory, 1988.
11. R. Fletcher, Practical Methods ofoptimization, Vol. 2, Constrained Optimization, Wiley, New York, 1981, pp. 86-87.
12. P. E. Gill and W. Murray, in P. E. Gill and W. Murray (eds.), Numerical Methodsfor Constrained Optimization,
Academic Press, London, 1974, pp. 132-135.
13. F. X. Roux, ‘Accelerationof the outer conjugate gradient by reorthogonalization for a domain decomposition method
for structural analysis problems’, Proc. Third Znt. Con$ on Supercomputing, Crete, Greece, 1989, pp. 471-477.
14 S. H. Bokhari, ‘On the mapping problem’, ZEEE Trans. Comp., C-30,207-214 (1981).
15. C. Farhat, ‘On the mapping of massively parallel processors onto finite element graphs’, Comp. Struct., 32, 347-354
( 1989).
16. C. Farhat, ‘Which parallel finite element algorithm for which architecture and which problem’, in R. V. Grandhi et al.
(eds.), Computational Structural Mechanics and Mu~tidisciplinaryOptimization, AD-Vol. 16, ASME, New York, pp.
35-43.
17. C. Farhat, ‘A simple and efficient automatic FEM domain decomposer’, Comp. Struct., 28, 579402 (1988).
18. L. Bomans and D. Roose, ‘Benchmarking the iPSC/2’, Report TW 114, Katholieke Universiteit Leuven, Department
of Computer Science, Belgium, 1988.

1991 - [Farhat] - Feti Method

Uploaded by

Copyright:

Available Formats

1991 - [Farhat] - Feti Method

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1991 - [Farhat] - Feti Method

Uploaded by

Copyright:

Available Formats

INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN VOL.

32, 1205-1227 (1991)

A METHOD O F FINITE ELEMENT TEARING AND

0029-598 1/91/141205-23$11 S O Received 28 February 1990

2. FINITE ELEMENT TEARING AND INTERCONNECTING

(0, h)r = jrh dr

Figure 1. Decomposition in two subdomains

and which satisfy on the interface b o u ~ ~ ~ TI

u1 = K;'(f, + BTk) (7)

Figure 2. Finite element tearing

BjkUj =: BjkUk j = 1, N , and Qk connected to Q j

3. HANDLING LOCAL SINGULARITIES

Figure 3. Decomposition resulting in a singular subdomain

ti2 = K:(f, - Bzh) + R,a

Iterate k = 1, 2, . . . until convergence

5. PRECONDITIONING THE INTERFACE PROBLEM

where K J 1 , j = 1, 2, is replaced by KT if sZj is a floating subdomain. The objective is to find an

6. PARALLEL CHARACTERISTlCS O F THE ~ ~ O ~ O SMETHOD

Figure 4. Cylindrical panel- N, = 2

7. TEARING VS. SUBSTR UCTURING

8. OPTIMAL MESH DECOMPOSITION

Figure 7. Boxwise partitioning of a parallelepiped

computational load will be as evenly distributed as possible among the processors;

9. APPLICATIONS AND PERFORMANCE ASSESSMENT

16 5002 14932 16.3ms 5.2 ms 602s 546s 14.4 15.4

Figure 8. Finite element discretization of a mechanical joint

PRECONDITIONED TEARING METHOD

Figure 11. Numerical results for decomposition D1

Figure 12. Numerical results for decomposition D2

10. CLOSURE AND OVERVIEW OF SUBSEQUENT RESEARCH

Table 111. Performance results on CRAY-2. Composite

4-subdomain tearing method

Global Cholesky factorization

CPU time 15 s 130 s 650 s

APPENDIX I: SOLVING A CONSISTENT SINGULAR SYSTEM Kjuj = fj

1. Computing the rigid body modes

where KY has full rank equal to ng

where I,; is the ng x nfi identity matrix, then Rj satisfies

In practice, K, cannot be explicitly re-arranged as in (30). Rather, the following should be

APPENDIX IT: STARTING LAGRANGE MULTIPLIER VECTOR

which admits as solution

You might also like