1991 - [Farhat] - Feti Method
1991 - [Farhat] - Feti Method
1991 - [Farhat] - Feti Method
CHARBEL FARHAT
Department of Aerospace Engineering Sciences and Center for Space Structures and Controls, University of Colorado at
Boulder, Boulder, CO 80309-0429, U.S.A.
FRANCOIS-XAVIER ROUX
O.N.E.R.A.Groupe Calcul Parallele, 29 Au. de la Division Leclerc, BP72 92322 Chatillon Cedex, France
SUMMARY
A novel domain decomposition approach for the parallel finite element solution of equilibrium equations is
presented. The spatial domain is partitioned into a set of totally disconnected subdomains, each assigned to
an individual processor. Lagrange multipliers are introduced to enforce compatibility at the interface nodes.
In the static case, each floating subdomain induces a local singularity that is resolved in two phases. First,
the rigid body modes are eliminated in parallel from each local problem and a direct scheme is applied
concurrently to all subdomains in order to recover each partial local solution. Next, the contributions of
these modes are related to the Lagrange multipliers through an orthogonality condition. A parallel
conjugate projected gradient algorithm is developed for the solution of the coupled system of local rigid
modes components and Lagrange multipliers, which completes the solution of the problem. When
implemented on local memory multiprocessors, this proposed method of tearing and interconnecting
requires less interprocessor communications than the classical method of substructuring. It is also suitable
for parallel/vector computers with shared memory. Moreover, unlike parallel direct solvers, it exhibits a
degree of parallelism that is not limited by the bandwidth of the finite element system of equations.
1. INTRODUCTION
A number of methods based on domain decomposition procedures have been proposed in recent
years for the parallel solution of both static and dynamic finite element equations of equilibrium.
Most of these methods are derived from the popular substructuring technique. Typically, the
finite element domain is decomposed into a set of subdomains and each of these is assigned to an
individual processor. The solution of the local problems is trivially parallelized and usually a
direct method is preferred for this purpose. Parallel implementations of both a direct' and an
iterative' solution of the resulting interface problem have been reported in the literature. A
number of more original approaches have also been spurred by the advent of new parallel
processors. Ortiz and Nour-Omid3 have developed a family of unconditionally stable concurrent
procedures for transient finite element analysis and Farhat4 has designed a multigrid-like
algorithm for the massively parallel finite element solution of static problems. Both of these
developments relate to the divide and conquer paradigm but depart from classical substructuring.
In this paper, we present a parallel finite element computational method for the solution of
static problems that is also a departure from the classical method of substructures. The unique
feature about the proposed procedure is that it requires fewer interprocessor communications
than traditional domain dccornposition algorithms, while it still offers the same amount of
parallelism. R o u ~has ~ ,presented
~ an early version of this work that is limited to a very special
class of problems where a finite element domain can be partitioned into a set of disconnected but
non-floating subdomains. Here, we generalize the method for arbitrary finite element problems
and arbitrary mesh partitions. We denote the resulting computational strategy by 'finite element
tearing and interconnecting' because of its resemblance with the very early work of I(ron7 on
tearing methods for electric circuit models. Tn Section 2, we partition the finite element domain
into a set of totally disconnected s u b ~ ~ m a iand
n s derive a computational strategy from a hybrid
variational principle where the inter-subdomain continuity constraint is removed by the in-
troduction of a Lagrange multiplier. An arbitrary mesh partition typically contains a set of
floating sL~bdoi~ains which induce local singularities. The handling of these singularities is treated
in Section 3. First, the rigid body modes are eliminated in parallel from each local problem and a
direct scheme is applied concurrently to all subdomains in order to recover each partial local
solution. Next, the contributions of these modes are related to the Lagrange multipliers through
an orthogonality condition. A parallel conjugate projected gradient algorithm is developed in
Section 4 for the solution of the coupled system of local rigid modes components and Lagrange
multipliers, which completes the solution of the problem. Section 5 deals with the preconditioning
of the interface problem in order to speed up the recovery of the Lagrange multipliers. Section 6
emphasizes the parallel characteristics of the proposed method and Section 7 contrasts it with the
method of substructures. Section 8 discusses some important issues related to the partitioning of a
given finite element mesh. Finally, Section 9 illustrates the method with structural examples on
the distributed memory hypercube iPSC (32 processors) and the shared memory parallel/veclor
CRAY-2 system (4processors), and Section 10 concludes the paper.
+
Tn the above, the indices i, j, k take the values 1 to 3, qi,j ) = (vi, vj, i)/2 and vi, denotes the
partial derivative of the ith component of IJ with respect to the jth spatial variable, cijklare the
elastic coefficients, C8 denotes the volume of the elastostatic body, r its piecewise smooth
boundary and l?* the piece of where the tractions hi are prescribed.
FE TEARING AND INTERCONNECTING 1207
where
1208 C. FARHAT AND F.-X. ROUX
for any admissible v,, v 2 and p. Clearly, the left inequality in ( 5 ) implies that (ul - u,, P ) ~<
,
(u, - u,, A)rr,which imposes (u, - u,, p)r, = 0 for any admissible p and therefore u1 = u2 on TI.
+ +
The right inequality in (5) imposes J , ( u , ) J,(u,) < J,(v,) J,(u,) for any pair of admissible
functions (0, , u,). This implies that, among all admissible pairs (vl, 0,) which satisfy the
continuity condition (3),the pair (ul, 2.4,) minimizes the sum of the energy functionals J , and J ,
defined respectively on R, and R,. Therefore, u1 and u2 are the restriction of the solution u of the
non-partitioned problem (1) to respectively R, and R,. Indeed, equations (4)and (5) correspond
to a hybrid variational principle where the inter-subdomain continuity constraint (3) is removed
by the introduction of a Lagrange multiplier (see, for example, Pian*).
If now the displacement fields u, and u, are expressed by suitable shape functions as
u1 = N u , and u2 = N u ,
and the continuity equation is enforced for the discrete problem, a standard Galerkin procedure
transforms the hybrid variational principle (4)in the following algebraic system:
K,u, = f, + B:I
K,u, = f2 - BZI (6)
B,u, = B,u,
where K,, u, and f,, j = 1,2, are respectively the stiffness matrix, the displacement vector and the
prescribed force vector associated with the finite element discretization of Q,. The vector of
Lagrange multipliers I represents the interaction forces between the two subdomains R, and Q,
along their common boundary Ti. Within each subdomain Rj,we denote the number of interior
nodal unknowns by ng and the number of interface nodal unknowns by n:. The total number of
interface nodal unknowns is denoted by n,. Note that n, = n: = n: in the particular case of two
subdomains. If the interior degrees of freedom are numbered first and the interface ones are
numbered last, each of the two connectivity matrices B, and B, takes the form
B, = [O, I,] j = 1, 2
where O j is an ni x n; null matrix and I, is the n: x ni identity matrix. The vector of Lagrange
multipliers I is n, long.
If both K , and K, are non-singular, equations (6) can be written as
(BIKLIBT + B2KT'BZ)I = B2KT1f, - B,K;'f,
Bj"= ~
O l ( j 3k)
C$
0 2 ( j 2k)
]
FE TEARING AND INTERCONNECTING 1209
where 0,( j , k ) is an m,(j , k) x (nj + n:) zero matrix, O,( j , k ) is another m2( j , k ) x (n; + n:) zero
matrix and C! is an m,( j , k ) x ($ + n;) connectivity matrix, m,( j , k ) is the number of Lagrange
multipliers that interconnect Q j with its neighbour Q k , and m , ( j , k) and m,( j , k ) are two non-
negative integers which satisfy m,(j , k ) + m,( j , k ) + m2(j , k ) = n,. The connectivity matrix C:
can be written as
Cj”= C 0 3 ( j ,k ) 1: O A j , k)l
where 0 3 ( j ,k ) is an m,( j, k ) x m 3 ( j ,k ) zero matrix, 15 is the m,(j , k) x m,( j , k) identity matrix,
04( j, k ) is another m,( j , k ) x m4(j, k ) zero matrix, and m3( j , k ) and m4(j , k ) are two non-negative
+ + +
integers which verify m3(j , k ) m,( j , k ) m4(j , k ) = nj n:. If uj and N , denote respectively the
number of subdomains Qk that are adjacent to Inj and the total number of subdomains, the finite
element variational interpretation of the saddle-point problem (4) generates the following
algebraic system:
k=aj
Kjuj = fj + B;Th j = 1, N ,
k=1
found-that is a matrix K: which verifies K,K:K, = K,--and the general solution of (9) is given
(ii) the pseudo-inverse KZ does not need to be explicitly computed. For a given input vector
v, the output vector K:v and the rigid modes R, can be obtained at almost the
same computational cost as the response vector K i l v , where K, is non-singular (see
Appendix I);
(iii) system (11) is under-determined. Both 3, and a need to be determined before u1 and u2 can
be found, but only three equations are available so far.
Since K, is symmetric. the singular equation (9) admits at least one solution if and only if the
right hand side (f, - BTL) has no component in the nu61 space of K,. This can be expressed as
(12)
The above orthogonality condition provides the missing equation for the complete solution of
(1 1). Combining (1 1) and (12) yields, after some algebraic manipulations,
1
FE TEARING AND INTERCONNECTING 1211
where
F, = (B1K;'BT + B2K:BT)
Clearly, F, is s y ~ ~positive r ~ ~ and 1 has full column rank. Therefore, the system of
e ~ definite
equations in (a,.) is symmetric and non-si ular. It admits a unique solution (h,a) which
uniquely determrnes 1~~ and
n t that, since az; 6 6, systems (I 3) and (7) have almost the same size. For an
It is i ~ p o ~ ~toa note
arbitrary number ~ ~ s u b d o ~Na, ~ofnwhichs N , are floating, the additional number of equations
introduced by the ~ a ~ ~of~local i nsingularities
g is bounded by SN,. For large-scale problems and
relatively coarse mesh partitions, this number is a very small fraction of the size of the global
system. On the other hand, if a given tearing process does not result in any floating subdomain, u
is zero and the systems of equations (13) and (7) are identical.
Next, we present a numerical algorithm for the solution of (13).
We seek an efficient solution algorithm that does not require the explicit assembly of F,.
The solution to the above problem can be expressed as
where
As written in ( ! 5 > , this solution procedure is not recommended because it requires either the
~ ~ a of the ~ a of F,,
~ inverse ~ or~ the ~nested
n solutions of two linear systems involving F, and
I r F I- 1 R;. It is noted by Fletcher" that if two matrices S and Z are computed such that
an alternative r e p ~ e s e ~ i ~ a of
~ i the
o n solution to (14) is given by
h I= - W~BzK;~2- IB,P(;'fl) +T
1K;'fI .- BzKif2) - UR:f2
1212 C. FARHAT AND F.-X. ROUX
where
H = Z(ZTF,Z)-'ZT
T = S - HFIS
U = STFIHF,S- STF,S
which does not require the explicit assembly of F, if a suitable iterative scheme is chosen for
solving all the temporary systems involving the quantity (ZTF,Z)- Still, the above solution
procedure is not feasible because it requires the computation of S and Z-typically via a Q R
factorization of some matrix involving R',"-and the iterative solution of too many temporary
systems before h and a can be obtained.
Clearly, the nature of F, makes the solution of (14) inadequate by any technique which requires
this submatrix explicitly. This implies that a direct method or an iterative method of the SOR type
cannot be used. The only efficient method of solving (14) in the general sparse case is that of
conjugate gradients, because once K , and K, have been factorized, matrix-vector products of the
form F,v can be performed very efficiently using only forward and backward substitutions.
Unfortunately, the Lagrangian matrix
.=[ - RF1
F -"',I
0
is indefinite so that a straightforward conjugate gradient algorithm cannot be directly applied to
the solution of (14). However, the conjugate gradient iteration with the projected gradient (see, for
example, Gill and Murray',) can be used to obtain the sought-after solution. In order to
introduce the latter solution algorithm, we first note that solving (14) is equivalent to solving the
equality constraint problem:
minimize @(A) = $hTF,h + (B,K;'f, - B2K:f,)Th
(18)
subject to R r h = RSf,
Since F, is symmetric positive definite, a conjugate gradient algorithm is most suitable for
computing the unique solution to the unconstrained problem. Therefore, this algorithm will
converge to the solution to (18)if and only if it can be modified so that the constraint R F h = RTf2
is satisfied at each iteration. This can be achieved by projecting all the search directions onto the
null space of R',.
The result is a conjugate gradient algorithm with the projected gradient. It is of the form
Initialize
Pick I(') such that RTh(') = Rzf,
A fast scheme for finding a starting li.(O) which satisfies the constraint R?;1(*) = Rzf, is given in
Appendix 11. Clearly, RFdk) = 0 for all k > 1. Therefore, R?h(k) = R:'li.(O), which indicates that
the approximate solution satisfies the linear equality constraint of problem (14) at each
iteration k. It is also important to note that, within each iteration, only one projection is
performed. This projection is relatively inexpensive since the only implicit computations that are
involved are associated with the matrix R',TR', which is at most 6 x 6. This matrix is factored once,
before the first iteration begins. Except for this small overhead, algorithm (20) above has the same
computational cost as the regular conjugate gradient method.
After li. is found, the rigid body mode coefficients are computed as
a = (RrR\)-'(F,h - BZKi1fz + BIK;'fl)
RITA = i"].
RiT
= RTf= 14 !!i.
For an arbitrary number of subdomains N , of which N , are floating, the equality constraint is
.o.
0 RT
Only those columns of R? which operate on Lagrange multipliers that are associated with TI n Q j
are non-zero. The projection matrix is P .= [I - R1(RITR')-'R'T], where R"R' is generally
banded of dimension at most equal to 6 N , . The banded structure of P is determined by the
subdomains interconnectivity. If for practical reasons this banded structure is not exploited, the
number of three-dimensional floating subdomains should be kept as small as possible, say less
than thirty two, which implies that the proposed computational method would be suitable only
for coarse and medium grain multiprocessors.
At each iteration k, the preconditioned conjugate projected gradient algorithm involves the
solution of an auxiliary system of the form
plz(k) =: (23)
where rck)is the residual at the kth iteration. The particular choice of P; ' given in (22) offers the
advantage of solving (23) explicitly without the need for any intermediate factorization.
For computational efficiency, P; ' is implemented as
P;' = K! + K', (24)
1214 C. FARHAT AND F.-X. ROUX
\ are the traces of K , and K, on TI.Clearly, with this choice for the
preconditioner, the auxiliary system (23) is 'cheap', easy to solve and perfectly parallelizablie on
both local and shared memory parallel architectures.
Since we do not have a strong mathematical justification for this choice of the preconditioner,
we have conducted a set of numerical experiments to assess a p r i ~ rits
i performance. A fixed-fixed
cylindrical panel was discretized with an N by M regular mesh and was modelled with 4 node
shell elements (Figure 4). All test cases used N , = 2 and a vertical slicing.
Table I reports the condition numbers of the global stiffness matrix K, the subdomain stiffness
matrices K , and K,, and the original and preconditioned interface flexibility-like matrices F, and
P;'F,, for various values of N .
For this test problem, the condition number of the preconditioned interface is two orders of
magnitude lower than that of the global problem.
The extrapolation of (22) and (24) to N , > 2 is straightforward. In order to reduce further the
number of reconditioned conjugate pro-jeeted gradient iterations, the selective reorthogonaliz-
ation procedure developed by Roux and reported in Reference 13 is also utilized.
also amenable to parallel processing. For example, the matrix-vector product F,s@)can be
computed in parallel by assigning to each processor p j the task of evaluating yy) = BFK; B!Tsy),
and exchanging y?) with the processors assigned to neighbouring subdomains in order to
assemble the global result. Interprocessor communication is required only during the solution of
the interface problem (14) and takes place exclusively among neighbouring processors during the
assembly of the subdomain results.
iat this point, we stress that the parallel solution method developed herein requires inherently
iess interprocessor communication than other domain decomposition based parallel algorithms.
As mentioned earlier, interprocessor communication within the proposed method occurs only
during the solution of the interface problem (14).The reader should trace back this problem as
well as the presence of the Lagrange multipliers to the integral quantity
= lrI,,
(u, - u j , JJ~~,, A(oi - uj)dr (23
where rIZis the interface between subdomains R, and Rj. If has a zero measure, then
( v E- z ' ~A3 and no exchange of information is needed between Q and Q j . Therefore the
, rL, = 0
subdomains which interconnect along one edge in three-dimensional problems and those which
interconnect along one vertex in both two- and three-dimensional problems do not require any
interprocessor cornmunication. This is unlike the parallel method of substructures, whether the
interface problem is solved with a direct scheme' or with an iterative one.2 For a three-
dimensional regular mesh that is partitioned into subcubes, the proposed method of finite element
tearing and interconnecting requires that each subdomain communicates with at most six
neighbouring subdomains (since a cube has only six faces), while the parallel method of
substructures necessitates that each subdomain communicates with up to 26 neighbours (Figure
5). This communication characteristic makes the proposed parallel solution method very attract-
ive for a multiprocessor with a distributed memory such as a hypercube. Indeed, the advantages
of the method for this family of parallel processors are twofold: (a) the number of message-
pasings is dramatically reduced, which reduces the overhead due to communication start-up,
and (b) the complexity of the communication requirements is reduced so that an optimal mapping
of the processors onto the subdomains can be r e a ~ h e d ; ' ~therefore
.'~ the elapsed time for a given
message is improved. Both enhancements (a) and (b) reduce the communication overhead of the
parallel solution algorithm in a synergistic manner. This algorithmic feature of the proposed
method is still desirable for shared memory multiprocessors because it eases the assembly process
during the interface solution and makes the latter more manageable. It is not, however, as critical
for the performance as it is for local memory multiprocessors.
4 RATHER THAN 8
t
*
Figure 5. Reduced interprocessor communication patterns for two- and three-dimensional regular mesh partitions
standard conjugate gradient algorithm may be used for solving (26). On the other hand, the
resulting interface problem for the method of finite element tearing and interconnecting corres-
ponds to a formulation that is homogeneous to a jiexibility, even if it is not exactly a flexibility
formulation. For the two-subdomain decomposition, it can be written as in (14) and necessitates
the use of a conjugate projected gradient algorithm for finding the solution 1.
If K, and K, are partitioned into internal and boundary (interface) components and then are
injected into the first of equations (7), it can be easily shown that
where KS' and Kit) denote respectively the contributions of the first and second subdomains to
K,, . Equations (27) above establish the relationship between both approaches to domain
decomposition.
The computational implications of the differences between the two solution methods are as
follows:
e within each iteration, the solution process of problem (14) requires an additional com-
putational step which corresponds to the projection of the search direction onto the null
space of It:;
a within each iteration, the solution process of problem (14) requires the evaluation of the
matrix-vector product BjKl: B;dk), while the solution process of problem (26) requires the
'
evaluation of the matrix-vector product KTIKjj Kj, dk).Given that Bj is a Boolean matrix
and that its application to a matrix or a vector defines a floating-point-free extraction
FE TEARING AND INTERCONNECTING 1217
process, each conjugate gradient iteration applied to (14) is less computationally intensive
than its counterpart that is applied to (26);
0 since a conjugate gradient algorithm captures initially the high frequency mesh mode of a
problem, it can be expected to perform better on a flexibility-like matrix than on a stiffness
matrix because the high frequencies of the former are indeed the low frequencies of the
stiffness matrix which are closer to the solution of the static problem.
In the light of the above remarks, it is reasonable to expect that for a given mesh partition:
0 each conjugate projected gradient iteration that is applied to the solution of the interface
problem (14) which results from the method of finite element tearing and interconnecting will
not be slower-and may be even faster for large-scale problems and a small number of
interface nodes-than each conjugate gradient iteration applied to the solution of the
interface problem (26) which results from the method of substructures;
0 the iterative solution of the interface problem associated with the tearing method will exhibit
a faster rate of convergence than the iterative solution of the interface problem resulting from
the conventional method of substructures.
Finally, it should be noted that domain decomposition methods in general exhibit a larger
degree of parallelism than parallel direct solvers. The efficiency of the latter is governed by the
bandwidth of the given finite element system of equations. If the bandwidth is not large enough,
interprocessor communication and/or process synchronization can dominate the work done in
parallel by each processor. This is true not only for multiprocessors with a message-passing
system, but also for super-vector-multiprocessors with a shared memory such as the CRAY
systems, where synchronization primitives are rather expensive. Therefore, the computational
method described in this paper should be seriously considered for large-scale problems with a
relatively small or medium bandwidth. These problems are typically encountered in the finite
element analysis of large space structures which are often elongated and include only a few
elements along one or two directions.16The method is also recommended for problems where the
storage requirements of direct solvers cannot be met.
Besides conditioning, there are two other factors which affect the convergence of the interface
problem (14) and which are directly related to the mesh partition: (a) the number of interface
nodes, and (b) the interconnectivity of the subdomains along their interface. It can be easily
checked that, within one iteration of the conjugate projected gradient algorithm, new information
that is issued from a subdomain Q j reaches only those subdomains that interconnect with Q j
along an edge or a plane. Therefore, the interface problem converges faster for a mesh partition
that is characterized by a larger effective interconnectivity bandwidth.
The above observations suggest that an automatic finite element mesh decomposer that is
suitable for the computational method described herein should meet or strike a balanced
compromise between the seven following requirements:
(i) it should yield a set of subdomains where the bandwidths of each local problem is only a
fraction of the bandwidth of the global system of equations;
(ii) it should keep the amount of interface nodes as small as possible in order to reduce the size
of the interface problem;
(iii) it should yield a set of subdomains with a relatively high interconnectivity bandwidth so
that within each iteration a new correction reaches as many subdomains as possible;
(iv) it should avoid producing subdomains with ‘bad’ aspect ratio (for example, elongated and
flat subdomains) in order to keep the local problems as well conditioned as possible;
(v) it should deliver as few as possible floating subdomains in order to keep the cost
associated with the projected gradients as low as possible;
(vi) it should yield a set of balanced subdomains in order to ensure that the overall
FE TEARING AND INTERCONNECTING 1219
Table 11. Performance results on iPSC/2. Mechanical joint-brick elements-16 and 32 subdomains
-
Tm,$itr. Tm,$itr. TP TP SP SP
NP NE NDF subs. tearing subs. tearing subs. tearing
between neighbouring processors. However, for this problem, the tearing algorithm is faster and
exhibits a 20 per cent higher speed-up than the conventional substructuring algorithm for which
the time elapsed in interprocessor communication is 3.31 times higher. Again, because it avoids
interprocessor communication along the edges and corners of the subdomains, the tearing
algorithm requires fewer message-passing startups which, in the case of short messages, are
known to account for the largest portion of the time elapsed in interprocessor communication on
the iPSC/2 (see, for example, the benchmarks of Boman and Rose’ *). A performance comparison
with a parallel direct solver i s not provided because of the lack of memory space to store in-core
the triangular factors of K.
Now that the parallel properties of the presented algorithm have been illustrated, we focus next
on example problems that illustrate its intrinsic properties and performance. We consider the
large-scale finite element static analysis of the pure bending of a set of beams made of similar
jointed composite ‘pencils’ (Figure 9). Each composite pencil contains one carbon fibre with its
elastomer matrix and is discretized in 51 vertical layers containing each 25 mesh points. The cross
section of the finite element mesh corresponding to a 16 pencil beam is shown in Figure 10.
The numerical results obtained on a 4 processor CRAY-2 for a 16 pencil beam with 48000
degrees of freedom are reported in Figures 11 and 12. These correspond to two extreme mesh
decompositions, namely: a horizontal cross-slicing into 4 subdomains each containing 4 cantil-
ever parallel pencils (Dl), and (b) a vertical slicing into 4 subdomains, of which three are floating
(D2). Poisson’s ratio for the elastomer is 0.49.
For each decomposition case, three curves are reported which correspond to monitoring
convergence with three different measures: (A) the global force relative residual, (B) the displace-
FE TEARING AND INTERCONNECTING 1221
Figure 10. Cross section of the finite element mesh for a 16 pencil composite beam
ment relative variation, and (C)the interface relative residual. Clearly, decomposition D1 induces
a faster convergence rate than decomposition D2. We have predicted this result since, within each
iteration of the iterative solution of the interface problem, information reaches all of the
subdomains in decomposition D1, while it reaches only half of these in decomposition D2.
1222 C. FARHAT AND F.-X. ROUX
A=Globdl F o r c e R e s i .
2 8-Global D i s . Vdrid.
C i l n t e r f d c e Residual
-2
(P
c
e
O -4
L
0
s
0
Q
-6
-8
-10
-12
0 20 40 60 80 100 120 140 160 180
No. o f iteration
Another important result relates to the relative positioning of the three curves, independently
from the decomposition pattern. Note first that convergence with the global force relative
residual is harder to achieve than convergence with the displacement relative variation. This is
because the problem suiTers from a severe ill-conditioning owing to the incompressibility of the
elastomer (Poisson ratio -=10.49)and the elongated shape of the cantilever composite beam. Note
also that convergence with. the interface relative residual is closer to convergence with the
displacement relative variation than it is to convergence with the global force relative residual.
This is because the interface problem i s formulated in the functional space of the stresses, so that
its residuals correspond to a d~sp~acement increment.
Finally, the tearing method is compared for performance with a direct Cholesky algorithm. The
factorization routine is highly optimized for CRAY-like supercomputers: it is fully vectorized and
n t sreduced via an unrolling procedure. In order to have a
its memory traffic r e q ~ ~ r e ~ eare
meaningful cornparison, we use the same factorization routine for the Cholesky global algorithm
and for factoring the s u b ~ o m a ~stiffnesses
n in the tearing algorithm. The same bending problem
as previously is selected for that purpose. Several djfferent mesh configurations which correspond
to different numbers of pencils are considered. Performance results on a CRAY-2 single processor
FE TEARING AND INTERCONNECTING 1223
A=Global F o r c e R e s i .
2 - B=Global D i s . V d r i d . -
-2
6)
c
&
0
-4
c
(u
0
Cl
-6
-8
-10
No. o f iteration
are reported in Table 111. A tolerance of on the global relative residuals is selected as a
convergence criterion.
The above results demonstrate that, for sufficiently large problems, the tearing method can
outperform direct solvers. For the particular problem above, it runs up to 3.3 times faster than
Cholesky factorization and requires 5.2 times less memory space.
Number of pencils 4 9 16
NDF 13000 28 000 48 000
gradient algorithm is developed for the solution of the coupled system of local rigid modes
components and Lagrange multipliers, which completes the solution of the problem. When
implemented on local memory multiprocessors, this proposed method of tearing and inter-
connecting requires less interprocessor communications than the classical method of sub-
structuring. It is also suitable for parallel/vector computers with shared memory. Large-scale
example applications are reported on the iPSC/l and CRAY-2. Measured performance results
illustrate the advantages of the proposed method and demonstrate its potential to outperform the
classical method of substructures and parallel direct solvers.
It is our experience that domain decomposition methods are very sensitive to the mesh
partition. In this paper, we have outlined some guidelines for the practical decomposition of a
given finite element mesh. Subsequent research will focus on determining the relationship
between a pattern of decomposition and the resulting conditioning of each of the local problems
and the interface one. While several preconditioners for conventional domain decomposition
methods (Schur methods) are available in the literature, further research is needed to develop a
preconditioner for hybrid domain decomposition algorithms such as the tearing method de-
veloped herein.
ACKNOWLEDGEMENTS
The first author would like to thank Pr. M. Geradin at the University of Liege, Belgium, for his
valuable comments. He also wishes to acknowledge partial support by the National Science
Foundation under Grant ASC-8717773, and partial support by the Air Force Office of Scientific
Research under Grant AFOSR-89-0422.
+
where Kj is the (ny ni) x (n; + n;) stiffness matrix associated with SZ,, and uj and f, are the
corresponding displacement and forcing vectors. If SZ, has n: rigid body modes, K j is rank n:
deficient. Provided that fj is orthogonal to the null space of K,, the singular system (28) is
consistent and admits a general solution of the form
uj = Kff, + Rju (29)
where KT is a pseudo-inverse of Kj-that is, Kf verifies KjKf K j = K,, R, is a basis of the null
space of Kj-that is, Rj stores the nfi rigid body modes of R,, and u is a vector of length n;
containing arbitrary real coefficients.
+ ni - nfi. If Rj i s defined as
2. Computing Kf fj
The partitioning of the singular matrix Kj defined in (30) implies that
Kg’ = Ky*K!P- Kjp’
Using the above identity, it can be easily checked that the matrix Kf defined as
is a pseudo-inverse of Kj. Therefore, a solution of the form Kff, can be also written as
backward substitution is modified to operate also on the n; extra right hand sides in order to
recoyer up = - K P P - l K P r ~ U f i .
The abote procedure for solving a consistent singular system of equations has almost the same
compu~a~i~ complexity
na~ as the solution of a non-singular one.
is an (ti; + ni) x nj full column rank matrix which stores the rigid body modes of the
floating subdomain aj,€4; is the restriction of R, to the intersection of sZj and the interface rl,and
fj is the vector of prescribed forces in Q j . If li.?) is written as
(34)
then (33) becomes
The matrix product (Ry ) is only n; x ni7 where n; is the number of rigid body modes of the
floating subdomain Q j . Therefore, (R:TR:) is at most 3 x 3 in two-dimensional problems and at
most 6 x 6 in three-dime~isionalproblems, and the evaluation of hy) according to (37) requires
little computational effort.
REFERENCES
1. C. Farhat and E. Wilson, ‘A new finite element concurrent computer program architecture’, Int. j. numer. methods eng.,
24, 1771-1792 (1987).
2. B. Now-Oniid, A Raefsky and G. Lyzenga, ‘Solving finite element equations on concurrent computers’, in A. K. Noor
(ed.), Pmallel Computations and Their Impact on Mechanics, ASME, New York, 1987, pp. 209-228.
3. M. Qrtiz and B. Nour-Omid, ‘Unconditionally stable concurrent procedures for transient finite element analysis’,
Comp. Methods Appl. Mech. Eng., 58, 151-174 (1986).
4. C. Farhat, ’A ~ u l t i g r i d . ~ isenii-iterative
~e algorithm for the massively parallel solution of large scale finite elements
systems’, MuEtigrid Methods: Proc. Fourth Copper Mountain Con$ on Multigrid Methods, SIAM, Copper Mountain,
Colorado, 1989, pp. 171-180.
5. F. X. Roux, ‘Test on parallel machines of a domain decomposition method for a composite three dimensional
structural analysis problem’, Proc. Int. Con$ on ~upercompu~ing, Saint Malo, France, 1988, pp. 273-283.
6. F. X. Roux, ‘A parallel solver for the linear elasticity equations on a composite beam’, in T. F. Chan et al. (eds.), Proc.
Second Int. Con$ on Domain Decomposition Methods, SIAM, Los Angeles, California, 1989.
7. 6. Kron, ‘A set of principles to interconnect the solutions of physical systems’, J . Appl. Phys., 24, 965-980 (1953).
8. T. M. H. Pian, ‘Finite element formulation by variational principles with relaxed continuity requirements’, in A. K.
Aziz (cd.), The Mathematical Foundation of the Finite Element Method with Applications to Partial Differential
Equations, Part I I , Academic Press, London, 1972, pp. 671-687.
FE TEARING AND INTERCONNECTING 1227
9. Q. V. Dihn, R. Glowinski and J. Periaux, ‘Solving elliptic problems by domain decomposition methods with
applications’, in A. Schoenstadt (ed.), Elliptic Problem Solvers ZZ, Academic Press, London, 1984.
10. M. R. Dorr, ‘Domain decomposition via Lagrange multipliers’, Report No. UCRL-98532, Lawrence Livermore
National Laboratory, 1988.
11. R. Fletcher, Practical Methods ofoptimization, Vol. 2, Constrained Optimization, Wiley, New York, 1981, pp. 86-87.
12. P. E. Gill and W. Murray, in P. E. Gill and W. Murray (eds.), Numerical Methodsfor Constrained Optimization,
Academic Press, London, 1974, pp. 132-135.
13. F. X. Roux, ‘Accelerationof the outer conjugate gradient by reorthogonalization for a domain decomposition method
for structural analysis problems’, Proc. Third Znt. Con$ on Supercomputing, Crete, Greece, 1989, pp. 471-477.
14 S. H. Bokhari, ‘On the mapping problem’, ZEEE Trans. Comp., C-30,207-214 (1981).
15. C. Farhat, ‘On the mapping of massively parallel processors onto finite element graphs’, Comp. Struct., 32, 347-354
( 1989).
16. C. Farhat, ‘Which parallel finite element algorithm for which architecture and which problem’, in R. V. Grandhi et al.
(eds.), Computational Structural Mechanics and Mu~tidisciplinaryOptimization, AD-Vol. 16, ASME, New York, pp.
35-43.
17. C. Farhat, ‘A simple and efficient automatic FEM domain decomposer’, Comp. Struct., 28, 579402 (1988).
18. L. Bomans and D. Roose, ‘Benchmarking the iPSC/2’, Report TW 114, Katholieke Universiteit Leuven, Department
of Computer Science, Belgium, 1988.