I T S C C W: What Is Scientific Computing?
I T S C C W: What Is Scientific Computing?
I T S C C W: What Is Scientific Computing?
CONCLUSION WORK
with the operators and therefore must do some work automatically. For these reasons,
computers are an indispensable tool in today’s science.
Modelling
In this context, model is a simplified version of an actual phenomena, process or object
used for study and experimentation. Such models can be physical (for example the well
known small models of buildings in architecture) or virtual/mathematical. Modelling is
used for this purpose not only in science, but also in industry, education or art. Scientific
computing usually deals with mathematical models of natural phenomena represented
in a computer. There are 3 basic types of models used in this field: simulation models,
mathematical models and optimization models.
Mathematical model
In the simpler cases, it’s possible to construct a closed form equation which models the
process we are studying. Such equations have the advantage of being easy to evaluate
for any given parameter and they can also be easily studied from multiple viewpoints
or integrated into other models. For more complex systems, it’s usually not possible to
obtain such a representation.
For example a model of position (s) and velocity (v) of a freely falling object is
v(t) = gt
s(t) = 1/2gt 2
assuming an uniform gravitational field, no resistance, zero initial speed and origin of
coordinate system equal to position of the object at t = 0.
There are also other kinds of mathematical models which are closer to simulation or
optimization.
Simulations
Simulations are one of the most important tools computers provide in this field, because
it’s highly versatile and not too difficult to implement. To create a simulation, scientists
3 METHODS OF SCIENTIFIC COMPUTING
write a computer model of the natural process they want to study in which they describe
the simple behaviour of the actors in the model (such as atoms, stars, ants, . . .), specify how
many actors are there in the model and let the computer simulate their behaviour step by
step. This allows them to watch and study processes which are otherwise hard, expensive
or even impossible to watch in reality. There are many examples of such phenomena, the
"hard" group includes the behaviour in the microworld, which are sometimes possible to
observe, material behaviour (gases, sands, . . .) and others whereas the "impossible" group
includes the evolution of the universe, collisions of galaxies and other.
Simulations are also used to predict, for example when an comet is approaching the Earth,
its trajectory is calculated in by simulating the movement of the comet.
Simulations always use only a model of the studied process, which means that details not
important for our current point of view are not considered. This is necessary in order to be
able to actually create the model and run it on available computers in a reasonable time.
Nevertheless, there are always big enough models to make use of the biggest and fastest
computers. Currently clusters of thousands of processors are being used for scientific
simulations.
Visualization
The human brain has an incredible ability to find patterns in what it sees. Therefore the
benefit of proper visualization of data is not to be underrated. And computers can help
very much here as well. Thanks to computer games, the field of computer graphics is well
developed and there is also cheap dedicated hardware available. Of course, the interesting
and sometimes beautiful products are also used for science popularization. Other uses of
visualization are that the state of a simulation can be visualized on the fly which allows
the operator to check if it is running as expected or even to steer it and see a particular
interesting part of the problem which was not foreseen at the beginning.
Prediction
The field of Artificial Intelligence has developed several methods for machine learning.
Machine learning is a process in which the parameters of a mathematical model exe-
cuted by the computer are automatically tuned based on the input data. This phase is
called learning. If we’re successful, after the learning the model’s parameters reflect the
underlying pattern in the data and we can use it to predict unknown features of new data.
For example, we might have a dataset of 100 chemical compounds together with the
information whether they are active in interaction with one specific target or not. We can
4 METHODS OF SCIENTIFIC COMPUTING
build an artificial neural network and train it with these compounds and their activity.
When it’s done, the neural network can be used to predict activity of new compounds
based on its similarity to the compounds "it knows". Another example is the usage of
machine learning for automatic insect classification into species based on the insect’s
features or its photo. This can help with routinne classification and save a lot of tedious
work.
Optimization
In real world we often encounter the situation that we can choose among many alter-
natives and we know how to calculate the profit of each alternative, but don’t know
which brings the highest profit, because there are just too many of them. This is called
an optimization problem and it can occur both in normal life (in business for instance)
and as a subproblem in computing problems.
For example finding the stable configuration of a molecule is an optimization problem,
because each atom’s position is a variable and we are able to (approximately) calculate
the energy potential of each configuration. There are, however, usually too many configu-
rations to be able to try them all out, so we need to find a smarter way of finding the best
one.
Over the last several decades, many approaches have been developed which try to tackle
this problem. Optimization problems can be also divided into several classes, some of
which being easier to solve, some of them harder.
Data processing
In some fields, measuring devices can supply large amounts of data about some natural
phenomena and we need to process these data in order to understand them. This is yet
another application of computers in science. Whether we need to automate methods
which have been used for years or devise completely new methods, computers can help
significantly.
In the simpler cases, it’s just the ability to store and retrieve data based on queries about
any feature or automatic correlation. For example Bioinformatics is is the field of applying
computers to process biological data. It works with data such as DNA sequences, protein
sequences, gene expression or even medical publications. For biological sequences, meth-
ods have been developed which allow to align and compare them, find common motifs
and based on that, build fylogenetic trees (trees of ancestry) of species. By comparing
the sets of expressed genomes in healthy and unhealthy (cancer-suffering for example)
5 METHODS OF SCIENTIFIC COMPUTING
patients, more can be learned about the specific disease. Protein folding, which means
determining the 3D structure of protein based on its sequence of amino acids, is a vital
task for understanding its function in the organism and for drug discovery. Currently
computers are not strong enough to solve this task without additional data, but methods
which use known folding of similar proteins are being successfully applied today. In
bioinformatics and medicine, the amount of publications per year is very large and the
human body is terribly complex. Semantic methods, which have their origin in philosophy,
are being used to categorise and search in the publications and information about the
human body.
And while bioinformatics deals with data on the order of billions of DNA base-pairs,
there are even bigger challenges. The recently launched Large Hadron Collider employs
gigantic detectors which will observe the debris after particle collisions. The data from
this detector will be processed in the hope that proof of existence of so called Higgs boson
will be found. But the volume of data will be tremendous and cannot be processed even
by all the computers physicists have available at the moment, so it must be first cut down
by simple filters, which should delete the uninteresting part. And even after this cutting
down, there will be a lot of work.
Measuring devices
Computers are ubiquitous nowadays and that includes data-collecting devices as well.
Especially those, which need to be autonomous to an extent or must process the data
first.
For example, telescopes on the Earth orbit must compress the data first, because there is
limited connection bandwidth and time. And explorers on Mars have to posses a certain
degree of autonomous behaviour, because the time it takes for signal to get over the large
distance to an operator on Earth and back is long and it may come too late for a reaction.
Tools
Aside from specialised software developed for a specific task, there are some programs
which have a wider range of application. Many of them are free of charge or even open
6 METHODS OF SCIENTIFIC COMPUTING
source and some of them are commercial. We’ll make a short overview of the best known
general purpose programs.
Mathematical software
Nowadays computers can not only add and multiply numbers, but can also work with
symbolic mathematical objects such as equations, functions and so on. This means that
they can be used to speed up the mathematical work which cannot use only numbers and
help avoid mistakes people could make. Programs which can manipulate with symbolic
objects are called Computer Algebra Systems.
Another kind of mathematical software is more aimed at the numeric part, because after
all, that’s the area where computers excel and is also very useful. Both kinds of programs
can usually present the objects they work with graphically, for example as a function
graph or a color plot of a matrix.
Some known mathematical software packages include:
R Open source
http://www.r-project.org/
7 METHODS OF SCIENTIFIC COMPUTING
Due to the memory and processor demands of many scientific computing tasks, dis-
tributed computing is an inseparable facet of scientific computing. Faster and faster
supercomputers are built, many of which are used for science.
In the past, specialised hardware used to be built for specific computing demands. This
is no longer the majority of cases today, because it’s more cost-efficient to use the same
hardware which is being produced en masse for the commercial sector. That means that
today’s supercomputers are built using the same computer architecture and the same
type of processors (only the fastest version) as the ones in consumer’s computers. Most
of them are structured as clusters – more or less standard computers connected by a
network. Programs for clusters are written mostly in C, C++ or Fortran. One notable piece
of software for cluster computing is MPI, an open standard for message passing. Programs
running on cluster consist of independently running programs which communicate by
8 PHARMACOINFORMATICS
passing messages over the network and MPI defines a standard interface and a set of
functions for this task, allowing the programmer to concentrate on the computation
rather than the technical details of network communication.
For example, the fastest supercomputer at the moment is Jaguar [3], a Cray XT5 system,
which has more than 200,000 Opteron processor cores, the same processor, which can be
found in high-end consumer desktop PCs or servers.
A significant new development in the field of high performance computing is the usage of
chips which were previously designed purely as game graphics accelerators. These chips
are specialized for graphics, but due to continuous high demand over more than 10 years,
they have developed quickly and today allow game developers to program the materials
(color, shininess, transparency, bumpiness) of game objects and henceforth the graph-
ics accelerators are programmable. These accelerators are highly parallel, containing
hundreds of small processing cores. The cores are not universal and can not work inde-
pendently as in the case of CPUs, but for some tasks such a card could bring a hundredfold
acceleration. Coupled with a relatively low prices, they are a perfect tool for certain high
performance jobs. Programs using these accelerators must be specifically written for a
given architecture. There are currently two major interface/programming environments.
CUDA is a proprietary interface to Nvidia’s graphic cards and at present time dominates
the field. The new standard, OpenCL, is open and has already been embraced by both
AMD and Nvidia.
Pharmacoinformatics
This section will describe the topic of Pharmacoinformatics. There were two lectures
given by Gerhard Ecker in this course and I attended his course Computational Life Sciences
as well [4].
What is it?
Like bioinformatics employs computers to process biological data, pharmacoinformatics
is the field of applying computers in pharmacy, specifically drug design. Drug design is a
big business, but also a very tough one. Usually it takes more than 10 years and millions
of € to bring a new drug to market, mostly due to large amount of mandatory testing.
Computers are used in all stages of drug design, but mainly in the initial phases of finding
9 PHARMACOINFORMATICS
the best chemical compound. It needs to be highly active, must have no side effects and
should have minimal interactions with other drugs.
it can be generated by a computer just from its 2D structure (or a SMILES string). To
find the 3D conformation, we need to solve an optimization problem where the variables
are torsion angles on the bonds and the objective function is the system’s energy. This
problem is computationally tractable for small molecules. It is either solved using tradi-
tional optimization methods such as stochastic search, systematic search, evolutionary
algorithms or by using known 3D structures of similar compounds.
Proteins are very large molecules, on the order of hundreds or thousands of atoms. That
means that finding their structure ab initio, from the ground up, is not possible. It can
be determined experimentally, using X-ray crystallography or NMR spectroscopy. But
experimental methods are sometimes rather expensive and difficult and for some proteins
impossible to carry out. With computers, one way to determine protein’s structure is
through homology modelling. This method calculates the 3D structure based on known
structures of other proteins, which have similar amino acid sequences.
Activity prediction
One of the problem pharmacy faces is finding the best chemical compound for a given
target. The number of possible chemical compounds is vast, because the atoms can
be combined in almost any way. Pharmacologists have developed methods such as
High Throughput Screening, which are capable of testing up to 10.000 compounds a
day, but even this is too slow and expensive. Predicting activity of virtual compounds
using computers can be therefore very helpful. There are several ways of going about
that.
QSAR
QSAR stands for Quantitative Structure-Activity Relationships. As the name implies, it tries
to capture the relationship between compound’s structure and biological activity. More
specifically, it is a function of several molecular descriptors which gives the predicted
activity. To find the function, statistical methods such as regression analysis are used
on a set of known (compound, activity) pairs. It is important to hold on several rules
to get meaningful results - the training set should be large enough with respect to the
number of descriptors, one should not extrapolate too far away from the training set and
11 PHARMACOINFORMATICS
the biological system in which activity of training set is measured should be as simple
(and understandable) as possible.
Pharmacophore
Another approach is based on the insight that the precise 3D structure is not as important
for biological activity as the "high level" features such as hydrophobicity, lipophilicity,
hydrogen bond donor/acceptor or ionizability which are generated by the structure. These
features together with their 3D placement can be used to describe a group of chemical
compounds which have the same properties. Then a pharmacophore model built from a
set of known active compounds can be used to find more active compounds in a database
of virtual chemical compounds.
Machine learning
input data is inserted to the first layer of neurons, propagated using the rules described
and the output from the last layer of the network is compared with the expected output.
The weights of edges are then slightly adjusted to come closer to the expected output.
After going through the entire training set, the network should have the strongest weights
for the most commonly occurring patterns in the training set and we can start inserting
data for which we need a prediction.
Neural networks naturally have also some weaknesses. Some of them can be overcome
by using more complicated methods, some of them can’t. Most importantly, a neural
network doesn’t explain why it gives such results. The information captured by the weights
cannot be easily interpreted. This is in contrast with some other methods, such as QSAR
or decision trees, because the generalised pattern can be easily seen from them.
Other methods include random forests, a generalization of decision trees, support vector
machines based on separating the data points by a hyperplane, or clustering, based on
building spatially compact groups of data items.
Docking
Docking means performing a rather accurate simulation of the interaction between our
compound and the target protein. Naturally, the 3D structure of both participants need to
be known as well as the location of binding pocket on the protein which can be determined
from known interaction of the protein with similar compounds. It is even possible to
create a cocrystal of protein and a compound and find 3D positions of both using X-ray
crystallography. But as we already mentioned, these experiments are not as fast and
cheap as "virtual experiments".
The problem of docking is an optimization problem. We can translate (move) the com-
pound in 3 dimensions, we can rotate it and sometimes we also need to twist or stretch
the bonds, because the lowest energy state in isolation may not be the same as the lowest
energy state in interaction with a protein. There are several possibilities of rating the
conformations, but molecular mechanics calculating the total energy of the system is the
most commonly used one. Other approaches once again make use of known alignments.
Docking is a rather accurate method of estimating the binding affinity and can be used to
find hits in virtual databases or for optimization (finding a better compound than the one
we already have, one which is less toxic for example). Compared to other methods, this
one is rather computationally intensive.
number of TV sets:
f(x1 ,x2 ) = 20 · x1 + 10 · x2
The constraints can be expressed as follows:
x1 , x 2 ≥0
x1 ≤ 70
x2 ≤ 50
x1 + 2x2 ≤ 120
x1 + x2 ≤ 90
Solving the problem using OpenOffice.org Calc with x1 ,x2 constrained to integers gives
the following result:
x1 = 70
x2 = 20
for a total profit of $1600.
Critical evaluation
As a lecture intended to introduce freshmen into the topic of scientific computing, this
subject is doing well in my opinion. There are just a few minor problems that I would like
to point out.
The part about good programming practices in the lecture about Computational Chemistry
was somewhat out of place. Don’t get me wrong, this topic should definitely be taught at
universities, for the sake of anybody even remotely working with software, but it should
be put in the correct place. In a programming course, that is. As I noticed from the
extensive use of Excel, students of this subject are not expected to know programming at
all, so these notes are not relevant for them at this point.
Distributed/parallel computing is a more general topic which finds application in every
part of scientific computing, so maybe it could get a lecture of its own instead of being a
part of Computation Chemistry lecture.
I think this subject should contain more practical work with scientific software such as
Mathematica, Matlab, R and also the free alternatives to these programs, possibly instead
of working so much with Excel. This would help students in the following courses such
as Optimierung :)
15 REFERENCES
Aside from these notes, the topics were diverse and sometimes even colorful. Coupled
with practical hands-on exercises, the course was certainly an interesting one.
References
1 Mathematica Solutions, http://www.wolfram.com/solutions/
2 Data analysts captivated by R’s power, http://www.nytimes.com/2009/01/07/technology
/business-computing/07program.html
3 TOP500, from Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Top
_500
4 Gerhard, Ecker, Lectures Computational Life Sciences,