Starting GCDKit

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

GCDkit Data handling Plot settings Calculations Plots Plot editing Plugins

Geochemical Data Toolkit for Windows


written in R language

© 2000-2006
Vojtech Janousek, Czech Geological Survey;
Colin M. Farrow, University of Glasgow;
Vojtech Erban, Czech Geological Survey

http://www.gla.ac.uk/gcdkit
User’s Guide to version 2.1.1 (1 April 2006)

is a system for handling and recalculation of whole-rock analyses from


igneous rocks. It is fully menu driven but, at the same time, can be used in an
interactive regime (power users only).
Using GCDkit, the data can be loaded, individual analyses grouped into coherent groups or
searched according to various criteria. They can be plotted into commonly used classification
diagrams (e.g. Harker plots, TAS, AFM, R1-R2) as well as a variety of user-defined plots
(binary and ternary plots, multiple plots, spiderdiagrams).
Moreover, some basic statistical methods are implemented (both for the whole data set and
individual groups) using a fraction of the power built in R language. Included are the
descriptive statistics, boxplots, histograms, principal component and cluster analysis.
GCDkit also makes possible the most common normative calculations to be performed.
The system is easy to expand by means of the so-called plugins that provide a simple method
of adding effortlessly new items to the menus of GCDkit. Three of them are included in the
current distribution, namely modules for interpretation of the radiogenic isotope data
(calculating Sr and Nd initial compositions, Nd
model ages, plotting isochrons etc.), a module Menu for R Console
estimating saturation temperatures of accessory
phases (zircon, apatite and monazite)
and a module for calculating the
magnitude of the REE tetrad effect.

TECHNICAL NOTES GCDkit menu


• After double clicking the icon
GCDkit opens “R Console”, a
text window serving for entry of
commands as well as display of
textual output. In addition,
during the session are typically
opened one or more graphical
windows.
• Please note that the GCDkit menu appears right of the menus of the R system itself (see
Figure on the proceeding page).
• The interactions built in many plots (most commonly data point identification) can be
stopped from menu that appears after pressing the right mouse button. If experiencing
problems, the computation/plotting can be interrupted anytime by hitting Escape key or
from the menu (Misc|Stop current computation).
• More immediate response is obtained when print buffering (Misc|Buffered Output) is
disabled.
• The errors encountered while running the GCDkit system and displayed in the Console are
(mostly) not fatal. In most cases it means that the command can be re-run (one can simply
press the up/down arrows in the Console to scroll through the commands history),
entering/modifying the parameters that caused the crash. If the problem persists, record
the details and send the data file to us, so we can fix the bug.
• This GCDkit was built in R version 2.2.1. Due to the quick developments in the R project,
the correct function with any other version cannot be guaranteed!
• GCDkit requires the R packages base, stats, methods, utils, graphics, MASS, grid, foreign
and lattice to be installed (which is normally the case).
• Recommended is also the package RODBC by Brian Ripley enabling import and export
into the MS Excel, Access, dBase and other formats. This is attached to our binary
distribution of GCDkit (i.e. that with the Windows installer) so one normally needs not to
worry about its presence. But if need be, it can be uploaded via the menu Packages|Install
package(s) from CRAN (online Internet connection required). Offline installation is also
possible: zip file downloaded from the CRAN site (http://www.r-project.org) can be
installed via menu Packages|Install package(s) from local zip files.
• Note that recommended systems to run GCDkit are Windows 2000/NT/XP, even though
if some care is exercised (especially keeping the number of open graphical windows to a
necessary minimum), Win 95/98/ME would probably work as well. The point is that
under Windows 95/98/ME, the R may become unstable, failing to redraw graphical
windows if too many of them are being open. It is always a good idea to close the
unnecessary ones, for instance using the function graphics.off() or the corresponding item
from the menu GCDkit.
• PostScript output should be given preference to WMF (Windows metafiles) or copying via
clipboard, as the latter two methods lead sometimes to distortions of the graphs. For
saving simply right-click the graphical window and select the desired format.
• The use of special symbols or accented characters (such as in some East European
languages) is mostly okay but it may also sometimes cause unexpected problems. So
please be sensible, especially in the variable names.
• WARNING! DO NOT DELETE your default data directory or the file ‘.Rprofile’
therein. Otherwise desktop shortcuts to
run the GCDkit will stop working.
Quick guide to data files
GCDkit requires plain text data files,
delimited by tabs, commas or semicolons
(delimiter is recognised automatically, as
is a decimal point/comma). See the file
sazava.data in subdirectory Test data for
an example of a valid GCDkit data file.
As an (less forgiving) alternative, the data can be pasted, via clipboard, from any
Windows-based software, including Excel.
If library RODBC has been installed, the GCDkit first attempts to establish ODBC
connection to the selected file, and open it as a dBase III/IV (*.dbf), Excel (*.xls) or Access
(*.mdb) format. The DBF files are used to store data by other popular geochemical packages,
such as IgPet (Carr, 1995) or MinPet (Richard, 1995). With ODBC support, the GCDkit can
read plain text files, including the variants represented by the NewPet (*.ROC) (Clarke et al.,
1994) and PetroGraph (*.PEG) (Petrelli et al., 2005) files as well as the outputs from
WWW-based databases such as GEOROC (http://georoc.mpch-mainz.gwdg.de/georoc) and
PETDB (http://www.petdb.org).
For text files and spreadsheets, the first line should contain names of the data columns
(except for the first column that is automatically assumed to contain the sample names).
Hence the first line of the text file may (or may not)
have one item less than the following ones.
Missing values (NA) are allowed anywhere in the
body of the data file; values negative or equalling to
zero, or any of 'NA', 'N.A.', '-', 'b.d.', 'bd', 'b.d.l.' and
'bdl' are also treated as such. While loading, the values
#WHATEVER! (Excel error messages) and '< x' ( used
by some authors to indicate values below detection
limit) are also replaced by 'NA' as are any non-numeric
items found in a data column with one of the names
known to express a concentration of an element/oxide.
The data rows start with sample name and do not
have to be all of the same length (the rest of the row is
filled by 'NA' automatically).
A column named "Symbol" (if any) is taken as containing plotting symbols. They can be
specified either as single character strings or numeric codes, whose overview is given above
or can be obtained by invoking the menu item Data handling|Show available symbols).
Another column, whose name is "Colour" or “Color” (if any, capitalization does not
matter) may contain codes (1–49) or English names for the plotting colours. There are 657 of
the latter, see menu Data handling|Show available colours or type colours() into the Console.
If specifications of both the plotting symbols and colours are missing completely, and at
least one non-numeric variable is present, the user is prompted whether he does want to have
the symbols and colours assigned automatically, according to the levels of the selected label.
Otherwise default symbols (empty black circles)
are used.
Note that names of variables are case sensitive
in R. However, many of the fully upper case
names of the oxides/elements are translated
automatically, if necessary. Also trailing spaces in
the end of the column names are disposed of.
However note that, with some exceptions such as
importing from online databases, no duplicated
column or sample names are allowed!
The correct syntax is checked upon loading.
Apostrophes and other special characters
should be avoided in the sample/variable names.
The data files are practically freeform, i.e. no specified oxides/elements are required and no
exact order of these is to be adhered to. Analyses can contain as many numeric columns as
necessary; the names of oxides and trace elements are self-explanatory (e.g. "SiO2", "Fe2O3",
" Rb", "Nd"). Total iron, if given, should be expressed as ferrous oxide '(FeOt', ‘FeOT’ or
'FeO*)'. Structurally bound water can be named 'H2O.PLUS', 'H2O.P', 'H2O+', 'H2OPLUS' or
'H2O_PLUS'.
Note that names of variables are case sensitive in R. However, any of the fully upper case
names of the oxides/elements that appear in the following list are translated automatically to
the appropriate capitalization:

SiO2, TiO2, Al2O3, Fe2O3, FeO, MnO, MgO, CaO, Na2O, FeOt, Fe2O3t, Li2O, mg#, Li,
Rb, Cs, Be, Sr, Ba, Sc, Y, Ti, Zr, Hf, V, Nb, Ta, Cr, Mo, W, Re, Ru, Os, Co, Rh, Ir, Ni, Pd,
Pt, Cu, Ag, Au, Zn, Cd, B, Ga, In, Tl, C, Ge, Sn, Pb, P, As, Sb, Bi, S, Se, Te, F, Cl, Br, I, At,
La, Ce, Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, Lu.

Upon loading, all the completely empty columns are removed first. Any non-numeric items
found in a data column with one of the names listed in the above dictionary are replaced by
'NA' automatically. At the next stage all fully numeric data columns are stored in a numeric
data matrix WR. For any missing major- and minor-element data (SiO2, TiO2, Al2O3, Fe2O3,
FeO, MnO, MgO, CaO, Na2O, K2O, H2O.PLUS, CO2, P2O5, F, S), an empty (NA) column
is created automatically.
The remaining, that is all at least partly textual data columns are transferred into a data
frame labels, in which are also stored codes of plotting symbols and their colours.
Subsequently, several extra parameters are calculated (if not provided already by the user),
namely concentrations of some elements from oxides (P, K, Ti, Ba, Sr) in ppm, total iron
expressed as ferrous oxide (FeOt), Shand’s indexes in molar % (A/NK, A/CNK) and
K2O/Na2O ratios. At the same time are the analyses recast to millications (data matrix milli)
and major-element analyses to anhydrous basis (data matrix anhydrous).

Data editing and grouping


The GCDkit contains basic functions for data input and output, such as pasting, loading and
saving or union of two data files (either appending new samples or new data columns for
samples already present in the system). If need be, the data set can be modified in a simple
spreadsheet. Subsets of the data can be displayed/selected, data columns deleted or new ones
appended. In addition, there is a possibility of editing a single value (in the R jargon a level)
of the given label (a factor).
Important feature of data handling in the GCDkit is grouping. Each of the samples can be
assigned to one of the groups that are subsequently utilised by some statistical and plotting
functions. Groups can be defined on the basis any of the labels (locality, rock type…), a value
of a chosen numerical variable, position in classification diagram (e.g., TAS) or cluster
analysis. The information concerning current grouping is stored in a vector groups; default
grouping after loading a new data file or selection of subset is on plotting symbol.

Searching and subsetting in GCDkit


Great emphasis in the GCDkit is on managing the data, in particular searching and selecting
subsets.
1. Searching is useful for temporary selections of samples that are, for instance, to be
displayed or plotted in a diagram. In this case, the data stored in memory are fully
unaffected. Searching dialog appears automatically within each relevant function
(plots, norms etc.)
2. Upon selection of a subset (via menu items Select subset by sample name or label,
Select subset by range, Select subset by Boolean), the data in memory are replaced by
their part (subset) fulfilling the given criteria. Still, the changes are temporary as each
time a file is loaded, a backup of the data is kept that can be restored anytime using the
menu Select the whole data set (connected to the function select.all). However, in that
case all changes made e.g. to plotting symbols, grouping etc. would be lost.
The searching and subsetting functions employ powerful regular expressions (see R help
pages for regex). Syntax of search patterns employed in searching and subsetting is dealt
with in a considerable detail in the corresponding manual/help entries.

Choosing numeric variable(s), calculation core routine


Many functions (e.g. binary plots, some statistics) require a single numeric variable to be
chosen first. The easiest way is to type in the name of the numerical column (e.g., ‘SiO2’) or
its sequence number (2 for the second column). However, it is not necessary to enter the name
in its entirety. Only a substring that appears somewhere in the column name or other forms of
regular expressions can be specified.
If the result is ambiguous, the correct variable has to be picked manually from a drop-down
list of the multiple matches. Ultimately, empty response invokes list of all available variables.
As an useful alternative in many instances, the system allows to enter a formula that can
involve any combination of names of existing numerical columns with constants, brackets,
arithmetic operators +-*/^ and R functions. The most useful of these are ‘sqrt’ (square root),
‘log’ (natural logarithm), ‘log10’ (common logarithm) and ‘exp’ (exponential function).
Potentially useful can be also ‘min’, ‘max’ (extremes), ‘length’ (number of elements/cases),
‘sum’, ‘mean’, and ‘prod’ (product of the elements). In any case, no one prevents the
advanced users from writing their own functions that can be invoked here.
When multiple columns are to be selected (e.g. for Harker plots or correlation plots), the
easiest way is to type in directly their names separated by commas. Alternatively can be used
a comma-delimited list of sequence numbers that may also contain ranges expressed by
colons. Also user-defined or built-in lists can be employed, such as 'LILE', 'REE', 'major' and
'HFSE' or their combinations with the column names.

Calculations
All the calculation algorithms produce results that are temporarily stored in the memory, in
the variable (a vector, matrix or list) results. Before they are rewritten by the next calculation
procedure, they can be saved or appended to the current dataset for further processing (e.g.,
CIPW normative albite, quartz and orthoclase to be plotted in the Ab–Q–Or ternary).
Most of the normative recalculation algorithms have been adopted from an earlier
QuickBasic program NORMAN (Janousek, 2001). Currently are available R modules for
calculation of the CIPW norm, including the modification with biotite and hornblende
(Hutchison, 1974; 1975), Niggli Catanorm (Hutchison, 1974 and references therein), Niggli’s
cationic values (Niggli, 1948), multicationic parameters of the French authors (De La Roche
et al., 1980; Debon and Le Fort, 1983; 1988), improved Mesonorm for granitoid rocks
(Mielke and Winkler, 1979). As an optional outcome of the latter, implemented is a Q’–
ANOR binary chemically approximating the modal QAPF classification for the igneous
rocks (Streckeisen and Le Maitre, 1979).
The GCDkit provides a menu-driven interface to (a fraction of) statistical functions built
in R. These include simple descriptive statistics, histograms and boxplots (for the whole data
set or individual groups), correlation plots as well as more sophisticated methods of
multivariate statistics (cluster analysis and principal components).

Plotting
The plotting symbols and colours can be allocated, if need be independently of each other,
according to current grouping or any of the labels. If two distinct criteria for symbols and
colours are chosen, two legends are built.
The GCDkit produces publication quality plots in PostScript that can be easily imported
into other graphical or DTP package for further editing (simply right-click the graphical
window and select the desired format). All the plots can be saved simultaneously to
PostScript (Save all graphics to PS) or PDF (Save all graphics to PDF).
The strength of GCDkit consists in the wealth of build in plots most of which (the stand
alone, i.e. not multiple, diagrams) are defined as templates compatible with Figaro, a set of
graphical utilities for R.
Figaro provides means to create figure objects, which contain both the data and methods to
make changes to the figure (via the menu Plot editing). So, for example, the title can be
changed or the histogram fill colour altered and any changes are automatically made visible
on interactive devices. In addition one can zoom in and out of the data. Figaro objects
currently permit the editing of the text, font, size and colour of the main title, sub title and
axis labels; colour, size, and symbol of points; colour, line type, and width of lines. Thus
Figaro provides a degree of interactive editing before committing to hardcopy.

Future

Our long-term aim is to build a single, coherent and eventually platform-independent system
with high-level plotting capabilities for interpretation of whole-rock geochemical data that
would be straightforward to use by ordinary users but, at the same time, easily expandable by
the more demanding ones. In a nearest future GCDkit should be expanded by functions for
direct and inverse modelling of the main petrogenetic processes (fractional crystallization,
AFC, binary mixing). These are ready but not yet stable enough to be released.

Enjoy! On behalf of the authors

Vojtech Janousek
[email protected]

Prague, 1 April 2006

http://www.gla.ac.uk/gcdkit
References

Carr, M. (1995). Program IgPet. Terra Softa, Somerset, New Jersey, U.S.A.
Clarke, D., Mengel, F., Coish, R. A. & Kosinowski, M. H. F. (1994). NewPet for DOS,
version 94.01.07. Department of Earth Sciences, Memorial University of
Newfoundland, Canada.
De La Roche, H., Leterrier, J., Grandclaude, P. & Marchal, M. (1980). A classification of
volcanic and plutonic rocks using R1R2-diagram and major element analyses – its
relationships with current nomenclature. Chemical Geology 29, 183–210.
Debon, F. & Le Fort, P. (1983). A chemical–mineralogical classification of common plutonic
rocks and associations. Transactions of the Royal Society of Edinburgh, Earth Sciences
73, 135–149.
Debon, F. & Le Fort, P. (1988). A cationic classification of common plutonic rocks and their
magmatic associations: principles, method, applications. Bulletin de Minéralogie 111,
493–510.
Hutchison, C. S. (1974). Laboratory Handbook of Petrographic Techniques. New York: John
Wiley & Sons.
Hutchison, C. S. (1975). The norm, its variations, their calculation and relationships.
Schweizerische mineralogische und petrografische Mitteilungen 55, 243–256.
Janoušek, V. (2001). Norman, a QuickBasic programme for petrochemical re-calculation of
whole-rock major-element analyses on IBM PC. Journal of the Czech Geological
Society 46, 9–13.
Mielke, P. & Winkler, H. G. F. (1979). Eine bessere Berechnung der Mesonorm für
granitische Gesteine. Neues Jahrbuch für Mineralogie, Monatshefte 471–480.
Niggli, P. (1948). Gesteine und Minerallagerstätten. Basel: Birkhäuser.
Petrelli, M., Poli, G., Perugini, D. & Peccerillo, A. (2005). PetroGraph: A new software to
visualize, model, and present geochemical data in igneous petrology. Geochemistry
Geophysics Geosystems 6, 1–15.
Richard L. R. (1995). MinPet: Mineralogical and Petrological Data Processing System,
Version 2.02. MinPet Geological Software, Québec, Canada.
Streckeisen, A. & Le Maitre, R. W. (1979). A chemical approximation to the modal QAPF
classification of the igneous rocks. Neues Jahrbuch für Mineralogie, Abhandlungen 136,
169–206.

You might also like