En Tanagra Scilab Data Mining PDF
En Tanagra Scilab Data Mining PDF
En Tanagra Scilab Data Mining PDF
Ricco Rakotomalala
1 Topic
Data Mining with Scilab.
I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical
analysis. It seemed not interesting in the context of the statistical data processing and data mining.
Recently a mathematician colleague spoke to me about this tool. He was surprised about the low
visibility of Scilab within the data mining community, knowing that it proposes functionalities which
are quite similar to those of R software. I confess that I did not know Scilab from this perspective. I
decided to study Scilab by setting a basic goal: is it possible to perform simply a predictive analysis
process with Scilab? Namely: loading a data file (learning sample), building a predictive model,
obtaining a description of its characteristics, loading a test sample, applying the model on this second
set of data, building the confusion matrix and calculating the test error rate.
We will see in this tutorial that the whole task has been completed successfully easily. Scilab is
perfectly prepared to fulfill statistical treatments. But two small drawbacks appear during the catch
in hand of Scilab: the library of statistical functions exists but it is not as comprehensive as that of R;
their documentation is not very extensive at this time. However, I am very satisfied of this first
experience. I discovered an excellent free tool, flexible and efficient, very easy to take in hand, which
turns out a credible alternative to R in the field of data mining.
2 Scilab
2.1
What is Scilab?
The objects provided by the statistical procedures have properties that we can use in
subsequent calculations.
http://help.scilab.org/docs/5.4.1/fr_FR/csvRead.html
7 janvier 2014
Page 1
Ricco Rakotomalala
importing mechanism makes easier their installation and their use (ATOMS: AuTomatic mOdules
Management for Scilab - http://atoms.scilab.org/).
I feel like I describe the specifications of R! At this stage, I think that apart from the learning of new
programming language syntax, the identification and understanding of the commands necessary for
my analysis should not be too difficult.
2.2
For Windows, we have a setup file which performs the whole installation. I downloaded the 5.4.1
version for Windows 64-bit. Other versions for other platforms are available.
7 janvier 2014
Page 2
2.3
Ricco Rakotomalala
Launching Scilab
(1) This is a file browser. (2) The command editor enables us to enter the instructions and visualize
the results. (3) The objects available into the memory are shown here. (4) The list of last entered
commands is available here. From a certain perspective, the organization of the workspace is similar
to those of the specialized editors for R such as R-Studio.
7 janvier 2014
Page 3
Ricco Rakotomalala
For batch processing, it is better to use the code editor: SCINOTES. We click on the APPLICATIONS /
SCINOTES menu. A new editor window appears. It enables to handle Scilab program files (*.sce).
We use the help command to get help. E.g. help(csvRead) shows the description about the
corresponding command.
7 janvier 2014
Page 4
2.4
Ricco Rakotomalala
Scilab provides a mechanism for managing packages ('Toolbox'). It enables to enrich its library of
numerical methods. This kind of feature is, among others, the reason of the great success of R in
recent years. Scilab can achieve the same success. It is nevertheless noted the relative scarcity of
modules for data mining and statistical computing available to date, and more boring in my opinion,
the almost total lack of detailed documentation about them.
The list of available toolboxes can be found on the ATOMS website.
We can obtain also the same list with the atomsGui() command.
7 janvier 2014
Page 5
Ricco Rakotomalala
A window for managing toolboxes appears. We observe for instance that the libraries "nan" and
"libsvm" for statistics and machine learning are available. We can use this tool to install a toolbox.
We can use also the atomsInstall(name of the toolbox) and atomsLoad(name of the toolbox)
commands. The installed toolboxes are automatically loaded at each starting of Scilab.
Dataset
We use the BREAST-CANCER-WISCONSIN2 dataset in this tutorial. There are 9 continuous descriptors
and the class attribute CLASSE (2 values: BEGNIN and MALINGNANT). We have subdivided the
dataset into two parts: the train set with 399 instances (breast-train.txt), the test set with 300
instances (breast-test.txt). The columns are separated by the tab stop character. We show below the
first rows of the breast-train.txt data file.
3.2
We load the data file into the DTRain matrix using the csvRead() command. The last parameter
defines the data range into the file: from the 2nd row and the 1st column to the 400th row and the
10th column. At this time, we consider that all the values are in the character format.
//loading the data file
DTrain=csvRead("breast-train.txt","\t",".","string",[],[],[2 1 400 10])
We obtain:
http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29
7 janvier 2014
Page 6
Ricco Rakotomalala
We cannot use the data in the character format. We must specify to Scilab that the descriptors are
numeric; and the class attribute (the 10th column) has 2 distinct values {1, 2}. For that, we write the
function recodage(). We can use it for the test sample. The values of the descriptors are described in
a matrix; the values of the class attribute are stored in a vector.
//own user function for data coding
function [descriptors, target]=recodage(D)
//transforming the values of the descriptors in a numeric type
descriptors = evstr(D(:,1:9))
//retrieving the last column (10th column)
y=D(:,10:10)
//coding the class attribute with 1 and 2
n=size(y,1) //n number of instances
grp=zeros(n,1) //creating a vector with 0
for i=1:n
if y(i)=="begnin" then
grp(i)=1
else
grp(i)=2
end
end
target=grp
endfunction
XTrain is the matrix of the descriptors; CTrain is a vector corresponding to the class attribute. The
tabul() function calculates the frequency distribution of the class attribute.
7 janvier 2014
Page 7
3.3
Ricco Rakotomalala
You must install and load the 'Nan' Toolbox to be able to continue. It will be subsequently loaded
automatically each time you start Scilab. We want to use the linear discriminant analysis (LD2) in a
first step. The library provides several functions for the predictive discriminant analysis. I confess I do
not know the differences between them. Unfortunately, the help of Scilab describes only the use of
the procedures, not the underlying methods [cf. help(nan_train_sc)].
We set the following command.
//learning phase - LD2
modelLD2=nan_train_sc(XTrain,CTrain,'LD2')
disp(modelLD2)
modelLD2 is an object provides by the nan_train_sc() function. Various properties are available. To
obtain the weights of the classification functions, we set:
disp(modelLD2 weights)
We do not obtain the same values as the other tools such as SAS or Tanagra. Because the help file
does not provide more information about the underlying method, we cannot really put forward more
comments here.
3.4
We must perform several operations to apply the classifier on the test set and evaluate its
performance: (1) loading the test sample; (2) coding the values using the same process as for the
learning sample; (3) creating the prediction column by applying the classifier on the test set; (4)
computing the confusion matrix with the predicted and observed values of the class attribute; (5)
calculating the error rate from the confusion matrix.
7 janvier 2014
Page 8
Ricco Rakotomalala
For the steps (1) and (2), we use the same commands as for the learning sample.
//loading the test sample
DTest=csvRead("breast-test.txt","\t",".","string",[],[],[2 1 301 10])
//coding
[XTest,CTest]=recodage(DTest)
disp(tabul(CTest))
The distribution of the target variable is very similar to that observed for the sample of learning. This
is, among others, a simple way to verify that the subdivision of the dataset was carried out randomly.
We define a new function for the steps (3), (4) and (5).
//function for the evaluation of a classifier
//input: classifier (object from nan toolbox)
//values for the descriptors (matrix)
//values for the class attribute (vector)
function [MC, ERR_RATE]=test_classifier(classifier, descriptors, target)
//prediction
pred=nan_test_sc(classifier,descriptors)
//confusion matrix
MC=nan_confusionmat(target,pred.classlabel)
//error rate
ERR_RATE = 1.0 - sum(diag(MC))/sum(MC)
endfunction
7 instances over 300 are misclassified; the test error rate is 2.33%.
3.5
We have implemented all the elements which allow us to apply the same process with another
learning method. In this section, we build and evaluate the naive bayes classifier model 'NBC'
provided by the "NaN" toolbox.
//learning phase - Naive Bayes Classifier
modelNBC=nan_train_sc(XTrain,CTrain,'NBC')
7 janvier 2014
Page 9
Ricco Rakotomalala
//evaluation
[mc,err_rate] = test_classifier(modelNBC,XTest,CTest)
3.6
The famous LIBSVM3 library is also available for Scilab. We perform the train-test schema using a
linear SVM classifier in this section. Of course, we must install and load the toolbox in a first step.
//libsvm toolbox learning phase
//SVM lineair kerne -t 0
modelLINSVM=libsvm_svmtrain(CTrain,XTrain,'-t 0')
//prediction on the test sample
[predLINSVM]=libsvm_svmpredict(CTest,XTest,modelLINSVM)
//confusion matrix
mc=libsvm_confmat(CTest,predLINSVM)
As output of the learning function [libsvm_svmtrain], Scilab provides the list of the support vectors
(Svs) and the associated weights (sv_coef).
From the prediction column provided by [libsvm_svmpredict], we compute the confusion matrix
using the predefined [libsvm_confmat] function. Here is the confusion matrix.
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
7 janvier 2014
Page 10
Ricco Rakotomalala
We must modify the default directory before performing the command in the screenshot.
7 janvier 2014
Page 11
Ricco Rakotomalala
5 Conclusion
Scilab is fully suited to the data mining process. It provides all the tools to handling a dataset and
achieving a statistical analysis. In addition, if we are accustomed to R (or similar programming
language), learning the syntax of Scilab is easy.
Nevertheless, there are some weaknesses that we highlight below:
1. The toolboxes for statistical processing and data mining remain few in number, particularly
compared with the packages for R ([2013/11/04], 4986 official packages are available for R http://cran.univ-lyon1.fr/web/packages/index.html).
2. The toolboxes are not well documented. I have primarily used the source code of the macros to
learn how to use the functions described in this tutorial (e.g. nan_train_sc.sci).
3. The output of the functions (from "NaN" and "libsvm" at least) are not really well detailed. It is a
real obstacle on their use.
4. The data management must be improved. For instance, a special data type for the handling of
categorical variables (such as the factor type under R) is needed.
There is a nothing insurmountable. The number of libraries will evolve positively. The existing
toolboxes will be enhanced over the years. About the documentation, we can contribute to their
improvement.
7 janvier 2014
Page 12