Churn Prediction in Telecom Industry Using R: Manpreet Kaur, Dr. Prerna Mahajan
Churn Prediction in Telecom Industry Using R: Manpreet Kaur, Dr. Prerna Mahajan
Churn Prediction in Telecom Industry Using R: Manpreet Kaur, Dr. Prerna Mahajan
46 www.erpublication.org
Churn Prediction in Telecom Industry Using R
clustering, classification and regression forms the four management that was assumed to determine the customer
techniques used by data mining. turnover is called as Churn management. (Hadden, Tiwari,
Roy and Ruta, 2007). Customer movement from one
In Data mining new rules and patterns can be discovered by provider to another in telecommunication industry is called
the system known as discovery oriented and system can also customer churn and the operators process to retain profitable
check the users hypothesis called verification oriented. It customers counted as churn management (Berson, Smith &
helps in taking knowledge-driven decisions and for predicting Thearling, 2000) [13].
the future trends of the business.
2.5 Data Set Used
2.2. J48 Decision Tree Technique
J48 construction is like a flow- chart. A test applied on an The attributes in our data are taken from Orange Database.
attribute is denoted by internal node, its effect is denoted by a
Table I: Orange Dataset Attributes
branch and class labels are presented by leaf- nodes. Process
divided in two levels, one is Division of root is recursively
S.No. Attribute name
based on selection of attribute for all training examples at the
tree construction and second is that the noise or outliers 1 State
branches are identified and removed by Tree pruning. Rules 2 Account. Length
can be classified from the tree. If-then statement is used to 3 Area. Code
represent the knowledge. For each path from root to a leaf one 4 Phone
rule is created.
5 Int .l .Plan
Here we use J48 for churn dataset. The attribute whose value 6 VMail.Plan
has to be predicted is known as dependent variable. Its value 7 VMail.Message
is decided by value of other attributes. These attributes that 8 Day.Mins
predict the value of the dependent variable are known as
9 Day.Calls
independent variables.
10 Day.Charge
2.3. Tool Used: A Revolution Analytics Tool - R 11 Eve.Mins
In the past few years, the fast emerging requirements from 12 Eve.Calls
both academia and industry has helped R programming 13 Eve.Charge
language to emerge as one of the necessary tool for 14 Night.Mins
visualization, computational statistics and data science. R is 15 Night.Calls
most popular in field of data science and important in Finance
16 Night.Charge
and analytics- driven companies.
R virtually consists all the possible statistical models, data 17 Intl.Mins
manipulation and charts that could ever be required by a 18 Intl.Calls
modern day scientist. One can easily use the best reviewed 19 Intl.Charge
methods from leading researchers in field of Data Science 20 CustServ.Calls
without any cost. It provides a large collection of graphical
21 Churn.
and statistical techniques, consisting of modelling (linear and
non-linear), statistical tests, time-series, classification,
clustering, etc.
R helps in representing complex data as beautiful and unique III. ALGORITHM AND LIBRARIES USED
data visualizations. Evaluation of result in R is very much
easier as we do not have to remember any clicks or steps, it is 3.1. J48 Algorithm
simply a programming language designed specifically for J48 (formula, data, subset, control= Weka_control ())
Predict is a generic function for predictions from the results of
data analysis that also has the capability to use mix and match
model fitting functions.
models for best results.
As R is supported by a large community worldwide, solution 3.1.1. Steps:
to the errors and code is available freely. Its source code is Step 1. A flow-chart-like tree structure. Internal node denotes a
written in C, Fortran and R. R is easily extensible through test on an attribute. Outcome of the test is represented by
functions and extensions, and the R community is noted for its Branch. Class labels are represented by Leaf nodes.
active contributions in terms of packages. R is an open source Step 2. Decision tree generation comprised of two phases.
and can be extended easily as individuals using it can Tree construction: At start, root contains all the training
contribute in its growth. Dynamic and static graphics are examples. Tree pruning: Branches that reflect noise and outliers
available through additional packages. R can easily deal with are identified and removed.
complex and large datasets. Step 3. Decision tree is used to classify an Unknown sample.
Attribute values of the sample are tested against the decision
The libraries and packages of R that are being used in this
tree.
paper are: RWeka, ggplot2, rpart, rJava, class Step 4. When all samples for a given node belong to the same
2.4. Related Literature class, or there are no remaining attributes for further
partitioning then the partitioning is stopped.
Churn customer is one who leaves the existing company and
become a customer of another competitor company. The
47 www.erpublication.org
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869, Volume-3, Issue-5, May 2015
3.1.2. Extracting Classification Rules from Trees Table II: Description of Data Set Attribute
churn<-read.csv("C:\\Users\\Documents\\R\\win-library\\3.1\\RWe
ka\\R\\churn.csv", header=T)
> names(churn)
[1] "State" "Account.Length" "Area.Code" "Phone"
[5] "Int.l.Plan" "VMail.Plan" "VMail.Message" "Day.Mins"
[9] "Day.Calls" "Day.Charge" "Eve.Mins" "Eve.Calls"
[13] "Eve.Charge" "Night.Mins" "Night.Calls"
"Night.Charge"
[17] "Intl.Mins" "Intl.Calls" "Intl.Charge" "CustServ.Calls"
[21] "Churn."
4.5. Decision Tree for Churn (using J48)
4.3. Description of complete data Set
m2 <- J48(`Churn.` ~ ., data = churn)
m2
48 www.erpublication.org
Churn Prediction in Telecom Industry Using R
library(rpart)
f<-rpart(Churn.~CustServ.Calls+Eve.Calls+Intl.Calls+Night.Calls
+Day.Calls,method="class", data=churn)
plot(f, uniform=TRUE,main="Classification Tree for Churn")
text(f, use.n=TRUE, all=TRUE, cex=.7)
Fig. 2 represents the classification tree for all the Calls considered in
churn Dataset. The decision is made on basis of call number and the
churn factor having values true and false.
library(rpart)
f<-rpart(Churn.~CustServ.Calls+Eve.Charge+Intl.Charge+Night.
Charge+Day.Charge, method="class", data=churn)
plotcp(f,lty=4,col="red")
Fig. 3 represents Applied on the set of possible cost- complexity
pruning of a tree from a nested set. A cross- validation is already
performed by rpart on the geometric means of the Interval values of
cp where pruning is optimal. The mean and standard deviation of
errors in cross- validated prediction against each of the geometric
means is stored in cptable in f are plotted by this function.
Fig. 1 depicts the churn values from table formed by predicting the
values of J48 decision tree on churn parameter.
49 www.erpublication.org
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869, Volume-3, Issue-5, May 2015
4.11. Line Charts
4.11.1. Line Chart for Day Calls and Customer Service Calls
Fig. 7 shows the line chart of Day calls and Customer Service calls
using numbers as range and considering the Churn factor. The
number of churns increase with the increase in customer service
calls.
50 www.erpublication.org
Churn Prediction in Telecom Industry Using R
Fig. 9 shows the relativity in number of night calls and day calls. We Fig. 11 shows the relativity in number of customer service calls and
observe that they are relatively dense in the same area. Whereas, the day calls on the subset of Data. The third parameter Churn factor
third parameter Churn factor represented by the color shows that and fourth is Area Code. The facets are representing third and fourth
True churns are less in number. The line in the graph is the smoother parameter. We can observe the churns in particular area code with
that depicts the trend followed by the data in graph. Here it depicts respect to number of day calls and customer service calls.
the relatively same number of night calls and Day calls.
Fig. 12 shows the graph of day calls and churn factor on the subset
of Data. The third parameter state represented by the color shows the
churns in various states.
Figure 9: Relativity in Night Calls and Day Calls with churn
factor using smooth curve
51 www.erpublication.org
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869, Volume-3, Issue-5, May 2015
4.12.9. qplot(Night.Calls, data = dsc,geom =
fig. 13 shows the relativity in Night minutes and Area code. We see "histogram",fill=Int.l.Plan)
that there is no calls made in Area code between 425 and 500.
Fig. 16 shows the histogram, color is done using the Churn factor
and represents Night Calls.
V. CONCLUSION
The proposed research has used data mining technique and R
package to predict the results of churn customers on the benchmark
Churn dataset available at http://www.sgi.com/tech/mlc/db/ and
http://www.dataminingconsultant.com/data/churn.txt. It has
evaluated, the number of churns using the classification technique
J48 tree. The R tool has represented the large dataset churn in form
of graphs which depicts the outcomes vividly and in a unique pattern
visualization manner. The Churn Factor is used in many functions to
depict the various areas or scenarios when the churn rate is high. The
study predicts that there is a huge deviation in graph of churners
when customer service calls are measured. The graphs are made
taking churn factors as the deciding parameters. Graphs represent
Figure 14: Alpha Filter the different ways of observing the number of churners from the
dataset. Once the root area is recognized the steps can be taken by
4.12.8. qplot(Day.Calls, data = dsc,geom = Telecom Company to improve their services and retain their old
"histogram",fill=Churn.) customers from churning
Fig. 15 shows the histogram, in which color is done using the Churn
factor and represents Day REFERENCES
Calls.
52 www.erpublication.org
Churn Prediction in Telecom Industry Using R
53 www.erpublication.org