Final - Report TU

Download as pdf or txt
Download as pdf or txt
You are on page 1of 96

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING

PULCHOWK CAMPUS

Major Project Report on

CORPORATE DECISION SUPPORT SYSTEM FOR TELECOMMUNICATIONS

By:

Aashish Acharya (066 BCT 501)

Santosh Ghimire (066 BCT 533)

Sudip Kafle (066 BCT 539)

Supritam Raj Shrestha (066 BCT 542)

A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS

AND COMPUTER ENGINEERING IN PARTIAL FULLFILLMENT OF THE

REQUIREMENT FOR THE BACHELOR’S DEGREE IN COMPUTER ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING

LALITPUR, NEPAL

August 26, 2013


ii

TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING

PULCHOWK CAMPUS

Major Project Report on

CORPORATE DECISION SUPPORT SYSTEM FOR TELECOMMUNICATIONS

By:

Aashish Acharya (066 BCT 501)

Santosh Ghimire (066 BCT 533)

Sudip Kafle (066 BCT 539)

Supritam Raj Shrestha (066 BCT 542)

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING

LALITPUR, NEPAL

August 26, 2013


iii

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING

PULCHOWK CAMPUS

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING

The undersigned certify that they have read, and recommended to the Institute of
Engineering for acceptance, a project report entitled CORPORATE DECISION SUPPORT
SYSTEM FOR TELECOMMUNICATIONS submitted by Aashish Acharya, Santosh
Ghimire, Sudip Kafle and Supritam Raj Shrestha in partial fulfilment of the requirements
for the Bachelor’s degree in Computer Engineering.

_________________________________________________
Supervisor, Dr. Nanda Bikram Adhikari
Assistant Professor
Department of Electronics and Computer Engineering

_________________________________________________
Co-Supervisor, Er. Manish Subedi
Team Lead
Business Intelligence Department, Yomari Inc.

_________________________________________________
Internal Examiner, Dr. Arun Timilsina
Head
Department of Electronics and Computer Engineering, Pulchowk Campus

_________________________________________________
External Examiner, Er. Krishna Prasad Bhandari
Sr. Engineer
Nepal Telecom

Date of Approval: August 26, 2013


iv

COPYRIGHT

The author has agreed that the Library, Department of Electronics and Computer
Engineering, Pulchowk Campus, Institute of Engineering may make this report freely
available for inspection. Moreover, the author has agreed that permission for extensive
copying of this project report for scholarly purpose may be granted by the supervisors who
supervised the project work recorded herein or, in their absence, by the Head of the
Department wherein the project report was done. It is understood that the recognition will
be given to the author of this report and to the Department of Electronics and Computer
Engineering, Pulchowk Campus, Institute of Engineering in any use of the material of this
project report. Copying or publication or the other use of this report for financial gain
without approval of to the Department of Electronics and Computer Engineering,
Pulchowk Campus, Institute of Engineering and author’s written permission is prohibited.
Request for permission to copy or to make any other use of the material in this report in
whole or in part should be addressed to:

Head
Department of Electronics and Computer Engineering

Pulchowk Campus, Institute of Engineering

Lalitpur,Nepal
v

ACKNOWLEDGEMENT

We extend our sincere gratitude to Dr. Arun Timalsina and Dr. Aman Shakya for providing
the necessary guidelines and support for understating the feasibility and technical aspects
for the project. We would like to hearty thank our supervisor Asst. Professor Dr. Nanda
Bikram Adhikari for helping us in the project in every ways. Also, our special thanks goes
to our co-supervisor Er. Manish Subedi and Yomari Inc. and for providing us internship
and guiding us through the project. We would also like to express our gratitude to Dr.
Sanjeeb Pd. Panday for guiding us and helping us work with documentation works and
refinement of our proposal. We would also like to thank Pulchowk Campus and
Department of Electronics and Computer Engineering for providing us the opportunity to
build up the project. Last but not the least; we would like to thank our friends and seniors
for helping us in the project. Thanks for making the difference.

Aashish Acharya (066 BCT 501)

Santosh Ghimire (066 BCT 533)

Sudip Kafle (066 BCT 539)

Supritam Raj Shrestha (066 BCT 542)


vi

SUMMARY

The telecommunication market today is competitive. In order to understand the customers


of the company and to create a better environment for the management of customer
relationships, many telecommunications companies use call detail analysis to achieve
competitive advantage. By understanding customers’ characteristics and behavior, it is
possible to successfully tailor services that meet customers’ needs. With the project
proposal entitled Corporate Decision Support System for Telecommunications (CDSST),
proposed to and accepted by the Department of Electronics and Computer Engineering,
IOE Pulchowk Campus, we are able to provide a new base of analysis through
Microstrategy reporting tools to analyze the customer behaviors and improve the
marketing campaigns of the telecom industry. Additionally, our project CDSST helps
detect loss by the analysis of the call data record generated by different switches used in
the communication network. We have successfully developed a new enterprise level data
model that can handle a huge amount of data using Teradata database. The final result of
this project is a data warehouse. We have been able to create various consistent and
fundamental data models for various dimensions. These dimensions act as the stage to
showcase the various forms of data relevant to the telecommunication industry. These data
have been interpreted as the facts. The data in our data marts are used extensively to dig
out the past and the present status of the corporate industry in terms of audience
penetration and market regime of the company. The result set generated by our system are
presented as a report and can be viewed in a dashboard. Further the source data has been
undertaken through a serious of steps of Data Driven Discovery methods. The mining of
data has enabled us to make future predictions of the trends in the telecommunication
industry. We have successfully created prediction and descriptive models of the data that
best describe and/or predict the behavior of the data. The prediction generated by our
system can also be viewed as a part of the dashboard, along with the graphs, charts and
figures.

Keywords: Call Detail Record (CDR), Data warehouse, Extract-Transform-Load (ETL),


Telecommunication, Churn Prediction, Classification and Regression Tree (CART),
Customer Segmentation, RFM Clustering.
vii

TABLE OF CONTENTS
ACKNOWLEDGEMENT ..................................................................................................... v

SUMMARY .......................................................................................................................... vi

LIST OF FIGURES ..............................................................................................................xii

LIST OF TABLES ...............................................................................................................xiv

LIST OF ABBREVIATIONS .............................................................................................. xv

1. INTRODUCTION ......................................................................................................... 1

1.1. Project Background ................................................................................................. 1

1.2. Project Overview ..................................................................................................... 3

1.3. Aims and Objectives ............................................................................................... 3

1.3.1. General Objectives ........................................................................................... 3

1.3.2. Specific Objectives .......................................................................................... 4

1.4. Scope ....................................................................................................................... 5

1.4.1. Problem Statement ........................................................................................... 5

1.4.2. Offered Solution:.............................................................................................. 5

1.4.3. Within Scope.................................................................................................... 5

1.4.4. Not in Scope..................................................................................................... 6

1.5. Organization of Report ............................................................................................ 6

2. Literature Review........................................................................................................... 8

2.1. Telecommunication Data Structure......................................................................... 8

2.2. Data Warehouse and ETL Process .......................................................................... 9

2.3. Data Mining and Knowledge Discovery ............................................................... 10

2.3.1. Customer Segmentation ................................................................................. 10

2.3.2. Churn Prediction ............................................................................................ 11

3. THEORITICAL BACKGROUND .............................................................................. 13


viii

3.1. Business Intelligence and Reporting ..................................................................... 13

3.2. Data Mining and Knowledge Discovery ............................................................... 13

3.2.1. Customer Segmentation ................................................................................. 13

3.2.2. Churn Prediction ............................................................................................ 14

4. TECHNICAL BACKGROUND.................................................................................. 16

4.1. Extraction, Transformation and Loading .............................................................. 16

4.1.1. Extraction ....................................................................................................... 16

4.1.2. Transformation............................................................................................... 16

4.1.3. Loading .......................................................................................................... 17

4.2. Business Intelligence and Reporting ..................................................................... 17

4.2.1. Creating Attributes, Facts and Metrics .......................................................... 18

4.2.2. Creating User Hierarchies .............................................................................. 19

4.2.3. Reports, Prompts and Filters .......................................................................... 20

4.2.4. Intelligent Cube, Caching and Dashboard Creation ...................................... 20

4.3. Customer Segmentation: RFM Approach ............................................................. 22

4.3.1. Other Approaches .......................................................................................... 22

4.3.2. Why RFM Approach? .................................................................................... 22

4.4. Churn Prediction ................................................................................................... 24

4.4.1. Use of decision trees in Churn prediction ...................................................... 25

5. SYSTEM ANALYSIS ................................................................................................. 27

5.1. Basic System Operation: ....................................................................................... 27

5.1.1. Stage 1............................................................................................................ 27

5.1.2. Stage 2............................................................................................................ 27

5.1.3. Stage 3............................................................................................................ 27

5.1.4. Stage 4............................................................................................................ 27


ix

5.2. Requirement Specification .................................................................................... 28

5.2.1. Functional Requirements: .............................................................................. 28

5.2.2. Non Functional Requirements ....................................................................... 29

5.3. Feasibility Assessment .......................................................................................... 29

5.3.1. Technical Feasibility ...................................................................................... 29

5.3.2. Economic Feasibility...................................................................................... 30

5.3.3. Operational Feasibility ................................................................................... 30

6. DETAILED DESIGN .................................................................................................. 33

6.1. Overview ............................................................................................................... 33

6.1.1. Data source block........................................................................................... 34

6.1.2. ETL block ...................................................................................................... 34

6.1.3. Data warehouse block .................................................................................... 34

6.1.4. Dashboard block ............................................................................................ 34

6.1.5. Data mining block .......................................................................................... 35

6.2. System Diagrams................................................................................................... 35

6.2.1. Component Diagram ...................................................................................... 35

6.2.2. Sequence Diagram ......................................................................................... 36

6.2.3. State Chart Diagram....................................................................................... 38

6.2.4. Activity Diagram............................................................................................ 39

6.2.5. Use Case Diagram.......................................................................................... 41

6.3. Warehouse Schema Diagram ................................................................................ 42

7. IMPLEMENTATION .................................................................................................. 43

7.1. System Design Speculation ................................................................................... 43

7.1.1. Time dimension.............................................................................................. 43

7.1.2. Date dimension .............................................................................................. 44


x

7.1.3. Location dimension........................................................................................ 44

7.1.4. Customer dimension ...................................................................................... 44

7.1.5. Service dimension .......................................................................................... 45

7.1.6. Fact table ........................................................................................................ 45

7.2. Detailed System Operation.................................................................................... 46

7.2.1. CDR data import ............................................................................................ 47

7.2.2. Staging table and loading ............................................................................... 47

7.2.3. Customer dimension check and id valuation ................................................. 50

7.2.4. Date dimension check and id valuation ......................................................... 50

7.2.5. Time dimension check and id valuation ........................................................ 51

7.2.6. Location dimension check and ID valuation.................................................. 51

7.3. Business Intelligence and Reporting ..................................................................... 52

7.3.1. OLAP Analysis: ............................................................................................. 52

7.3.2. Loss Analysis ................................................................................................. 55

7.4. Data Mining and Knowledge Discovery ............................................................... 55

7.4.1. Customer Segmentation using RFM Technique ............................................ 55

7.4.2. Customer Churn Prediction............................................................................ 57

7.5. Tools and Technologies for System Development ............................................... 57

7.5.1. Teradata Administrator 13.0: ......................................................................... 57

7.5.2. Teradata BTEQwin ........................................................................................ 58

7.5.3. Teradata Fastload ........................................................................................... 58

7.5.4. Python 2.7 ...................................................................................................... 59

7.5.5. Microsoft SQL Management Studio .............................................................. 59

7.5.6. SVN................................................................................................................ 59

7.5.7. Red Mine........................................................................................................ 60


xi

7.5.8. Microstrategy ................................................................................................. 60

8. RESULTS AND CONCLUSION ................................................................................ 61

8.1. Results ................................................................................................................... 61

8.1.1. Dashboard Home............................................................................................ 61

8.1.2. Call Analysis - OLAP Analysis ..................................................................... 62

8.1.3. Geographical Analysis – OLAP Analysis...................................................... 63

8.1.4. Datewise Analysis – OLAP Analysis ............................................................ 64

8.1.5. Timewise Analysis – OLAP Analysis ........................................................... 65

8.1.6. User Analysis – OLAP Analysis.................................................................... 66

8.1.7. Revenue Analysis – OLAP Analysis ............................................................. 67

8.1.8. Scatter Plot 3d – Customer Segmentation with RFM Algorithm .................. 69

8.1.9. RFM Values – Customer Segmentation with RFM Algorithm ..................... 70

8.1.10. Churn Prediction ........................................................................................ 71

8.1.11. Loss Analysis ............................................................................................. 72

8.2. Conclusion............................................................................................................. 73

8.3. Future Enhancement.............................................................................................. 73

Bibliography......................................................................................................................... 75

APPENDIX .......................................................................................................................... 78

Appendix A ...................................................................................................................... 78
xii

LIST OF FIGURES

Figure 4.1 Microstrategy project creation workflow ........................................................... 18

Figure 6.1 System Block Diagram of CDSST ..................................................................... 33

Figure 6.2 Component Diagram of CDSST......................................................................... 35

Figure 6.3 Sequence Diagram of CDSST ............................................................................ 36

Figure 6.4 Sequence Diagram of CDSST (Report Generation)........................................... 37

Figure 6.5 State Chart Diagram of CDSST ......................................................................... 38

Figure 6.6 Activity Diagram of CDSST .............................................................................. 39

Figure 6.7 Use Case diagram of CDSST ............................................................................. 41

Figure 6.8 Snowflake Schema Diagram of CDSST Warehouse .......................................... 42

Figure 7.1 Process flow of Project A .................................................................................. 46

Figure 8.1 Dashboard output – Home .................................................................................. 61

Figure 8.2 Dashboard Output - Call Analysis | OLAP Analysis ......................................... 62

Figure 8.3 Dashboard Output - Geographical Analysis | OLAP Analysis........................... 63

Figure 8.4 Dashboard Output - Datewise Analysis | OLAP Analysis ................................. 64

Figure 8.5 Dashboard Output - Timewise Analysis | OLAP Analysis ................................ 65

Figure 8.6 Dashboard Output - User Analysis | OLAP Analysis......................................... 66

Figure 8.7 Dashboard Output - Revenue Analysis | OLAP Analysis .................................. 67

Figure 8.8 Dashboard Output - Bubble Graph Revenue Analysis | OLAP Analysis........... 68

Figure 8.9 Dashboard Output - Scatter Plot 3d | Customer Segmentation with RFM ......... 69

Figure 8.10 Dashboard Output - RFM Values | Customer Segmentation with RFM .......... 70

Figure 8.11 Dashboard Output - Churn Prediction .............................................................. 71

Figure 8.12 Dashboard Output - Loss Analysis ................................................................... 72

Figure A.1 Schema for Time Dimension ............................................................................. 78


xiii

Figure A.2 Schema for Date Dimension .............................................................................. 78

Figure A.3 Schema for Location Dimension ....................................................................... 79

Figure A.4 Schema for Customer Dimension ...................................................................... 79

Figure A.5 Schema for Service Dimension ......................................................................... 80


xiv

LIST OF TABLES
Table 7.1 Staging table Structure ......................................................................................... 49

Table 7.2 Cluster details - Customer Segmentation............................................................. 56

Table A.1 Structure of Fact Table........................................................................................ 81


xv

LIST OF ABBREVIATIONS

ABC Artificial Bee Colony


AI Artificial Intelligence
BI Business Intelligence
BTEQ Basic Teradata Query
CART Classification And Regression Tree
CDR Call Detail Record/ Call Data Record
CDSST Corporate Decision Support System for Telecommunications
CRM Customer Relationship Management
CSH C-Shell
DSS Decision Support System
ER Entity Relationship
ETL Extract, Transform, Load
FCD Frequently Changing Dimension
GA Genetic Algorithm
GPRS General Packet Radio Service
GSM Global System for Mobile Communications
GUI Graphical User Interface
LDAP Light-weight Directory Access Protocol
OLAP Online Analytical Processing
OLTP Online Transaction Processing
PSO Particle Swarm Optimization
RFM Recency Frequency Monetary
SCD Slowly Changing Dimension
SCM Software Configuration Management
SOM Self-Organizing Map
SVN Sub-Version Network
TCSH Tab C- Shell
1

1. INTRODUCTION

1.1. Project Background

Mobile Communication is one of the technological fields that have seen rapid growth in
recent times. This growth is mainly influenced by technological breakthroughs in the field
of mobile technology, processing capability and communication. Furthermore, this
technology is easily available to the general public. The scenario is alike in both the
developed and developing countries. On this basis, it is speculated that currently in this
world the total number of users of this technology exceeds in billions.

The telecommunication market today is competitive. In the context of our country, there is
an ongoing battle between two major companies namely Nepal Telecom and Ncell for
being the prominent figure of mobile communication throughout the country. In order to
understand their customers and create a better environment for managing customer
relationships, many telecommunications companies use call detail analysis to achieve
competitive advantage. By understanding customers’ characteristics and behavior, it is
possible to successfully tailor services that meet customers’ needs [1]. This is known as the
personalization, and this concept is not restricted to the telecommunications industry. User
profiling can be valuable to semantic web, it is successfully applied in banking, financial
services and many other industries.

Telecommunications rating is the activity of determining the cost of a particular call. The
rating process involves converting call related data into a monetary-equivalent value. Call
related data is generated at various points in the network or measurements may be taken by
third party equipment such as network probes. One or more of the following parameters
are used in the rating system:

 Time property of the call (day of week, date, time of day)


 Amount of usage (Duration of call, amount of data, number of messages, number of
songs)
 Destination of the call (land line, overseas, etc.)
 Origin of call/ Location of the caller (for mobile networks)
2

 Premium charges (third party charges for premium content, cost of physical items
such as movie tickets)

Call data record is a record of a call set up and completion and its format varies among
telecom providers or programs. Over the last ten years the CDR has evolved to include
several new fields. As technology and new marketing opportunities have evolved, more
information has been added to the CDR. Today, CDR is the currency for the Telecom
industry [2]. A CDR contains a world of traffic analysis information including the
origination, termination, and routing of a call. You tell the switch what you want it to do
when these numbers are dialed, and the CDR will show how the switch actually routed the
call, correctly or not. Unlike network statistics, the CDR is authentic, real-time information
(Lee McDaniel).

The telecommunications industry has advantage over other sectors because it has a
voluminous amount of customer data [2]. Telecommunications companies collect various
demographic data (such as name, address, and age and gender information) and store call
detail records. Call detail records usually include information about originating and
terminating phone numbers, the date and time of the call and its duration. Demographic
and call detail data can be stored together in a data mart (known as the CDR data mart),
and transformed into information that can be analyzed using OLAP tools.

Although the CDR is the smallest piece of the telecom puzzle, the CDR is the building
block for the network and provides the detailed information carriers need to leverage
themselves appropriately in this competitive marketplace. Knowledge is power. In the
telecom industry, the CDR not only translates into currency, but also into power.
Intelligent telecom companies are recognizing the strategic value of efficient handling of
CDRs and are implementing advanced CDR applications such as Equinox’s ShowCDR,
ExtractCDR, and AccessCDR to maximize their resources [2]. Now more than ever,
carriers need to manage CDRs efficiently to compete in this competitive
telecommunications marketplace.
3

1.2. Project Overview

The sole purpose of our application “Corporate Decision Support System for
Telecommunications” is to help the company in making decisions and formulating
business policies based on analysis of data from CDR. This application assembles data
from various sources mainly CDR, integrates them into a separate data warehouse. Our
system is built based on the Kimball approach of data warehouse design. It is a bottom-up
approach of data warehousing which is based on excessive de-normalization of the
available data. In the bottom-up approach, data marts are first created to provide reporting
and analytical capabilities for specific business processes. The reports are a mixed
combination of graphs, pie charts, bar diagrams or plain text predictions. These records are
used to create the simple reports that are easy to understand for the non-technical
community. These reports generated by our system provide an insight to the company’s
past and present status in the consumer market. They show the activity of the company in
the past and the present and illustrate how much has been done to grow and spill over the
market. These results of analysis are arranged in the form of a dashboard. Also the data
content of our warehouse are processed using various mining algorithms to predict the
future of the company based on the data behavior in our warehouse. The end results of the
analysis are presented in a way that is easier to the non-technical people to navigate
around. Furthermore, the application also helps to draw conclusions that can be utilized for
improving business.

1.3. Aims and Objectives

1.3.1. General Objectives

The objective of this project was to deliver a product that will analyze the data associated
with any telecommunication company. The system contains a data warehouse built on top
of Teradata database. The warehouse is developed in a way efficient to manage millions of
records generated by the telecommunication switches. The data to be analyzed has been
transformed and loaded into the warehouse. Reports are generated with the help of
Microstrategy tool to analyze customer behavior and calling trends with the warehouse as
the source. Predictions have been made with the utilization of data mining algorithm for
4

making better strategies. The tool is capable of analyzing the data recorded by different
CDR switches which will help in making plans for loss prevention between the switches.

1.3.2. Specific Objectives

The higher level abstractions of the objectives of our project are described in the general
objectives section of the project proposal. On a finer level of granularity, CDSST has
fulfilled the following objectives:
 Loss Prevention by analysis of the call data record (CDR) generated by different
switch.
 Achieve new enterprise level data model to handle huge amount of data in Teradata
database.
 Provide a new base for OLAP analysis through Microstrategy reporting tool.
(analyze customers’ behavior and improve its marketing campaigns and pricing
strategies )
 Make predictions to support in the formulation of company’s business policies by
the mining of the CDR and customer data.
5

1.4. Scope

1.4.1. Problem Statement

Building a warehouse for the analysis of data for an enterprise is not a common practice in
Nepal. In fact this can contribute to the decision making process. Considering
telecommunication enterprise, all user activities are tracked in the CDR data. This data is
very rich and is potentially useful which is considered very confidential. Due to which no
significant researches regarding analysis of CDR data have been found. Considering the
confidentiality of the data, it is only used to determine the monetary value of calls but not
fully exploited to obtain benefits. CDR data can also contribute to analyze the loop hole of
the system. In the present context of our country, no relevant obtained benefits have been
seen from the analysis of CDR data. This CDR data can be used by the enterprise to
increase its marketing campaigns and pricing strategies. Different schemes can be
discovered from the analysis of CDR data which can benefit the enterprise. Mobile
penetration have been increasing very fast in our country and two big telecommunication
enterprises: Ncell and Nepal Telecom are mainly responsible for providing services to
them. Along with the increase in number of customer, telecommunication enterprise is not
able to meet their needs. This causes dissatisfaction of customer due to which they switch
from one network to another frequently. Also the problem gets stacked up. Rate of call
drop have been quite high due to which customers are not fully satisfied.

1.4.2. Offered Solution:

This project has deliver an application which allows users to view the inferences, trend,
predictions and so forth drawn out from the data in the data warehouse of our own. All
these information are visualized shown in an interactive form on a dashboard. This process
involved ETL process to build the warehouse which was accompanied by the creation of a
data model. There will always be a challenge of optimization since we will be dealing with
millions of data generated by switch (CDR). The project withholds a good potential which
will yield informative and interesting results

1.4.3. Within Scope

As previously stated, complete analysis of CDR data can deliver high end business
solutions. Some of the important results that lie within the scope of our project are as
follows:
6

 Prediction of calling trends using CDR analysis that can help in managing
resources to improve the quality of a call.
 Identification of customer network and solutions to business advertisement
campaign.
 Representation of telecom business reach in a spatial database.
 Recognizing strategies to prevent call drops and loss occurring in switches
throughout the network.
 Justification of customer migration from the industry.

1.4.4. Not in Scope

Telecom data base is huge and can contain a highly diversified set of fields. In this one
year long project of ours, we will focus ourselves only on the selected aspects of CDR data
and restricted domain of customer information. CDR analysis cannot provide all the
required business solutions. Following enlisted areas will not be included in our project,
although the enlisted field may seem to be closely related to what we are doing right now:

 Complete analysis of all the CDR records created by all the switches of the telecom
industry. The data would simply be too large to store and manage. (Literally
billions per week).
 Geographical analysis of call traffic throughout the entire network of the telecom is
simply out of the scope of this project. Complete geographical analysis and spatial
representation of call data throughout the country shall require a longer time span, a
greater work force and an industry grade effort.
 Optimization of data structure of CDR records and solution to the loss of call
requests among the switches in the telecom network cannot be simply implemented
as the analysis of CDR switcher would involve participation of the system engineer
and hardware programmer too.

1.5. Organization of Report

This chapter explains the project background, overview, aims and objectives, and scope of
the project.
7

Chapter 2 explains about the literature review, which gives the previous works done in the
field of telecommunication data structure, data warehouse and ETL process, customer
segmentation and churn prediction.

Chapter 3 provides the related theoretical backgrounds. Various methods of customer


segmentation, churn prediction and business intelligence reporting will be discussed.

Chapter 4 explains in detail about the technological backgrounds. The technical details of
business intelligence and reporting, algorithms and details of Customer segmentation and
churn prediction will be discussed in this chapter.

Chapter 5 presents system analysis which discusses about basic system operation,
requirement specification and feasibility assessment of the project.

Chapter 6 explains in detail about the system design and discusses about overview of the
system design, system diagrams and warehouse schema.

Chapter 7 explains about the implementation part of the system. Topics like system design
speculation, detailed system operation, implementation perspective of business intelligence
and reporting and data mining and knowledge discovery will be discussed.

Chapter 8 gives some explanation about the output of charts of CDSST dashboard along
with their snapshots and also conclusion and future enhancement of the project.
8

2. Literature Review

2.1. Telecommunication Data Structure

Every business enterprise wants to make better decisions with timely & accurate data.
Accurate, reliable, timely and consistent enterprise database management produces quality
data. Improving data quality benefits everyone in the organization. This helps them avoid
misleading and inaccurate conclusions and make better business decisions that improve
their bottom line. Teradata takes a step-by-step approach to data quality improvement that
focuses on the specific business issues to deliver tangible and measurable results [3].

More secure and trusted information can be achieved with Data Integration. Proper data
management provides business value - not only by giving them the ability to make faster,
better decisions, but also by increasing security and privacy and adhering to the data
governance policies. Teradata Integrated Data Management services can help enterprises
secure sensitive data that may be located in multiple data warehouses [3]. It will also help
establish consistent access rights, auditing capabilities and encryption to protect this vital
asset.

In the competitive telecom market, it is absolutely crucial to handle and manage CDRs in
an efficient fashion. The common denominator of CDR applications is that they allow
carriers to manage their business down to the smallest detail. Telecom companies initially
spend and invest a lot of resources into building their network and obtaining customers.
Unfortunately, these companies often neglect their currency, the CDR, which measures the
efficiencies of scale, and shows if revenue is being lost due to mismanagement of CDRs
[4]. Popular and useful CDR applications that have been developed so far include:

 Organizing CDR records based on the trunk group and/or billing number
 Giving customers electronic access to their CDRs
 Searching CDR files
 Relaying CDR records to billing systems
 Importing CDRs into spreadsheet or database applications
 Creating individualized call record reports on demand
9

Richard Harding, an Oracle Database Administrator with LDInet, classifies the focus of
analysis of CDR into following seven fields:

 Trunk group incoming


 Trunk group outgoing
 A Number
 B Number
 Date
 Time
 Duration.

All telecommunications service providers perform forecasting calculations to assist them in


planning their networks [5]. Accurate forecasting helps operators to make key investment
decisions relating to product development and introduction, advertising, pricing etc., well
in advance of product launch, which helps to ensure that the company will make a profit on
a new venture and that capital is invested wisely [6].

2.2. Data Warehouse and ETL Process

Data warehouse is the collection of tremendous amount of data. The data stored in data
warehouse is highly de-normalized, redundant and time variant. The data is stored
continuously from the infancy of any data collection mechanism. Extract-Transform-Load
(ETL) is a process in data warehousing responsible for pulling data out of the source
systems and placing it into a data warehouse. Data is extracted from an OLTP database
transformed to match the data warehouse schema and loaded into the data warehouse
database. Many data warehouses also incorporated data from non-OLTP systems such as
text files, legacy systems and spreadsheets; such data also requires extraction,
transformation and loading. Because ETL is an integral, ongoing and recurring part of a
data warehouse, it needs to be automated, well documented and easily changeable.

The principal reason why businesses need to create Data Warehouses is that their corporate
data assets are fragmented across multiple, disparate applications systems, running on
different technical platforms in different physical locations. This situation does not enable
good decision making [7]. Individual business units within an enterprise are designated as
"owners" of operational applications and databases. These "organizational silos"
10

sometimes don't understand the strategic importance of having well integrated, non-
redundant corporate data. Consequently, they frequently purchase or build operational
systems that do not integrate well with existing systems in the business.

Data Management issues have deteriorated in recent years as businesses deployed a


parallel set of e-business and ecommerce applications that don't integrate with existing
"full service" operational applications. Operational databases are normally "relational" -
not "dimensional". They are designed for operational, data entry purposes and are not well
suited for online queries and analytics.

Due to globalization, mergers and outsourcing trends, the need to integrate operational data
from external organizations has arisen. The sharing of customer and sales data among
business partners can, for example, increase business intelligence for all business partners.

The challenge for Data Warehousing is to be able to quickly consolidate, cleanse and
integrate data from multiple, disparate databases that run on different technical platforms
in different geographical locations.

2.3. Data Mining and Knowledge Discovery

2.3.1. Customer Segmentation

Although customer segmentation and market segmentation have many similarities, there
are some critical differences regarding their input variables for their clustering mechanisms
[8]. Market segmentation usually aims at acquiring new customers, and deals with the first
step of CRM (i.e. customer acquisition) using socio-demographic data, but, customer
segmentation is used at all steps of CRM using both socio-demographic and transactional
data. “We can imagine that customer cultivation and retention are more important than
customer acquisition, because lack of information on new customers makes it difficult to
select target customers and this will cause inefficient marketing efforts” [9].

Chan (2008) [10]has classified existing customer segmentation methods into methodology-
oriented and application-oriented approaches. Most of methodology-driven studies modify
some data clustering techniques, such as SOM, or use a combination of two or more data
mining techniques to achieve more accurate clusters or segments (e.g. Lee et al., 2004 [11];
Jonker et al., 2004 [12]; Huang et al., 2007 [13]; Kim and Ahn, 2008 [14]). “On the other
11

hand, application-oriented researches must search for the optimum method for solving
segmentation problems in specific applications” (Chan, 2008 [10]).They usually define and
create new variable for clustering procedure or use different variables in sequential
clustering steps (e.g. Hwang et al., 2004 [9]; Hsieh,2004 [15]; Chan, 2008 [10]; Cheng and
Chen, 2009 [16]).

Another important input variable for application-oriented customer segmentation is RFM.


For instance, Hsieh (2004) [15] used a self-organizing map neural network to identify
groups of customers based on repayment behavior and recency, frequency, and monetary
behavioral scoring predictors. He also classified bank customers into three major profitable
groups of customers. The resulting groups of customers were then profiled by customer’s
attributes determined by using an Apriori association rule inducer. Lately, McCarty and
Hastak, in 2007 investigated RFM, CHAID, and logistic regression as analytical methods
for direct marketing segmentation, using two different datasets. Finally, Cheng and Chen
in 2009 proposed a new procedure, joining quantitative value of RFM attributes and K -
means algorithm into rough set theory, to extract meaning rules.

Besides, a combination of above mentioned input variables has been utilized by


researchers too. For example, Chan in 2008 presented a novel approach that combines
customer targeting and customer segmentation for campaign strategies. This investigation
identified customer behavior using a RFM model and then used an LTV model to evaluate
proposed segmented customers.

2.3.2. Churn Prediction

Churn management is a field under operational CRM (Customer Relationship


Management) which refers to the process of keeping the most profitable customers in
subscription [17] and assessing the most effective way that an operator can react against
churn [18]. Churn Prediction mainly consists of two goals: (1) explain why customers
unsubscribe their subscription and move to a competitor, (2) predict which customers are
most likely to churn in the near future. Churn Prediction has increasingly received attention
from a large number of researchers over recent years. The most popular methods that have
been used for churn prediction are such as: decision tree, regression, Naive Bayes and
neural network.
12

(Hung, Yen, & Wang, 2006) employ a decision tree for churn prediction in the Taiwan
postpaid mobile market using several groups of features: customer demographics, billing
information, contract/service status, call detail records, and service change log [18].
Features are tested for significance using -test. According to the analysis, the features that
are significant to differentiate between churners and non-churners are: age, tenure, gender,
billing amount, number of overdue payment, in-net call duration, number of changes in
account information. Customers are then segmented with respect to the significant features.
Comparison of decision tree and artificial neural networks (ANN) verifies that ANN
performs better on this particular data set. To make sure that the models work well, the
authors track the models in one year from July 2001 to July 2002. The performances of all
models remain first steady but then drop dramatically after six months from 80-90% to 0-
10% of overall accuracy.

Baesens et al. (2004) used Bayesian network classifiers for identifying the slope of the
customer lifetime value (CLV) for long-life customers, but used simple linear regression
on the historical contributions of each customer to capture their individual life cycles [19].
The slope was then separated into either positive or negative classes to represent increased
or decreased spending. This variable was then used as the dependent variable in their
study. The CLV was also the focus of research by Rosset etal. (2003) who used the
Kaplan-Meier estimator to estimate the value, and Stahl et al.(2003) who link CLV to
company shareholder value. [20] [21]

Support vector machine (SVM) is another technology that is worth investigating for its
suitability for use with customer churn management; however a thorough search of the
literature has revealed little investigation into this method. The closest research to CRM
identified is the mining of customer credit scores (Lee et al., 2006). Auh and Johnson
(2004) use five firm-level variables named as customer satisfaction index scores (SAT),
customer loyalty index scores (LOY), year of data (YEAR),relative quality importance
(RQI) and the average ease of comparing quality differences(EQ), to create a general linear
model. [22] [23]
13

3. THEORITICAL BACKGROUND
The data collected in a telecommunication industry is huge and is continually evolving. It
is a never ceasing process. The huge amount of data, when harnessed properly, can give up
precious sets of information that can actually guide the business process in the grass root
level. To harness the goodness of the data collected by the interaction of the customers
with the telecommunication services provided by a telecommunication industry needs to be
properly collected and stored well. Storing data in a warehouse makes it efficient to
generate reports and find out the information at a higher level of abstraction.

3.1. Business Intelligence and Reporting

Decision support system for an enterprise should give its executives the ability to take
better business decisions from the insights on the warehouse repository provided by the
intuitive dashboards and reports. This is achieved with the help of business intelligence
tool provided by the Microstrategy platform. It helps to easily transform data into intuitive
dashboards and reports for greater analytical insights to make informed, data-driven
decisions. The MicroStrategy platform makes Business Intelligence faster, easier, and
more user-friendly. It enhances productivity, monitor trends and discover anomalies, and
forecast business opportunities. MicroStrategy provides a high performance, scalable
enterprise Business Intelligence platform capable of delivering deep insight with
interactive dashboards and superior analytics to large user populations through Web
browsers, mobile devices and Office applications.

3.2. Data Mining and Knowledge Discovery

3.2.1. Customer Segmentation

Customer Segmentation is the process of grouping the customers based on their purchase.
Data mining is useful in finding knowledge from huge amounts of data. The clustering
techniques in data mining can be used for the customer segmentation process so that it
clusters the customers in such a way that the customers in one group behave similar when
compared to the customers in the other group based on their transaction details. Customer
segmentation allows a company to target specific groups of customers effectively and
allocate marketing resources to best effect. According to an article by Jill Griffin for Cisco
Systems, traditional segmentation focuses on identifying customer groups based on
14

demographics and attributes such as attitude and psychological profiles. Value-based


segmentation, on the other hand, looks at groups of customers in terms of the revenue they
generate and the costs of establishing and maintaining relationships with them.

Customer Segmentation gives a quantifiable way to analyze the customer data and
distinguish the customers based on their purchase behavior [24]. In this way the customers
can be grouped into different categories for which the marketing people can employ
targeted marketing and thus retain the customers. Once the customers are segmented, rules
can be generated to describe the customers in each group based on their purchase behavior.
These rules can be used to classify the new customers to the appropriate group who have
similar purchase characteristics. In our case, the purchasing habit of customer indicates the
calling habit or the habit of using the services provided by the telecommunication
enterprise.

3.2.2. Churn Prediction

In the telecommunications industry, the broad definition of churn is the action that a
customer’s telecommunications service is canceled [9]. This includes both service-provider
initiated churn and customer initiated churn. Customer initiated churn is more complicated
and the reasons behind vary. Our model will only be prediction customer initiated churn. If
we are able to predict in advance, the attributes of customers whom we are going to lose in
near future one can take corrective action so that we can minimize this phenomenon.

Initially when the industry growth is high, churn may not be a problem. The focus at this
stage is on customer acquisition. For each customer churning out there may be many who
will be joining. As the industry matures, the churn rate rises and there will probably be a
time when for every one customer leaving only one new will be joining or there may not
be even one joining. When this stage is reached, it will affect the bottom line for the
company.

3.2.2.1. The benefits of Churn prediction

One of the important benefits of building a model for churn prediction will be to marketing
department / call center(s). The list of customers who are likely to churn in the coming
month will be made available to the above places so that further action can be taken. Data
analyzed from back end or Analytical Customer relationship management (CRM) System
will go to frontend or operational part of CRM closing the CRM loop. Marketing
15

department can plan giving customers discounts, other promotions/events, other products
of other sister companies wherever applicable.

3.2.2.2. Methodology for Churn prediction

For finding answers to the questions who and why is likely to churn, a classification of the
customers is needed. Churn prediction deals, therefore, with the identification of customers
likely to churn in the near future. The basis for this is historical data, containing
information about past churners. A comparison is made between these churners and
existing customers. As likely churners are identified customers for which the classification
suggests similarity to prior churners.
16

4. TECHNICAL BACKGROUND

4.1. Extraction, Transformation and Loading

Extraction, Transformation, and Loading (ETL) processes are responsible for the
operations taking place in the back stage of a data warehouse architecture. Data is extracted
from an OLTP (Online Transactional Processing) database, transformed to match the data
warehouse schema and loaded into the data warehouse database. Many data warehouses
also incorporate data from non-OLTP systems such as text files, legacy systems, and
spreadsheets; such data also requires extraction, transformation, and loading. ETL is not a
one-time event as new data is added to the Data Warehouse periodically – monthly, daily
or hourly.

4.1.1. Extraction

The extraction step is conceptually the simplest task of all, with the goal of identifying the
correct subset of source data that has to be submitted to the ETL workflow for further
processing. As with the rest of the ETL process, extraction also takes place at idle times of
the source system - typically at night. For extraction, there needs to be a logical data map
before the physical data can be transformed. The logical data map describes the
relationship between the extreme starting points and the extreme ending points of the ETL
system usually presented in a table or spreadsheet.

4.1.2. Transformation

Transformation is the main step where the ETL adds value. Actually the process changes
data and provides guidance whether data can be used for its intended purposes. It is
performed in the staging area. Data can actually have different issues which can be
classified into: schema-level problems, record-level problems or value-level problems. To
deal with such issues, the integration and transformation tasks involve a wide variety of
functions, such as normalizing, denormalizing, reformatting, recalculating, summarizing,
merging data from multiple sources,modifying key structures, adding an element of time,
identifying default values, supplying decision commands to choose between multiple
sources, and so forth.
17

4.1.3. Loading

The end of the source records' journey through the ETL workflow comes with their loading
to the appropriate table. Some data warehouses may overwrite existing information with
cumulative information; frequently, updating extracted data is done on a daily, weekly, or
monthly basis. Other data warehouses (or even other parts of the same data warehouse)
may add new data in a historical form at regular intervals - for example, hourly. Loading
can be classified to two types – loading dimensions and loading facts. In the process of
loading dimensions, the data loading module consists of all the steps required to administer
slowly changing dimensions (SCD) and write the dimension to disk as a physical table in
the proper dimensional format with correct primary keys, correct natural keys, and final
descriptive attributes. The most important thing about loading fact tables is that we first
need to load dimension tables and then according to the specification the fact tables. The
fact table is often located in the center of a star schema, surrounded by dimension tables. It
has two types of columns: those containing facts and other containing foreign keys to
dimension tables.

4.2. Business Intelligence and Reporting

The data source for Microstrategy was our CDSST warehouse created with Teradata. After
configuring Microstrategy to connect with the warehouse, the next step was to proceed
with the Microstrategy reporting and dashboard creation. Then tables were selected from
the warehouse to be used for reporting in the Microstrategy project which is the first task in
project creation workflow as shown in the figure below.
18

Figure 4.1 Microstrategy project creation workflow

The second task in the project creation workflow is to create the facts for the
MicroStrategy project. Facts are created based on our logical data model and mapped them
to columns in the data warehouse schema. The facts in the CDSST projects are cost, and
duration. These facts are then used to define metrics. As such, the fact schema object
serves as a bridge between fact values stored in the data warehouse and the metrics users
want to see on MicroStrategy reports. Facts point to the appropriate physical columns in
the data warehouse, while metrics perform aggregations on those columns.

4.2.1. Creating Attributes, Facts and Metrics

Facts are one of the essential elements within the business data model. They relate numeric
data values from the data warehouse to the MicroStrategy reporting environment. Facts
generally represent the answers to the business questions on which users want to report. In
the MicroStrategy environment, facts are schema objects created by and shared between
MicroStrategy users. The facts you create in MicroStrategy allow users to access data
stored in the data warehouse. Facts form the basis for metrics, which are used in the
majority of analyses and reports that users can create with MicroStrategy. Facts and
attributes are necessary to define projects. In a MicroStrategy project, facts are numeric
data and attributes are contextual data for the facts. For example in case of CDSST, you
want to analyze the amount of call duration at a certain area during January. In this case,
the amount of call duration represents the fact, and the area and month represent attributes.
Users can then use these facts and attributes as building blocks for metrics and reports.
19

Attributes provide the business model with a context in which to report on and analyze
facts. While knowing your company’s total sales is useful, knowing where and when the
sales took place provides the kind of analytical depth users require on a daily basis.

Metrics are MicroStrategy objects that represent business measures and key performance
indicators. They are calculations to be performed on data stored in the database and are
similar to formulas in spreadsheet software. Questions such as “What were the sales for the
eastern region during the fourth quarter?”, “Are inventory levels being consistently
replenished at the beginning of each week?”, or “How many employees are currently
working for our company?” can easily be answered by creating metrics. A metric
definition must contain a formula, which determines the data to be used and the
calculations to be performed on the data. A metric is categorized as one of the following
types based on the functions used in its formula:

 Simple metric: The formula of a simple metric is a mathematical expression based


on at least one group function, such as sum or average, which is applied to facts,
attributes, or other metrics. It can also contain non-group functions or arithmetic
operators, in addition to the required group function. An example of the formula of
a simple metric is Avg(Sum((Cost + Profit))) where Cost and Profit are facts. The
formula contains two group functions, Avg (which stands for average) and Sum.
 Compound metric: The formula of a compound metric is based on arithmetic
operators and non-group functions. Arithmetic operators are +, -, *, and /; non-
group functions are OLAP and scalar functions such as running sum or rank. The
operators and functions can be applied to facts, attributes, or metrics. An example
of the formula of a compound metric is RunningAvg(Cost) where Cost is a metric.
The formula contains a non-group function, RunningAvg (which stands for running
average). Another example is Sum(Cost) + Sum(Profit) where Cost and Profit are
metrics. The addition expression, denoted by the + operator, makes this a
compound metric.

4.2.2. Creating User Hierarchies

Hierarchies are groupings of attributes that can be displayed, either ordered or unordered,
to reflect the relationships that exist between the attributes in a project. Hierarchies are
used to group related attributes in practical business areas. For CDSST, there is a Date
20

hierarchy that consists of Day, Day of Week, Week, Month, Quarter, Semi-year and Year
attributes. And other hierarchies in CDSST includes Time hierarchy and the Location
hierarchy. Hierarchies as they exist in the MicroStrategy environment and provides
information on the two different types of hierarchies in MicroStrategy. These types of
hierarchies include the system hierarchy and the user hierarchy. The system hierarchy is
automatically created when you create a project and is maintained by the relationships that
exist among the project’s schema objects. The user hierarchy is a hierarchy which you
create specifically for your report designers. The mentioned hierarchies of Date, Time and
Location are all user hierarchy. These user hierarchies are used to define the browse and
drill relationships between attributes.

4.2.3. Reports, Prompts and Filters

Reports that display your business data are the focus and goal of business intelligence.
Reports allow you to gather business insight through data analysis. The results displayed in
any MicroStrategy report are often a starting point for further investigation. Each row of a
report represents business concepts i.e. attributes while columns represent financial
calculations that are important to business i.e. metrics. Reports can be viewed in
Microstrategy from different perspectives, grids, graphs, grid/graph and SQL view.

A filter specifies the conditions that the data must meet to be included in the report results.
Using a filter on a report narrows the data to consider only the information that is relevant
to answer your business question, since a report queries the database against all the data
stored in the data warehouse. Filters are helpful in clarifying large quantities of data and
only displaying subsets of that data, so reports show users what they really need to see.

The different types of prompts allow you to create a prompt for nearly every part of a
report. Remember that prompts can be used in many objects including reports, filters,
metrics, and custom groups, but all prompts require user interaction when the report is
executed, Filter definition prompts allow the user to define the filtering criteria, based on
the following objects: Attributes in a hierarchy, Attribute forms, Attribute element lists and
Metrics

4.2.4. Intelligent Cube, Caching and Dashboard Creation

Intelligent Cubes are In-memory copy of data, which is also known as Multi-Dimensional
Cubes (sets of data), that are present on MicroStrategy Intelligent Server memory. The data
21

in Intelligent Cubes can be accessed by different MicroStrategy objects like Reports and
Documents (dashboard). It allows us to create sets of data that supports multiple reports
that can answer variations to similar Business Queries. So, intelligent cubes represents an
OLAP cube that contains data cached on memory of the intelligent server so that every
time a report or a document is run, it can be easily fetched from the memory and need not
pull data from the warehouse executing the query grabbing the data. Implementing
intelligent cube makes the processing of document a lot faster as compared to other
methods. In case of CDSST system, an intelligent cube was created consisting of various
dimensions like date, time, service and location with metrics like count of service usage,
sum of cost, average call duration etc. Creation of cube eases the drilling capability of a
report. The cube was then used to create a dataset (report) which was used as the main
dataset for creating a document.

A MicroStrategy Report Services document contains objects representing data coming


from one or more reports, as well as positioning and formatting information. It is used to
format data from multiple reports in a single display of presentation quality. Upon creation
of a document, one can specify the data that appears, control the layout, formatting,
grouping, and subtotaling of data, and specify the position of page breaks. In addition, one
can insert pictures and draw borders in the document. Most of the data on a document is
from an underlying dataset. A dataset is a MicroStrategy report that defines the information
that the Intelligence Server retrieves from your data warehouse or cache.

Other document components that do not originate from the dataset, such as static text used
for a title or heading, page numbers, and images, are added by the document's designer and
are stored in the document's definition. A dashboard is a special type of document. A
dashboard is commonly only one page long, is intended to be viewed online, and usually
provides interactive features that let analysts change how they view the dashboard’s data,
as well as what data they are viewing. Multiple display modes are provided for a
document: design, express, interactive, editable and flash mode. While selectors like
dropdown, radio buttons, checkboxes etc and filters make the dashboard more interactive,
drilling capability is also provided in some modes. Drilling is one of the most powerful
data analysis functionalities in any reporting environment. Drilling lets users explore data
beyond the data immediately visible in a grid or a graph. One can drill through a grid to
analyze data that is closely related to the original data, or can expose an entirely different
22

object on a grid to see what the data looks like within a different context. The dashboard
screenshots for CDSST are included in the Appendix A1.

4.3. Customer Segmentation: RFM Approach

The Recency (R), Frequency (F) and Monetary (M) are the important attributes that
determine the behavior of the customer. RFM method is very effective for customer
segmentation [Jing Wu and Zheng Lin (2005)]. R means ‘recency’ which indicates the
time interval between the present and previous transaction date of a customer. F means
frequency which indicates the number of transactions that the customer has done in a
particular interval of time. M means monetary which indicates the total value of the
customer’s transaction amount. It has been proven that the values of R, F and M decide the
characteristics of the customer behavior [25].

4.3.1. Other Approaches

Customers can also be segmented using other Artificial Intelligence (AI) methods like Self
Organizing Map (SOM), Genetic Algorithm (GA), Particle Swarm Optimization (PSO)
and Artificial Bee Colony (ABC). GA is a part of evolutionary computational technique. It
is based on the process of natural evolution using operations like mutation and cross over.
The main disadvantage of GA is it is time consuming and it is difficult to make the
solution converge to an optimal solution and has stability issues. SOM is an unsupervised
neural network learning algorithm. Neural network has interconnected processing elements
and achieves input/output relationships. It is a trial and error method and suffers from long
training time [16]. PSO is an evolutionary computation technique developed by Kennedy
and Eberhart (1995). Like GA, PSO has a fitness evaluation function that takes the
particle’s position and assigns to it a fitness value. The objective is to optimize the fitness
function [Chui-Yu Chiu et.al. (2009)]. It has the same demerits as those of GA. ABC are
based on the behavior of honey-bees for food collection. A greedy selection mechanism is
employed to choose the candidate solution which is tedious [26].

4.3.2. Why RFM Approach?

An improved clustering algorithm for segmenting customers using RFM values is


attractive to use. However, the efficiency and accuracy of clustering algorithms is highly
dependent on the nature of the data records and the number and shape/property of clusters
to be formed. RFM algorithm also overcomes the disadvantages of k-means, single link
23

and complete link clustering algorithm. In k-means the output clusters varies according to
the initial centers of clusters. In single link and complete link, the number of clusters is
reduced only by one as the iteration proceeds. Thus, RFM clustering algorithm is better to
use for customer segmentation.

Algorithm

 Assign each object to its own cluster(singleton cluster)


 Calculate the distance from each object to all other objects using Manhattan
distance and store it in a distance matrix
 Find the minimum value in the distance matrix
 If the minimum value found in step2 occurs more than once for a pair of clusters in
the distance matrix then for all pair of clusters merge the two clusters
 Calculate the center for each cluster obtained in step 4 and assign the center value
to the objects within that cluster
 Repeat steps 2 to 5 until the required number of clusters are obtained

For implementing the segmentation with the values of R, F and M, these three values must
be normalized.

For normalizing R

 The data set is sorted in descending order of the R


 Divide the data set into five equal parts of 20% record in each
 Assign numbers 1,2,3,4,5 to first, second, third, fourth, fifth part of records
respectively

For normalizing F

 The data set is sorted in ascending order of the F


 Divide the data set into five equal parts of 20% record in each
 Assign numbers 1,2,3,4,5 to first, second, third, fourth, fifth part of records
respectively

For normalizing M

 The data set is sorted in ascending order of the M


 Divide the data set into five equal parts of 20% record in each
24

 Assign numbers 1,2,3,4,5 to first, second, third, fourth, fifth part of records
respectively

Assessing quality of customer segments

After normalization the values of R, F and M are from 1 to 5. Performance of clustering


algorithms in terms of quality are measured by four criteria- Mean Square Error(MSE),
Intra cluster distance, Inter cluster distance and Intra cluster distance divided by Inter
cluster distance. The MSE is defined as

𝑛 𝑘

∑ ∑ |(𝑥𝑖 − 𝑚𝑗)|2
𝑖=1 𝑗=1,𝑥𝑖 𝜖 𝑐𝑗

The intra cluster distance is the Manhattan distance between data within a cluster. It is
given by
𝑛 𝑛

∑ ∑ 𝑑 (𝑥𝑖, 𝑥𝑗)
𝑖 =1 𝑗=𝑖 +1

The inter cluster distance is the Manhattan distance between the center of the clusters. It is
given by

𝑘 𝑘

∑ ∑ 𝑑 (𝑚𝑖, 𝑚𝑗)
𝑖 =1 𝑗=𝑖 +1

Here, n indicates the number of objects, k indicates the number of clusters and mj indicates
the center of jth cluster. Hence, the objective is to

 Minimize MSE
 Minimize intra cluster distance
 Minimize inter cluster distance
 Minimize intra/inter cluster distance

4.4. Churn Prediction

To analyze the churn there are different techniques available. Some of the common models
used are: Regression analysis, Decision Trees, Logistic Regression and Neural networks.
25

Out of the above models, Decision Trees are a good choice since they provide the rules
that business users can understand. Other models like Neural Networks, Logistic
Regression reduce the understandability of the phenomenon even though they may be little
bit more accurate than decision trees. [9]

4.4.1. Use of decision trees in Churn prediction

Decision trees are the most commonly used tool for predictions and classification of future
events. The development of such trees is done in two major steps: building and pruning.
During the first phase the data set is partitioned recursively until most of the records in
each partition contain identical value. The second phase then removes some branches
which contain noisy data (those with the largest estimated error rate) [11].

CART, a Classification And Regression Tree is constructed by recursive splits of an


instance into subgroups until a specified criteria has been met. The tree grows until the
decrease of impurity falls below a user-defined threshold. Each node in a decision tree is a
test condition and the branching is based on the value of the attribute being tested. The tree
is representing a collection of multiple rule sets. When evaluating a customer data set the
classification is done by traversing through the tree until a leaf node is reached. The label
of this leaf node (Churner or Non Churner) is assigned to the customer record under
evaluation.

Algorithm:

Tree growing process:


Step 1: Start with root node (t = 1)
Step 2: Search for a split s* among the set if all possible candidates s that gives the purest
decrease in impurity.
Step 3: Split node 1 (t = 1) into two nodes (t = 2, t = 3) using the split s*.
Step 4: Repeat the split search process (t = 2, t = 3) as indicated in steps 1-3 until the tree
growing the tree growing rules are met.

Splitting Criteria and impurity measures:


At node t, the best split s is chosen to maximize a splitting criterion ∆i(s,t). When the
impurity measure for a node can be defined, the splitting criterion corresponds to a
decrease in impurity.
26

Gini Criterion:

The Gini impurity measure at a node t is defined as

The Gini splitting criterion is the decrease of impurity defined as

where pL and pR are probabilities of sending a case to the left child node tL and to the right
child node tR respectively. They are estimated as

and

Stopping Rules (CART algorithms)

Stopping rules control if the tree growing process should be stopped or not. The following
stopping rules are used:
 If a node becomes pure; that is, all cases in a node have identical values of the
dependent variable, the node will not be split.
 If all cases in a node have identical values for each predictor, the node will not be
split.
 If the current tree depth reaches the user-specified maximum tree depth limit value,
the tree growing process will stop.
 If the size of a node is less than the user-specified minimum node size value, the
node will not be split.
 If the split of a node results in a child node whose node size is less than the user-
specified minimum child node size value, the node will not be split.
 If for the best split s* of node t, the improvement ΔI(s*,t)=p(t)Δi(s*,t) is smaller
than the user-specified minimum improvement, the node will not be split.
27

5. SYSTEM ANALYSIS

5.1. Basic System Operation:

5.1.1. Stage 1

The system acquires data generated from various switches throughout the
telecommunication network. The data acquired are in form of Call Data Records (CDR).
CDR is a log file comprising of various information regarding each call. The CDR holds
detailed information regarding the call. This information is segmented into smaller flat
files. The initial stage involves acquiring the data related to the CDR from the respective
switches.

5.1.2. Stage 2

Once the necessary and relevant data are acquired from the CDR switches, the data are
transformed into a pre-determined data model suitable for storage in the warehouse. The
conversion from the flat files to the structured form is achieved through Extract, Transform
and Load Process abbreviated as ETL Process. ETL process is performed through scripting
using Korn Shell. After completing the conversion process, the data are then loaded to the
data warehouse for storage. Data marts, for the data warehouses to store the filtered data of
CDR, are created using Teradata.

5.1.3. Stage 3

The data alignment in the data marts enables the system to generate various inferences
regarding the past and the present of the telecommunication industry. In the next stage the
data from the staging tables are segregated and explicitly undertaken though the series of
stages of operation to facilitate Data Driven Discovery. The data input is mined rigorously
to generate the future values of various constraints of measurement of the
telecommunication activity.

5.1.4. Stage 4

With the records of the past and present activity of the company from the data warehouse
and with the future values of various constraints, the result set is populated into the screen
using an interactive UI. The GUI of the system is designed to address the technical
28

knowledge domain of the nontechnical class of people that are present in the industry. The
process of report generation largely involves making selection of different measures of the
user activity. The system is built in such a way that the back end coding requirement is
reduced to nil to the end user.

5.2. Requirement Specification

5.2.1. Functional Requirements:

The functional requirements of our system are categorized as Interface requirements,


business requirements, compliance requirements and security requirements. These
subheading completely illustrate what is required of the end user or the implementing
company to have a fully working form of our product.

Interface requirements:

1) Screen shall display a dashboard with tabbed paging.


2) Different tabs shall enable user to view reports based multiple perspectives of the
telecommunication business.
3) A separate login button shall be shown for administrator to log into the system.
4) Screen can print on-screen data to the printer.

Business Requirements:

1) Data must be gathered from different CDR switches of the telecommunication


2) Data must be available in a predefined machine readable format.
3) Clicking the Approve Button starts the loading process.

Regulatory/Compliance Requirements:

1) The system shall limit access to authorized users.


2) The warehouse shall be updated with new facts and dimensions periodically.

Security Requirements:
29

1) The manager can view reports but cannot load dimensions and facts.
2) The administrator can create predefined dimensions initially but cannot define his
own dimensions.

5.2.2. Non Functional Requirements

The non-functional requirements of our system are encompassed in the following points:

 Performance: The system shall have a quick response, processing and reporting
time. Loading of CDR data shall take no more than 1 minute for 100 thousand
records. Reports should be generated in real-time.
 Capacity and Scalability: The system shall be able to store huge amount of data
generated by different CDR switches in the order of gigabytes per day.
 Availability: The system shall be available during office-hours for reporting.
Maintenance and other operations which consume high processing power shall be
scheduled during off-hours.
 Recovery: The system shall perform regular backing up of data. Data lost during
crashes shall be easily recoverable.

5.3. Feasibility Assessment

Feasibility is a measure of how beneficial or practical the development of a system will be


to an organization or an enterprise. In other words, a feasibility study is an evaluation and
analysis of the potential of the proposed project which is based on extensive investigation
and research to give full comfort to the decisions makers. A system should have a
reasonable cost and should be technically and operationally feasible to be actually
implemented. The two criteria to judge feasibility are cost required and value to be
attained.
The feasibility of CDSST is analyzed under the following three headings:

5.3.1. Technical Feasibility

Technical feasibility is a measure of the practicality of a specific technical solution and the
availability of technical resources and expertise. The CDSST system in technical aspect is
practical as it can be actually realized in a telecommunication enterprise. The technology
used in the system like Teradata as database platform for the data warehouse is a robust
30

and efficient technology. Similarly, Microstrategy tool used for reporting is an enterprise
application that provides very customizable reporting. Python programming language used
for scripting and logical task is an efficient and robust language. Thus, the technology used
in this system is mature and we currently possess the necessary technology.

The technical expertise necessary for the system can be provided by us and the enterprise
implementing the system can contact the technical experts in a totally reasonable schedule
in case any technical problems arises. Thus, CDSST uses the technology that are currently
available and is technically feasible.

5.3.2. Economic Feasibility

Economic feasibility is a measure of the cost-effectiveness of a project or solution. This is


often called a cost-benefit analysis. The system that is developed should be inside a
reasonable cost that can be actually adopted by the clients. The major costs that are
incurred in CDSST would consist of development cost and operation/production costs.
Within the development cost, the cost of software used is remarkable. The Teradata
Software and Microstrategy tool is not free and one has to pay for the use. Other software
tools used in CDSST like Redmine and Python are totally free of cost. Since the client of
CDSST is a telecommunication enterprise, it can afford the costs incurred easily. The other
cost within development cost is training/documentation cost that is minimal in comparison
to the software costs. The other type of cost is operation/production costs. This cost
includes personnel costs and maintenance cost. These costs are generally high as human
resources are one of the most costly resource.

In a nutshell, developing a decision support system for a telecommunication enterprise is


not easy. For this task costs are incurred but are not that high so that a potential client
refuses to implement the system. Thus, CDSST is economically feasible.

5.3.3. Operational Feasibility

Operational feasibility is a measure of how well a specific solution will work in the
organization. It is also a measure of how people feel about the system/project. It also
analyzes the inside operations on how a deemed process will work, be implemented, and
dealing with change resistance and acceptance.
After analyzing the technical and economic feasibility studies, next would come the
operational analysis. CDSST would allow the executives and other employees of an
31

enterprise to view and analyze the basic trends in the services they provide based upon
various criteria. Besides, end users can also obtain important inferences about whether to
choose the service provider or not based on these inferences. In order to determine if the
CDSST would work, the operational feasibility study would follow this path based on six
elements:

 Process – Input and analysis from everyone in the telecommunication enterprise,


the new system CDSST will effect on the decision making process and the internal
business process of the enterprise. It will enhance the marketing and advertisement
campaigns and help more users join their system.
 Evaluation – Determinations from the process suggestions; will the system benefit
everyone? Who is left behind? Who feels threatened? CDSST indeed will benefit
the enterprise implementing the system as well as the end users that are part of the
enterprise.
 Implementation – Identify resources both inside and out that will work on the
system. How will the CDSST system construction interfere with current work? The
current decision making policy is based on the experience and knowledge of the
high level executives. Implementing CDSST will add on this part to the power of
data. The suggested decision extracted from the data will have a strong possibility
to take a positive change in the enterprise.
 Resistance – What areas and individuals will be most resistant? Developing a
change resistance plan would be suggested. Many people may suggest that the
current way of making decision is better than with the incorporated CDSST system.
They should be shown clearly how the system aids to enhance the enterprise to take
correct and advantageous decisions to take itself to a higher level.
 Strategies – How will the organization deal with the changed system
implementation? Do new processes or structures need to be reviewed or
implemented in order for the system to be effective? The first few periods will be
challenging for the enterprise to accommodate the system in the decision making
process. But, gradually practicing the system will turn it to an easy and enhanced
technique.
32

 Adapt & Review – How much time does the organization need to adapt to the new
system. How will it be reviewed and monitored? What will happen if through a
monitoring process, additional changes must be made? CDSST will be operated by
an administrator for source data management which includes loading of new
records for dimensions and facts. Next, the executives will view the inferences to
take proper and correct decisions. This takes some time in order to incorporate
CDSST in the decision making process completely. In the early stages, the
involvement will be less as the question of trust in the system arises. Gradually, the
involvement will rise because of the positive effects shown by the implementation
of the system.
33

6. DETAILED DESIGN

6.1. Overview

Figure 6.1 System Block Diagram of CDSST


34

The system is divided into five different blocks. The section of the system block diagram
included under the pair of braces are complete to suit the purpose and requirement of the
system. The functionality of different components of the system are described in the form
of various block as below:

6.1.1. Data source block

Data source is the block that consists of origin of all the data associated with our system.
Data source consists of both raw data and information available in different formats. The
main source of the data is the Call Detail Record information. Besides CDR, the sources
also include flat files, database etc. Data and information available in this block do not
have a common format.

6.1.2. ETL block

This block is involved with extracting, transforming and loading raw data with disparate
format into the data warehouse. This block contains of 3 different sub-blocks; each
involved with its individual task as described by the name. First sub-block is the extract
block and is responsible for pulling data out of the source systems. Second sub-block is the
transform block and is responsible for applying a series of rules or functions to the
extracted data from the source to derive the data for loading into the end target. Third sub-
block is the load block and is responsible for loading the data into the end target, which is
data warehouse.

6.1.3. Data warehouse block

The data warehouse block is the centralized repository that stores data from multiple
information sources. After ETL process, all the data are stored in warehouse in a common,
multidimensional data model for efficient querying and analysis. Data warehouse contains
a wide variety of data that presents the history and scenario of the business. The
information acquired from warehouse is useful in reporting, data analysis and data mining.

6.1.4. Dashboard block

Dashboard is a user interface that organizes and presents information in a way that is easy
to read. The dashboard summarizes all the business information in the form of reports that
consists various figures and graphs. Dashboard has important role as it provides strategic,
35

analytic, operational and informational support to the users in simplistic but sophisticated
manner.

6.1.5. Data mining block

Data mining is the computational process of discovering patterns in large set of


information. This information is available through the data warehouse. Data mining block
is responsible for analyzing data from different perspectives and summarizing it into useful
information. This information can be used for making business strategies and innovations.
It is also helpful for making inferences and predictions related to the business.

6.2. System Diagrams

Figure 6.2 Component Diagram of CDSST

6.2.1. Component Diagram

The above diagram shows the most basic components involved in the system. CDR Switch
collects information from all the call transactions occurring throughout the area and stores
this information in the telecom database. Telecom database is a repository consisting of all
36

the data regarding the transaction and also the telecom industry. This database is used as
the raw data source required by the application to construct warehouse. ETL tools are used
to extract the data from various raw data sources and load them into the staging tables of
the warehouse. Staging table successfully transforms the unstructured data into a structured
and organized format which is then loaded into the target table of the warehouse.

6.2.2. Sequence Diagram

Figure 6.3 Sequence Diagram of CDSST

The above diagram shows the sequence in which the system administrator operates the
application. The system uses secured login process to identify the administrator from other
normal users. Once logged in, system administrator is provided with privilege to modify
the contents of the warehouse. After receiving command from the administrator, the
application selects the raw data sources and then separates between dimensions and facts.
Once separation is performed, data are loaded into respective dimension and fact tables.
37

Figure 6.4 Sequence Diagram of CDSST (Report Generation)

The above diagram depicts the sequence of operation occurring during the generation of
report by the manager. This process requires no authentication. Once the application is
running, manager can select to generate report or to acquire aid in decision support. Based
on the selection, appropriate query is selected and performed on the warehouse and the
result is presented in a well-formed report.
38

6.2.3. State Chart Diagram

Figure 6.5 State Chart Diagram of CDSST

The above diagram shows the various states involved with our application. Once the
application initiates, the user is prompted with authorization. The logon process is essential
if the administrator wishes to work with the warehouse. After successful logon, the
administrator can select the data sources. The application decides whether there are new
dimensions data or not. If no new data regarding dimension is detected, facts are directly
loaded into warehouse, otherwise dimension tables are filled before loading facts. Once
dimensions and facts are loaded, the warehouse is updated.
39

6.2.4. Activity Diagram

Figure 6.6 Activity Diagram of CDSST

The above diagram is an activity diagram depicting activities involved with three major
components: system administrator, telecom manager and system/application. Application
can be initiated by either system admin or the manager by giving respective command. The
application corresponds in the way it suits and fulfils the need of the respective user and
40

hence performs designated activity. Activity involved with system admin is to update
warehouse whereas the activity involved with telecom manager is generating report.
41

6.2.5. Use Case Diagram

Figure 6.7 Use Case diagram of CDSST

The above diagram shows use cases for our system. Our system involves two actors: those
are system administrator and telecom manager. System administrator is responsible for
working with the data warehouse whereas telecom manager is simply a user and is able to
generate and view reports and predictions. Thus, based on their responsibilities, each user
are provided with different set of operational tools.
42

6.3. Warehouse Schema Diagram

Figure 6.8 Snowflake Schema Diagram of CDSST Warehouse


43

7. IMPLEMENTATION

7.1. System Design Speculation

This section of the report focuses on the more general and essential description of the
system and how different system components are put together to obtain various milestones
described earlier in the report. The schema diagrams in each of the following components
illustrates a logical connection between the various entities (tables) within each
component. Since warehouse in our project aims to preserve data redundancy to promote
the availability of the data, the actual table implementations in the project are devoid of
such key based connections. The schema for different dimensions and facts explained
below are attached in Appendix A1 section of the report.

7.1.1. Time dimension

Time dimension is most basic dimension comprising of second, minute and hour
dimension tables. Time dimension in the data marts enables us to manage records ‘online’
and process real time data to deliver real valued services. The need of Time dimension is to
represent the time-stamp associated with each call. Multiple calls are recorded by the CDR
switch in the matter of seconds and are loaded into the warehouse. Information held by
time dimension includes the start and ending of events occurring during calls. The call
related inferences can be deduced in terms of seconds, minutes and hour basis. We have
divided a day into 8 different smaller intervals. The records generated by the switches can
be analyzed across the references of these 8 smaller intervals to compute the pick hours of
the call and the least active durations of the telecommunication services. In doing so, we
roll up across the time dimensions tables to aggregate results and generate the reports to
suit the purpose. The methodology is simple and pretty straight forward. The
implementation of the snowflake schema facilitates such roll up operation and aggregation
of facts generate useful results. The entries of time dimension table are rather static or
technically, Slowly Changing Dimensions (SCDs). These entries can be loaded one time at
the initial phase of the warehouse ETL process and can work without a fault through the
rest of the session of analysis and reporting.
44

7.1.2. Date dimension

Date dimension is used in conjunction with time dimension in order to specify the exact
occurrence of the call. Once the time dimensions are correctly loaded into the data
warehouse, use of date dimensions allow us to visualize those real time data in the form of
a calendar. Main components of date dimension are day, month and year. Call detail
records of daily basis are recorded on the lowest level of the table hierarchy. Subsequent
inferences of the weekly, quarterly, semi-annual or annual basis are simple obtained by
roll-up process. A mere aggregation of the facts at lower level tables give a clear and valid
result for the facts of the upper level table in the hierarchy. Like Time dimension, Date
dimension stores static data regarding the dates for a specified period of time. These entries
too can be performed ‘on-time-load’ operation at the beginning of the session of analysis
and reporting.

7.1.3. Location dimension

Date and time dimensions allowed us to successfully visualize the occurrence of an event
onto a calendar. Since it was a real time data that we were concerned about, we required
some real time reference that could vouch of the occurrence of a real time event. The
correspondence was maintained by the development and use of location dimension.
Location Dimension is used for specifying the location of the calling parties. Location is
specified through area codes. The higher level tables contain the zone and region specific
data corresponding to the individual district obtained by the cross reference of area codes
and the district information. With this dimensions, data related to the demographic
distribution of the consumers and the visualization of spatiotemporal data at various levels
of abstraction can be achieved. Again, the whole operation would merely be a roll-up
procedure and subsequent aggregation.

7.1.4. Customer dimension

A real time event with a real world reference provided a fundamental basis for all the
analysis of events. Customer dimension serves as a real time variable on which all our
analysis can be based and the reports can be generated. These reports, due to the presence
of customer-related details add up to the relevancy and usefulness of the reports. Although,
location table partially served the purpose, the design of customer dimensions was the
demand of the context. Customer dimension is logically connected with various other
45

fields in the way shown in the schema above. Customer dimension is center to all analytics
in the data and is related to other dimensions in a manner that suits the purpose of the
analysis. The simultaneous correlation of the customer dimension with other dimensions
assist us in generating facts associated with the spatio-temporal events recorded by the
CDR switches.

7.1.5. Service dimension

Services dimension provide the functionality that could add up to various other system
functionalities and services. It helps us analyze the various aspect of interactions of a user
to a telecommunication industry and its services. It serves as the socket through which
association of a normal customer to the communication services provided the
telecommunication industry can be logically analyzed and successfully visualized. The
components of services dimension is direly affected by the customer domain, the
telecommunication company policy and the market demand, we acknowledge that our
dimensions are not sufficient. However, the schema above shows basic framework of
minimalistic services that could provide a self-sufficient basis for analysis of interaction of
a customer with the telecommunication industry.

7.1.6. Fact table

Fact table is the final table generated after the ETL process on various sources and
dimension tables. Fact tables are loaded in the data warehouse and are the source to
perform mining operations and acquiring useful information. Fact table comprises of all
the primary keys of various base dimension table along with the measurement facts. The
structure of the fact table presented above corresponds to the record-level details incoming
to the system from a given CDR switch. Initial report generating approaches will centered
on these entries and these initial reports are simply the results of SQL queries. On the
second part of the project, these results shall be project into a presentable Graphical User
Interface (GUI).

The fact table holds two main measurement metrics which are duration and cost. Duration
field provides information about the total time-span of the execution of a call whereas the
cost field provides information regarding the expenses incurred during the call. All the
other fields are simply foreign keys of dimension table that allows the analysis of
information held by the fact table with regards to various dimensional fields.
46

7.2. Detailed System Operation

The part of the project that has been successfully completed till the date of the mid-term
report submission includes a primordial model of the data warehouse with a base level fact
table. The base level fact table consists of essential facts: cost and duration, which have
been described in detail in the fact tables above. This section of report consists of the
detailed description of how a fact table is constructed by our system using the dimension
tables and how reports are generated using the Micro Strategy Tools. The overall summary
of the system process is illustrated in the chart below.

The individual component operation is described hereafter:

CDR Data Import

Staging Table and data loading

Customer dimension check and id valuation

Date dimension check and id valuation

Time dimension check and id valuation

Location dimension check and id valuation

Services dimension check and id valuation

Figure 7.1 Process flow of Project A


47

7.2.1. CDR data import

The data that needs to be imported to our system primarily consists of records and logs of
difference all activities. To facilitate the purpose of analysis and to cover a larger domain,
we have combined data contents of multiple switches, all anonymous and logically related
into a single flat file. Although the information domain of the logs and records obtained
from the switches is huge, since our analysis is focused and constricted to meet very few
objectives, we need the transformation of these high dimensional data into lower
dimension. The lower dimensions are identified and used in such a way that the changes
relevant to our objectives are well preserved and the analysis of sampled data is valid. The
objectives of our analysis are as follows:

 Achieve new enterprise level data model to handle huge amount of data in Teradata
database.
 Provide a new base for OLAP analysis through Microstrategy reporting tool.
(analyze customers’ behavior and improve its marketing campaigns and pricing
strategies )
 Make predictions to support in the formulation of company’s business policies by
the mining of the CDR and customer data.

7.2.2. Staging table and loading

As mentioned in the above writings, our major project analysis required only a fraction of
what was obtained from the CDR switches. Hence we needed to create a temporary table to
hold only the required data before the data went to processing. The required fields were
pulled out of the CDR log files and inserted into a temporary table called staging tables.
The structure of these tables is very much similar to the target tables. The only difference
is that the staging tables uses identifiers while the target tables use keys. All the target
tables are populated using their individual staging table. The key used in each of the target
tables are generated as using unique identifier generator; hence the keys in each target table
are auto-incremented for each entry of the staging table. After each staging table is
successfully loaded into the target table, the content of these tables are cleared for new data
to be loaded.
48

An example of the staging table is shown below:

The various fields in the staging table have their own significances in the process of the
collective analysis of the data.

 calling_number: The calling_number field is used to identify the calling party in


the process.
 called_number: The called_number field is used to identify the destination party in
the process.
 Area_code: The area_code field is used to inspect the call process and identify the
area from which the given call is originated.
 call_duration: The call_duration field is used to isolated the interval of time for
which a particular service is accessed via a connection network.
49

Table 7.1 Staging table Structure

 scheme_id: The scheme_id field is used to identify the particular type of service
used by a customer such as a voice call or a SMS service.
 session_origin_date/time: These fields mark the beginning of any service accessed
by the consumer with a precise date and time stamp associated with every service
access instance.
 session_end_dat/time: These fields mark the end of any service accessed by the
consumer win a precise date and time stamp associated with every service
termination instance.
 caller_prefix/called_prefix: These fields help us identify the type of service and the
exact scheme being used by a particular customer such as prepaid-to-prepaid-voice-
call service.
 roaming_flag: This field helps us identify if the user is being a part of the roaming
services provided by the communication enterprise.
 peak_flag: This field helps us identify if the user is using the service in the peak
hour (6:00 to 22:00) as provided by the telecommunications enterprise.
50

7.2.3. Customer dimension check and id valuation

Customer dimension table consist of all the records and details that are useful to tie a
telecommunication enterprise assigned number identification to the real world person.
These details include first name, last name, address, DOB, legal identification and gender
details. The address details of the customer in the customer dimension table correspond to
the permanent address of the customer. The temporary address of the customer are a part of
location table and is accessed through the use When a data entry from the staging table
needs to be analyzed, the calling_number and called_number field of the staging table are
cross referenced with the customer_dm table to aggregate the details of the respective
number in terms of geo-spatial variables. Upon a ‘hit’ on the table, the corresponding id,
which is the telecommunication enterprise assigned identification number, is extracted and
inserted into the fact table. The types of inferences that can be generated from this
reference are:

 Average call per period of a particular customer and the details of the expenses.
 Average call per locality, on the basis of permanent address of the customer.
 Graphical representation of customer behavior with respect to a particular
telecommunication services in a given time interval.
 Customer-customer communication pattern and link states between any two
random customers on the basis of telecommunication activity.

7.2.4. Date dimension check and id valuation

Date dimension consists of the date details from year level down to the day level for a user
specified date span. The initial analysis data set consists data from 2000-2030 AD. The
session_origin/end_date from the staging table is cross referenced against the date
dimension to aggregate other details out of the date dimension. Upon hit, the identification
number of the corresponding entry is added to the base level fact table. The types of
inferences that can be generated from the date dimension are as follows:

 The weekly/monthly/quarterly/annual call pattern for any given scheme in the


telecommunication company.
 Seasonal evaluation of service accesses and the customer traffic in the corporation.
51

 Dependency of customer in situations of unforeseen events and calamities.


 The chronological service access analysis and reporting.

7.2.5. Time dimension check and id valuation

Time dimension consists of the time stamp details from seconds’ level to hour level. The
time dimension has been hierarchically arranged. The lowest level table consist of a day
portioned into 86400 seconds. The service access/termination time stamp obtained from
the staging table is cross referenced across the entries of the dimension table to segregate
the id of the record that gets hit. This id is incorporated into the base level fact table. The
probable inferences that can be generated from time dimension are as follows:

 Pick/off hour call detection


 Graphical analysis of telecom service usage logs on a day basis.
 Events per second analysis of the user interaction with the telecommunication
services provided by the company.
 Call costs based on the peak/off hour tariffs.

7.2.6. Location dimension check and ID valuation

Analysis of data from individual CDR switches constrained our data domain to be a narrow
one. We had to access records from multiple switches to have presentable results. We built
location dimension consisting of data details down to the district levels. Since our main
priority was not to make a tower-wise analysis, we have included tower_max_number field
for fail safe. The location dimension on higher levers consist of zones and regions. Staging
table consists of the field area_code that consists of the area from which the
calling_number is accessing the telecommunication service. The area_code value is cross
checked with the entries of the location dimension table to find a record hit. As in the
previous dimension table lookup processes, upon the hit, the corresponding id is included
in the base level fact table. The possible inferences that can be dug out of the location
dimension table are as follows:

 District level call records analysis


52

 Location specific average call, total call, and call frequency analysis.
 Higher level data inferences related to location wise call activities.
 Traffic pressure per locality with/without chronological frame of reference.

7.3. Business Intelligence and Reporting

The CDR data along with other related data stored in the warehouse are presented in the
Microstrategy dashboard in the form of interactive reports. The dashboard is divided into
different layouts which are:

7.3.1. OLAP Analysis:

In this layout, data from the intelligent cube (OLAP cube) is taken and presented to create
different reports. The main benefit of OLAP analysis is that it provides the drilling
capabilities while viewing the reports as well as provides caching of data for faster
response in dashboard navigation. Under OLAP analysis, we categorized the analysis into
five categories.

7.3.1.1. Call Analysis:

In this layout, the call duration was analyzed in the basis of number of calls. What call
duration do most users prefer? This question is answered. Call duration is divided into four
range of values and number of calls made by customers for each duration range is
displayed in a bar graph. Besides, number of calls per scheme is also shown in another
report.

7.3.1.2. Geographical Analysis:

The CDR data used in our system was comprised of call records from all over the nation.
So, it was beneficial and important to perform analysis according to geographical
locations. This kind of analysis also aids to attracting attention to some specific areas
where improvement is necessary. To make the analysis more effective, a date slider is
included so that the user can perform the geographical analysis only on the selected date
interval. Under geographical analysis, three different reports are shown:

 Call by Area: This report is shown as a bar graph. In this report, how much calls are
generated according to call area is shown. One can select a development region for
53

analysis and see how different areas within the selected development region differ
or are similar in number of call records.
 Call by Region: A pie chart is displayed to depict the number of calls per region i.e.
Himalayan, Hill and Terai region.
 Call by Development Region: This bar graph displays an overview of number of
calls per development region.

7.3.1.3. Datewise Analysis:

We had the call record data of over a time span of a year in our warehouse. Since call
record data are massively generated (millions of records per day), we could only use the
record of a year. This was then gone through quarter wise analysis, month wise analysis
and analysis per day of week. This layout comprised of three reports:

 Call by Day of week: It is always important for a telecommunication enterprise to


know which day the maximum number of calls are generated and which day there
are least calls. This helps them to make some plan or policies so as to boost the low
number of calls. For this, a bar graph is displayed showing the number of calls per
days of week. Again, for making the analysis more interactive and giving the user
more flexibility, a date slider was included so the graph depicts the results as per
the selected date interval.
 Call by Quarter: From business point of view, quarterly reports are considered to be
the most important reports. It shows the number of calls in all four quarters that
helps the executives to compare the previous performance with that of the current.
This report was shown in the form of a pie chart.
 Call by Month: Monthly reports shows how previous months are in comparison to
the current month. This report was shown in the form of a bar graph.

7.3.1.4. Timewise Analysis:

It is always important for a corporate telecom house to know which time their customers
make the maximum number of calls. This analysis suggests a strong point in decision
making – encourage the off hour calls by bringing up a new scheme. A dropdown selector
is kept so as to narrow down the analysis to the specific area. For example: It may be
interest for the executives of an enterprise to know the least calling hours in the
54

Kathmandu Valley for its improvement. The analysis could be of course based on all areas
too. In this analysis, two different reports are presented:

 Call by Hour of Day: This report shows the number of calls made by customers
against the 24 hours of the day in a bar graph. It can also be drilled up to see the
number of calls against different time of day (A TOD consists of 3 hours interval).
 Call by Hour of Day and Day of week: This report gives an insight on how number
of calls varies with different days of week and hour of call. It is helpful for
comparison among different days. Eg: Sunday noon with Wednesday noon.

7.3.1.5. User Analysis:

In business it is said that Customers are gods. It is true because they are the source of all
the income of the business enterprise. If the customers are unhappy, then it is the business
enterprise who shall bear a loss. This scenario is not different in the telecommunication
enterprise. So, there lies keen interest for the executives of such enterprise to know their
customers in much detail. They need to analyze their customers from different aspects such
as geographical aspect, gender aspect and many others. In our CDSST system, we have a
customer base of 11,788 customers on which we perform the analysis. For this, we created
a user (customer) analysis layout. In this analysis, a user is required to select one or
multiple areas and based on the selection, gender wise report in the form of pie-chart and
gender and area wise report in the form of bar diagram are displayed.

7.3.1.6. Revenue Analysis:

Revenue is by far the most important thing for a business enterprise and
telecommunication enterprise is not an exception. Their objective is generally to earn
profits by giving quality service to customers. This makes the revenue analysis one of the
most important reports that executives want to see. In this case, we created the following
reports in revenue analysis.

 Bubble Graph: An overview of revenue is shown on a bubble graph where revenue


and number of calls are depicted with development region and quarter as the two
axes. Revenue is shown by color whereas the size of bubble shows the number of
calls in that region and specified quarter.
55

 Revenue by Quarter: This report is shown as a pie chart where the revenue of four
quarters are depicted.
 Revenue by Scheme: In this report, revenue by different schemes e.g. post to post
call, pre to pre call, pre to post call and post to pre call are shown in pie chart.
 Revenue by Development Region: Revenue according to development region are
shown in a bar diagram. Here a development region can be drilled down to see
which area constitutes most of the revenue of that region.

7.3.2. Loss Analysis

Call dropping is a serious issue in telecommunication industry. Many calls don’t get
connected because they can’t reach the receiving server. In this layout, comparison of
dropped calls and succeeded calls are shown on the basis of time of day. The analysis can
also be narrowed down to a specific area by the area selector.

7.4. Data Mining and Knowledge Discovery

For mining hidden information out of the call detail record data residing in the warehouse,
a powerful statistical and analytical tool was necessary. This tool was also required to be
robust enough to handle large voluminous data and suitable enough for integration with the
Teradata warehouse and Microstrategy reporting tool. Having all these features and above
all being open source, R was the most suitable tool for data analytics for us. Teradata and
Microstrategy provided packages named TeradataR and MicroStrategyR respectively for
communication with R. These packages played the intermediately role for bridging the gap
between R and warehouse & reporting tool.

7.4.1. Customer Segmentation using RFM Technique

The recency, frequency and monetary values of customers obtained from the warehouse
are first normalized to scale of 1 to 5, 1 being the best value and 5 being the worst value.
So, customers with all recency, frequency and monetary values equal to 1 are the best
customers. On these normalized value of attrubutes, clusters are created using RFM
technique. Normalization of customers’ RFM attributes caused many customers to have
identical values of these attributes. This seem to form a group of customers even before the
56

clustering is applied. The clustering algorithm was applied 10 times and from that a best
result of clusters was selected. The clustering algorithm generated five clusters with the
following number of customers in it out of our customer base of 11788 customers.

The center values of R, F and M for each cluster and withinss (within-cluster sum of
squares) are shown in the table below:

Cluster Size Color Center (R, F, M) Withinss

1 1779 Green 1.00 2.97 3.02 165.02

2 383 Blue 2.31 1.00 1.01 172.36

3 1951 Yellow 1.00 4.36 4.37 904.35

4 4276 Black 1.00 1.00 1.00 18.92

5 2611 Red 1.00 1.94 2.00 144.82

Table 7.2 Cluster details - Customer Segmentation

The totss (total sum of squares), tot.withinss (total within-cluster sum of squares, i.e.,
sum(withinss)) and betweenss (between-cluster sum of squares, i.e. totss-tot.withinss) for
the clustering result are 36176.54, 1405.462 and 34771.08 respectively. The selection
criteria for best clustering result was betweenss/totss and the selected cluster had the
highest value of 96.1%.

Each cluster is assigned a different color so that they can be distinct. The fourth cluster is
the largest cluster with about 36 % customers. This cluster mostly contains the best
customers with RFM values equal to 1. Customers with high recency i.e. those customers
who make a call after a long time are grouped by the second cluster which is the smallest
cluster. And remaining three clusters namely 5 th cluster, 1st cluster and 3rd cluster groups
customers with low monetary and frequency attributes respectively. The clusters are
plotted using 3d scatter plot in R and integrated in the Microstrategy dashboard.

In customer segmentation layout, the output of RFM clustering is shown in a scatter 3d


plot. Here, Recency, Frequency and Monetary are the three axes, size of bubble denotes the
number of customers with that attributes and color of bubble denotes cluster. Customers
57

that are near to origin are the best customers with farther customers being deviated from
best customers.

Also, customer count are shown in another panel according to Recency, Frequency and
Monetary attributes. This shows what type of customers are more in our system.

7.4.2. Customer Churn Prediction

Churn prediction has been implemented in CDSST using CART – a decision tree based
algorithm. The variables chosen for building the model are:

 customer's age
 customer's tenure
 average no. of calls per day
 average duration per call
 average cost per call
 customer's gender

A model was prepared for training and building a decision tree for the prediction. The
training set contained 5% churners. Based on the results, average calls per day played the
most important role in determining if the customer is churn followed by customer's tenure,
average duration and customer's age. Gender had a very little effect in the result.

The model was finally used for prediction of churn from the data in the warehouse. From
the prediction of customers for the year 2012, it is found that 7% customers are more likely
to churn in the upcoming month. The obtained results is shown in the main dashboard.

7.5. Tools and Technologies for System Development

7.5.1. Teradata Administrator 13.0:

Teradata Administrator provides a comprehensive Windows-based graphical interface to


the Teradata Database Data Dictionary for performing a multitude of database
administration tasks on the Teradata Database. Since we have implemented the warehouse
of Telecommunication system in the Teradata database, this tool provides us with various
database management functions. The Teradata database has been installed on the server of
Yomari Inc. and we access the database with the help of the client applications that include
58

Teradata Administrator 13.0 and other Teradata utilities. With this tool, we performed the
following functions:

 Create, Modify and Drop Databases


 Create Tables
 Drop or Rename Tables, Views or Macros
 Run an SQL query
 Display information about a Database
 Display information about a Table, View or Macro

7.5.2. Teradata BTEQwin

Basic TEradata Query (BTEQ) Window is a Unicode application that allows the user to
run multiple BTEQ sessions and to access different databases simultaneously. It provides
pulldown menus and dialogs that users use to send SQL statements to the Teradata
Database. With this tool, we can

 Run a BTEQ script using drag and drop functionality


 Scroll back the window display
 Save all BTEQ Window output
 Edit a BTEQ or a SQL statement
 Re-send a BTEQ or a SQL statement to the Teradata Database without retyping
For accessing the Teradata database, logon command was given as:

.logon 192.168.50.50/telecomdb,password

Using BTEQwin, we wrote the create table query and used its drag and drop functionality
to create the table.

7.5.3. Teradata Fastload

FastLoad is a command-driven utility to load large data into an empty table on a Teradata
Relational Database Management System (RDBMS). Fastload can load large amounts of
data to the Teradata database much faster than with BTEQ using simple insert statements.
Constraints on the use of fastload are

 the target table must be empty


 no access (read or write) is allowed on the target table while loading
59

 only a single table can be loaded in one job


We used this utility of Teradata to load the dimensions and facts of the data warehouse. It
can directly load the data from flat files to the Teradata database and for this project, we
had the records in .csv format. For loading the table, we need to create two error tables
which gives the errors on loading process if they occur.

7.5.4. Python 2.7

Python is a general purpose, high-level programming language which enhances code


readability, and its syntax allows users to code at fewer lines than other high-level
languages like C. It is most widely used as scripting language.

Python 2.7 version was used for writing scripts in the project to populate the flat files for
loading dimensions. Besides, it was also used to transform the data records from the raw
format to specific format intended for storing in our data warehouse.

7.5.5. Microsoft SQL Management Studio

Microsoft SQL Management Studio was used to draw the schema diagram of the
dimensions which shows the logical connection between the various hierarchy and other
accessory tables of a dimension.

7.5.6. SVN

TortoiseSVN is a Subversion client, implemented as a Microsoft Windows shell extension.


It is free software released under the GNU General Public License. In Windows Explorer,
besides showing context menu items for Subversion commands, it also provides icon
overlay that indicates the status of Subversion working copies. It's intuitive and easy to use,
since it doesn't require the Subversion command line client to run.

We used TortoiseSVN to create a repository of our project and share it among the project
members. It provided code, documents, data, diagrams, and research documents to be
managed efficiently through the process of commit and update.
60

7.5.7. Red Mine

Redmine is a free and open source flexible project management web application. It
includes a calendar and Gantt charts to aid visual representation of projects and their
deadlines. It handles multiple projects. Redmine provides integrated project management
features, issue tracking, and support for various version control systems. Being written
using the Ruby on Rails framework, it is cross-platform and cross-database. Some of the
main features of Redmine are:

 Multiple projects support


 Flexible role based access control, flexible issue tracking system
 Gantt chart and calendar
 News, documents & files management
 Feeds & email notifications
 Time tracking
 Custom fields for issues, time-entries, projects and users
 Issue creation via email
Red mine was used as a project management tool for the project CDSST. Tasks were
created and assigned to project members within a given deadline. The assignee has to
complete the task assigned within the given deadline and update the status of the task via
Red mine.

7.5.8. Microstrategy

MicroStrategy, Inc., is a business intelligence (BI) software vendor whose software enables
organizations to analyze vast amounts of data to make strategic business decisions. The
software delivers actionable information to business users via the web and mobile devices,
including Apple, Android and BlackBerry smartphones and tablets. The software includes
technologies to capitalize on the big data, mobile, cloud, and social media trends.
MicroStrategy Report Services generates dashboards, scorecards, and production reports.
61

8. RESULTS AND CONCLUSION

8.1. Results

As stated in the ‘Business Intelligence and Reporting’ chapter, we have been successful in
generating various forms of reports that give insights to the status of a telecommunication
industry from various perspectives. We have used pie-charts, bar graphs, scatter plots, and
bubble graphs to present the results in graphical form. The information are arranged in the
form of dashboard. We have a highly flexible dashboard that enables the users to use the
system in interactive or flash mode. The charts are displayed with beautiful animations in
the flash mode. The results can be viewed in the screen as well as can be easily exported to
various other formats such as pdf, html, excel and flash. The results corresponding to
various hierarchical levels created in the warehouse can be easily accessed and examined.
The output of the system can be viewed from laptops, ipads, iphones as well as android
devices. The snapshots of the results are given below:

8.1.1. Dashboard Home

Figure 8.1 Dashboard output – Home


62

Figure 8.1 is the snapshot of home of the dashboard of CDSST. It suggests the project
overview of CDSST as composed of OLAP analysis, Customer segmentation, Churn
Prediction and Loss analysis. These four main components of the project are structured in
four layouts which can be easily accessed from the tab heading of the dashboard.

8.1.2. Call Analysis - OLAP Analysis

Figure 8.2 Dashboard Output - Call Analysis | OLAP Analysis

OLAP Analysis is one of the most important part of dashboard for the user of CDSST i.e.
executives of telecommunication enterprise. The data in OLAP Cube is shown under
different categories. In Call analysis category, two reports are shown. The first report is a
bar diagram named Call by Schema which shows how number of calls varies between
prepaid and postpaid users of the telecommunication enterprise. This report suggests that
most calls happen among prepaid users.

The next report is also a bar diagram named Call by Duration in which the distribution of
call duration is grouped into six different range values and displayed its frequency. This
report suggests that short duration call are more preferred by the customers. Calls with
duration less than 30 second are the highest in number. Another important inference given
by this report is to encourage the customers to have long duration call by introducing some
schemes and giving discounts for long calls. This can boost the revenue well.
63

8.1.3. Geographical Analysis – OLAP Analysis

Figure 8.3 Dashboard Output - Geographical Analysis | OLAP Analysis

Geographical analysis consists of three reports which are driven by a date slider where user
can choose a date range. The first report is a bar diagram named Call by Area where user
can select a development region and see the number of calls in areas of that region. Areas
with low number of calls can be identified and can be improved by targeting customers of
such area. The next two reports are Call by Region (pie-chart) and Call by Development
Region (bar diagram). These reports can be drilled down to see which areas or zones
contributes the most and the least to a region’s number of calls. Information is shown in
the mouse tooltip about the hovered part of the report. The data can also be seen in grid by
changing the view to grid on right click option.
64

8.1.4. Datewise Analysis – OLAP Analysis

Figure 8.4 Dashboard Output - Datewise Analysis | OLAP Analysis

Datewise analysis shows the call pattern according to date ranges. The report Call by Day
of Week shows the seven days of week with number of calls on a bar diagram for the
selected date range from the date slider. This report suggests that weekends are the busier
than weekdays. Similarly, the next report is Call by Quarter which is a pie diagram. It
shows the number of calls in four different quarters of the year 2012. The fourth quarter
has the highest number of calls which is because Dashain and Tihar festival falls in this
quarter and the call density is high at festival time. The third report is Call by Month – bar
diagram which shows the call distribution according to various months.
65

8.1.5. Timewise Analysis – OLAP Analysis

Figure 8.5 Dashboard Output - Timewise Analysis | OLAP Analysis

In timewise analysis, the main objective is to find out which is the busiest and the low call
density time period. For a selected area, the number of calls for different hour of day is
shown in the Call by Hour of Day report in the form of a bar diagram. This suggests that
the evening hours ranging from 5:00 PM to 7:00 PM are the busiest. Special scheme can be
suggested for late night hours e.g. after 11 PM. The next report named Call by Hour of Day
and Day of week as the name suggests shows how number of call varies with time of day
in the days of week. This also provides a comparison of different days of week in terms of
number of calls also giving insight about the calls occurring in certain time of the day. This
report suggests that Friday evening time is busier than that of other weekdays.
66

8.1.6. User Analysis – OLAP Analysis

Figure 8.6 Dashboard Output - User Analysis | OLAP Analysis

In this layout, user can select multiple areas for user analysis. After selection, gender wise
analysis of user is presented. In the Call by Gender pie diagram, number of calls by male
and female for the selected areas are displayed. For some areas, the participation of female
in making call is very small which can be improved by targeting such areas and giving
priorities to females. In the next report, selected areas are compared in terms of gender and
number of calls made by them which shows the need of enhancement of plans in low call
density areas.
67

8.1.7. Revenue Analysis – OLAP Analysis

Figure 8.7 Dashboard Output - Revenue Analysis | OLAP Analysis

In revenue analysis, reports are shown in two different layouts named overview and chart.
In the chart layout, three different reports are shown. The first one is the Revenue by
Quarter which gives the performance of the enterprise in the four quarters in terms of
revenue. The fourth quarter containing most number of calls collects highest revenue.
Similarly, second report named Revenue by Scheme shows in a pie diagram how
interaction between prepaid and postpaid users contribute to the revenue. Postpaid service
having low number of users as compared to the prepaid service can be boosted by
introducing scheme oriented campaigns. Revenue by Development Region report is a bar
diagram showing contribution of five different development regions in the enterprise’s
revenue. Central development region contributes the most with eastern and western
development region following it.
68

Figure 8.8 Dashboard Output - Bubble Graph Revenue Analysis | OLAP Analysis

Figure 8.8 shows the overview of the status of the telecommunication enterprise on how
well they are in the revenue perspective. In this bubble graph, development region and
quarter are the two axes and revenue corresponding to a development region and quarter is
shown by the color of bubble. The size of bubble shows the number of calls. So, larger the
bubble, larger the number of calls. This graph shows that Central development region
contributes to the highest in terms of revenue and has the maximum number of calls.
Eastern and Western region are a little behind central region but are doing well. But Mid-
western and Far-western region are seriously backward. This suggests some special care
must be taken for these two regions. Some marketing campaigns should be run in these
regions and they should be given top priorities.
69

8.1.8. Scatter Plot 3d – Customer Segmentation with RFM Algorithm

Figure 8.9 Dashboard Output - Scatter Plot 3d | Customer Segmentation with RFM

Recency, Frequency and Monetary rank are taken as three axes and its values for different
customers are plotted. Customers falling in the same cluster are assigned same color.
Customers with all RFM values equal to 1 are the best customers. These customers form
the largest clusters shown in black sphere. Customers with low frequency and monetary
values are shown in three clusters with red, green and yellow colors. These types of
customers participate in calls more as a call receiver than caller. The cluster shown in blue
color groups such customers who has larger gap between last call and the previous call.
Such customers are lower in number.
70

8.1.9. RFM Values – Customer Segmentation with RFM Algorithm

Figure 8.10 Dashboard Output - RFM Values | Customer Segmentation with RFM

In this layout, number of customers with these three attributes of Recency, Frequency and
Monetary are shown. The three reports show number of customers according to rank of
these attributes. Number of customers with high call frequencies in a year is considerably
high. Similarly, number of customers spending large amount of money is also high as
compared to other customers. But, most of the customers make a second call within few
days of their first call as suggested by the Customer Recency bar diagram.
71

8.1.10. Churn Prediction

Figure 8.11 Dashboard Output - Churn Prediction

The output of the churn prediction performed with CART algorithm is shown in this
dashboard layout. Here, user can select a development region and see how are the
customers predicted to be churned are distributed in different districts. This report shows
that people of urban area are more likely to be churned. For example, in the Mid-western
development region, Dang contributes to the highest number of customers predicted to
churn out. This is because of factors mentioned in the implementation section of this report
as well as easy access to different service providers. The pie diagram alongside shows the
number of customers predicted to churn according to gender in the selected development
region.
72

8.1.11. Loss Analysis

Figure 8.12 Dashboard Output - Loss Analysis

Considerably high number of calls are unsuccessful to get connected to the destination
number because of various reasons. Call request from MSC server when reaches to the IN
server (prepaid) or Mediation server (postpaid), then call is successful and the two parties
are connected. The problem is that the call request arrives the MSC server but in the route
from the MSC server to IN or Mediation server, the call request is lost. This causes call
dropping. Such calls are recorded in the CDR records with call duration of 0 seconds.
These types of dropped calls are compared with successful calls on the basis of hour of day
at which the call was made. User can select some specific area or perform the analysis on
all areas. Generally, more calls are dropped at high call traffic times.
73

8.2. Conclusion

Corporate Decision Support System for Telecommunications has been completed as a


Major Project for Computer Engineering 066-070 Batch under the supervision of Dr.
Nanda Bikram Bhandari. The project’s journey starts from collection of raw data from
diverse sources in multiple formats (flat file, csv file, soft copy records etc.). The
modelling of the data for was based on Richard Harding’s model of telecommunication
data analysis. Since the primary data domain of the project, Call Detail Records, consisted
of very divers sets of information, it was required for us to strip down various information
area from a given CDR record to manage the project within the scope specified by the
Syllabus of the Course. Design and implementation of data warehouse immediately
followed the data modelling process. Rigorous ETL process enabled successful data
representation in the form of facts and dimensions. To facilitate the information
availability and to preserve the principle of data redundancy for the target data warehouse,
hierarchy of data was maintained across multiple levels. MicroStrategy tool was
implemented to extract reports from the data warehouse built in Teradata Database. Usage
of Microstrategy tools enabled us to present graphically rich reports in a more convenient
manner. Further, the data set was analyzed to see if any future projection of any form could
be made to facilitate the decision making process in the industry. The mining results
obtained from our implementation of data mining algorithms are verified to the results
obtained from the built in Knowledge Discovery Support available in Microstrategy tools
as well. This project of ours gives the blueprint of the implementation model of an
enterprise class decision support system for a telecommunication organization. The full on
applicability and 100% work performance can only be expected out of is system with a
massive restructuring and context based customization of the reporting formats and the
GUI. Further this project can be taken ahead to incorporate other important fields of CDR
data record and project the analysis results that are facilitated by larger decision
parameters.

8.3. Future Enhancement

There are several rooms for future enhancement of CDSST.


74

 The data analysis performed in this project consists of only a single year data. If the
data content could encompass the telecommunication records of several years,
multiple forms of analysis such as time series analysis, regression analysis could be
possible.
 The data model can be further used to analyze and detect frauds in the
telecommunication industry.
 The project performs the loss analysis for any CDR switch but does not suggest any
method to minimize or prevent the loss.
75

Bibliography

[1] C. D, B.-V. D and G. N., "A Call Detail Records Data Mart-Data Modelling and
OLAP Analysis;Computer Science and Information Systems," vol. 6, 2009, pp. 87-
110.

[2] C. Wright, Rating Matters Issue, Service Level Corp., 2001.

[3] Teradata Coorporation, "Integrated Data Management Services," 2012. [Online].


Available: http://www.teradata.com/services/Integrated-Data-Management/.

[4] L. S. Cauffman, J. N. Thompson and J. M. Cauffman, Billing System with Data


Indexing, 1994.

[5] R. E. Farr, "Telecommunications Traffic, Tariffs and Costs-An Introduction For


Managers," Peter Peregrinus, 1988.

[6] I. G. Kennedy, "Forecasting," University of Witwatersrand, 2003.

[7] "Datawarehouse.net," [Online]. Available: http://data-


warehouses.net/architecture/operational.html. [Accessed 22 8 2013].

[8] G. M. R. a. K. S. Namvar M., "A two phase clustering method for intelligent customer
segmentation," in Proceeding of First International Conference on Intelligent systems,
Modeling and Simulation, LIverpool, 2010.

[9] J. T. a. S. E. Hwang H., "An LTV model and customer segmentation based
oncustomer value: a case study on the wireless telecommunication industry," in Expert
Systemswith Applications, 2004, pp. 26,181-188.

[10 C. Chan, "Intelligent value-based customer segmentation method for


] campaignmanagement: A case study of automobile retailer," in Expert Systems with
76

Applications, 2008, pp. 34,2754-2762.

[11 K. Lee, "A cross-national market segmentation of online game industry using SOM,"
] in Expert Systems with Applications, 2004, pp. 27, 599-570.

[12 P. N. a. P. D. V. Jonker J. J., "Joint optimization of customer segmentationand


] marketing policy to maximize long-term profitability," in Expert Systems with
Applications, 2004, pp. 27, 159–168.

[13 J. J. T. G. H. a. O. C. S. Huang, "Marketing segmentation using support vector


] clustering," in Expert Systems with Applications, 2007, pp. 32, 13–317.

[14 K. Ahn, "A recommender system using GA K-means clustering in anonline shopping
] market," in Expert Systems with Applications, 2008, pp. 34, 1200–1209.

[15 Hsieh, "An integrated data mining and behavioral scoring model for analyzing bank
] customers," in Expert Systems with Applications, 2004, pp. 27, 623-633.

[16 C.-H. Cheng and Y.-S. Chen, "Classifying the segmentation of customer value via
] RFM model and RS theory," in Expert Systems with Applications, 2009, pp. pp.4176-
4184.

[17 S. Kentrias, "Customer relationship management: The SAS Perspective," 2001.


] [Online]. Available: www.crm2day.com. [Accessed 21 8 2013].

[18 D. C. Y. ,. H. Y. W. S. Y. Hung, "Applying data mining to telecom churn," in Expert


] Systems with Applications, 2006, pp. 31, 515-524.

[19 V. G. V. D. P. D. E.-P. BAESENS B., "Bayesian Network Classifiers for Identifying


] the Slope of the Customer Lifecycle of Long-Life Customers," European Journal of
operational Research, pp. 156, 508-523, 2004.

[20 S. N. E. E. U. &. V. N. ROSSET, "Customer lifetime value models for decision


] support," in Data mining and knowledge discovery, 2003, pp. 7,321-339.
77

[21 H. K. M. K. &. H. H. H. STAHL, "Linking Customer Lifetime Value With


] Shareholder Value," in Industrial Marketing Management, 2003, pp. 32, 267-279.

[22 T. C. C. C. Y. &. L. C. LEE, "Mining the customer credit using classification and
] regression tree and multivariate adaptive regression splines," in Computational
Statistics & Data Analysis, 2006, pp. 50, 1113-1130.

[23 S. &. J. M. D. AUH, "Compatibility Effects in Evaluations of Satisfaction and


] Loyalty," in Journal of Economic Psychology, 2004, pp. 26, 35-57.

[24 J. Wu and Z. Lin, "Research on Customer Segmentation Model by Clustering," 2005,


] pp. pp. 316-318..

[25 F. Newell, How to use one-to-one relationship marketing to be the leader in your
] industry, McGraw-Hill Publisher, 1997.

[26 C. Zhang, D. Ouyang and J. Ning, "An artificial bee colony approach for clustering,"
] in Expert Systems with Applications, 2010, pp. pp. 4761-4767.

[27 "abs.gov.au," Australiian Bureau of Statistics, [Online]. Available:


] http://www.abs.gov.au/websitedbs/D3310114.nsf/home/Time+Series+Analysis:+The+
Basics. [Accessed 5 8 2013].

[28 "itl.nist.gov," Engineering Statistics Handbook, [Online]. Available:


] http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm. [Accessed 5 8
2013].

[29 "www.statsoft.com," Stastisitica Data Analysis Software, [Online]. Available:


] http://www.statsoft.com/textbook/time-series-analysis/. [Accessed 5 8 2013].

[30 D. S. a. R. M. Boone, "Retail segmentation using artificial neural networks," in


] International Journal of Research in Marketing, 2002, pp. 19(3), 287-301.
78

APPENDIX
Appendix A

Schema for warehouse dimensions

Figure A.1 Schema for Time Dimension

Figure A.2 Schema for Date Dimension


79

Figure A.3 Schema for Location Dimension

Figure A.4 Schema for Customer Dimension


80

Figure A.5 Schema for Service Dimension


81

Table A.1 Structure of Fact Table

You might also like