Final - Report TU
Final - Report TU
Final - Report TU
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
By:
LALITPUR, NEPAL
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
By:
LALITPUR, NEPAL
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
The undersigned certify that they have read, and recommended to the Institute of
Engineering for acceptance, a project report entitled CORPORATE DECISION SUPPORT
SYSTEM FOR TELECOMMUNICATIONS submitted by Aashish Acharya, Santosh
Ghimire, Sudip Kafle and Supritam Raj Shrestha in partial fulfilment of the requirements
for the Bachelor’s degree in Computer Engineering.
_________________________________________________
Supervisor, Dr. Nanda Bikram Adhikari
Assistant Professor
Department of Electronics and Computer Engineering
_________________________________________________
Co-Supervisor, Er. Manish Subedi
Team Lead
Business Intelligence Department, Yomari Inc.
_________________________________________________
Internal Examiner, Dr. Arun Timilsina
Head
Department of Electronics and Computer Engineering, Pulchowk Campus
_________________________________________________
External Examiner, Er. Krishna Prasad Bhandari
Sr. Engineer
Nepal Telecom
COPYRIGHT
The author has agreed that the Library, Department of Electronics and Computer
Engineering, Pulchowk Campus, Institute of Engineering may make this report freely
available for inspection. Moreover, the author has agreed that permission for extensive
copying of this project report for scholarly purpose may be granted by the supervisors who
supervised the project work recorded herein or, in their absence, by the Head of the
Department wherein the project report was done. It is understood that the recognition will
be given to the author of this report and to the Department of Electronics and Computer
Engineering, Pulchowk Campus, Institute of Engineering in any use of the material of this
project report. Copying or publication or the other use of this report for financial gain
without approval of to the Department of Electronics and Computer Engineering,
Pulchowk Campus, Institute of Engineering and author’s written permission is prohibited.
Request for permission to copy or to make any other use of the material in this report in
whole or in part should be addressed to:
Head
Department of Electronics and Computer Engineering
Lalitpur,Nepal
v
ACKNOWLEDGEMENT
We extend our sincere gratitude to Dr. Arun Timalsina and Dr. Aman Shakya for providing
the necessary guidelines and support for understating the feasibility and technical aspects
for the project. We would like to hearty thank our supervisor Asst. Professor Dr. Nanda
Bikram Adhikari for helping us in the project in every ways. Also, our special thanks goes
to our co-supervisor Er. Manish Subedi and Yomari Inc. and for providing us internship
and guiding us through the project. We would also like to express our gratitude to Dr.
Sanjeeb Pd. Panday for guiding us and helping us work with documentation works and
refinement of our proposal. We would also like to thank Pulchowk Campus and
Department of Electronics and Computer Engineering for providing us the opportunity to
build up the project. Last but not the least; we would like to thank our friends and seniors
for helping us in the project. Thanks for making the difference.
SUMMARY
TABLE OF CONTENTS
ACKNOWLEDGEMENT ..................................................................................................... v
SUMMARY .......................................................................................................................... vi
1. INTRODUCTION ......................................................................................................... 1
2. Literature Review........................................................................................................... 8
4. TECHNICAL BACKGROUND.................................................................................. 16
4.1.2. Transformation............................................................................................... 16
7. IMPLEMENTATION .................................................................................................. 43
7.5.6. SVN................................................................................................................ 59
8.2. Conclusion............................................................................................................. 73
Bibliography......................................................................................................................... 75
APPENDIX .......................................................................................................................... 78
Appendix A ...................................................................................................................... 78
xii
LIST OF FIGURES
Figure 8.8 Dashboard Output - Bubble Graph Revenue Analysis | OLAP Analysis........... 68
Figure 8.9 Dashboard Output - Scatter Plot 3d | Customer Segmentation with RFM ......... 69
Figure 8.10 Dashboard Output - RFM Values | Customer Segmentation with RFM .......... 70
LIST OF TABLES
Table 7.1 Staging table Structure ......................................................................................... 49
LIST OF ABBREVIATIONS
1. INTRODUCTION
Mobile Communication is one of the technological fields that have seen rapid growth in
recent times. This growth is mainly influenced by technological breakthroughs in the field
of mobile technology, processing capability and communication. Furthermore, this
technology is easily available to the general public. The scenario is alike in both the
developed and developing countries. On this basis, it is speculated that currently in this
world the total number of users of this technology exceeds in billions.
The telecommunication market today is competitive. In the context of our country, there is
an ongoing battle between two major companies namely Nepal Telecom and Ncell for
being the prominent figure of mobile communication throughout the country. In order to
understand their customers and create a better environment for managing customer
relationships, many telecommunications companies use call detail analysis to achieve
competitive advantage. By understanding customers’ characteristics and behavior, it is
possible to successfully tailor services that meet customers’ needs [1]. This is known as the
personalization, and this concept is not restricted to the telecommunications industry. User
profiling can be valuable to semantic web, it is successfully applied in banking, financial
services and many other industries.
Telecommunications rating is the activity of determining the cost of a particular call. The
rating process involves converting call related data into a monetary-equivalent value. Call
related data is generated at various points in the network or measurements may be taken by
third party equipment such as network probes. One or more of the following parameters
are used in the rating system:
Premium charges (third party charges for premium content, cost of physical items
such as movie tickets)
Call data record is a record of a call set up and completion and its format varies among
telecom providers or programs. Over the last ten years the CDR has evolved to include
several new fields. As technology and new marketing opportunities have evolved, more
information has been added to the CDR. Today, CDR is the currency for the Telecom
industry [2]. A CDR contains a world of traffic analysis information including the
origination, termination, and routing of a call. You tell the switch what you want it to do
when these numbers are dialed, and the CDR will show how the switch actually routed the
call, correctly or not. Unlike network statistics, the CDR is authentic, real-time information
(Lee McDaniel).
The telecommunications industry has advantage over other sectors because it has a
voluminous amount of customer data [2]. Telecommunications companies collect various
demographic data (such as name, address, and age and gender information) and store call
detail records. Call detail records usually include information about originating and
terminating phone numbers, the date and time of the call and its duration. Demographic
and call detail data can be stored together in a data mart (known as the CDR data mart),
and transformed into information that can be analyzed using OLAP tools.
Although the CDR is the smallest piece of the telecom puzzle, the CDR is the building
block for the network and provides the detailed information carriers need to leverage
themselves appropriately in this competitive marketplace. Knowledge is power. In the
telecom industry, the CDR not only translates into currency, but also into power.
Intelligent telecom companies are recognizing the strategic value of efficient handling of
CDRs and are implementing advanced CDR applications such as Equinox’s ShowCDR,
ExtractCDR, and AccessCDR to maximize their resources [2]. Now more than ever,
carriers need to manage CDRs efficiently to compete in this competitive
telecommunications marketplace.
3
The sole purpose of our application “Corporate Decision Support System for
Telecommunications” is to help the company in making decisions and formulating
business policies based on analysis of data from CDR. This application assembles data
from various sources mainly CDR, integrates them into a separate data warehouse. Our
system is built based on the Kimball approach of data warehouse design. It is a bottom-up
approach of data warehousing which is based on excessive de-normalization of the
available data. In the bottom-up approach, data marts are first created to provide reporting
and analytical capabilities for specific business processes. The reports are a mixed
combination of graphs, pie charts, bar diagrams or plain text predictions. These records are
used to create the simple reports that are easy to understand for the non-technical
community. These reports generated by our system provide an insight to the company’s
past and present status in the consumer market. They show the activity of the company in
the past and the present and illustrate how much has been done to grow and spill over the
market. These results of analysis are arranged in the form of a dashboard. Also the data
content of our warehouse are processed using various mining algorithms to predict the
future of the company based on the data behavior in our warehouse. The end results of the
analysis are presented in a way that is easier to the non-technical people to navigate
around. Furthermore, the application also helps to draw conclusions that can be utilized for
improving business.
The objective of this project was to deliver a product that will analyze the data associated
with any telecommunication company. The system contains a data warehouse built on top
of Teradata database. The warehouse is developed in a way efficient to manage millions of
records generated by the telecommunication switches. The data to be analyzed has been
transformed and loaded into the warehouse. Reports are generated with the help of
Microstrategy tool to analyze customer behavior and calling trends with the warehouse as
the source. Predictions have been made with the utilization of data mining algorithm for
4
making better strategies. The tool is capable of analyzing the data recorded by different
CDR switches which will help in making plans for loss prevention between the switches.
The higher level abstractions of the objectives of our project are described in the general
objectives section of the project proposal. On a finer level of granularity, CDSST has
fulfilled the following objectives:
Loss Prevention by analysis of the call data record (CDR) generated by different
switch.
Achieve new enterprise level data model to handle huge amount of data in Teradata
database.
Provide a new base for OLAP analysis through Microstrategy reporting tool.
(analyze customers’ behavior and improve its marketing campaigns and pricing
strategies )
Make predictions to support in the formulation of company’s business policies by
the mining of the CDR and customer data.
5
1.4. Scope
Building a warehouse for the analysis of data for an enterprise is not a common practice in
Nepal. In fact this can contribute to the decision making process. Considering
telecommunication enterprise, all user activities are tracked in the CDR data. This data is
very rich and is potentially useful which is considered very confidential. Due to which no
significant researches regarding analysis of CDR data have been found. Considering the
confidentiality of the data, it is only used to determine the monetary value of calls but not
fully exploited to obtain benefits. CDR data can also contribute to analyze the loop hole of
the system. In the present context of our country, no relevant obtained benefits have been
seen from the analysis of CDR data. This CDR data can be used by the enterprise to
increase its marketing campaigns and pricing strategies. Different schemes can be
discovered from the analysis of CDR data which can benefit the enterprise. Mobile
penetration have been increasing very fast in our country and two big telecommunication
enterprises: Ncell and Nepal Telecom are mainly responsible for providing services to
them. Along with the increase in number of customer, telecommunication enterprise is not
able to meet their needs. This causes dissatisfaction of customer due to which they switch
from one network to another frequently. Also the problem gets stacked up. Rate of call
drop have been quite high due to which customers are not fully satisfied.
This project has deliver an application which allows users to view the inferences, trend,
predictions and so forth drawn out from the data in the data warehouse of our own. All
these information are visualized shown in an interactive form on a dashboard. This process
involved ETL process to build the warehouse which was accompanied by the creation of a
data model. There will always be a challenge of optimization since we will be dealing with
millions of data generated by switch (CDR). The project withholds a good potential which
will yield informative and interesting results
As previously stated, complete analysis of CDR data can deliver high end business
solutions. Some of the important results that lie within the scope of our project are as
follows:
6
Prediction of calling trends using CDR analysis that can help in managing
resources to improve the quality of a call.
Identification of customer network and solutions to business advertisement
campaign.
Representation of telecom business reach in a spatial database.
Recognizing strategies to prevent call drops and loss occurring in switches
throughout the network.
Justification of customer migration from the industry.
Telecom data base is huge and can contain a highly diversified set of fields. In this one
year long project of ours, we will focus ourselves only on the selected aspects of CDR data
and restricted domain of customer information. CDR analysis cannot provide all the
required business solutions. Following enlisted areas will not be included in our project,
although the enlisted field may seem to be closely related to what we are doing right now:
Complete analysis of all the CDR records created by all the switches of the telecom
industry. The data would simply be too large to store and manage. (Literally
billions per week).
Geographical analysis of call traffic throughout the entire network of the telecom is
simply out of the scope of this project. Complete geographical analysis and spatial
representation of call data throughout the country shall require a longer time span, a
greater work force and an industry grade effort.
Optimization of data structure of CDR records and solution to the loss of call
requests among the switches in the telecom network cannot be simply implemented
as the analysis of CDR switcher would involve participation of the system engineer
and hardware programmer too.
This chapter explains the project background, overview, aims and objectives, and scope of
the project.
7
Chapter 2 explains about the literature review, which gives the previous works done in the
field of telecommunication data structure, data warehouse and ETL process, customer
segmentation and churn prediction.
Chapter 4 explains in detail about the technological backgrounds. The technical details of
business intelligence and reporting, algorithms and details of Customer segmentation and
churn prediction will be discussed in this chapter.
Chapter 5 presents system analysis which discusses about basic system operation,
requirement specification and feasibility assessment of the project.
Chapter 6 explains in detail about the system design and discusses about overview of the
system design, system diagrams and warehouse schema.
Chapter 7 explains about the implementation part of the system. Topics like system design
speculation, detailed system operation, implementation perspective of business intelligence
and reporting and data mining and knowledge discovery will be discussed.
Chapter 8 gives some explanation about the output of charts of CDSST dashboard along
with their snapshots and also conclusion and future enhancement of the project.
8
2. Literature Review
Every business enterprise wants to make better decisions with timely & accurate data.
Accurate, reliable, timely and consistent enterprise database management produces quality
data. Improving data quality benefits everyone in the organization. This helps them avoid
misleading and inaccurate conclusions and make better business decisions that improve
their bottom line. Teradata takes a step-by-step approach to data quality improvement that
focuses on the specific business issues to deliver tangible and measurable results [3].
More secure and trusted information can be achieved with Data Integration. Proper data
management provides business value - not only by giving them the ability to make faster,
better decisions, but also by increasing security and privacy and adhering to the data
governance policies. Teradata Integrated Data Management services can help enterprises
secure sensitive data that may be located in multiple data warehouses [3]. It will also help
establish consistent access rights, auditing capabilities and encryption to protect this vital
asset.
In the competitive telecom market, it is absolutely crucial to handle and manage CDRs in
an efficient fashion. The common denominator of CDR applications is that they allow
carriers to manage their business down to the smallest detail. Telecom companies initially
spend and invest a lot of resources into building their network and obtaining customers.
Unfortunately, these companies often neglect their currency, the CDR, which measures the
efficiencies of scale, and shows if revenue is being lost due to mismanagement of CDRs
[4]. Popular and useful CDR applications that have been developed so far include:
Organizing CDR records based on the trunk group and/or billing number
Giving customers electronic access to their CDRs
Searching CDR files
Relaying CDR records to billing systems
Importing CDRs into spreadsheet or database applications
Creating individualized call record reports on demand
9
Richard Harding, an Oracle Database Administrator with LDInet, classifies the focus of
analysis of CDR into following seven fields:
Data warehouse is the collection of tremendous amount of data. The data stored in data
warehouse is highly de-normalized, redundant and time variant. The data is stored
continuously from the infancy of any data collection mechanism. Extract-Transform-Load
(ETL) is a process in data warehousing responsible for pulling data out of the source
systems and placing it into a data warehouse. Data is extracted from an OLTP database
transformed to match the data warehouse schema and loaded into the data warehouse
database. Many data warehouses also incorporated data from non-OLTP systems such as
text files, legacy systems and spreadsheets; such data also requires extraction,
transformation and loading. Because ETL is an integral, ongoing and recurring part of a
data warehouse, it needs to be automated, well documented and easily changeable.
The principal reason why businesses need to create Data Warehouses is that their corporate
data assets are fragmented across multiple, disparate applications systems, running on
different technical platforms in different physical locations. This situation does not enable
good decision making [7]. Individual business units within an enterprise are designated as
"owners" of operational applications and databases. These "organizational silos"
10
sometimes don't understand the strategic importance of having well integrated, non-
redundant corporate data. Consequently, they frequently purchase or build operational
systems that do not integrate well with existing systems in the business.
Due to globalization, mergers and outsourcing trends, the need to integrate operational data
from external organizations has arisen. The sharing of customer and sales data among
business partners can, for example, increase business intelligence for all business partners.
The challenge for Data Warehousing is to be able to quickly consolidate, cleanse and
integrate data from multiple, disparate databases that run on different technical platforms
in different geographical locations.
Although customer segmentation and market segmentation have many similarities, there
are some critical differences regarding their input variables for their clustering mechanisms
[8]. Market segmentation usually aims at acquiring new customers, and deals with the first
step of CRM (i.e. customer acquisition) using socio-demographic data, but, customer
segmentation is used at all steps of CRM using both socio-demographic and transactional
data. “We can imagine that customer cultivation and retention are more important than
customer acquisition, because lack of information on new customers makes it difficult to
select target customers and this will cause inefficient marketing efforts” [9].
Chan (2008) [10]has classified existing customer segmentation methods into methodology-
oriented and application-oriented approaches. Most of methodology-driven studies modify
some data clustering techniques, such as SOM, or use a combination of two or more data
mining techniques to achieve more accurate clusters or segments (e.g. Lee et al., 2004 [11];
Jonker et al., 2004 [12]; Huang et al., 2007 [13]; Kim and Ahn, 2008 [14]). “On the other
11
hand, application-oriented researches must search for the optimum method for solving
segmentation problems in specific applications” (Chan, 2008 [10]).They usually define and
create new variable for clustering procedure or use different variables in sequential
clustering steps (e.g. Hwang et al., 2004 [9]; Hsieh,2004 [15]; Chan, 2008 [10]; Cheng and
Chen, 2009 [16]).
(Hung, Yen, & Wang, 2006) employ a decision tree for churn prediction in the Taiwan
postpaid mobile market using several groups of features: customer demographics, billing
information, contract/service status, call detail records, and service change log [18].
Features are tested for significance using -test. According to the analysis, the features that
are significant to differentiate between churners and non-churners are: age, tenure, gender,
billing amount, number of overdue payment, in-net call duration, number of changes in
account information. Customers are then segmented with respect to the significant features.
Comparison of decision tree and artificial neural networks (ANN) verifies that ANN
performs better on this particular data set. To make sure that the models work well, the
authors track the models in one year from July 2001 to July 2002. The performances of all
models remain first steady but then drop dramatically after six months from 80-90% to 0-
10% of overall accuracy.
Baesens et al. (2004) used Bayesian network classifiers for identifying the slope of the
customer lifetime value (CLV) for long-life customers, but used simple linear regression
on the historical contributions of each customer to capture their individual life cycles [19].
The slope was then separated into either positive or negative classes to represent increased
or decreased spending. This variable was then used as the dependent variable in their
study. The CLV was also the focus of research by Rosset etal. (2003) who used the
Kaplan-Meier estimator to estimate the value, and Stahl et al.(2003) who link CLV to
company shareholder value. [20] [21]
Support vector machine (SVM) is another technology that is worth investigating for its
suitability for use with customer churn management; however a thorough search of the
literature has revealed little investigation into this method. The closest research to CRM
identified is the mining of customer credit scores (Lee et al., 2006). Auh and Johnson
(2004) use five firm-level variables named as customer satisfaction index scores (SAT),
customer loyalty index scores (LOY), year of data (YEAR),relative quality importance
(RQI) and the average ease of comparing quality differences(EQ), to create a general linear
model. [22] [23]
13
3. THEORITICAL BACKGROUND
The data collected in a telecommunication industry is huge and is continually evolving. It
is a never ceasing process. The huge amount of data, when harnessed properly, can give up
precious sets of information that can actually guide the business process in the grass root
level. To harness the goodness of the data collected by the interaction of the customers
with the telecommunication services provided by a telecommunication industry needs to be
properly collected and stored well. Storing data in a warehouse makes it efficient to
generate reports and find out the information at a higher level of abstraction.
Decision support system for an enterprise should give its executives the ability to take
better business decisions from the insights on the warehouse repository provided by the
intuitive dashboards and reports. This is achieved with the help of business intelligence
tool provided by the Microstrategy platform. It helps to easily transform data into intuitive
dashboards and reports for greater analytical insights to make informed, data-driven
decisions. The MicroStrategy platform makes Business Intelligence faster, easier, and
more user-friendly. It enhances productivity, monitor trends and discover anomalies, and
forecast business opportunities. MicroStrategy provides a high performance, scalable
enterprise Business Intelligence platform capable of delivering deep insight with
interactive dashboards and superior analytics to large user populations through Web
browsers, mobile devices and Office applications.
Customer Segmentation is the process of grouping the customers based on their purchase.
Data mining is useful in finding knowledge from huge amounts of data. The clustering
techniques in data mining can be used for the customer segmentation process so that it
clusters the customers in such a way that the customers in one group behave similar when
compared to the customers in the other group based on their transaction details. Customer
segmentation allows a company to target specific groups of customers effectively and
allocate marketing resources to best effect. According to an article by Jill Griffin for Cisco
Systems, traditional segmentation focuses on identifying customer groups based on
14
Customer Segmentation gives a quantifiable way to analyze the customer data and
distinguish the customers based on their purchase behavior [24]. In this way the customers
can be grouped into different categories for which the marketing people can employ
targeted marketing and thus retain the customers. Once the customers are segmented, rules
can be generated to describe the customers in each group based on their purchase behavior.
These rules can be used to classify the new customers to the appropriate group who have
similar purchase characteristics. In our case, the purchasing habit of customer indicates the
calling habit or the habit of using the services provided by the telecommunication
enterprise.
In the telecommunications industry, the broad definition of churn is the action that a
customer’s telecommunications service is canceled [9]. This includes both service-provider
initiated churn and customer initiated churn. Customer initiated churn is more complicated
and the reasons behind vary. Our model will only be prediction customer initiated churn. If
we are able to predict in advance, the attributes of customers whom we are going to lose in
near future one can take corrective action so that we can minimize this phenomenon.
Initially when the industry growth is high, churn may not be a problem. The focus at this
stage is on customer acquisition. For each customer churning out there may be many who
will be joining. As the industry matures, the churn rate rises and there will probably be a
time when for every one customer leaving only one new will be joining or there may not
be even one joining. When this stage is reached, it will affect the bottom line for the
company.
One of the important benefits of building a model for churn prediction will be to marketing
department / call center(s). The list of customers who are likely to churn in the coming
month will be made available to the above places so that further action can be taken. Data
analyzed from back end or Analytical Customer relationship management (CRM) System
will go to frontend or operational part of CRM closing the CRM loop. Marketing
15
department can plan giving customers discounts, other promotions/events, other products
of other sister companies wherever applicable.
For finding answers to the questions who and why is likely to churn, a classification of the
customers is needed. Churn prediction deals, therefore, with the identification of customers
likely to churn in the near future. The basis for this is historical data, containing
information about past churners. A comparison is made between these churners and
existing customers. As likely churners are identified customers for which the classification
suggests similarity to prior churners.
16
4. TECHNICAL BACKGROUND
Extraction, Transformation, and Loading (ETL) processes are responsible for the
operations taking place in the back stage of a data warehouse architecture. Data is extracted
from an OLTP (Online Transactional Processing) database, transformed to match the data
warehouse schema and loaded into the data warehouse database. Many data warehouses
also incorporate data from non-OLTP systems such as text files, legacy systems, and
spreadsheets; such data also requires extraction, transformation, and loading. ETL is not a
one-time event as new data is added to the Data Warehouse periodically – monthly, daily
or hourly.
4.1.1. Extraction
The extraction step is conceptually the simplest task of all, with the goal of identifying the
correct subset of source data that has to be submitted to the ETL workflow for further
processing. As with the rest of the ETL process, extraction also takes place at idle times of
the source system - typically at night. For extraction, there needs to be a logical data map
before the physical data can be transformed. The logical data map describes the
relationship between the extreme starting points and the extreme ending points of the ETL
system usually presented in a table or spreadsheet.
4.1.2. Transformation
Transformation is the main step where the ETL adds value. Actually the process changes
data and provides guidance whether data can be used for its intended purposes. It is
performed in the staging area. Data can actually have different issues which can be
classified into: schema-level problems, record-level problems or value-level problems. To
deal with such issues, the integration and transformation tasks involve a wide variety of
functions, such as normalizing, denormalizing, reformatting, recalculating, summarizing,
merging data from multiple sources,modifying key structures, adding an element of time,
identifying default values, supplying decision commands to choose between multiple
sources, and so forth.
17
4.1.3. Loading
The end of the source records' journey through the ETL workflow comes with their loading
to the appropriate table. Some data warehouses may overwrite existing information with
cumulative information; frequently, updating extracted data is done on a daily, weekly, or
monthly basis. Other data warehouses (or even other parts of the same data warehouse)
may add new data in a historical form at regular intervals - for example, hourly. Loading
can be classified to two types – loading dimensions and loading facts. In the process of
loading dimensions, the data loading module consists of all the steps required to administer
slowly changing dimensions (SCD) and write the dimension to disk as a physical table in
the proper dimensional format with correct primary keys, correct natural keys, and final
descriptive attributes. The most important thing about loading fact tables is that we first
need to load dimension tables and then according to the specification the fact tables. The
fact table is often located in the center of a star schema, surrounded by dimension tables. It
has two types of columns: those containing facts and other containing foreign keys to
dimension tables.
The data source for Microstrategy was our CDSST warehouse created with Teradata. After
configuring Microstrategy to connect with the warehouse, the next step was to proceed
with the Microstrategy reporting and dashboard creation. Then tables were selected from
the warehouse to be used for reporting in the Microstrategy project which is the first task in
project creation workflow as shown in the figure below.
18
The second task in the project creation workflow is to create the facts for the
MicroStrategy project. Facts are created based on our logical data model and mapped them
to columns in the data warehouse schema. The facts in the CDSST projects are cost, and
duration. These facts are then used to define metrics. As such, the fact schema object
serves as a bridge between fact values stored in the data warehouse and the metrics users
want to see on MicroStrategy reports. Facts point to the appropriate physical columns in
the data warehouse, while metrics perform aggregations on those columns.
Facts are one of the essential elements within the business data model. They relate numeric
data values from the data warehouse to the MicroStrategy reporting environment. Facts
generally represent the answers to the business questions on which users want to report. In
the MicroStrategy environment, facts are schema objects created by and shared between
MicroStrategy users. The facts you create in MicroStrategy allow users to access data
stored in the data warehouse. Facts form the basis for metrics, which are used in the
majority of analyses and reports that users can create with MicroStrategy. Facts and
attributes are necessary to define projects. In a MicroStrategy project, facts are numeric
data and attributes are contextual data for the facts. For example in case of CDSST, you
want to analyze the amount of call duration at a certain area during January. In this case,
the amount of call duration represents the fact, and the area and month represent attributes.
Users can then use these facts and attributes as building blocks for metrics and reports.
19
Attributes provide the business model with a context in which to report on and analyze
facts. While knowing your company’s total sales is useful, knowing where and when the
sales took place provides the kind of analytical depth users require on a daily basis.
Metrics are MicroStrategy objects that represent business measures and key performance
indicators. They are calculations to be performed on data stored in the database and are
similar to formulas in spreadsheet software. Questions such as “What were the sales for the
eastern region during the fourth quarter?”, “Are inventory levels being consistently
replenished at the beginning of each week?”, or “How many employees are currently
working for our company?” can easily be answered by creating metrics. A metric
definition must contain a formula, which determines the data to be used and the
calculations to be performed on the data. A metric is categorized as one of the following
types based on the functions used in its formula:
Hierarchies are groupings of attributes that can be displayed, either ordered or unordered,
to reflect the relationships that exist between the attributes in a project. Hierarchies are
used to group related attributes in practical business areas. For CDSST, there is a Date
20
hierarchy that consists of Day, Day of Week, Week, Month, Quarter, Semi-year and Year
attributes. And other hierarchies in CDSST includes Time hierarchy and the Location
hierarchy. Hierarchies as they exist in the MicroStrategy environment and provides
information on the two different types of hierarchies in MicroStrategy. These types of
hierarchies include the system hierarchy and the user hierarchy. The system hierarchy is
automatically created when you create a project and is maintained by the relationships that
exist among the project’s schema objects. The user hierarchy is a hierarchy which you
create specifically for your report designers. The mentioned hierarchies of Date, Time and
Location are all user hierarchy. These user hierarchies are used to define the browse and
drill relationships between attributes.
Reports that display your business data are the focus and goal of business intelligence.
Reports allow you to gather business insight through data analysis. The results displayed in
any MicroStrategy report are often a starting point for further investigation. Each row of a
report represents business concepts i.e. attributes while columns represent financial
calculations that are important to business i.e. metrics. Reports can be viewed in
Microstrategy from different perspectives, grids, graphs, grid/graph and SQL view.
A filter specifies the conditions that the data must meet to be included in the report results.
Using a filter on a report narrows the data to consider only the information that is relevant
to answer your business question, since a report queries the database against all the data
stored in the data warehouse. Filters are helpful in clarifying large quantities of data and
only displaying subsets of that data, so reports show users what they really need to see.
The different types of prompts allow you to create a prompt for nearly every part of a
report. Remember that prompts can be used in many objects including reports, filters,
metrics, and custom groups, but all prompts require user interaction when the report is
executed, Filter definition prompts allow the user to define the filtering criteria, based on
the following objects: Attributes in a hierarchy, Attribute forms, Attribute element lists and
Metrics
Intelligent Cubes are In-memory copy of data, which is also known as Multi-Dimensional
Cubes (sets of data), that are present on MicroStrategy Intelligent Server memory. The data
21
in Intelligent Cubes can be accessed by different MicroStrategy objects like Reports and
Documents (dashboard). It allows us to create sets of data that supports multiple reports
that can answer variations to similar Business Queries. So, intelligent cubes represents an
OLAP cube that contains data cached on memory of the intelligent server so that every
time a report or a document is run, it can be easily fetched from the memory and need not
pull data from the warehouse executing the query grabbing the data. Implementing
intelligent cube makes the processing of document a lot faster as compared to other
methods. In case of CDSST system, an intelligent cube was created consisting of various
dimensions like date, time, service and location with metrics like count of service usage,
sum of cost, average call duration etc. Creation of cube eases the drilling capability of a
report. The cube was then used to create a dataset (report) which was used as the main
dataset for creating a document.
Other document components that do not originate from the dataset, such as static text used
for a title or heading, page numbers, and images, are added by the document's designer and
are stored in the document's definition. A dashboard is a special type of document. A
dashboard is commonly only one page long, is intended to be viewed online, and usually
provides interactive features that let analysts change how they view the dashboard’s data,
as well as what data they are viewing. Multiple display modes are provided for a
document: design, express, interactive, editable and flash mode. While selectors like
dropdown, radio buttons, checkboxes etc and filters make the dashboard more interactive,
drilling capability is also provided in some modes. Drilling is one of the most powerful
data analysis functionalities in any reporting environment. Drilling lets users explore data
beyond the data immediately visible in a grid or a graph. One can drill through a grid to
analyze data that is closely related to the original data, or can expose an entirely different
22
object on a grid to see what the data looks like within a different context. The dashboard
screenshots for CDSST are included in the Appendix A1.
The Recency (R), Frequency (F) and Monetary (M) are the important attributes that
determine the behavior of the customer. RFM method is very effective for customer
segmentation [Jing Wu and Zheng Lin (2005)]. R means ‘recency’ which indicates the
time interval between the present and previous transaction date of a customer. F means
frequency which indicates the number of transactions that the customer has done in a
particular interval of time. M means monetary which indicates the total value of the
customer’s transaction amount. It has been proven that the values of R, F and M decide the
characteristics of the customer behavior [25].
Customers can also be segmented using other Artificial Intelligence (AI) methods like Self
Organizing Map (SOM), Genetic Algorithm (GA), Particle Swarm Optimization (PSO)
and Artificial Bee Colony (ABC). GA is a part of evolutionary computational technique. It
is based on the process of natural evolution using operations like mutation and cross over.
The main disadvantage of GA is it is time consuming and it is difficult to make the
solution converge to an optimal solution and has stability issues. SOM is an unsupervised
neural network learning algorithm. Neural network has interconnected processing elements
and achieves input/output relationships. It is a trial and error method and suffers from long
training time [16]. PSO is an evolutionary computation technique developed by Kennedy
and Eberhart (1995). Like GA, PSO has a fitness evaluation function that takes the
particle’s position and assigns to it a fitness value. The objective is to optimize the fitness
function [Chui-Yu Chiu et.al. (2009)]. It has the same demerits as those of GA. ABC are
based on the behavior of honey-bees for food collection. A greedy selection mechanism is
employed to choose the candidate solution which is tedious [26].
and complete link clustering algorithm. In k-means the output clusters varies according to
the initial centers of clusters. In single link and complete link, the number of clusters is
reduced only by one as the iteration proceeds. Thus, RFM clustering algorithm is better to
use for customer segmentation.
Algorithm
For implementing the segmentation with the values of R, F and M, these three values must
be normalized.
For normalizing R
For normalizing F
For normalizing M
Assign numbers 1,2,3,4,5 to first, second, third, fourth, fifth part of records
respectively
𝑛 𝑘
∑ ∑ |(𝑥𝑖 − 𝑚𝑗)|2
𝑖=1 𝑗=1,𝑥𝑖 𝜖 𝑐𝑗
The intra cluster distance is the Manhattan distance between data within a cluster. It is
given by
𝑛 𝑛
∑ ∑ 𝑑 (𝑥𝑖, 𝑥𝑗)
𝑖 =1 𝑗=𝑖 +1
The inter cluster distance is the Manhattan distance between the center of the clusters. It is
given by
𝑘 𝑘
∑ ∑ 𝑑 (𝑚𝑖, 𝑚𝑗)
𝑖 =1 𝑗=𝑖 +1
Here, n indicates the number of objects, k indicates the number of clusters and mj indicates
the center of jth cluster. Hence, the objective is to
Minimize MSE
Minimize intra cluster distance
Minimize inter cluster distance
Minimize intra/inter cluster distance
To analyze the churn there are different techniques available. Some of the common models
used are: Regression analysis, Decision Trees, Logistic Regression and Neural networks.
25
Out of the above models, Decision Trees are a good choice since they provide the rules
that business users can understand. Other models like Neural Networks, Logistic
Regression reduce the understandability of the phenomenon even though they may be little
bit more accurate than decision trees. [9]
Decision trees are the most commonly used tool for predictions and classification of future
events. The development of such trees is done in two major steps: building and pruning.
During the first phase the data set is partitioned recursively until most of the records in
each partition contain identical value. The second phase then removes some branches
which contain noisy data (those with the largest estimated error rate) [11].
Algorithm:
Gini Criterion:
where pL and pR are probabilities of sending a case to the left child node tL and to the right
child node tR respectively. They are estimated as
and
Stopping rules control if the tree growing process should be stopped or not. The following
stopping rules are used:
If a node becomes pure; that is, all cases in a node have identical values of the
dependent variable, the node will not be split.
If all cases in a node have identical values for each predictor, the node will not be
split.
If the current tree depth reaches the user-specified maximum tree depth limit value,
the tree growing process will stop.
If the size of a node is less than the user-specified minimum node size value, the
node will not be split.
If the split of a node results in a child node whose node size is less than the user-
specified minimum child node size value, the node will not be split.
If for the best split s* of node t, the improvement ΔI(s*,t)=p(t)Δi(s*,t) is smaller
than the user-specified minimum improvement, the node will not be split.
27
5. SYSTEM ANALYSIS
5.1.1. Stage 1
The system acquires data generated from various switches throughout the
telecommunication network. The data acquired are in form of Call Data Records (CDR).
CDR is a log file comprising of various information regarding each call. The CDR holds
detailed information regarding the call. This information is segmented into smaller flat
files. The initial stage involves acquiring the data related to the CDR from the respective
switches.
5.1.2. Stage 2
Once the necessary and relevant data are acquired from the CDR switches, the data are
transformed into a pre-determined data model suitable for storage in the warehouse. The
conversion from the flat files to the structured form is achieved through Extract, Transform
and Load Process abbreviated as ETL Process. ETL process is performed through scripting
using Korn Shell. After completing the conversion process, the data are then loaded to the
data warehouse for storage. Data marts, for the data warehouses to store the filtered data of
CDR, are created using Teradata.
5.1.3. Stage 3
The data alignment in the data marts enables the system to generate various inferences
regarding the past and the present of the telecommunication industry. In the next stage the
data from the staging tables are segregated and explicitly undertaken though the series of
stages of operation to facilitate Data Driven Discovery. The data input is mined rigorously
to generate the future values of various constraints of measurement of the
telecommunication activity.
5.1.4. Stage 4
With the records of the past and present activity of the company from the data warehouse
and with the future values of various constraints, the result set is populated into the screen
using an interactive UI. The GUI of the system is designed to address the technical
28
knowledge domain of the nontechnical class of people that are present in the industry. The
process of report generation largely involves making selection of different measures of the
user activity. The system is built in such a way that the back end coding requirement is
reduced to nil to the end user.
Interface requirements:
Business Requirements:
Regulatory/Compliance Requirements:
Security Requirements:
29
1) The manager can view reports but cannot load dimensions and facts.
2) The administrator can create predefined dimensions initially but cannot define his
own dimensions.
The non-functional requirements of our system are encompassed in the following points:
Performance: The system shall have a quick response, processing and reporting
time. Loading of CDR data shall take no more than 1 minute for 100 thousand
records. Reports should be generated in real-time.
Capacity and Scalability: The system shall be able to store huge amount of data
generated by different CDR switches in the order of gigabytes per day.
Availability: The system shall be available during office-hours for reporting.
Maintenance and other operations which consume high processing power shall be
scheduled during off-hours.
Recovery: The system shall perform regular backing up of data. Data lost during
crashes shall be easily recoverable.
Technical feasibility is a measure of the practicality of a specific technical solution and the
availability of technical resources and expertise. The CDSST system in technical aspect is
practical as it can be actually realized in a telecommunication enterprise. The technology
used in the system like Teradata as database platform for the data warehouse is a robust
30
and efficient technology. Similarly, Microstrategy tool used for reporting is an enterprise
application that provides very customizable reporting. Python programming language used
for scripting and logical task is an efficient and robust language. Thus, the technology used
in this system is mature and we currently possess the necessary technology.
The technical expertise necessary for the system can be provided by us and the enterprise
implementing the system can contact the technical experts in a totally reasonable schedule
in case any technical problems arises. Thus, CDSST uses the technology that are currently
available and is technically feasible.
Operational feasibility is a measure of how well a specific solution will work in the
organization. It is also a measure of how people feel about the system/project. It also
analyzes the inside operations on how a deemed process will work, be implemented, and
dealing with change resistance and acceptance.
After analyzing the technical and economic feasibility studies, next would come the
operational analysis. CDSST would allow the executives and other employees of an
31
enterprise to view and analyze the basic trends in the services they provide based upon
various criteria. Besides, end users can also obtain important inferences about whether to
choose the service provider or not based on these inferences. In order to determine if the
CDSST would work, the operational feasibility study would follow this path based on six
elements:
Adapt & Review – How much time does the organization need to adapt to the new
system. How will it be reviewed and monitored? What will happen if through a
monitoring process, additional changes must be made? CDSST will be operated by
an administrator for source data management which includes loading of new
records for dimensions and facts. Next, the executives will view the inferences to
take proper and correct decisions. This takes some time in order to incorporate
CDSST in the decision making process completely. In the early stages, the
involvement will be less as the question of trust in the system arises. Gradually, the
involvement will rise because of the positive effects shown by the implementation
of the system.
33
6. DETAILED DESIGN
6.1. Overview
The system is divided into five different blocks. The section of the system block diagram
included under the pair of braces are complete to suit the purpose and requirement of the
system. The functionality of different components of the system are described in the form
of various block as below:
Data source is the block that consists of origin of all the data associated with our system.
Data source consists of both raw data and information available in different formats. The
main source of the data is the Call Detail Record information. Besides CDR, the sources
also include flat files, database etc. Data and information available in this block do not
have a common format.
This block is involved with extracting, transforming and loading raw data with disparate
format into the data warehouse. This block contains of 3 different sub-blocks; each
involved with its individual task as described by the name. First sub-block is the extract
block and is responsible for pulling data out of the source systems. Second sub-block is the
transform block and is responsible for applying a series of rules or functions to the
extracted data from the source to derive the data for loading into the end target. Third sub-
block is the load block and is responsible for loading the data into the end target, which is
data warehouse.
The data warehouse block is the centralized repository that stores data from multiple
information sources. After ETL process, all the data are stored in warehouse in a common,
multidimensional data model for efficient querying and analysis. Data warehouse contains
a wide variety of data that presents the history and scenario of the business. The
information acquired from warehouse is useful in reporting, data analysis and data mining.
Dashboard is a user interface that organizes and presents information in a way that is easy
to read. The dashboard summarizes all the business information in the form of reports that
consists various figures and graphs. Dashboard has important role as it provides strategic,
35
analytic, operational and informational support to the users in simplistic but sophisticated
manner.
The above diagram shows the most basic components involved in the system. CDR Switch
collects information from all the call transactions occurring throughout the area and stores
this information in the telecom database. Telecom database is a repository consisting of all
36
the data regarding the transaction and also the telecom industry. This database is used as
the raw data source required by the application to construct warehouse. ETL tools are used
to extract the data from various raw data sources and load them into the staging tables of
the warehouse. Staging table successfully transforms the unstructured data into a structured
and organized format which is then loaded into the target table of the warehouse.
The above diagram shows the sequence in which the system administrator operates the
application. The system uses secured login process to identify the administrator from other
normal users. Once logged in, system administrator is provided with privilege to modify
the contents of the warehouse. After receiving command from the administrator, the
application selects the raw data sources and then separates between dimensions and facts.
Once separation is performed, data are loaded into respective dimension and fact tables.
37
The above diagram depicts the sequence of operation occurring during the generation of
report by the manager. This process requires no authentication. Once the application is
running, manager can select to generate report or to acquire aid in decision support. Based
on the selection, appropriate query is selected and performed on the warehouse and the
result is presented in a well-formed report.
38
The above diagram shows the various states involved with our application. Once the
application initiates, the user is prompted with authorization. The logon process is essential
if the administrator wishes to work with the warehouse. After successful logon, the
administrator can select the data sources. The application decides whether there are new
dimensions data or not. If no new data regarding dimension is detected, facts are directly
loaded into warehouse, otherwise dimension tables are filled before loading facts. Once
dimensions and facts are loaded, the warehouse is updated.
39
The above diagram is an activity diagram depicting activities involved with three major
components: system administrator, telecom manager and system/application. Application
can be initiated by either system admin or the manager by giving respective command. The
application corresponds in the way it suits and fulfils the need of the respective user and
40
hence performs designated activity. Activity involved with system admin is to update
warehouse whereas the activity involved with telecom manager is generating report.
41
The above diagram shows use cases for our system. Our system involves two actors: those
are system administrator and telecom manager. System administrator is responsible for
working with the data warehouse whereas telecom manager is simply a user and is able to
generate and view reports and predictions. Thus, based on their responsibilities, each user
are provided with different set of operational tools.
42
7. IMPLEMENTATION
This section of the report focuses on the more general and essential description of the
system and how different system components are put together to obtain various milestones
described earlier in the report. The schema diagrams in each of the following components
illustrates a logical connection between the various entities (tables) within each
component. Since warehouse in our project aims to preserve data redundancy to promote
the availability of the data, the actual table implementations in the project are devoid of
such key based connections. The schema for different dimensions and facts explained
below are attached in Appendix A1 section of the report.
Time dimension is most basic dimension comprising of second, minute and hour
dimension tables. Time dimension in the data marts enables us to manage records ‘online’
and process real time data to deliver real valued services. The need of Time dimension is to
represent the time-stamp associated with each call. Multiple calls are recorded by the CDR
switch in the matter of seconds and are loaded into the warehouse. Information held by
time dimension includes the start and ending of events occurring during calls. The call
related inferences can be deduced in terms of seconds, minutes and hour basis. We have
divided a day into 8 different smaller intervals. The records generated by the switches can
be analyzed across the references of these 8 smaller intervals to compute the pick hours of
the call and the least active durations of the telecommunication services. In doing so, we
roll up across the time dimensions tables to aggregate results and generate the reports to
suit the purpose. The methodology is simple and pretty straight forward. The
implementation of the snowflake schema facilitates such roll up operation and aggregation
of facts generate useful results. The entries of time dimension table are rather static or
technically, Slowly Changing Dimensions (SCDs). These entries can be loaded one time at
the initial phase of the warehouse ETL process and can work without a fault through the
rest of the session of analysis and reporting.
44
Date dimension is used in conjunction with time dimension in order to specify the exact
occurrence of the call. Once the time dimensions are correctly loaded into the data
warehouse, use of date dimensions allow us to visualize those real time data in the form of
a calendar. Main components of date dimension are day, month and year. Call detail
records of daily basis are recorded on the lowest level of the table hierarchy. Subsequent
inferences of the weekly, quarterly, semi-annual or annual basis are simple obtained by
roll-up process. A mere aggregation of the facts at lower level tables give a clear and valid
result for the facts of the upper level table in the hierarchy. Like Time dimension, Date
dimension stores static data regarding the dates for a specified period of time. These entries
too can be performed ‘on-time-load’ operation at the beginning of the session of analysis
and reporting.
Date and time dimensions allowed us to successfully visualize the occurrence of an event
onto a calendar. Since it was a real time data that we were concerned about, we required
some real time reference that could vouch of the occurrence of a real time event. The
correspondence was maintained by the development and use of location dimension.
Location Dimension is used for specifying the location of the calling parties. Location is
specified through area codes. The higher level tables contain the zone and region specific
data corresponding to the individual district obtained by the cross reference of area codes
and the district information. With this dimensions, data related to the demographic
distribution of the consumers and the visualization of spatiotemporal data at various levels
of abstraction can be achieved. Again, the whole operation would merely be a roll-up
procedure and subsequent aggregation.
A real time event with a real world reference provided a fundamental basis for all the
analysis of events. Customer dimension serves as a real time variable on which all our
analysis can be based and the reports can be generated. These reports, due to the presence
of customer-related details add up to the relevancy and usefulness of the reports. Although,
location table partially served the purpose, the design of customer dimensions was the
demand of the context. Customer dimension is logically connected with various other
45
fields in the way shown in the schema above. Customer dimension is center to all analytics
in the data and is related to other dimensions in a manner that suits the purpose of the
analysis. The simultaneous correlation of the customer dimension with other dimensions
assist us in generating facts associated with the spatio-temporal events recorded by the
CDR switches.
Services dimension provide the functionality that could add up to various other system
functionalities and services. It helps us analyze the various aspect of interactions of a user
to a telecommunication industry and its services. It serves as the socket through which
association of a normal customer to the communication services provided the
telecommunication industry can be logically analyzed and successfully visualized. The
components of services dimension is direly affected by the customer domain, the
telecommunication company policy and the market demand, we acknowledge that our
dimensions are not sufficient. However, the schema above shows basic framework of
minimalistic services that could provide a self-sufficient basis for analysis of interaction of
a customer with the telecommunication industry.
Fact table is the final table generated after the ETL process on various sources and
dimension tables. Fact tables are loaded in the data warehouse and are the source to
perform mining operations and acquiring useful information. Fact table comprises of all
the primary keys of various base dimension table along with the measurement facts. The
structure of the fact table presented above corresponds to the record-level details incoming
to the system from a given CDR switch. Initial report generating approaches will centered
on these entries and these initial reports are simply the results of SQL queries. On the
second part of the project, these results shall be project into a presentable Graphical User
Interface (GUI).
The fact table holds two main measurement metrics which are duration and cost. Duration
field provides information about the total time-span of the execution of a call whereas the
cost field provides information regarding the expenses incurred during the call. All the
other fields are simply foreign keys of dimension table that allows the analysis of
information held by the fact table with regards to various dimensional fields.
46
The part of the project that has been successfully completed till the date of the mid-term
report submission includes a primordial model of the data warehouse with a base level fact
table. The base level fact table consists of essential facts: cost and duration, which have
been described in detail in the fact tables above. This section of report consists of the
detailed description of how a fact table is constructed by our system using the dimension
tables and how reports are generated using the Micro Strategy Tools. The overall summary
of the system process is illustrated in the chart below.
The data that needs to be imported to our system primarily consists of records and logs of
difference all activities. To facilitate the purpose of analysis and to cover a larger domain,
we have combined data contents of multiple switches, all anonymous and logically related
into a single flat file. Although the information domain of the logs and records obtained
from the switches is huge, since our analysis is focused and constricted to meet very few
objectives, we need the transformation of these high dimensional data into lower
dimension. The lower dimensions are identified and used in such a way that the changes
relevant to our objectives are well preserved and the analysis of sampled data is valid. The
objectives of our analysis are as follows:
Achieve new enterprise level data model to handle huge amount of data in Teradata
database.
Provide a new base for OLAP analysis through Microstrategy reporting tool.
(analyze customers’ behavior and improve its marketing campaigns and pricing
strategies )
Make predictions to support in the formulation of company’s business policies by
the mining of the CDR and customer data.
As mentioned in the above writings, our major project analysis required only a fraction of
what was obtained from the CDR switches. Hence we needed to create a temporary table to
hold only the required data before the data went to processing. The required fields were
pulled out of the CDR log files and inserted into a temporary table called staging tables.
The structure of these tables is very much similar to the target tables. The only difference
is that the staging tables uses identifiers while the target tables use keys. All the target
tables are populated using their individual staging table. The key used in each of the target
tables are generated as using unique identifier generator; hence the keys in each target table
are auto-incremented for each entry of the staging table. After each staging table is
successfully loaded into the target table, the content of these tables are cleared for new data
to be loaded.
48
The various fields in the staging table have their own significances in the process of the
collective analysis of the data.
scheme_id: The scheme_id field is used to identify the particular type of service
used by a customer such as a voice call or a SMS service.
session_origin_date/time: These fields mark the beginning of any service accessed
by the consumer with a precise date and time stamp associated with every service
access instance.
session_end_dat/time: These fields mark the end of any service accessed by the
consumer win a precise date and time stamp associated with every service
termination instance.
caller_prefix/called_prefix: These fields help us identify the type of service and the
exact scheme being used by a particular customer such as prepaid-to-prepaid-voice-
call service.
roaming_flag: This field helps us identify if the user is being a part of the roaming
services provided by the communication enterprise.
peak_flag: This field helps us identify if the user is using the service in the peak
hour (6:00 to 22:00) as provided by the telecommunications enterprise.
50
Customer dimension table consist of all the records and details that are useful to tie a
telecommunication enterprise assigned number identification to the real world person.
These details include first name, last name, address, DOB, legal identification and gender
details. The address details of the customer in the customer dimension table correspond to
the permanent address of the customer. The temporary address of the customer are a part of
location table and is accessed through the use When a data entry from the staging table
needs to be analyzed, the calling_number and called_number field of the staging table are
cross referenced with the customer_dm table to aggregate the details of the respective
number in terms of geo-spatial variables. Upon a ‘hit’ on the table, the corresponding id,
which is the telecommunication enterprise assigned identification number, is extracted and
inserted into the fact table. The types of inferences that can be generated from this
reference are:
Average call per period of a particular customer and the details of the expenses.
Average call per locality, on the basis of permanent address of the customer.
Graphical representation of customer behavior with respect to a particular
telecommunication services in a given time interval.
Customer-customer communication pattern and link states between any two
random customers on the basis of telecommunication activity.
Date dimension consists of the date details from year level down to the day level for a user
specified date span. The initial analysis data set consists data from 2000-2030 AD. The
session_origin/end_date from the staging table is cross referenced against the date
dimension to aggregate other details out of the date dimension. Upon hit, the identification
number of the corresponding entry is added to the base level fact table. The types of
inferences that can be generated from the date dimension are as follows:
Time dimension consists of the time stamp details from seconds’ level to hour level. The
time dimension has been hierarchically arranged. The lowest level table consist of a day
portioned into 86400 seconds. The service access/termination time stamp obtained from
the staging table is cross referenced across the entries of the dimension table to segregate
the id of the record that gets hit. This id is incorporated into the base level fact table. The
probable inferences that can be generated from time dimension are as follows:
Analysis of data from individual CDR switches constrained our data domain to be a narrow
one. We had to access records from multiple switches to have presentable results. We built
location dimension consisting of data details down to the district levels. Since our main
priority was not to make a tower-wise analysis, we have included tower_max_number field
for fail safe. The location dimension on higher levers consist of zones and regions. Staging
table consists of the field area_code that consists of the area from which the
calling_number is accessing the telecommunication service. The area_code value is cross
checked with the entries of the location dimension table to find a record hit. As in the
previous dimension table lookup processes, upon the hit, the corresponding id is included
in the base level fact table. The possible inferences that can be dug out of the location
dimension table are as follows:
Location specific average call, total call, and call frequency analysis.
Higher level data inferences related to location wise call activities.
Traffic pressure per locality with/without chronological frame of reference.
The CDR data along with other related data stored in the warehouse are presented in the
Microstrategy dashboard in the form of interactive reports. The dashboard is divided into
different layouts which are:
In this layout, data from the intelligent cube (OLAP cube) is taken and presented to create
different reports. The main benefit of OLAP analysis is that it provides the drilling
capabilities while viewing the reports as well as provides caching of data for faster
response in dashboard navigation. Under OLAP analysis, we categorized the analysis into
five categories.
In this layout, the call duration was analyzed in the basis of number of calls. What call
duration do most users prefer? This question is answered. Call duration is divided into four
range of values and number of calls made by customers for each duration range is
displayed in a bar graph. Besides, number of calls per scheme is also shown in another
report.
The CDR data used in our system was comprised of call records from all over the nation.
So, it was beneficial and important to perform analysis according to geographical
locations. This kind of analysis also aids to attracting attention to some specific areas
where improvement is necessary. To make the analysis more effective, a date slider is
included so that the user can perform the geographical analysis only on the selected date
interval. Under geographical analysis, three different reports are shown:
Call by Area: This report is shown as a bar graph. In this report, how much calls are
generated according to call area is shown. One can select a development region for
53
analysis and see how different areas within the selected development region differ
or are similar in number of call records.
Call by Region: A pie chart is displayed to depict the number of calls per region i.e.
Himalayan, Hill and Terai region.
Call by Development Region: This bar graph displays an overview of number of
calls per development region.
We had the call record data of over a time span of a year in our warehouse. Since call
record data are massively generated (millions of records per day), we could only use the
record of a year. This was then gone through quarter wise analysis, month wise analysis
and analysis per day of week. This layout comprised of three reports:
It is always important for a corporate telecom house to know which time their customers
make the maximum number of calls. This analysis suggests a strong point in decision
making – encourage the off hour calls by bringing up a new scheme. A dropdown selector
is kept so as to narrow down the analysis to the specific area. For example: It may be
interest for the executives of an enterprise to know the least calling hours in the
54
Kathmandu Valley for its improvement. The analysis could be of course based on all areas
too. In this analysis, two different reports are presented:
Call by Hour of Day: This report shows the number of calls made by customers
against the 24 hours of the day in a bar graph. It can also be drilled up to see the
number of calls against different time of day (A TOD consists of 3 hours interval).
Call by Hour of Day and Day of week: This report gives an insight on how number
of calls varies with different days of week and hour of call. It is helpful for
comparison among different days. Eg: Sunday noon with Wednesday noon.
In business it is said that Customers are gods. It is true because they are the source of all
the income of the business enterprise. If the customers are unhappy, then it is the business
enterprise who shall bear a loss. This scenario is not different in the telecommunication
enterprise. So, there lies keen interest for the executives of such enterprise to know their
customers in much detail. They need to analyze their customers from different aspects such
as geographical aspect, gender aspect and many others. In our CDSST system, we have a
customer base of 11,788 customers on which we perform the analysis. For this, we created
a user (customer) analysis layout. In this analysis, a user is required to select one or
multiple areas and based on the selection, gender wise report in the form of pie-chart and
gender and area wise report in the form of bar diagram are displayed.
Revenue is by far the most important thing for a business enterprise and
telecommunication enterprise is not an exception. Their objective is generally to earn
profits by giving quality service to customers. This makes the revenue analysis one of the
most important reports that executives want to see. In this case, we created the following
reports in revenue analysis.
Revenue by Quarter: This report is shown as a pie chart where the revenue of four
quarters are depicted.
Revenue by Scheme: In this report, revenue by different schemes e.g. post to post
call, pre to pre call, pre to post call and post to pre call are shown in pie chart.
Revenue by Development Region: Revenue according to development region are
shown in a bar diagram. Here a development region can be drilled down to see
which area constitutes most of the revenue of that region.
Call dropping is a serious issue in telecommunication industry. Many calls don’t get
connected because they can’t reach the receiving server. In this layout, comparison of
dropped calls and succeeded calls are shown on the basis of time of day. The analysis can
also be narrowed down to a specific area by the area selector.
For mining hidden information out of the call detail record data residing in the warehouse,
a powerful statistical and analytical tool was necessary. This tool was also required to be
robust enough to handle large voluminous data and suitable enough for integration with the
Teradata warehouse and Microstrategy reporting tool. Having all these features and above
all being open source, R was the most suitable tool for data analytics for us. Teradata and
Microstrategy provided packages named TeradataR and MicroStrategyR respectively for
communication with R. These packages played the intermediately role for bridging the gap
between R and warehouse & reporting tool.
The recency, frequency and monetary values of customers obtained from the warehouse
are first normalized to scale of 1 to 5, 1 being the best value and 5 being the worst value.
So, customers with all recency, frequency and monetary values equal to 1 are the best
customers. On these normalized value of attrubutes, clusters are created using RFM
technique. Normalization of customers’ RFM attributes caused many customers to have
identical values of these attributes. This seem to form a group of customers even before the
56
clustering is applied. The clustering algorithm was applied 10 times and from that a best
result of clusters was selected. The clustering algorithm generated five clusters with the
following number of customers in it out of our customer base of 11788 customers.
The center values of R, F and M for each cluster and withinss (within-cluster sum of
squares) are shown in the table below:
The totss (total sum of squares), tot.withinss (total within-cluster sum of squares, i.e.,
sum(withinss)) and betweenss (between-cluster sum of squares, i.e. totss-tot.withinss) for
the clustering result are 36176.54, 1405.462 and 34771.08 respectively. The selection
criteria for best clustering result was betweenss/totss and the selected cluster had the
highest value of 96.1%.
Each cluster is assigned a different color so that they can be distinct. The fourth cluster is
the largest cluster with about 36 % customers. This cluster mostly contains the best
customers with RFM values equal to 1. Customers with high recency i.e. those customers
who make a call after a long time are grouped by the second cluster which is the smallest
cluster. And remaining three clusters namely 5 th cluster, 1st cluster and 3rd cluster groups
customers with low monetary and frequency attributes respectively. The clusters are
plotted using 3d scatter plot in R and integrated in the Microstrategy dashboard.
that are near to origin are the best customers with farther customers being deviated from
best customers.
Also, customer count are shown in another panel according to Recency, Frequency and
Monetary attributes. This shows what type of customers are more in our system.
Churn prediction has been implemented in CDSST using CART – a decision tree based
algorithm. The variables chosen for building the model are:
customer's age
customer's tenure
average no. of calls per day
average duration per call
average cost per call
customer's gender
A model was prepared for training and building a decision tree for the prediction. The
training set contained 5% churners. Based on the results, average calls per day played the
most important role in determining if the customer is churn followed by customer's tenure,
average duration and customer's age. Gender had a very little effect in the result.
The model was finally used for prediction of churn from the data in the warehouse. From
the prediction of customers for the year 2012, it is found that 7% customers are more likely
to churn in the upcoming month. The obtained results is shown in the main dashboard.
Teradata Administrator 13.0 and other Teradata utilities. With this tool, we performed the
following functions:
Basic TEradata Query (BTEQ) Window is a Unicode application that allows the user to
run multiple BTEQ sessions and to access different databases simultaneously. It provides
pulldown menus and dialogs that users use to send SQL statements to the Teradata
Database. With this tool, we can
.logon 192.168.50.50/telecomdb,password
Using BTEQwin, we wrote the create table query and used its drag and drop functionality
to create the table.
FastLoad is a command-driven utility to load large data into an empty table on a Teradata
Relational Database Management System (RDBMS). Fastload can load large amounts of
data to the Teradata database much faster than with BTEQ using simple insert statements.
Constraints on the use of fastload are
Python 2.7 version was used for writing scripts in the project to populate the flat files for
loading dimensions. Besides, it was also used to transform the data records from the raw
format to specific format intended for storing in our data warehouse.
Microsoft SQL Management Studio was used to draw the schema diagram of the
dimensions which shows the logical connection between the various hierarchy and other
accessory tables of a dimension.
7.5.6. SVN
We used TortoiseSVN to create a repository of our project and share it among the project
members. It provided code, documents, data, diagrams, and research documents to be
managed efficiently through the process of commit and update.
60
Redmine is a free and open source flexible project management web application. It
includes a calendar and Gantt charts to aid visual representation of projects and their
deadlines. It handles multiple projects. Redmine provides integrated project management
features, issue tracking, and support for various version control systems. Being written
using the Ruby on Rails framework, it is cross-platform and cross-database. Some of the
main features of Redmine are:
7.5.8. Microstrategy
MicroStrategy, Inc., is a business intelligence (BI) software vendor whose software enables
organizations to analyze vast amounts of data to make strategic business decisions. The
software delivers actionable information to business users via the web and mobile devices,
including Apple, Android and BlackBerry smartphones and tablets. The software includes
technologies to capitalize on the big data, mobile, cloud, and social media trends.
MicroStrategy Report Services generates dashboards, scorecards, and production reports.
61
8.1. Results
As stated in the ‘Business Intelligence and Reporting’ chapter, we have been successful in
generating various forms of reports that give insights to the status of a telecommunication
industry from various perspectives. We have used pie-charts, bar graphs, scatter plots, and
bubble graphs to present the results in graphical form. The information are arranged in the
form of dashboard. We have a highly flexible dashboard that enables the users to use the
system in interactive or flash mode. The charts are displayed with beautiful animations in
the flash mode. The results can be viewed in the screen as well as can be easily exported to
various other formats such as pdf, html, excel and flash. The results corresponding to
various hierarchical levels created in the warehouse can be easily accessed and examined.
The output of the system can be viewed from laptops, ipads, iphones as well as android
devices. The snapshots of the results are given below:
Figure 8.1 is the snapshot of home of the dashboard of CDSST. It suggests the project
overview of CDSST as composed of OLAP analysis, Customer segmentation, Churn
Prediction and Loss analysis. These four main components of the project are structured in
four layouts which can be easily accessed from the tab heading of the dashboard.
OLAP Analysis is one of the most important part of dashboard for the user of CDSST i.e.
executives of telecommunication enterprise. The data in OLAP Cube is shown under
different categories. In Call analysis category, two reports are shown. The first report is a
bar diagram named Call by Schema which shows how number of calls varies between
prepaid and postpaid users of the telecommunication enterprise. This report suggests that
most calls happen among prepaid users.
The next report is also a bar diagram named Call by Duration in which the distribution of
call duration is grouped into six different range values and displayed its frequency. This
report suggests that short duration call are more preferred by the customers. Calls with
duration less than 30 second are the highest in number. Another important inference given
by this report is to encourage the customers to have long duration call by introducing some
schemes and giving discounts for long calls. This can boost the revenue well.
63
Geographical analysis consists of three reports which are driven by a date slider where user
can choose a date range. The first report is a bar diagram named Call by Area where user
can select a development region and see the number of calls in areas of that region. Areas
with low number of calls can be identified and can be improved by targeting customers of
such area. The next two reports are Call by Region (pie-chart) and Call by Development
Region (bar diagram). These reports can be drilled down to see which areas or zones
contributes the most and the least to a region’s number of calls. Information is shown in
the mouse tooltip about the hovered part of the report. The data can also be seen in grid by
changing the view to grid on right click option.
64
Datewise analysis shows the call pattern according to date ranges. The report Call by Day
of Week shows the seven days of week with number of calls on a bar diagram for the
selected date range from the date slider. This report suggests that weekends are the busier
than weekdays. Similarly, the next report is Call by Quarter which is a pie diagram. It
shows the number of calls in four different quarters of the year 2012. The fourth quarter
has the highest number of calls which is because Dashain and Tihar festival falls in this
quarter and the call density is high at festival time. The third report is Call by Month – bar
diagram which shows the call distribution according to various months.
65
In timewise analysis, the main objective is to find out which is the busiest and the low call
density time period. For a selected area, the number of calls for different hour of day is
shown in the Call by Hour of Day report in the form of a bar diagram. This suggests that
the evening hours ranging from 5:00 PM to 7:00 PM are the busiest. Special scheme can be
suggested for late night hours e.g. after 11 PM. The next report named Call by Hour of Day
and Day of week as the name suggests shows how number of call varies with time of day
in the days of week. This also provides a comparison of different days of week in terms of
number of calls also giving insight about the calls occurring in certain time of the day. This
report suggests that Friday evening time is busier than that of other weekdays.
66
In this layout, user can select multiple areas for user analysis. After selection, gender wise
analysis of user is presented. In the Call by Gender pie diagram, number of calls by male
and female for the selected areas are displayed. For some areas, the participation of female
in making call is very small which can be improved by targeting such areas and giving
priorities to females. In the next report, selected areas are compared in terms of gender and
number of calls made by them which shows the need of enhancement of plans in low call
density areas.
67
In revenue analysis, reports are shown in two different layouts named overview and chart.
In the chart layout, three different reports are shown. The first one is the Revenue by
Quarter which gives the performance of the enterprise in the four quarters in terms of
revenue. The fourth quarter containing most number of calls collects highest revenue.
Similarly, second report named Revenue by Scheme shows in a pie diagram how
interaction between prepaid and postpaid users contribute to the revenue. Postpaid service
having low number of users as compared to the prepaid service can be boosted by
introducing scheme oriented campaigns. Revenue by Development Region report is a bar
diagram showing contribution of five different development regions in the enterprise’s
revenue. Central development region contributes the most with eastern and western
development region following it.
68
Figure 8.8 Dashboard Output - Bubble Graph Revenue Analysis | OLAP Analysis
Figure 8.8 shows the overview of the status of the telecommunication enterprise on how
well they are in the revenue perspective. In this bubble graph, development region and
quarter are the two axes and revenue corresponding to a development region and quarter is
shown by the color of bubble. The size of bubble shows the number of calls. So, larger the
bubble, larger the number of calls. This graph shows that Central development region
contributes to the highest in terms of revenue and has the maximum number of calls.
Eastern and Western region are a little behind central region but are doing well. But Mid-
western and Far-western region are seriously backward. This suggests some special care
must be taken for these two regions. Some marketing campaigns should be run in these
regions and they should be given top priorities.
69
Figure 8.9 Dashboard Output - Scatter Plot 3d | Customer Segmentation with RFM
Recency, Frequency and Monetary rank are taken as three axes and its values for different
customers are plotted. Customers falling in the same cluster are assigned same color.
Customers with all RFM values equal to 1 are the best customers. These customers form
the largest clusters shown in black sphere. Customers with low frequency and monetary
values are shown in three clusters with red, green and yellow colors. These types of
customers participate in calls more as a call receiver than caller. The cluster shown in blue
color groups such customers who has larger gap between last call and the previous call.
Such customers are lower in number.
70
Figure 8.10 Dashboard Output - RFM Values | Customer Segmentation with RFM
In this layout, number of customers with these three attributes of Recency, Frequency and
Monetary are shown. The three reports show number of customers according to rank of
these attributes. Number of customers with high call frequencies in a year is considerably
high. Similarly, number of customers spending large amount of money is also high as
compared to other customers. But, most of the customers make a second call within few
days of their first call as suggested by the Customer Recency bar diagram.
71
The output of the churn prediction performed with CART algorithm is shown in this
dashboard layout. Here, user can select a development region and see how are the
customers predicted to be churned are distributed in different districts. This report shows
that people of urban area are more likely to be churned. For example, in the Mid-western
development region, Dang contributes to the highest number of customers predicted to
churn out. This is because of factors mentioned in the implementation section of this report
as well as easy access to different service providers. The pie diagram alongside shows the
number of customers predicted to churn according to gender in the selected development
region.
72
Considerably high number of calls are unsuccessful to get connected to the destination
number because of various reasons. Call request from MSC server when reaches to the IN
server (prepaid) or Mediation server (postpaid), then call is successful and the two parties
are connected. The problem is that the call request arrives the MSC server but in the route
from the MSC server to IN or Mediation server, the call request is lost. This causes call
dropping. Such calls are recorded in the CDR records with call duration of 0 seconds.
These types of dropped calls are compared with successful calls on the basis of hour of day
at which the call was made. User can select some specific area or perform the analysis on
all areas. Generally, more calls are dropped at high call traffic times.
73
8.2. Conclusion
The data analysis performed in this project consists of only a single year data. If the
data content could encompass the telecommunication records of several years,
multiple forms of analysis such as time series analysis, regression analysis could be
possible.
The data model can be further used to analyze and detect frauds in the
telecommunication industry.
The project performs the loss analysis for any CDR switch but does not suggest any
method to minimize or prevent the loss.
75
Bibliography
[1] C. D, B.-V. D and G. N., "A Call Detail Records Data Mart-Data Modelling and
OLAP Analysis;Computer Science and Information Systems," vol. 6, 2009, pp. 87-
110.
[8] G. M. R. a. K. S. Namvar M., "A two phase clustering method for intelligent customer
segmentation," in Proceeding of First International Conference on Intelligent systems,
Modeling and Simulation, LIverpool, 2010.
[9] J. T. a. S. E. Hwang H., "An LTV model and customer segmentation based
oncustomer value: a case study on the wireless telecommunication industry," in Expert
Systemswith Applications, 2004, pp. 26,181-188.
[11 K. Lee, "A cross-national market segmentation of online game industry using SOM,"
] in Expert Systems with Applications, 2004, pp. 27, 599-570.
[14 K. Ahn, "A recommender system using GA K-means clustering in anonline shopping
] market," in Expert Systems with Applications, 2008, pp. 34, 1200–1209.
[15 Hsieh, "An integrated data mining and behavioral scoring model for analyzing bank
] customers," in Expert Systems with Applications, 2004, pp. 27, 623-633.
[16 C.-H. Cheng and Y.-S. Chen, "Classifying the segmentation of customer value via
] RFM model and RS theory," in Expert Systems with Applications, 2009, pp. pp.4176-
4184.
[22 T. C. C. C. Y. &. L. C. LEE, "Mining the customer credit using classification and
] regression tree and multivariate adaptive regression splines," in Computational
Statistics & Data Analysis, 2006, pp. 50, 1113-1130.
[25 F. Newell, How to use one-to-one relationship marketing to be the leader in your
] industry, McGraw-Hill Publisher, 1997.
[26 C. Zhang, D. Ouyang and J. Ning, "An artificial bee colony approach for clustering,"
] in Expert Systems with Applications, 2010, pp. pp. 4761-4767.
APPENDIX
Appendix A