Lec1 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Business Analytics & Text Mining

Modeling Using Python


INTRODUCTION
Dr. GAURAV DIXIT
DEPARTMENT OF MANAGEMENT STUDIES

1
INTRODUCTION

• This course is subsequent to my earlier courses in the Data


Science area
– “Business Analytics & Data Mining Modeling Using R”
– “Business Analytics & Data Mining Modeling Using R Part II”
• In these two courses, we used numeric data for predictive
analytics
– Mainly ‘structured numeric data’ was processed using data mining
techniques
– Categorical variables were also processed using numeric codes

2
INTRODUCTION

• Structured Numeric Data


– Uniform measurements are taken for all the observations in the
sample

• In this course, we progress towards processing unstructured


data
– Text is typically described as unstructured data
– We model prediction problems using unstructured text data

3
INTRODUCTION

• Machine learning algorithms can be employed to model


prediction problems using data which could be
– Structured numerical measurements or
– Unstructured text
• This is possible because
– Text and documents can be transformed into measured values
• Where ‘presence’ or ‘absence’ of words on the column side of the tabular format
can be indicated against various documents on the row side
– This leads to the common representation used in data mining techniques for numerical data

4
INTRODUCTION

• Central themes in Text Mining and Data Mining are similar


with following key differences
– Evaluation techniques
• Chronological order of publication
• Alternative measures of error
– Data are text and documents
• Specialized techniques may be preferred
– Techniques must be modified to work with high dimensional data
• Tens of thousands of words and documents

5
INTRODUCTION

• In the related domains of ‘Natural language Processing’ and


‘Search Engine Technology’
– Focus is on Linguistic techniques
• Essence of language understanding
– Becoming closer to the generic machine learning paradigm
• Learning from data, whether numerical or text
• Main theme in Text Mining is
– Empirical in nature
• Mine for recurring word patterns in large text collections, or large collections of
digital documents

6
INTRODUCTION

• How text mining is different?


– A progress from applying analytics on large data to ‘big data’
– Nowadays, most data originate in digital form due to pervasive use of
computers
• For example, following activities are being performed electronically
– Stock trading
– Writing a book
– Buying a product online
– Digital transactions (many paper-based transactions have been replaced by paperless digital
alternatives)

7
INTRODUCTION

• Data Mining vs Text Mining


– Both are about finding valuable patterns in data
– Data mining domain
• In its maturity phase
– No significant development is expected
– Incremental development will continue
• No longer an emerging technology
• Techniques are highly developed
• Requires highly structured numeric data
– Involves extensive data preparation
• Lacks universal applicability

8
INTRODUCTION

• Data Mining vs Text Mining


– Both are about learning from samples of past experience or examples
– Text mining domain
• An emerging area
• Works with large collection of documents
– Contents are readable and meaningful

– Numbers vs text
– Analytics tasks are formulated differently
• Even though many techniques are similar

9
INTRODUCTION

• Structured data (for data mining)


– Requires data preparation involving data transformation steps
– Data collection effort might be based on careful prior design for
mining
– Measurements are well-defined and recorded uniformly for every
observation in the sample
– Types of variable measurements
• Continuous variables (Interval, ratio) and categorical variables (Nominal, ordinal)
– Finally, described in a highly structured tabular/matrix format

10
INTRODUCTION

• Structured data (for data mining)


– A row in the tabular format is a complete example of past experience
– A column is one measurement taken uniformly for all the rows
– Creates a structured world for applications of data mining techniques
• We can operate in a typical mathematical fashion

• Unstructured Data (for text mining)


– Initial presentation is a variant of XML format
– Text is transformed into numerical data leading to tabular format used
in data mining

11
INTRODUCTION

• Unstructured Data (for text mining)


– For text, a row represents a document (an example of prior
experience)

– A column represents measurements taken to indicate the presence or


absence of a word for all the rows
• Each row represents a document and each column a word
• Cells are filled with 1s & 0s

12
INTRODUCTION

• Unstructured Data (for text mining)


– This is why techniques similar to data mining can be used in text
mining
• These techniques have been found to be very successful
• Without understanding specific properties of text such as
– The concepts of grammar or
– The meaning of words

– Example: A binary spreadsheet of words in documents

13
INTRODUCTION

Company Income Job Overseas


0 1 0 1
1 0 1 1
1 1 1 0
0 0 0 1

14
INTRODUCTION

• Text Mining
– Words are attributes/predictors and documents are cases/records
– Together these form a sample of data that can feed our well-known
learning methods
– Machine learning techniques can be used to work with this format and
process large amounts of data
• Machine learning techniques
– Can be described as statistical techniques without prior knowledge
– They typically don’t make any assumption about the data like
statistical techniques do

15
INTRODUCTION

• Machine learning techniques


– For example, multiple linear regression assumes the linear relationship
between Y (Target variable) and Xs (Predictors)
– Rather, this deficiency is counterbalanced with massive processing of
data
• Finding patterns in word combinations that are recurring and predictive

16
INTRODUCTION

• Understanding text characteristics


– Given a collection of documents
• Set of attributes will be the total set of ‘unique words’ in the collection
– Called as dictionary

– For thousands or even millions of documents


• Dictionary will converge to a smaller number of words
– Technical documents with alphanumeric terms may lead to very large
dictionaries
• Tabular layout can become too big in size to be practical

17
INTRODUCTION

• Text mining problems


– Information Retrieval
• Business Problem: Document matcher (online or device)
– Given a large collection of documents, finding relevant documents
– Analytics Component
» Task is to retrieve the relevant documents based on the best matches of input document with
the collection of documents
» New document is compared to all the other rows (documents), and the most similar rows and
their associated documents are the answers
• Similar to a search engine function
– A few words are presented, and these words are matched to others
– Best matches are presented as the responses
• Based on measuring similarity as in nearest-neighbor methods

18
Key References

• Fundamentals of Predictive Text Mining


– By Sholom M. Weiss, Nitin Indurkhya, & Tong Zhang (2015)
• Python for Data Analysis: Data Wrangling with Pandas,
NumPy, and Ipython
– By Wes McKinney (2017)

19
Thanks…

20

You might also like