Lec1 PDF

Business Analytics & Text Mining
Modeling Using Python

INTRODUCTION
Dr. GAURAV DIXIT
DEPARTMENT OF MANAGEMENT STUDIES
1
INTRODUCTION
• This course is subsequent to my earlier courses in the Data

Science area
– “Business Analytics & Data Mining Modeling Using R”
– “Business Analytics & Data Mining Modeling Using R Part II”
• In these two courses, we used numeric data for predictive
analytics
– Mainly ‘structured numeric data’ was processed using data mining
techniques
– Categorical variables were also processed using numeric codes
2
INTRODUCTION
• Structured Numeric Data

– Uniform measurements are taken for all the observations in the
sample
• In this course, we progress towards processing unstructured

data
– Text is typically described as unstructured data
– We model prediction problems using unstructured text data
3
INTRODUCTION
• Machine learning algorithms can be employed to model

prediction problems using data which could be
– Structured numerical measurements or
– Unstructured text
• This is possible because
– Text and documents can be transformed into measured values
• Where ‘presence’ or ‘absence’ of words on the column side of the tabular format
can be indicated against various documents on the row side
– This leads to the common representation used in data mining techniques for numerical data
4
INTRODUCTION
• Central themes in Text Mining and Data Mining are similar

with following key differences
– Evaluation techniques
• Chronological order of publication
• Alternative measures of error
– Data are text and documents
• Specialized techniques may be preferred
– Techniques must be modified to work with high dimensional data
• Tens of thousands of words and documents
5
INTRODUCTION
• In the related domains of ‘Natural language Processing’ and

‘Search Engine Technology’
– Focus is on Linguistic techniques
• Essence of language understanding
– Becoming closer to the generic machine learning paradigm
• Learning from data, whether numerical or text
• Main theme in Text Mining is
– Empirical in nature
• Mine for recurring word patterns in large text collections, or large collections of
digital documents
6
INTRODUCTION
• How text mining is different?

– A progress from applying analytics on large data to ‘big data’
– Nowadays, most data originate in digital form due to pervasive use of
computers
• For example, following activities are being performed electronically
– Stock trading
– Writing a book
– Buying a product online
– Digital transactions (many paper-based transactions have been replaced by paperless digital
alternatives)
7
INTRODUCTION
• Data Mining vs Text Mining

– Both are about finding valuable patterns in data
– Data mining domain
• In its maturity phase
– No significant development is expected
– Incremental development will continue
• No longer an emerging technology
• Techniques are highly developed
• Requires highly structured numeric data
– Involves extensive data preparation
• Lacks universal applicability
8
INTRODUCTION
• Data Mining vs Text Mining

– Both are about learning from samples of past experience or examples
– Text mining domain
• An emerging area
• Works with large collection of documents
– Contents are readable and meaningful
– Numbers vs text
– Analytics tasks are formulated differently
• Even though many techniques are similar
9
INTRODUCTION
• Structured data (for data mining)

– Requires data preparation involving data transformation steps
– Data collection effort might be based on careful prior design for
mining
– Measurements are well-defined and recorded uniformly for every
observation in the sample
– Types of variable measurements
• Continuous variables (Interval, ratio) and categorical variables (Nominal, ordinal)
– Finally, described in a highly structured tabular/matrix format
10
INTRODUCTION
• Structured data (for data mining)

– A row in the tabular format is a complete example of past experience
– A column is one measurement taken uniformly for all the rows
– Creates a structured world for applications of data mining techniques
• We can operate in a typical mathematical fashion
• Unstructured Data (for text mining)

– Initial presentation is a variant of XML format
– Text is transformed into numerical data leading to tabular format used
in data mining
11
INTRODUCTION

– For text, a row represents a document (an example of prior
experience)
– A column represents measurements taken to indicate the presence or

absence of a word for all the rows
• Each row represents a document and each column a word
• Cells are filled with 1s & 0s
12
INTRODUCTION

– This is why techniques similar to data mining can be used in text
mining
• These techniques have been found to be very successful
• Without understanding specific properties of text such as
– The concepts of grammar or
– The meaning of words
– Example: A binary spreadsheet of words in documents
13
INTRODUCTION
Company Income Job Overseas

0 1 0 1
1 0 1 1
1 1 1 0
0 0 0 1
14
INTRODUCTION
• Text Mining
– Words are attributes/predictors and documents are cases/records
– Together these form a sample of data that can feed our well-known
learning methods
– Machine learning techniques can be used to work with this format and
process large amounts of data
• Machine learning techniques
– Can be described as statistical techniques without prior knowledge
– They typically don’t make any assumption about the data like
statistical techniques do
15
INTRODUCTION
• Machine learning techniques

– For example, multiple linear regression assumes the linear relationship
between Y (Target variable) and Xs (Predictors)
– Rather, this deficiency is counterbalanced with massive processing of
data
• Finding patterns in word combinations that are recurring and predictive
16
INTRODUCTION
• Understanding text characteristics

– Given a collection of documents
• Set of attributes will be the total set of ‘unique words’ in the collection
– Called as dictionary
– For thousands or even millions of documents

• Dictionary will converge to a smaller number of words
– Technical documents with alphanumeric terms may lead to very large
dictionaries
• Tabular layout can become too big in size to be practical
17
INTRODUCTION
• Text mining problems

– Information Retrieval
• Business Problem: Document matcher (online or device)
– Given a large collection of documents, finding relevant documents
– Analytics Component
» Task is to retrieve the relevant documents based on the best matches of input document with
the collection of documents
» New document is compared to all the other rows (documents), and the most similar rows and
their associated documents are the answers
• Similar to a search engine function
– A few words are presented, and these words are matched to others
– Best matches are presented as the responses
• Based on measuring similarity as in nearest-neighbor methods
18
Key References
• Fundamentals of Predictive Text Mining

– By Sholom M. Weiss, Nitin Indurkhya, & Tong Zhang (2015)
• Python for Data Analysis: Data Wrangling with Pandas,
NumPy, and Ipython
– By Wes McKinney (2017)
19
Thanks…
20

Lec1 PDF

Uploaded by

Copyright:

Available Formats

Lec1 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec1 PDF

Uploaded by

Copyright:

Available Formats

Business Analytics & Text Mining

Modeling Using Python

• This course is subsequent to my earlier courses in the Data

• Structured Numeric Data

• In this course, we progress towards processing unstructured

• Machine learning algorithms can be employed to model

• Central themes in Text Mining and Data Mining are similar

• In the related domains of ‘Natural language Processing’ and

• How text mining is different?

• Data Mining vs Text Mining

• Data Mining vs Text Mining

• Structured data (for data mining)

• Structured data (for data mining)

• Unstructured Data (for text mining)

• Unstructured Data (for text mining)

– A column represents measurements taken to indicate the presence or

• Unstructured Data (for text mining)

– Example: A binary spreadsheet of words in documents

Company Income Job Overseas

• Machine learning techniques

• Understanding text characteristics

– For thousands or even millions of documents

• Text mining problems

• Fundamentals of Predictive Text Mining

You might also like