Week 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

CP610 Data Analysis

-Introduction & Get to know your data

Jiashu (Jessie) Zhao


At the Beginning of this course
• 9:00 – 11:50 Friday
• Location: DAWB 2-105
• Office Hour: Friday 11:50-12:30
• Office Location: N2076G

If you are contacting my or the TA by email, please


include [CP610B] in the email subject line.

2
• No Required Textbook
• References
• Data Mining: Concepts and Techniques, 2nd ed., Jiawei
Han and Micheline Kamber, Morgan Kaufmann, 2006.
• Jure Leskovec, Anand Rajaraman, and Jeffrey David
Ullman, Mining of Massive Datasets (2nd Edition),
Cambridge University Press, 2014.
• Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin
Kumar, Introduction to Data Mining (2nd Edition),
Pearson, 2018.
• D. J. Hand, H. Mannila, and P. Smyth, Principles of Data
Mining, MIT Press, 2001.

3
• Evaluation:
Assessment Weighting
Assignment 30%
Midterm 30%
Paper Critique 10%
Project 30%
Total 100%

• Paper Critique and project: You can work in a group


of 2, or work on it by yourself. You will not receive
extra marks if you work on it by yourself.

4
Paper critique
• Select a paper from recent 2 year’s papers on
related conferences and journals
• Submit a written critique with
• Introduction: paper’s information & preview of your
analysis
• Summary: paper’s main point & findings
• Critique: Strength and weakness of the paper (2-3) &
your opinion (supported by evidence from the paper)
• Conclusion: key points of your analysis & significance of
the research

5
Conferences and Journals

• KDD Conferences n Other related conferences


• ACM SIGKDD Int. Conf. on Knowledge
n DB conferences: ACM SIGMOD,
Discovery in Databases and Data
VLDB, ICDE, EDBT, ICDT, …
Mining (KDD)
n Web and IR conferences: WWW,
• SIAM Data Mining Conf. (SDM)
SIGIR, WSDM
• (IEEE) Int. Conf. on Data Mining
(ICDM) n ML conferences: ICML, NIPS
• European Conf. on Machine Learning n PR conferences: CVPR,
and Principles and practices of n Journals
Knowledge Discovery and Data n Data Mining and Knowledge
Mining (ECML-PKDD)
Discovery (DAMI or DMKD)
• Pacific-Asia Conf. on Knowledge
Discovery and Data Mining (PAKDD) n IEEE Trans. On Knowledge and
Data Eng. (TKDE)
• Int. Conf. on Web Search and Data
Mining (WSDM) n KDD Explorations
n ACM Trans. on KDD

6
Project
• Identify a research topic closely related to this
course
• Propose a novel approach – clarify how it is
different from existing approaches
• Experiments
• Datasets: Public Accessible (or upon approval)
• Experimental Results
• Validate your argument
• Prove the effectiveness of your approach
• Presentation (10 min)
• Submit a report
7
Report Format
• Template:
• https://www.ieee.org/conferences/publishing/template
s.html
• Paper Critique: 1-2 pages
• Project: 5 pages
Introduction Outline
• Why data analysis?
• What is data analysis?
• What kind of data?
• Data analysis functions and evaluation
• Applications of Data Analysis
• Major Challenges in Data Analysis

9
Why Data Analysis?

• The Explosive Growth of Data: from terabytes to petabytes


• Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
• Major sources of abundant data
• Business: E-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube

• We are drowning in data, but starving for knowledge!

• “Necessity is the mother of invention”— Data analysis

10
What is data analysis
• Data analysis is a process of
inspecting, cleansing, transforming and modeling
data with the goal of discovering useful
information, informing conclusions and supporting
decision-making.
Data Science

Data
Analysis
Data
Mining

11
Example: A Web Mining Framework

• Web mining usually involves


• Data cleaning
• Data integration from multiple sources
• Warehousing the data
• Data selection for data mining
• Data mining
• Presentation of the mining results
• Patterns and knowledge to be used or stored into
knowledge-base

12
Example: Medical Data Mining
• Health care & medical data mining – often adopted
such a view in statistics and machine learning
• Preprocessing of the data (including feature
extraction and dimension reduction)
• Classification or/and clustering processes
• Post-processing for presentation

13
What Kinds of Data?
• Database-oriented data sets and applications
• Relational database, data warehouse, transactional database

• Advanced data sets and advanced applications


• Data streams and sensor data
• Time-series data, temporal data, sequence data (incl. bio-sequences)
• Structure data, graphs, social networks and multi-linked data
• Object-relational databases
• Heterogeneous databases and legacy databases
• Spatial data and spatiotemporal data
• Multimedia database
• Text databases
• The World-Wide Web

• Big Data
14
Data Analysis Function: Generalization

• Information integration and data warehouse construction


• Data cleaning, transformation, integration, and
multidimensional data model
• Data cube technology
• Scalable methods for computing (i.e., materializing)
multidimensional aggregates
• OLAP (online analytical processing)
• Multidimensional concept description: Characterization and
discrimination
• Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet region
15
Data Analysis Function: Association and Correlation
Analysis
• Frequent patterns (or frequent itemsets)
• What items are frequently purchased together in your
Walmart?
• Association, correlation vs. causality
• A typical association rule
• Computer à Software [1%, 50%] (support, confidence)

• How to mine such patterns and rules efficiently in large


datasets?

16
Data Analysis Function: Classification

• Classification and label prediction


• Construct models (functions) based on some training examples
• Describe and distinguish classes or concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
• Predict some unknown class labels
• Typical methods
• Decision trees, naïve Bayesian classification, support vector machines,
neural networks, rule-based classification, pattern-based classification,
logistic regression, …
• Typical applications:
• Credit card fraud detection, direct marketing, classifying stars, diseases,
web-pages, …
17
Classification Example
• Given existing data about customers and payments,
predict new applicant’s loan eligibility.

18
Data Analysis Function: Cluster Analysis

• Unsupervised learning (i.e., Class label is unknown)


• Group data to form new categories (i.e., clusters), e.g., cluster
houses to find distribution patterns
• Principle: Maximizing intra-class similarity & minimizing interclass
similarity
• Many methods and applications

19
Data Analysis Function: Outlier Analysis

• Outlier analysis
• Outlier: A data object that does not comply with the general behavior of
the data
• Noise or exception? ― One person’s garbage could be another person’s
treasure
• Methods: statistical tests, distance measures, density-based methods
• Useful in fraud detection, rare events analysis

20
Time and Ordering: Sequential Pattern, Trend and Evolution
Analysis

• Sequence, trend and evolution analysis


• Trend, time-series, and deviation analysis: e.g., regression and
value prediction
• Sequential pattern mining
• e.g., first buy digital camera, then buy large SD memory cards
• Periodicity analysis
• Biological sequence analysis
• Approximate and consecutive motifs
• Similarity-based analysis
• Mining data streams
• Ordered, time-varying, potentially infinite, data streams
• Conversational Analysis

21
Structure and Network Analysis

• Graph mining
• Finding frequent subgraphs (e.g., chemical compounds), substructures
(web fragments)
• Information network analysis
• Social networks: actors (objects, nodes) and relationships (edges)
• e.g., author networks in CS, terrorist networks
• Multiple networks
• A person could be multiple information networks: friends, family,
classmates, …
• Links carry a lot of semantic information: Link mining
• Web mining
• Web is a big information network: from PageRank to Google
• Analysis of Web information networks
• Web community discovery, opinion mining, usage mining, …

22
Evaluation of Knowledge
• Are all mined knowledge interesting?
• One can mine tremendous amount of “patterns” and knowledge
• Some may fit only certain dimension space (time, location, …)
• Some may not be representative, may be transient, …

• Evaluation of mined knowledge


• Descriptive vs. predictive
• Typicality vs. novelty
• Accuracy
• Timeliness
• …

24
Application Areas
• Industry • Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport promotion analysis
Consumer goods Value added data
Data Service providers Power usage analysis
… …

25
Data Analysis works with
Warehouse Data
• Data Warehousing
provides the Enterprise
with a memory

• Data Analysis provides


the Enterprise with
intelligence

26
Major Challenges in Data Analysis (1)

• Mining Methodology
• Mining various and new kinds of knowledge
• Data Analysis: An interdisciplinary effort
• Boosting the power of discovery in a networked environment
• Handling noise, uncertainty, and incompleteness of data
• Pattern evaluation and pattern- or constraint-guided mining
• User Interaction
• Interactive analysis
• Incorporation of background knowledge
• Presentation and visualization of data analysis results

27
Major Challenges in Data Analysis (2)

• Efficiency and Scalability


• Efficiency and scalability of data analysis algorithms
• Parallel, distributed, stream, and incremental analysis methods
• Diversity of data types
• Handling complex types of data
• Analyzing dynamic, networked, and global data repositories
• Data analysis and society
• Social impacts of data analysis
• Privacy-preserving data analysis

28
Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Measuring Data Similarity and Dissimilarity

• Summary

29
Data Objects

• Data sets are made up of data objects.


• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points, objects,
tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.

30
Attributes
• Attribute (or dimensions, features, variables): a data
field, representing a characteristic or feature of a data
object.
• E.g., customer _ID, name, address
• Types:
• Nominal
• Binary
• Ordinal
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled

31
Attribute Types
• Nominal: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV
positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude between
successive values is not known.
• Size = {small, medium, large}, grades, army rankings

32
• Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚, calendar dates
• No true zero-point
• Ratio
• A ratio-scaled attribute is a numeric attribute with an inherent zero-
point.
• We can speak of values as being an order of magnitude larger than
the unit of measurement.
• e.g., $10 is twice as high as $5

33
Discrete vs. Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete
attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented
using a finite number of digits
• Continuous attributes are typically represented as floating-
point variables
34
Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Measuring Data Similarity and Dissimilarity

• Summary

35
Measuring the Central Tendency
1 n
• Mean (algebraic measure) (sample vs. population): x = å xi (1)
Note: n is sample size and N is population size. n i =1
n
Weighted arithmetic mean:
åw x

i i
• Trimmed mean: chopping extreme values
x= i =1
n
(2)
å wi
• Median: i =1

• Middle value if odd number of values, or average of the middle two


values otherwise
• Estimated by interpolation (for grouped data):

(3)
n / 2 - (å freq)l
Mode
median = L1 + ( ) width

freqmedian
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal

37
Symmetric vs.
Skewed Data
• Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data

positively skewed negatively skewed

38
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR

• Variance and standard deviation (sample: s, population: σ)


• Variance: (algebraic, scalable computation)

1 n 1 n 2 1 n 1 n 1 n
s = 2
å
n - 1 i =1
( xi - x ) =
2
[å xi - (å xi ) 2 ]
n - 1 i =1 n i =1
s = å ( xi - µ ) 2 =
2

N i =1 N
å xi - µ 2
i =1
2

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

39
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to
Minimum and Maximum
• Outliers: points beyond a specified outlier
threshold, plotted individually

40
Histogram Analysis

40
• Histogram: Graph display of tabulated
35
frequencies, shown as bars
30
• It shows what proportion of cases fall
into each of several intervals 25
20
15
10
5
0
10000 30000 50000 70000 90000

41
Histograms Often Tell More than Boxplots

n The two histograms


shown in the left may
have the same boxplot
representation
n The same values
for: min, Q1,
median, Q3, max
n But they have rather
different data
distributions

42
Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Measuring Data Similarity and Dissimilarity

• Summary

43
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity

44
Data Matrix and Dissimilarity Matrix
• Data matrix
• n data points with p é x11 ... x1f ... x1p ù
dimensions ê ú
ê ... ... ... ... ... ú
êx ... xif ... xip ú
ê i1 ú
ê ... ... ... ... ... ú
êx ... xnf ... xnp úú
• Dissimilarity matrix êë n1 û
• n data points, but
registers only the é 0 ù
distance ê d(2,1) 0 ú
ê ú
• A triangular matrix ê d(3,1) d ( 3,2) 0 ú
ê ú
ê : : : ú
êëd ( n,1) d ( n,2) ... ... 0úû

45
Proximity Measure for Nominal Attributes

• Method 1: Simple matching


• m: # of matches, p: total # of variables

d (i, j) = p -
p
m

• Method 2: Binary Encoding


• Creating a new binary attribute for each of the M nominal
states

46
Proximity Measure for Binary Attributes
A contingency table for binary data
Object j

Object i

• Distance measure for symmetric binary variables:

• Distance measure for asymmetric binary variables:

• Jaccard coefficient (similarity measure for asymmetric binary


variables): 47
Distance on Numeric Data: Minkowski Distance

• Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
49
Special Cases of Minkowski Distance

• h = 1: Manhattan (city block, L1 norm) distance


• E.g., the Hamming distance: the number of bits that are different
between two binary vectors

d (i, j) =| x - x | + | x - x | +...+ | x - x |
i1 j1 i2 j 2 ip jp
• h = 2: (L2 norm) Euclidean distance
d (i, j) = (| x - x |2 + | x - x |2 +...+ | x - x |2 )
i1 j1 i2 j 2 ip jp
• h ® ¥. “supremum” (Lmax norm, L¥ norm) distance.
• This is the maximum difference between any component
(attribute) of the vectors

50
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum
L¥ x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
Proximity for Ordinal Variables

• Order is important, e.g., rank


• Can be treated like numeric data
• replace xif by their rank rif Î{1,..., M f }
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif -1
zif =
M f -1

• compute the dissimilarity using methods for numeric


variables

Example: {small, median, large} -> rank {1, 2, 3} ->numerical {0, 0.5, 1}
52
Attributes of Mixed Type

• A database may contain all attribute types


• Nominal, symmetric binary, asymmetric binary, numeric, ordinal
• One may use a weighted formula to combine their effects

S pf = 1d ij( f ) dij( f )
d (i, j) =
S pf = 1d ij( f )
if xif = xjf =0 and f is asymmetric binary, otherwise
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal
• Compute ranks rif and
• Treat zif as numeric zif =
r -1
if

M -1 f
53
Textual Data: Cosine Similarity
• A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.

• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(A, B) = (A• B) /||A|| ||B|| =

where • indicates vector dot product, ||d||: the length of vector

54
Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Measuring Data Similarity and Dissimilarity

• Summary

55
Summary
• Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled
• Gain insight into the data by:
• Basic statistical data description: central tendency, dispersion, graphical
displays
• Data visualization: map data onto graphical primitives
• Measure data similarity
• Above steps are the beginning of data preprocessing.
• Many methods have been developed but still an active area of research.

56
References
• Part of the slides are based on
• the slides provided by Jiawei Han, Micheline Kamber,
and Jian Pei. © 2012 Han, Kamber & Pei.
• the slides provide by Richong Zhang @
http://act.buaa.edu.cn

57

You might also like