unit 3 dw

Download as pdf or txt
Download as pdf or txt
You are on page 1of 91

UNIT-V

Contents:
Data Visualization and Overall Perspective: Aggregation, Historical information,
Query Facility, OLAP function and Tools. OLAP Servers, ROLAP, MOLAP, HOLAP,
Data Mining interface, Security, Backup and Recovery, Tuning Data Warehouse,
Testing Data Warehouse. Warehousing applications and Recent Trends: Types
of Warehousing Applications, Web Mining, Spatial Mining and Temporal Mining
Data Mining Interface, Security,
Backup and Recovery
data warehouse ek library ki tarah hoti jaha bahut books hoti and let waha kuch lock m book ho jinko nahi padh sakte
Data Warehousing - Security

• The objective of a data warehouse is to make large amounts of


data easily accessible to the users, hence allowing the users to
extract information about the business as a whole.
• But we know that there could be some security restrictions
applied on the data that can be an obstacle for accessing the
information.
• If the analyst has a restricted view of data, then it is impossible
to capture a complete picture of the trends within the
business. Ifrestrictions,
an analyst, who studies the data, can only see part of the information because of these
it’s like trying to solve a puzzle with missing pieces. Without seeing the whole picture,
it becomes very difficult to understand all the patterns and trends happening in the business.
Security Requirements
• We should consider the following possibilities during the
design phase. from the start
– Whether the new data sources will require new security and/or audit
restrictions to be implemented?
– Whether the new users added who have restricted access to data that
is already generally available?Ifcanwesee
add new users, should they have limited access to certain data that others
freely?

• The following activities get affected by security measures −


– User access : Who can see and use the data?
– Data load How securely can we bring new data into the system?
– Data movement How safely can we move data between different parts of the warehouse?
– Query generation How do we ensure users can only run searches and queries on data they are allowed to access?
User Access
• We need to first classify the data and then classify the users on
the basis of the data they can access.
• Data Classification
• The following two approaches can be used to classify the data
– Data can be classified according to its sensitivity.
– Data can also be classified according to the job function.
• User classification
– The following approaches can be used to classify the users
– Users can be classified as per the hierarchy of users in an organization,
i.e., users can be classified by departments, sections, groups, and so on.
– Users can also be classified according to their role, with people grouped
across departments based on their role.
Auditing in a data warehouse is like keeping a detailed record of who is doing what with the data to ensure everything is secure. However, it’s
expensive and can slow down the system because it uses a lot of resources. That’s why it’s advised to turn off auditing when it’s not
absolutely necessary.

Audit Requirements
• Auditing is a subset of security, a costly activity. Auditing can
cause heavy overheads on the system.
• To complete an audit in time, we require more hardware and
therefore, it is recommended that wherever possible, auditing
should be switched off.
• Audit requirements can be categorized as follows −
– Connections Who is logging into the system?
– Disconnections Who is logging out of the system?
– Data access Who is viewing or using the data?
– Data change Who is making changes to the data?
In short, auditing helps track activities for security but should be used only when needed to avoid unnecessary costs and slowdowns.
Network Requirements
It ensures that data stays safe while it’s being transferred over the network.

• Network security is as important as other securities. We


cannot ignore the network security requirement.
• We need to consider the following issues −
– Is it necessary to encrypt data before transferring it to the data
warehouse?
– Are there restrictions on which network routes the data can take?
Network routes refer to the paths data takes when moving over a network. The question is:

Are there rules to make sure data only travels through safe and trusted routes? This helps avoid sending data through
unsafe or risky paths where it could be stolen or tampered with.
Data Movement
(a simple data file like a .csv or .txt)

• Suppose we need to transfer some restricted data as a flat file


to be loaded.
• When the data is loaded into the data warehouse, the
following questions are raised −
– Where is the flat file stored? Where do you keep the file before loading it into the warehouse?
– Who has access to that disk space? Who is allowed to access the disk or storage where the file is saved?

Do you save the file as encrypted (locked for safety) or
Do you backup encrypted or decrypted versions? decrypted (readable)?
– Do these backups need to be made to special tapes that are stored
separately? Do backups need to go on special tapes or devices, and should those be stored in a secure place?
– Who has access to these tapes? Who is allowed to access these backup tapes?
– Where is that temporary table to be held? loaded, where is that table stored?
If the data is temporarily stored in a table before being fully

– How do you make such table visible?


How do you make the temporary table accessible only to those who need it, without exposing it to everyone?
Proper documentation is essential for audit and security requirements. It serves as proof and explanation of why certain security
measures are in place.
Documentation

• The audit and security requirements need to be properly


documented.
• This will be treated as a part of justification. This document
can contain all the information gathered from −
– Data classification How is the data categorized (e.g., sensitive, public, restricted)?
– User classification What access levels do different users have (e.g., admin, analyst, viewer)?
– Network requirements What are the rules for transferring data securely over the network?
– Data movement and storage requirements How is the data transferred, stored, and backed up
securely?

– All auditable actions


What activities (e.g., logins, data access, or changes) are tracked and recorded for auditing?
Security plays a big role in shaping how a data warehouse or application is built and maintained.

Impact of Security on Design

• Security affects the application code and the development


timescales.
• Security affects the following area −
– Application development
– Database design
– Testing

Application development: Testing: More time and effort


Developers need to write extra Database design: The database are needed to test the
code to make sure the needs to be structured in a way security features of the
application is secure (e.g., that ensures data is stored system, ensuring that the
encrypting data, limiting user securely, which might involve data is protected and the
access). This adds complexity using encryption, access controls, application works as
and time to the development and special storage solutions. expected under secure
process. conditions.
When planning for hardware backup, it's important to choose the right equipment because the speed of backing up and restoring data depends on
several factors:
Backup software: The tools used for the
backup process also play a role in how
Hardware used: Different devices have
different speeds for storing and
retrieving data.
Hardware Backup
Network bandwidth: The speed of the
efficiently it works.

Server I/O speed: The speed at which the server can


Connection type: How the hardware is network can impact how quickly data is read and write data affects the backup and restore
connected to the network can affect transferred for backup. times.
backup speed.
• It is important to decide which hardware to use for the backup.
• The speed of processing the backup and restore depends on the
hardware being used, how the hardware is connected, bandwidth
of the network, backup software, and the speed of server's I/O
system.
– Disk Backups
Using hard drives or other storage devices for backups, which are fast and easy to use.

– Tape Technology Using tape drives to store backups, which are slower but often used for long-term storage.
• The tape choice can be categorized as follows −
– Tape media The actual tapes used for storage.
– Standalone tape drives A single device that reads and writes data to tapes.
– Tape stackers Devices that can hold multiple tapes to allow automatic loading and unloading.
– Tape silos Large systems that store and manage many tapes, often used for very large backups.
Disk Backups involve storing backup data on hard drives or disks instead of using tapes.

Disk Backups used befcause


Speed of initial backups: It’s faster to back up
data onto a disk compared to tape.
Speed of restore: It’s also quicker to restore
data from a disk than from a tape.
• Methods of disk backups are −
This method involves saving backups directly onto another disk (instead of tape).
– Disk-to-disk backups
– Mirror breaking
Disk-to-Disk Backups
• Here backup is taken on the disk rather on the tape. Disk-to-
disk backups are done for the following reasons −
– Speed of initial backups
– Speed of restore Mirrored disks: Two copies of the same data are stored on different disks, creating a "mirror" of the data. This
ensures that if one disk fails, the other has an exact copy, keeping the data safe.
Mirror Breaking When backup is needed: One of the mirrored disks can be temporarily separated, or "broken out," and used to create
a backup without affecting the normal operation of the system.

• The idea is to have disks mirrored for resilience during the


working day. When backup is required, one of the mirror sets
can be broken out. This technique is a variant of disk-to-disk
backups.
not only helps create backups but also manages and controls the entire backup strategy.

Software Backups
• There are software tools available that help in the backup process.
These software tools come as a package.
• These tools not only take backup, they can effectively manage and
control the backup strategies.

• The criteria for choosing the best software package are listed below
– How scalable is the product as tape drives are added?
– Does the package have client-server option, or must it run on the database
server itself? Does the software need to run on the main database server, or can it work remotely from a client server?
Can the software work in systems where multiple servers or processing units are involved?
– Will it work in cluster and MPP environments?
– What degree of parallelism is required?
– What platforms are supported by the package? Does the software support the platforms (operating systems)
you are using?

– Does the package support easy access to information about tape contents?
– Is the package database aware?
– What tape drive and tape media are supported by the package?
Data Warehousing - OLAP
Introduction
Online Analytical Processing Server (OLAP) is based on the
multidimensional data model. helps analyze data stored in multiple dimensions, like looking at data from
different angles (e.g., by time, location, or product).

It allows managers, and analysts to get an insight of the


information through fast, consistent, and interactive
access to information.
Types of OLAP Servers
We have four types of OLAP servers −
1. Relational OLAP (ROLAP)
2. Multidimensional OLAP (MOLAP)
3. Hybrid OLAP (HOLAP)
4. Specialized SQL Servers
Relational OLAP (ROLAP) is a system that sits between the relational database (where data is stored) and the client tools (used by users to access and
analyze the data).
ROLAP uses a relational database to store and manage data in the data warehouse, which is structured in tables (like in typical databases).

Relational OLAP

• ROLAP servers are placed between relational back-


end server and client front-end tools.
• To store and manage warehouse data, ROLAP uses
relational or extended-relational DBMS.
• ROLAP includes the following −
quickly calculate summaries and totals (aggregations) of the data for analysis.

– Implementation of aggregation navigation logic.


It customizes performance to work efficiently with the specific relational database system being used.
– Optimization for each DBMS back end.
– Additional tools and services.
it provides extra tools to enhance the user experience and help with data management.
Multidimensional OLAP (MOLAP) uses special data storage systems that organize data in a way that allows users to easily
view and analyze it from multiple perspectives, like looking at data by different categories or dimensions (e.g., time,
location, product).
Multidimensional OLAP

• MOLAP uses array-based multidimensional storage engines for


multidimensional views of data.
• With multidimensional data stores, the storage utilization may
be low if the data set is sparse.
• Therefore, many MOLAP server use two levels of data storage
representation to handle dense and sparse data sets.
• A MOLAP cube is built for fast information retrieval, and is
optimal for slicing and dicing operations.
Scalability of ROLAP: HOLAP can handle large volumes of data because it uses relational databases (like ROLAP) that
are good at managing huge datasets.

Hybrid OLAP
Speed of MOLAP: It also offers fast data analysis and computation, thanks to the multidimensional storage of MOLAP.
Aggregations in MOLAP: For faster calculations, the summarized (aggregated) data is stored in MOLAP format.

• Hybrid OLAP is a combination of both ROLAP and MOLAP.

• It offers higher scalability of ROLAP and faster computation of


MOLAP.
• HOLAP servers allows to store the large data volumes of
detailed information.

• The aggregations are stored separately in MOLAP store.


Specialized SQL Servers are designed to handle complex queries efficiently, particularly in data warehouses.

Specialized SQL Servers


Advanced query language support: These servers provide enhanced capabilities to execute SQL queries, especially for
more complex data structures like star and snowflake schemas (used in data warehousing).

• Specialized SQL servers provide advanced query language and


query processing support for SQL queries over star and
snowflake schemas in a read-only environment.
Read-only environment: These servers are typically used for querying and analyzing data without making changes to it,
ensuring that the data remains intact and consistent.
OLAP Operations

• OLAP servers are based on multidimensional view of data.


• Here is the list of OLAP operations −
1. Roll-up
2. Drill-down
3. Slice and dice
4. Pivot (rotate)
For example, if you start with data organized
by city (lower level), a roll-up operation will
aggregate this data to a higher level, like
country (higher level), combining all cities
into one summary for each country. Instead
Roll-up
of looking at individual cities, you'd see the
total data for the country.

• Roll-up performs aggregation on a data cube in any of the


following ways −
moving to a higher level in a hierarchy
– By climbing up a concept hierarchy for a dimension
– By dimension reduction
• Roll-up is performed by climbing up a concept hierarchy for the
dimension location.
• Initially the concept hierarchy was "street < city < province <
country".
• On rolling up, the data is aggregated by ascending the location
hierarchy from the level of city to the level of country.
• The data is grouped into cities rather than countries.
It allows you to explore data at a more detailed level by moving down in a hierarchy or adding new dimensions.
For example, if you start with data at the

Drill-down quarter level, drilling down will allow you to


break it into month-level data, giving you a
more detailed view. You could even drill
down into days or other finer levels.

• Drill-down is the reverse operation of roll-up.


• It is performed by either of the following ways
– By stepping down a concept hierarchy for a dimension
– By introducing a new dimension.
– Drill-down is performed by steppingmove down a concept
from a higher level to a lower level in a
hierarchy for the dimension time. dimension hierarchy.
– Initially the concept hierarchy was "day < month < quarter <
year."
– On drilling down, the time dimension is descended from
the level of quarter to the level of month.
– When drill-down is performed, one or more dimensions
from the data cube are added.
– It navigates the data from less detailed data to highly
detailed data.
Slice is an operation that extracts a specific portion (or "slice") of a data cube based on a single value from one of the dimensions,
creating a new sub-cube.

Slice
• The slice operation selects one particular dimension from a
given cube and provides a new sub-cube.
– Here Slice is performed for the dimension "time" using the criterion
time = "Q1".
– It will form a new sub-cube by selecting one or more dimensions.
slice helps you
focus on a specific
part of the data by
isolating one
dimension at a
time, like zooming
in on data from a If you have a data cube with
particular time multiple dimensions (like time,
period or category. location, and product), the slice
operation selects a specific value
from one of these dimensions.
If you perform a slice on the time
dimension with the criterion time
= "Q1" (Quarter 1), it will create a
sub-cube that contains only the
data for that specific time period
(Q1), while the other dimensions
(like location and product) remain
the same.
Dice
• Dice selects two or more dimensions from a given cube and
provides a new sub-cube.
In simple terms, pivot helps you look at the same data from different angles by rotating the axes, making it easier to view and analyze the
data in alternative ways.

orientation of the data in a data cube to provide a different


perspective or view of the data.
Pivot
Pivot, also known as rotation, is an operation that changes the

• The pivot operation is also known as rotation.


• It rotates the data axes in view in order to provide Ifan
you have a data cube with
alternative presentation of data. dimensions like time, location,
and product, the pivot operation
would allow you to change
which dimension is shown on
the rows or columns.
You could rotate the cube to
show the location on the rows
and product on the columns
instead of the original
arrangement.
Data Warehousing – Tuning and Testing
Data Warehousing - Tuning
• A data warehouse keeps evolving and it is unpredictable what
query the user is going to post in the future. Therefore it
becomes more difficult to tune a data warehouse system.
• Tuning a data warehouse is a difficult procedure due to
following reasons −
– Data warehouse is dynamic; it never remains constant.
– It is very difficult to predict what query the user is going to post in the
future.
– Business requirements change with time.
– Users and their profiles keep changing.
– The user can switch from one group to another.
– The data load on the warehouse also changes with time.
Performance Assessment
• Here is a list of objective measures of performance −
– Average query response time
– Scan rates
– Time used per day query
– Memory usage per process
– I/O throughput rates
• Following are the points to remember.
– It is necessary to specify the measures in service level agreement (SLA).
– It is of no use trying to tune response time, if they are already better than those
required.
– It is essential to have realistic expectations while making performance
assessment.
– It is also essential that the users have feasible expectations.
– To hide the complexity of the system from the user, aggregations and views
should be used.
– It is also possible that the user can write a query you had not tuned for.
Data Load Tuning
There are various approaches of tuning data load that are
discussed below
– The very common approach is to insert data using the SQL
Layer
– The second approach is to bypass all these checks and
constraints and place the data directly into the
preformatted blocks.
– The third approach is that while loading the data into the
table that already contains the table, we can maintain
indexes.
– The fourth approach says that to load the data in tables
that already contain data, drop the indexes & recreate
them when the data load is complete.
Integrity Checks
• Integrity checking highly affects the performance of the load.
Following are the points to remember −
• Integrity checks need to be limited because they require heavy
processing power.
• Integrity checks should be applied on the source system to
avoid performance degrade of data load.
Tuning Queries
• We have two kinds of queries in data warehouse −
– Fixed queries
– Ad hoc queries

• Fixed queries are well defined. Following are the


examples of fixed queries −
– regular reports
– Canned queries
– Common aggregations
• Tuning the fixed queries in a data warehouse is same as
in a relational database system.
Ad hoc Queries
• To understand ad hoc queries, it is important to know the ad hoc
users of the data warehouse.
• For each user or group of users, you need to know the following −
– The number of users in the group
– Whether they use ad hoc queries at regular intervals of time
– Whether they use ad hoc queries frequently
– Whether they use ad hoc queries occasionally at unknown intervals.
– The maximum size of query they tend to run
– The average size of query they tend to run
– Whether they require drill-down access to the base data
– The elapsed login time per day
– The peak time of daily usage
– The number of queries they run per peak hour
Points to Note
– It is important to track the user's profiles and identify the
queries that are run on a regular basis.
– It is also important that the tuning performed does not
affect the performance.
– Identify similar and ad hoc queries that are frequently
run.
– If these queries are identified, then the database will
change and new indexes can be added for those queries.
– If these queries are identified, then new aggregations
can be created specifically for those queries that would
result in their efficient execution.
Data Warehousing - Testing
• Testing is very important for data warehouse
systems to make them work correctly and
efficiently.
• There are three basic levels of testing
performed on a data warehouse −
– Unit testing
– Integration testing
– System testing
Unit Testing
• In unit testing, each component is separately
tested.
• Each module, i.e., procedure, program, SQL
Script, Unix shell is tested.
• This test is performed by the developer.
Integration Testing
• In integration testing, the various modules of
the application are brought together and then
tested against the number of inputs.
• It is performed to test whether the various
components do well after integration.
System Testing
– In system testing, the whole data warehouse
application is tested together.
– The purpose of system testing is to check whether
the entire system works correctly together or not.
– System testing is performed by the testing team.
– Since the size of the whole data warehouse is very
large, it is usually possible to perform minimal
system testing before the test plan can be enacted.
Testing Backup Recovery
• Testing the backup recovery strategy is extremely
important. Here is the list of scenarios for which
this testing is needed −
– Media failure
– Loss or damage of table space or data file
– Loss or damage of redo log file
– Loss or damage of control file
– Instance failure
– Loss or damage of archive file
– Loss or damage of table
– Failure during data failure
Testing Operational Environment
• Security − A separate security document is required for security testing.
This document contains a list of disallowed operations and devising tests
for each.
• Scheduler − Scheduling software is required to control the daily operations
of a data warehouse. It needs to be tested during system testing. The
scheduling software requires an interface with the data warehouse, which
will need the scheduler to control overnight processing and the
management of aggregations.
• Disk Configuration. − Disk configuration also needs to be tested to identify
I/O bottlenecks. The test should be performed with multiple times with
different settings.
• Management Tools. − It is required to test all the management tools during
system testing. Here is the list of tools that need to be tested.
– Event manager
– System manager
– Database manager
– Configuration manager
– Backup recovery manager
Testing the Database
• Testing the database manager and monitoring tools − To test
the database manager and the monitoring tools, they should be
used in the creation, running, and management of test database.
• Testing database features − Here is the list of features that we
have to test −
– Querying in parallel
– Create index in parallel
– Data load in parallel
• Testing database performance − Query execution plays a very
important role in data warehouse performance measures. There
are sets of fixed queries that need to be run regularly and they
should be tested.
Testing the Application

• All the managers should be integrated correctly and work in


order to ensure that the end-to-end load, index, aggregate and
queries work as per the expectations.

– Each function of each manager should work correctly


– It is also necessary to test the application over a period of time.
– Week end and month-end tasks should also be tested.
– Overnight processing
– Query performance
Logistic of the Test
• The aim of system test is to test all of the following areas −

– Scheduling software
– Day-to-day operational procedures
– Backup recovery strategy
– Management and scheduling tools
– Overnight processing
– Query performance
Warehousing applications and
Recent Trends
Trends in Data Mining
• Data mining concepts are still evolving and here are the latest trends
that we get to see in this field −
– Application Exploration.
– Scalable and interactive data mining methods.
– Integration of data mining with database systems, data warehouse systems
and web database systems.
– Standardization of data mining query language.
– Visual data mining.
– New methods for mining complex types of data.
– Biological data mining.
– Data mining and software engineering.
– Web mining.
– Distributed data mining.
– Real time data mining.
– Multi database data mining.
– Privacy protection and information security in data mining.
Web Mining
• Web is a collection of inter-related files on one or more Web
servers.
• Web mining is the application of data mining techniques to
extract knowledge from Web data.
• Web data is :
– Web content – text, image, records, etc.
– Web structure – hyperlinks, tags, etc.
– Web usage – http logs, app server logs, etc.
Web Mining Taxonomy
Pre-processing Web Data
• Web Content:
– Extract “snippets” from a Web document that
represents the Web Document
• Web Structure
– Identifying interesting graph patterns or preprocessing
the whole web graph to come up with metrics such as
PageRank
• Web Usage
– User identification, session creation, robot detection and
filtering, and extracting usage path patterns
‰
‰
Web Content Mining
• Web Content Mining is the process of extracting useful
information from the contents of Web documents.
– Content data corresponds to the collection of facts a Web page was
designed to convey to the users.
– It may consist of text, images, audio, video, or structured records such
as lists and tables.
• Research activities in this field also involve using techniques
from other disciplines such as Information Retrieval (IR) and
natural language processing (NLP).
Pre-processing Content
• Preparation:
– Extract text from HTML.
– Perform Stemming.
– Remove Stop Words.
– Calculate Collection Wide Word Frequencies (DF).
– Calculate per Document Term Frequencies (TF).
• Vector Creation:
– Common Information Retrieval Technique.
– Each document (HTML page) is represented by a sparse vector of
term weights.
– TFIDF weighting is most common.
– Typically, additional weight is given to terms appearing as keywords
or in titles
Common Mining Techniques
• The more basic and popular data mining
techniques include:
– Classification
– Clustering
– Associations
• The other significant ideas:
– Topic Identification, tracking and drift analysis
– Concept hierarchy creation
– Relevance of content.
™
™
™
™
Web Structure Mining
• The structure of a typical Web graph consists of Web pages as
nodes, and hyperlinks as edges connecting between two
related pages

• Web Structure Mining can be is the process of discovering


structure information from the Web
– This type of mining can be performed either at the (intra-page)
document level or at the (inter-page) hyperlink level
– The research at the hyperlink level is also called Hyperlink Analysis
Motivation to study Hyperlink Structure

• Hyperlinks serve two main purposes.


– Pure Navigation.
– Point to pages with authority on the same topic of
the page containing the link.
• This can be used to retrieve useful information
from the web.
™
Web Structure Terminology
• Directed Path:
– A sequence of links, starting from p that can be followed to reach q.
• Shortest Path:
– Of all the paths between nodes p and q, which has the shortest length,
i.e. number of links on it.
• Diameter:
– The maximum of all the shortest paths between a pair of nodes p and
q, for all pairs of nodes p and q in the Web-graph.
‰
‰
Web Usage Mining
Discovery of meaningful patterns from data generated by
client-server transactions on one or more Web localities

• Typical Sources of Data


– automatically generated data stored in server access logs,
referrer logs, agent logs, and client-side cookies
– user profiles
– meta data: page attributes, content attributes, usage data
Issues in Usage Data
• Session Identification
• Common Gateway Interface Data
• Caching
• Dynamic Pages
• Robot Detection and Filtering
• Transaction Identification
• Identify Unique Users
• Identify Unique User transaction
™
™
™
™
™
Introduction to Spatial Data
Mining
What is spatial Data
• Spatial Data is also known as geospatial data or geographic
information.
• It is the data or information that identifies the geographic
location of features and boundaries on Earth, such as natural
or constructed features, oceans, and more.
• Spatial data is usually stored as coordinates and topology, and
is data that can be mapped.
• A common example of spatial data can be seen in a road map.
• A road map is a two-dimensional object that contains points,
lines, and polygons that can represent cities, roads, and
political boundaries such as states or provinces.
Non Spatial Data
• The non spatial data are numbers, characters or
logical type.
• Here data are typically non spatial in nature as it
directly or indirectly, does not refer to any location.
• In another example consider a table containing
population information for specific locations say
city ,districts or provinces.
What is a Spatial Pattern ?
What is not a pattern?
• Random, haphazard, chance, stray, accidental, unexpected
• Without definite direction, trend, rule, method, design, aim,
purpose
• Accidental - without design, outside regular course of things
• Casual - absence of pre-arrangement, relatively
unimportant
• Fortuitous - What occurs without known cause
What is a Spatial Pattern ?
What is a Pattern?
• A frequent arrangement, configuration, composition,
regularity

• A rule, law, method, design, description

• A major direction, trend, prediction

• A significant surface irregularity or unevenness


What is Spatial Data Mining?
• Search for spatial patterns
• Non-trivial search - as “automated” as possible
—reduce human effort
• Interesting, useful and unexpected spatial
pattern
What is Spatial Data Mining?
Non-trivial Search
– Large (e.g. exponential) search space of plausible hypothesis
– Ex. Asiatic cholera : causes: water, food, air, insects, …; water
delivery mechanisms - numerous pumps, rivers, ponds, wells,
pipes, ...
Interesting
– Useful in certain application domain
– Ex. Shutting off identified Water pump => saved human life
Unexpected
– Pattern is not common knowledge
– May provide a new understanding of world
– Ex. Water pump - Cholera connection lead to the “germ” theory
What is NOT Spatial Data Mining?
Simple Querying of Spatial Data
– Find neighbors of Canada given names and boundaries of
all countries
– Find shortest path from Boston to Houston in a freeway
map Search space is not large (not exponential)
Testing a hypothesis via a primary data analysis
– Ex. Female chimpanzee territories are smaller than male
territories Search space is not large !
– SDM: secondary data analysis to generate multiple
plausible hypotheses
What is NOT Spatial Data Mining?
Uninteresting or obvious patterns in spatial data
– Heavy rainfall in Minneapolis is correlated with heavy
rainfall in St. Paul, Given that the two cities are 10 miles
apart.
– Common knowledge: Nearby places have similar rainfall
Mining of non-spatial data
Diaper sales and beer sales are correlated in evenings
GPS product buyers are of 3 kinds:
• outdoors enthusiasts, farmers, technology enthusiasts
Why Learn about Spatial Data Mining?
Two basic reasons for new work
– Consideration of use in certain application domains
– Provide fundamental new understanding
Application domains
• Scale up secondary spatial (statistical) analysis to very large
datasets
– Describe/explain locations of human settlements in last 5000 years
– Find cancer clusters to locate hazardous environments
– Prepare land-use maps from satellite imagery •
– Predict habitat suitable for endangered species
Find new spatial patterns
– Find groups of co-located geographic features
Why Learn about Spatial Data Mining?
New understanding of geographic processes for
Critical questions
– Ex. How is the health of planet Earth?
– Ex. Characterize effects of human activity on
environment and ecology
– Ex. Predict effect of El Nino on weather, and economy
Traditional approach: manually generate and test
hypothesis
• But, spatial data is growing too fast to analyze
manually
– Satellite imagery, GPS tracks, sensors on highways, …
Why Learn about Spatial Data Mining?
Number of possible geographic hypothesis too
large to explore manually
– Large number of geographic features and locations
– Number of interacting subsets of features grow
exponentially
– Ex. Find tele connections between weather events across
ocean and land areas
SDM may reduce the set of plausible hypothesis
– Identify hypothesis supported by the data
– For further exploration using traditional statistical
methods
Spatial Data Mining
Spatial Patterns
– Hot spots, Clustering, trends, …
– Spatial outliers
– Location prediction
– Associations, co-locations
Primary Tasks
– Spatial Data Clustering Analysis
– Spatial Outlier Analysis
– Mining Spatial Association Rules
– Spatial Classification and Prediction
Example:
• Unusual warming of Pacific ocean (El Nino) affects
weather in USA…
Spatial Data Mining
• Spatial data mining follows along the same functions in
data mining, with the end objective to find patterns in
geography, meteorology, etc.
• The main difference: spatial autocorrelation
– the neighbors of a spatial object may have an influence on it
and therefore have to be considered as well
• Spatial attributes
– Topological
• adjacency or inclusion information
– Geometric
• position (longitude/latitude), area, perimeter, boundary polygon
27 Example What Kind of Houses Are Highly
Temporal Data Mining
INTRODUCTION
Temporal Data Mining is a rapidly evolving area of
research that is at the intersection of several
disciplines including:
– Statistics (e.g., time series analysis)
– Temporal pattern recognition
– Temporal databases
– Optimisation
– Visualisation
– High-performance computing
– Parallel computing
DEFINITION OF TEMPORAL DATA MINING

• Temporal Data Mining is a single step in the process


of Knowledge Discovery in Temporal Databases.
• Temporal Database tuples associated with time
attributes.
• It enumerates structures (temporal patterns or
models) over the temporal data.
• Any algorithm that enumerates temporal patterns
from, or fits models to, temporal data is a Temporal
Data Mining Algorithm.
Temporal Data Mining Tasks
Temporal data mining tasks include:
• Temporal data characterization and discrimination
• Temporal clustering analysis
• Temporal classification
• Temporal association rules
• Temporal pattern analysis
• Temporal prediction and trend analysis
Temporal Data Mining Tasks
A new temporal data model (supporting time
granularity and time-hierarchies) may need to
be developed based on:
– Temporal data structures
– Temporal semantics.
Temporal Data Mining Tasks
A new temporal data mining concept may need to be
developed based on the following issues:

– The task of temporal data mining can be seen as a problem


of extracting an interesting part of the logical theory of a
model
– The theory of a model may be formulated in a logical
formalism able to express q.
TEMPORAL DATA MINING TECHNIQUES

1. Classification in Temporal Data Mining


– The basic goal of temporal classification is to predict
temporally related fields in a temporal database based
on other fields.
– The problem in general is cast as determining the most
likely value of the temporal variable being predicted
given the other fields
– The training data in which the target variable is given for
each observation, and a set of assumptions representing
one’s prior knowledge of the problem.
– Temporal classification techniques are also related to the
difficult problem of density estimation.
TEMPORAL DATA MINING TECHNIQUES

2. Temporal Cluster Analysis:


– Temporal clustering according to similarity is a concept
which appears in many disciplines, so there are two basic
approaches to analyze it.
1. Measure of temporal similarity approach and
2. Temporal optimal partition approach
If the number of clusters is given, then clustering
techniques can be divided into three classes:
1. Metric-distance based technique
2. Model-based technique
3. Partition-based technique.

You might also like