unit 3 dw
unit 3 dw
unit 3 dw
Contents:
Data Visualization and Overall Perspective: Aggregation, Historical information,
Query Facility, OLAP function and Tools. OLAP Servers, ROLAP, MOLAP, HOLAP,
Data Mining interface, Security, Backup and Recovery, Tuning Data Warehouse,
Testing Data Warehouse. Warehousing applications and Recent Trends: Types
of Warehousing Applications, Web Mining, Spatial Mining and Temporal Mining
Data Mining Interface, Security,
Backup and Recovery
data warehouse ek library ki tarah hoti jaha bahut books hoti and let waha kuch lock m book ho jinko nahi padh sakte
Data Warehousing - Security
Audit Requirements
• Auditing is a subset of security, a costly activity. Auditing can
cause heavy overheads on the system.
• To complete an audit in time, we require more hardware and
therefore, it is recommended that wherever possible, auditing
should be switched off.
• Audit requirements can be categorized as follows −
– Connections Who is logging into the system?
– Disconnections Who is logging out of the system?
– Data access Who is viewing or using the data?
– Data change Who is making changes to the data?
In short, auditing helps track activities for security but should be used only when needed to avoid unnecessary costs and slowdowns.
Network Requirements
It ensures that data stays safe while it’s being transferred over the network.
Are there rules to make sure data only travels through safe and trusted routes? This helps avoid sending data through
unsafe or risky paths where it could be stolen or tampered with.
Data Movement
(a simple data file like a .csv or .txt)
– Tape Technology Using tape drives to store backups, which are slower but often used for long-term storage.
• The tape choice can be categorized as follows −
– Tape media The actual tapes used for storage.
– Standalone tape drives A single device that reads and writes data to tapes.
– Tape stackers Devices that can hold multiple tapes to allow automatic loading and unloading.
– Tape silos Large systems that store and manage many tapes, often used for very large backups.
Disk Backups involve storing backup data on hard drives or disks instead of using tapes.
Software Backups
• There are software tools available that help in the backup process.
These software tools come as a package.
• These tools not only take backup, they can effectively manage and
control the backup strategies.
• The criteria for choosing the best software package are listed below
– How scalable is the product as tape drives are added?
– Does the package have client-server option, or must it run on the database
server itself? Does the software need to run on the main database server, or can it work remotely from a client server?
Can the software work in systems where multiple servers or processing units are involved?
– Will it work in cluster and MPP environments?
– What degree of parallelism is required?
– What platforms are supported by the package? Does the software support the platforms (operating systems)
you are using?
– Does the package support easy access to information about tape contents?
– Is the package database aware?
– What tape drive and tape media are supported by the package?
Data Warehousing - OLAP
Introduction
Online Analytical Processing Server (OLAP) is based on the
multidimensional data model. helps analyze data stored in multiple dimensions, like looking at data from
different angles (e.g., by time, location, or product).
Relational OLAP
Hybrid OLAP
Speed of MOLAP: It also offers fast data analysis and computation, thanks to the multidimensional storage of MOLAP.
Aggregations in MOLAP: For faster calculations, the summarized (aggregated) data is stored in MOLAP format.
Slice
• The slice operation selects one particular dimension from a
given cube and provides a new sub-cube.
– Here Slice is performed for the dimension "time" using the criterion
time = "Q1".
– It will form a new sub-cube by selecting one or more dimensions.
slice helps you
focus on a specific
part of the data by
isolating one
dimension at a
time, like zooming
in on data from a If you have a data cube with
particular time multiple dimensions (like time,
period or category. location, and product), the slice
operation selects a specific value
from one of these dimensions.
If you perform a slice on the time
dimension with the criterion time
= "Q1" (Quarter 1), it will create a
sub-cube that contains only the
data for that specific time period
(Q1), while the other dimensions
(like location and product) remain
the same.
Dice
• Dice selects two or more dimensions from a given cube and
provides a new sub-cube.
In simple terms, pivot helps you look at the same data from different angles by rotating the axes, making it easier to view and analyze the
data in alternative ways.
– Scheduling software
– Day-to-day operational procedures
– Backup recovery strategy
– Management and scheduling tools
– Overnight processing
– Query performance
Warehousing applications and
Recent Trends
Trends in Data Mining
• Data mining concepts are still evolving and here are the latest trends
that we get to see in this field −
– Application Exploration.
– Scalable and interactive data mining methods.
– Integration of data mining with database systems, data warehouse systems
and web database systems.
– Standardization of data mining query language.
– Visual data mining.
– New methods for mining complex types of data.
– Biological data mining.
– Data mining and software engineering.
– Web mining.
– Distributed data mining.
– Real time data mining.
– Multi database data mining.
– Privacy protection and information security in data mining.
Web Mining
• Web is a collection of inter-related files on one or more Web
servers.
• Web mining is the application of data mining techniques to
extract knowledge from Web data.
• Web data is :
– Web content – text, image, records, etc.
– Web structure – hyperlinks, tags, etc.
– Web usage – http logs, app server logs, etc.
Web Mining Taxonomy
Pre-processing Web Data
• Web Content:
– Extract “snippets” from a Web document that
represents the Web Document
• Web Structure
– Identifying interesting graph patterns or preprocessing
the whole web graph to come up with metrics such as
PageRank
• Web Usage
– User identification, session creation, robot detection and
filtering, and extracting usage path patterns
Web Content Mining
• Web Content Mining is the process of extracting useful
information from the contents of Web documents.
– Content data corresponds to the collection of facts a Web page was
designed to convey to the users.
– It may consist of text, images, audio, video, or structured records such
as lists and tables.
• Research activities in this field also involve using techniques
from other disciplines such as Information Retrieval (IR) and
natural language processing (NLP).
Pre-processing Content
• Preparation:
– Extract text from HTML.
– Perform Stemming.
– Remove Stop Words.
– Calculate Collection Wide Word Frequencies (DF).
– Calculate per Document Term Frequencies (TF).
• Vector Creation:
– Common Information Retrieval Technique.
– Each document (HTML page) is represented by a sparse vector of
term weights.
– TFIDF weighting is most common.
– Typically, additional weight is given to terms appearing as keywords
or in titles
Common Mining Techniques
• The more basic and popular data mining
techniques include:
– Classification
– Clustering
– Associations
• The other significant ideas:
– Topic Identification, tracking and drift analysis
– Concept hierarchy creation
– Relevance of content.
Web Structure Mining
• The structure of a typical Web graph consists of Web pages as
nodes, and hyperlinks as edges connecting between two
related pages