AD Theory 3
AD Theory 3
AD Theory 3
1. Top-down approach:
1. External Sources –
External source is a source from where data is collected irrespective of the type of data. Data
can be structured, semi structured and unstructured as well.
2. Stage Area –
Since the data, extracted from the external sources does not follow a particular format, so
there is a need to validate this data to load into datawarehouse. For this purpose, it is
recommended to use ETL tool.
L(Load): Data is loaded into datawarehouse after transforming it into the standard
format.
3. Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as central repository. It actually
stores the meta data and the actual data gets stored in the data marts. Note that
datawarehouse stores the data in its purest form in this top-down approach.
4. Data Marts –
Data mart is also a part of storage component. It stores the information of a particular
function of an organisation which is handled by single authority. There can be as many
number of data marts in an organisation depending upon the functions. We can also say that
data mart contains subset of the data stored in datawarehouse.
5. Data Mining –
The practice of analysing the big data present in datawarehouse is data mining. It is used to
find the hidden patterns that are present in the database or in datawarehouse with the help
of algorithm of data mining.
This approach is defined by Inmon as – datawarehouse as a central repository for the complete
organisation and data marts are created from it after the complete datawarehouse has been
created.
Bottom-up approach:
2. Then, the data go through the staging area (as explained above) and
loaded into data marts instead of datawarehouse. The data marts are
created first and provide reporting capability. It addresses a single
business area.
The Operational Database is the source of information for the data warehouse. It
includes detailed information used to run the day to day operations of the business.
The data frequently changes as updates are made and reflect the current value of the
last transactions.
Operational Database Management Systems also called as OLTP (Online Transactions
Processing Databases), are used to manage dynamic data in real-time.
Data Warehouse Systems serve users or knowledge workers in the purpose of data
analysis and decision-making. Such systems can organize and present information in
specific formats to accommodate the diverse needs of various users. These systems
are called as Online-Analytical Processing (OLAP) Systems.
Data Warehouse and the OLTP database are both relational databases. However, the
goals of both these databases are different.
Data within operational systems are Non-volatile, new data may be added
mainly updated regularly according to regularly. Once Added rarely changed.
need.
Operational systems are widely process- Data warehousing systems are widely
oriented. subject-oriented
Relational databases are created for on- Data Warehouse designed for on-line
line transactional Processing (OLTP) Analytical Processing (OLAP)
It is said to be star as its physical model resembles to the star shape having a fact
table at its center and the dimension tables at its peripheral representing the star’s
points. Below is an example to demonstrate the Star Schema:
Factless tables simply mean the key available in the fact that no remedies are
available. Factless fact tables are only used to establish relationships between
elements of different dimensions. And are also useful for describing events and
coverage, meaning tables contain information that nothing has happened. It often
represents many-to-many relationships.
There are two types of factless table :
1. Event Tracking Tables –
Use a factless fact table to track events of interest to the organization. For
example, attendance at a cultural event can be tracked by creating a fact table that
contains the following foreign keys (i.e. links to dimension tables) event
identifier speaker/entertainment identifier, participant identifier, event type; Date.
This table can then be searched for information, such as the most popular ones.
Which cultural program or program type. The following example shows a factless
fact table that records each time a student attends a course or which class has the
maximum attendance? Or what is the average number of attendance of a given
course? All questions are based on COUNT () with group BY questions. So we
can first count and then implement other aggregate functions like Aggress, Max,
Min.
Attendance fact
2. Coverage Tables –
The second type of factless fact table is called a coverage table by Ralph. It is
used to support negative analysis reports. For example, to create a report that a
store did not sell a product for a certain period of time, you should have a fact
table to capture all possible combinations. Then you can find out what is missing.
Common examples of factless fact table:
Ex-Visitors to the office.
List of people for the web click.
Tracking student attendance or registration events.
Types of Keys:
1. Primary Key – Every row in a dimension table is identified by a
unique value which is generally known as primary key. The primary
key is a unique identifier that helps isolate each row in a dimension.
Tables can have numerous records, however each record has only one
Primary Key.
2. Surrogate Key – These are the keys which are generated by the system
and generally does not have any built in meaning. It is UNIQUE and
SEQUENTIAL since it is a consecutively created number for each
record in the table. It is MEANINGLESS since it doesn’t have any
business significance other than identifying each row. For example, if a
data warehouse contains information on 20,000 clients the dimension
table will contain 20,000 surrogate keys one for each client.
3. Foreign Key – In the fact table the primary key of other dimension
table is act as the foreign key.
4. Alternate key – It is also a unique value of the table and generally
knows as secondary key of the table.
5. Composite key – It consists of two or more attributes. For example, the
entity has a clientID and a employeeCode as its primary key. Every one
of the characteristics that make up the primary key are basic keys on the
grounds that each speaks to an exceptional reference while
distinguishing a client in one occasion and a employee in the other, so
this key is a composite key.
6. Candidate key – A substance type in an intelligent information model
will have at least zero competitor keys, likewise alluded to just as one
of a kind identifiers . For instance, on the off chance that we just
interface with American residents, at that point SSN is one up-and-
comer key for the Person element type and the mix of name and
telephone number (expecting the mix is one of a kind) is possibly a
subsequent competitor key. Both of these keys are called up-and-comer
keys since they are possibility to be picked as the essential key, a
substitute key or maybe not so much as a key at all inside a physical
information model.
The snowflake schema is a variant of the star schema. Here, the centralized fact
table is connected to multiple dimensions. In the snowflake schema, dimensions
are present in a normalized form in multiple related tables. The snowflake
structure materialized when the dimensions of a star schema are detailed and
highly structured, having several levels of relationship, and the child tables have
multiple parent tables. The snowflake effect affects only the dimension tables and
does not affect the fact tables.
A snowflake schema is a type of data modeling technique used in data
warehousing to represent data in a structured way that is optimized for querying
large amounts of data efficiently. In a snowflake schema, the dimension tables are
normalized into multiple related tables, creating a hierarchical or “snowflake”
structure.
In a snowflake schema, the fact table is still located at the center of the schema,
surrounded by the dimension tables. However, each dimension table is further
broken down into multiple related tables, creating a hierarchical structure that
resembles a snowflake.
For Example, in a sales data warehouse, the product dimension table might be
normalized into multiple related tables, such as product category, product
subcategory, and product details. Each of these tables would be related to the
product dimension table through a foreign key relationship.
The main difference between star schema and snowflake schema is that the dimension table of the
snowflake schema is maintained in the normalized form to reduce redundancy. The advantage here
is that such tables (normalized) are easy to maintain and save storage space. However, it also means
that more joins will be needed to execute the query. This will adversely impact system performance.
However, the snowflake schema can also be more complex to query than a star schema because it
requires more table joins. This can result in slower query response times and higher resource usage
in the database. Additionally, the snowflake schema can be more difficult to understand and
maintain because of the increased complexity of the schema design.
The decision to use a snowflake schema versus a star schema in a data warehousing project will
depend on the specific requirements of the project and the trade-offs between query performance,
schema complexity, and data integrity.
The dimension table consists of two or more sets of attributes that define information at
different grains.
The sets of attributes of the same dimension table are populated by different source
systems.
Here, the pink coloured Dimension tables are the common ones among both the
star schemas. Green coloured fact tables are the fact tables of their respective star
schemas.
Example:
In above demonstration:
Placement is a fact table having attributes: (Stud_roll, Company_id,
TPO_id) with facts: (Number of students eligible, Number of students
placed).
Workshop is a fact table having attributes: (Stud_roll, Institute_id,
TPO_id) with facts: (Number of students selected, Number of students
attended the workshop).
Company is a dimension table having attributes: (Company_id, Name,
Offer_package).
Student is a dimension table having attributes: (Student_roll, Name,
CGPA).
TPO is a dimension table having attributes: (TPO_id, Name, Age).
Training Institute is a dimension table having attributes: (Institute_id,
Name, Full_course_fee).
So, there are two fact tables namely, Placement and Workshop which are part of
two different star schemas having dimension tables
– Company, Student and TPO in Star schema with fact table Placement and
dimension tables – Training Institute, Student and TPO in Star schema with fact
table Workshop. Both the star schema have two dimension tables common and
hence, forming a fact constellation or galaxy schema.
Advantage: Provides a flexible schema.
Disadvantage: It is much more complex and hence, hard to implement and
maintain.
What is a data lake?
A data lake is a central location that holds a large amount of data in its native,
raw format. Compared to a hierarchical data warehouse, which stores data in
files or folders, a data lake uses a flat architecture and object storage to store
the data.Object storage stores data with metadata tags and a unique
identifier, which makes it easier to locate and retrieve data across regions, and
improves performance. By leveraging inexpensive object storage and open
formats, data lakes enable many applications to take advantage of the data.
*Updates to dimension tables – slowly changing dimensions, type 1, type 2 and type 3 changes, large
dimensions, rapidly changing dimensions, junk dimensions, aggregate fact tables.
Reliability issues
Without the proper tools in place, data lakes can suffer from data reliability issues
that make it difficult for data scientists and analysts to reason about the data. These
issues can stem from difficulty combining batch and streaming data, data
corruption and other factors.
Slow performance
As the size of the data in a data lake increases, the performance of traditional query
engines has traditionally gotten slower. Some of the bottlenecks include metadata
management, improper data partitioning and others.