Unit I 1.1 Design Data Architecture and Manage The Data For Analysis

UNIT I
1.1 DESIGN DATA ARCHITECTURE AND MANAGE THE DATA FOR ANALYSIS
Data architecture is composed of models, policies, rules or standards that govern which
data is collected, and how it is stored, arranged, integrated, and put to use in data systems
and in organizations. Data is usually one of several architecture domains that form the
pillars of an enterprise architecture or solution architecture.
Various constraints and influences will have an effect on data architecture design. These
include enterprise requirements, technology drivers, economics, business policies and data
processing needs.
• Enterprise requirements
These will generally include such elements as economical and effective system
expansion, acceptable performance levels (especially system access speed), transaction
reliability, and transparent data management. In addition, the conversion of raw data such as
transaction records and image files into more useful information forms through such
features as data warehouses is also a common organizational requirement, since this enables
managerial decision making and other organizational processes. One of the architecture
techniques is the split between managing transaction data and (master) reference data.
Another one is splitting data capture systems from data retrieval systems (as done in a
datawarehouse).
• Technology drivers
These are usually suggested by the completed data architecture and database
architecture designs. In addition, some technology drivers will derive from existing
organizational integration frameworks and standards, organizational economics, and
existing site resources (e.g. previously purchased software licensing).
• Economics
These are also important factors that must be considered during the data architecture phase.
It is possible that some solutions, while optimal in principle, may not be potential
candidates due to their cost. External factors such as the business cycle, interest rates,
market conditions, and legal considerations could all have an effect on decisions relevant to
data architecture.
 Business policies
Business policies that also drive data architecture design include internal organizational
policies, rules of regulatory bodies, professional standards, and applicable governmental
laws that can vary by applicable agency. These policies and rules will help describe the
manner in which enterprise wishes to process their data.
• Data processing needs
These include accurate and reproducible transactions performed in high volumes, data
warehousing for the support of management information systems (and potential data
mining), repetitive periodic reporting, ad hoc reporting, and support of various
organizational initiatives as required (i.e. annual budgets, new productdevelopment).
The General Approach is based on designing the Architecture at three Levels of

Specification as shown below in figure 1.1
 The Logical Level
 The Physical Level
 The Implementation Level
Fig 1.1: Three levels architecture in data analytics.
The logical view/user's view, of a data analytics represents data in a format that is

meaningful to a user and to the programs that process those data. That is, the logical
view tells the user, in user terms, what is in the database. Logical level consists of data
requirements and process models which are processed using any data modelling techniques to
result in logical data model.
Physical level is created when we translate the top level design in physical tables in
the database. This model is created by the database architect, software architects, software
developers or database administrator. The input to this level from logical level and various
data modeling techniques are used here with input from software developers or database
administrator. These data modelling techniques are various formats of representation of data
such as relational data model, network model, hierarchical model, object oriented model,
Entity relationship model.
Implementation level contains details about modification and presentation of data through
the use of various data mining tools such as (R-studio, WEKA, Orange etc). Here each tool
has a specific feature how it works and different representation of viewing the same data.
These tools are very helpful to the user since it is user friendly and it does not require much
programming knowledge from the user.
Understand various sources of the Data
Data can be generated from two types of sources namely Primary and Secondary
Sources of Primary Data
The sources of generating primary data are -

 Observation Method
 Survey Method
 Experimental Method
Observation Method:
Fig 1.2: Data collections
An observation is a data collection method, by which you gather knowledge of the

researched phenomenon through making observations of the phenomena, as and when it
occurs. The main aim is to focus on observations of human behavior, the use of the
phenomenon and human interactions related to the phenomenon. We can also make
observations on verbal and nonverbal expressions. In making and documenting observations,
we need to clearly differentiate our own observations from the observations provided to us by
other people. The range of data storage genre found in Archives and Collections, is suitable
for documenting observations e.g. audio, visual, textual and digital including sub-genres
of note taking, audio recording and video recording.
There exist various observation practices, and our role as an observer may vary
according to the research approach. We make observations from either the outsider or insider
point of view in relation to the researched phenomenon and the observation technique can be
structured or unstructured. The degree of the outsider or insider points of view can be seen as
a movable point in a continuum between the extremes of outsider and insider. If you decide
to take the insider point of view, you will be a participant observer in situ and actively
participate in the observed situation or community. The activity of a Participant observer in
situ is called field work. This observation technique has traditionally belonged to the data
collection methods of ethnology and anthropology. If you decide to take the outsider point of
view, you try to try to distance yourself from your own cultural ties and observe the
researched community as an outsider observer. These details are seen in figure 1.2.
Experimental Designs
There are number of experimental designs that are used in carrying out and
experiment. However, Market researchers have used 4 experimental designs most frequently.
These are –
CRD - Completely Randomized Design
A completely randomized design (CRD) is one where the treatments are assigned
completely at random so that each experimental unit has the same chance of receiving any
one treatment. For the CRD, any difference among experimental units receiving the same
treatment is considered as experimental error. Hence, CRD is appropriate only for
experiments with homogeneous experimental units, such as laboratory experiments, where
environmental effects are relatively easy to control. For field experiments, where there is
generally large variation among experimental plots in such environmental factors as soil, the
CRD is rarely used. CRD is mainly used in agricultural field.
Step 1. Determine the total number of experimental plots (n) as the product of the number of
treatments (t) and the number of replications (r); that is, n = rt. For our example, n = 5 x 4 =
20. Here, one pot with a single plant in it may be called a plot. In case the number of
replications is not the same for all the treatments, the total number of experimental pots is to
be obtained as the sum of the replications for each treatment. i.e.,
t
n= ∑ ri
i=1
where ri, is the number of times the ith treatment replicated

Step 2. Assign a plot number to each experimental plot in any convenient manner; for
example, consecutively from 1 to n.
Step 3. Assign the treatments to the experimental plots randomly using a table of random
numbers.
Example 1: Assume that a farmer wishes to perform the experiment to determine which of
his 3 fertilizers to use on 2800 tress. Assuming that farmer has a farm divided in to 3 terraces,
where those 2800 trees can be divided in the below format
Lower Terrace 1200

Middle Terrace 1000
Upper Terrace 600
Design a CRD for this experiment
Solution
Scenario 1
First we divide the 2800 trees in to random assignment of almost 3 equal parts
Random Assignment1: 933 trees
So for example random assignment1 we can assign fertilizer1, random assignment2 we can
assign fertilizer2, random assignment3 we can assign fertilizer3.
Scenario 2
2800 trees is divided into terrace as shown below
Total no of Terrace Random Fertilizer

trees assignment usage
Upper 200 fertilizer1
Terrace 200 fertilizer2
(600 trees) 200 fertilizer3
Middle 400 fertilizer1
2800 Trees Terrace 400 fertilizer2
Lower 333 fertilizer1
Terrace 333 fertilizer2
Thus the farmer will be able analyze and compare various fertilizer performance on different
terrace.
Example 2:
A company wishes to test 4 different types of tyre. The tyres lifetime as determined
from their threads are given. Where each tyre has been tried on 6 similar automobiles
assigned at random to their tyres. Determine whether there is a significant difference between
tyres at .05 level.
Tyre Automobile Automobil Automobile Automobile Automobile Automobile

s 1 e2 3 4 5 6
A 33 38 36 40 31 35
B 32 30 28 32 30 34
C 31 37 35 33 34 30
D 29 34 32 30 33 31
Solution:
Null Hypothesis: There is no difference between the tyres in their life time.
We choose a random value closest to the average of all values in the table and subtract that
for each tyre in the automobile, for example by choosing 35
Tyre Automobil Automobile Automobil Automobil Automobile Automobil Total

s e1 2 e3 e4 5 e6
A -2 3 1 5 -4 0 3
B -3 -5 -7 3 -5 -1 -24
C -4 2 0 -2 -1 -5 -10
D -6 -1 -3 -5 -2 -4 -21
T = Sum(X) = -52
T∗T
Correction factor = = 112.67
N
Square the values to find
Tyre Automobil Automobile Automobil Automobil Automobile Automobil Total

s e1 2 e3 e4 5 e6
A 4 9 1 25 16 0 0
B 9 25 9 49 25 1 0
C 16 4 0 4 1 25 0
D 36 1 9 25 4 16 0
T = Sum(X2) = 0
Total sum of squares (SST) = sum(X2) – Correlation factor

= 314 – 112.67
= 201.33
Sum of Squares between Treatments (SSTr) =
= ((3)2/6 +(24)2/6 +(10)2/6 +(21)2/6) – Correlation factor
= 75
Sum of Squares Error (SSE) SST – SSTr = 201.33 – 75 = 126.33

Now by using ANOVA (one way classification) Table, We calculate the F- Ratio.
F-Ratio:
The F ratio is the ratio of two mean square values. If the null hypothesis is true, you
expect F to have a value close to 1.0 most of the time. A large F ratio means that the variation
among group means is more than you'd expect to see by chance
If the value of F-Ratio is closer to 1 it means that null hypothesis is true. If F-ratio is
greater than then we assume that the null hypothesis is false.
Source of Sum of Degrees of Mean of sum of F - Ratio

variation squares freedom squares
Between SSTr = 75 No of treatment – MSTr = SSTr/
treatments 1 Degrees of
= 4-1 =3 Freedom = 75/3
= 25
F-ratio =
MSTr/MSE =
25/6.3165 = 3.95
Within SSE = 126.33 No of values – no MSE =
treatments of treatment = 24 SSE/Degrees of
– 4 =20 Freedom
=126.33/20
=6.31
In this scenario the value of F-ratio is greater than 1.

Level of significance = 0.05 (given in question)
Degrees of Freedom = (3, 20)
Critical value = 3.10 (calculated from 5 percentage table)
F-Ratio > critical value (i.e) 3.95 > 3.10
Hence assumed null hypothesis is false. Which indicates there is life time difference between
tyres.
A randomized block design, the experimenter divides subjects into subgroups called
blocks, such that the variability within blocks is less than the variability between blocks.
Then, subjects within each block are randomly assigned to treatment conditions. Compared to
a completely randomized design, this design reduces variability within treatment conditions
and potential confounding, producing a better estimate of treatment effects.
The table below shows a randomized block design for a hypothetical medical experiment.
Gende Treatment
r Placebo Vaccine
Male 250 250
Female 250 250
Subjects are assigned to blocks, based on gender. Then, within each block, subjects are
randomly assigned to treatments (either a placebo or a cold vaccine). For this design, 250
men get the placebo, 250 men get the vaccine, 250 women get the placebo, and 250 women
get the vaccine.
It is known that men and women are physiologically different and react differently to
medication. This design ensures that each treatment condition has an equal proportion of men
and women. As a result, differences between treatment conditions cannot be attributed to
gender. This randomized block design removes gender as a potential source of variability and
as a potential confounding variable.
LSD - Latin Square Design - A Latin square is one of the experimental designs which has a
balanced two-way classification scheme say for example - 4 X 4 arrangement. In this scheme
each letter from A to D occurs only once in each row and also only once in each column. The
balance arrangement, it may be noted that, will not get disturbed if any row gets changed with
the other.
A B C D
B C D A
C D A B
D A B C
The balance arrangement achieved in a Latin Square is its main strength. In this design, the
comparisons among treatments, will be free from both differences between rows and
columns. Thus the magnitude of error will be smaller than any other design.
FD - Factorial Designs - This design allows the experimenter to test two or more variables
simultaneously. It also measures interaction effects of the variables and analyzes the impacts
of each of the variables.
In a true experiment, randomization is essential so that the experimenter can infer cause and
effect without any bias.
1.2 Various Sources of Data
Sources of Secondary Data

While primary data can be collected through questionnaires, depth interview, focus
group interviews, case studies, experimentation and observation; The secondary data can be
obtained through
 Internal Sources - These are within theorganization

 External Sources - These are outside theorganization
 Internal Sources ofData
If available, internal secondary data may be obtained with less time, effort and money
than the external secondary data. In addition, they may also be more pertinent to the situation
at hand since they are from within the organization. The internal sources include
Accounting resources- This gives so much information which can be used by the marketing
researcher. They give information about internal factors.
Sales Force Report- It gives information about the sale of a product. The information
provided is of outside theorganization.
Internal Experts- These are people who are heading the various departments. They can give
an idea of how a particular thing isworking
Miscellaneous Reports- These are what information you are getting from operational reports.
If the data available within the organization are unsuitable or inadequate, the marketer should
extend the search to external secondary data sources.
External Sources of Data

External Sources are sources which are outside the company in a larger environment.
Collection of external data is more difficult because the data have much greater variety and
the sources are much more numerous.
External data can be divided into following classes.
Government Publications- Government sources provide an extremely rich pool of data for
the researchers. In addition, many of these data are available free of cost on internet websites.
There are number of government agencies generating data. These are:
Registrar General of India- It is an office which generates demographic data. It includes

details of gender, age, occupation etc.
Central Statistical Organization- This organization publishes the national accounts

statistics. It contains estimates of national income for several years, growth rate, and rate of
major economic activities. Annual survey of Industries is also published by the CSO. It gives
information about the total number of workers employed, production units, material used and
value added by themanufacturer.
Director General of Commercial Intelligence- This office operates from Kolkata. It gives
information about foreign trade i.e. import and export. These figures are provided region-
wise and country-wise.
Ministry of Commerce and Industries- This ministry through the office of economic
advisor provides information on wholesale price index. These indices may be related to a
number of sectors like food, fuel, power, food grains etc. It also generates All India
Consumer Price Index numbers for industrial workers, urban, non-manual employees and
cultural labourers.
Planning Commission- It provides the basic statistics of Indian Economy.
Reserve Bank of India- This provides information on Banking Savings and investment. RBI
also prepares currency and finance reports.
Labour Bureau- It provides information on skilled, unskilled, white collared jobs etc.
National Sample Survey- This is done by the Ministry of Planning and it provides social,
economic, demographic, industrial and agricultural statistics.
Department of Economic Affairs- It conducts economic survey and it also generates

information on income, consumption, expenditure, investment, savings and foreign trade.
State Statistical Abstract- This gives information on various types of activities related to the
state like - commercial activities, education, occupation etc.
Non-Government Publications- These includes publications of various industrial and trade

associations, such as
The Indian Cotton Mill

Association Various
chambers of commerce
The Bombay Stock Exchange (it publishes a directory containing financial accounts, key
profitability and other relevant matter)
Various Associations of Press Media. Export Promotion Council.
Confederation of Indian Industries (CII)
Small Industries Development Board of India
Different Mills like - Woolen mills, Textile mills etc

The only disadvantage of the above sources is that the data may be biased. They are likely to
colour their negative points.
Syndicate Services- These services are provided by certain organizations which collect and
tabulate the marketing information on a regular basis for a number of clients who are the
subscribers to these services. So the services are designed in such a way that the information
suits the subscriber. These services are useful in television viewing, movement of consumer
goods etc. These syndicate services provide information data from both household as well as
institution.
In collecting data from household they use three approaches Survey- They conduct surveys
regarding - lifestyle, sociographic, general topics. Mail Diary Panel- It may be related to 2
fields - Purchase and Media.
Electronic Scanner Services- These are used to generate data on volume.

They collect data for Institutions from Whole sellers, Retailers, and Industrial Firms
Various syndicate services are Operations Research Group (ORG) and The Indian
Marketing Research Bureau (IMRB).
Importance of Syndicate Services
Syndicate services are becoming popular since the constraints of decision making are
changing and we need more of specific decision-making in the light of changing
environment. Also Syndicate services are able to provide information to the industries at a
low unit cost.
Disadvantages of Syndicate Services
The information provided is not exclusive. A number of research agencies provide
customized services which suits the requirement of each individual organization.
International Organization- These includes
The International Labour Organization (ILO)- It publishes data on the total and active
population, employment, unemployment, wages and consumer prices
The Organization for Economic Co-operation and development (OECD) - It publishes data
on foreign trade, industry, food, transport, and science andtechnology.
The International Monetary Fund (IMA) - It publishes reports on national and

international foreign exchange regulations.
1.2.1 Comparison of sources of data
Based on various features (cost, data, process, source time etc.) various sources of
data can be compared as per table 1.
Table 1: Difference between primary data and secondary data.
Comparison Feature Primary data Secondary data

Meaning Data that is collected by a Data that is collected by
researcher. other people.
Data Real time data Past data.
Process Very involved Quick and easy
Source Surveys, interviews, or Books, journals, publications
experiments, questionnaire, etc..
interview etc..
Cost effectiveness Expensive Economical
Collection time Long Short
Specific Specific to researcher need May not be specific to
researcher need
Available Crude form Refined form
Accuracy and reliability More Less
1.3 Understanding Sources of Data from Sensor
Sensor data is the output of a device that detects and responds to some type of input
from the physical environment. The output may be used to provide information or input to
another system or to guide a process. Examples are as follows
 A photosensor detects the presence of visible light, infrared transmission (IR) and/or
ultraviolet (UV) energy.
 Lidar, a laser-based method of detection, range finding and mapping, typically uses a
low-power, eye-safe pulsing laser working in conjunction with a camera.
 A charge-coupled device (CCD) stores and displays the data for an image in such a way
that each pixel is converted into an electrical charge, the intensity of which is related to a
color in the color spectrum.
 Smart grid sensors can provide real-time data about grid conditions, detecting outages,
faults and load and triggering alarms.
 Wireless sensor networks combine specialized transducers with a communications
infrastructure for monitoring and recording conditions at diverse locations. Commonly
monitored parameters include temperature, humidity, pressure, wind direction and speed,
illumination intensity, vibration intensity, sound intensity, powerline voltage, chemical
concentrations, pollutant levels and vital body functions.
1.4 Understanding Sources of Data from Signal
The simplest form of signal is a direct current (DC) that is switched on and off; this is
the principle by which the early telegraph worked. More complex signals consist of an
alternating-current (AC) or electromagnetic carrier that contains one or more data streams.
Data must be transformed into electromagnetic signals prior to transmission across a
network. Data and signals can be either analog or digital. A signal is periodic if it consists
of a continuously repeating pattern.
1.5 Understanding Sources of Data from GPS
The Global Positioning System (GPS) is a space based navigation system that
provides location and time information in all weather conditions, anywhere on or near the
Earth where there is an unobstructed line of sight to four or more GPS satellites. The system
provides critical capabilities to military, civil, and commercial users around the world. The
United States government created the system, maintains it, and makes it freely accessible to
anyone with a GPS receiver.
1.6 Data Management
Data management is the development and execution of architectures, policies,

practices and procedures in order to manage the information lifecycle needs of an enterprise
in an effective manner.
1.7 Data Quality

Data quality refers to the quality of data. Data quality refers to the state of qualitative
or quantitative pieces of information. There are many definitions of data quality but data is
generally considered high quality if it is "fit for [its] intended uses in operations, decision
making and planning
The seven characteristics that define data quality are:

1. Accuracy and Precision
2. Legitimacy and Validity
3. Reliability and Consistency
4. Timeliness and Relevance
5. Completeness and Comprehensiveness
6. Availability and Accessibility
7. Granularity and Uniqueness

Accuracy and Precision: This characteristic refers to the exactness of the data. It cannot
have any erroneous elements and must convey the correct message without being misleading.
This accuracy and precision have a component that relates to its intended use. Without
understanding how the data will be consumed, ensuring accuracy and precision could be off-
target or more costly than necessary. For example, accuracy in healthcare might be more
important than in another industry (which is to say, inaccurate data in healthcare could have
more serious consequences) and, therefore, justifiably worth higher levels of investment.

Legitimacy and Validity: Requirements governing data set the boundaries of this
characteristic. For example, on surveys, items such as gender, ethnicity, and nationality
are typically limited to a set of options and open answers are not permitted. Any answers
other than these would not be considered valid or legitimate based on the survey’s
requirement. This is the case for most data and must be carefully considered when
determining its quality. The people in each department in an organization understand what
data is valid or not to them, so the requirements must be leveraged when evaluating data
quality.

Reliability and Consistency: Many systems in today’s environments use and/or collect the
same source data. Regardless of what source collected the data or where it resides, it cannot
contradict a value residing in a different source or collected by a different system. There must
be a stable and steady mechanism that collects and stores the data without contradiction or
unwarranted variance.
Timeliness and Relevance: There must be a valid reason to collect the data to justify the
effort required, which also means it has to be collected at the right moment in time. Data
collected too soon or too late could misrepresent a situation and drive inaccurate
decisions.

Completeness and Comprehensiveness: Incomplete data is as dangerous as inaccurate
data. Gaps in data collection lead to a partial view of the overall picture to be displayed.
Without a complete picture of how operations are running, uninformed actions will occur. It’s
important to understand the complete set of requirements that constitute a comprehensive set
of data to determine whether or not the requirements are being fulfilled.

Availability and Accessibility: This characteristic can be tricky at times due to legal and
regulatory constraints. Regardless of the challenge, though, individuals need the right level of
access to the data in order to perform their jobs. This presumes that the data exists and is
available for access to be granted.

Granularity and Uniqueness: The level of detail at which data is collected is important,
because confusion and inaccurate decisions can otherwise occur. Aggregated, summarized
and manipulated collections of data could offer a different meaning than the data
implied at a lower level. An appropriate level of granularity must be defined to provide
sufficient uniqueness and distinctive properties to become visible. This is a requirement for
operations to function effectively.
Noisy data is meaningless data. The term has often been used as a synonym for
corrupt data. However, its meaning has expanded to include any data that cannot be
understood and interpreted correctly by machines, such as unstructured text.
An outlier is an observation that lies an abnormal distance from other values in a
random sample from a population. In a sense, this definition leaves it up to the analyst (or a
consensus process) to decide what will be considered abnormal.
Example of outlier test
2,4,4,5,7,9,11,11,13,14,41
(i)Arrange them in ascending order
(ii)Divide them into equi depths
2,4,4,5,7,|9|,11,11,13,14,41
MIN=2
MAX=41
Q1=4
Q3=13
Q3-Q1=13-4=9*1.5=13.5
Q1-13.5=4-13.5=-9.5
Q3+13.5=13+13.5=26.5
In statistics, missing data, or missing values, occur when no data value is stored for the
variable in an observation. Missing data are a common occurrence and can have a significant
effect on the conclusions that can be drawn from the data. missing values can be replaced by
following techniques:
 Ignore the record with missing values.

 Replace the missing term with constant.
 Fill the missing value manually based on domain knowledge.
 Replace them with mean (if data is numeric) or frequent value (if data is
categorical)
 Use of modelling techniques such decision trees, baye`s algorithm, nearest
neighbor algorithm Etc.
Examples of imputations are listed below
Partial imputation
The expectation-maximization algorithm is an approach in which values of the statistics which
would be computed if a complete dataset were available are estimated (imputed), taking into
account the pattern of missing data. In this approach, values for individual missing data-items are
not usually imputed.
Partial deletion
Methods which involve reducing the data available to a dataset having no missing values include:
Listwise deletion/casewise deletion

Pairwise deletion
Full analysis
Methods which take full account of all information available, without the distortion resulting
from using imputed values as if they were actually observed:
The expectation-maximization algorithm
full information maximum likelihood estimation
Interpolation
In the mathematical field of numerical analysis, interpolation is a method of constructing new
data points within the range of a discrete set of known data points.
In computing, data deduplication is a specialized data compression technique for

eliminating duplicate copies of repeating data. Related and somewhat synonymous terms are
intelligent (data) compression and single instance (data) storage.
1.8 Data Preprocessing
Data preprocessing is a data mining technique that involves transforming

raw data into an understandable format. Real-world data is often incomplete, inconsistent,
and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data
preprocessing is a proven method of resolving such issues.
Data goes through a series of steps during preprocessing:
 Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
Binning method is used to smoothing data or to handle noisy data. In this method,

the data is first sorted and then the sorted values are distributed into a number of
buckets or bins. As binning methods consult the neighborhood of values, they perform
local smoothing
Clustering: Detect and remove outliers
Combined computer and human inspection:
Detect suspicious value and check by human.
Regression:Smooth by fitting the data into regression functions.
Data Integration: Data with different representations are put together and conflicts within
the data are resolved. The sources may include database, files and data warehouse
One of the most well-known implementation of data integration is building an enterprise data
warehouse. The benefit of data warehouse enables a business to perform analyses based on
the data in the data warehouse.
There are mainly to two approaches for data integration
Tight Coupling: In tight coupling data is combined from different sources into a single
physical location through the process of Extraction, Transformation, Loading.
Loose Coupling :In loosely Coupling data only remains in the actual source databases. In this
approach an interface is provided that takes query from the user and transformsit in a waythe
source database can understand and then sends the query directly to the source databases to
obtain the results.
Data Transformation: Data is normalized, aggregated and generalized. In data

transformation, the data are transformed or consolidated into forms appropriate for mining
Data transformation can involve the following:
1.Smoothing:
It is a process that is used to remove noise from the dataset using some algorithms It allows
for highlighting important features present in the dataset. It helps in predicting the patterns.
When collecting data, it can be manipulated to eliminate or reduce any variance or any other
noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to
look at a lot of data which can often be difficult to digest for finding patterns that they
wouldn’t see otherwise.
Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources
into a data analysis description. This is a crucial step since the accuracy of data analysis
insights is highly dependent on the quantity and quality of the data used. Gathering accurate
data of high quality and a large enough quantity is necessary to produce relevant results.
The collection of data is useful for everything from decisions concerning financing or
business strategy of the product, pricing, operations, and marketing strategies.
Generalization:
It converts low-level data attributes to high-level data attributes using concept hierarchy. For
Example Age initially in Numerical form (22, 25) is converted into categorical value (young,
old).
For example, Categorical attributes, such as house addresses, may be generalized to higher-
level definitions, such as town or country.
Discretization:
It is a process of transforming continuous data into set of small intervals. Most Data Mining
activities in the real world require continuous attributes. Yet many of the existing data mining
frameworks are unable to handle these attributes.
Also, even if a data mining task can manage a continuous attribute, it can significantly
improve its efficiency by replacing a constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
Normalization: Data normalization involves converting all data variable into a given range.
Techniques that are used for normalization are:
Min-Max Normalization:
This transforms the original data linearly.
Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
We Have the Formula:
.
Where v is the value you want to plot in the new range.
v’ is the new value you get after normalizing the old value.
Solved example:
Suppose the minimum and maximum value for an attribute profit(P) are Rs. 10, 000 and Rs.
100, 000. We want to plot the profit in the range [0, 1]. Using min-max normalization the
value of Rs. 20, 000 for attribute profit can be plotted to:
And hence, we get the value of v’ as 0.11
Data Reduction: This step aims to present a reduced representation of the data in a data
warehouse.It reduces the data size by collecting and then replacing the low-level concepts
(such as 43 for age) to high-level concepts (categorical variables such as middle age or
Senior).
For numeric data following techniques can be followed:
Data Reduction : Sampling

Sampling is the main technique employed for data selection.
It is often used for both the preliminary investigation of the data and the final data analysis.
Statisticians sample because obtaining the entire set of data of interest is too expensive or time
consuming.
Sampling is used in data mining because processing the entire set of data of interest is too expensive
or time consuming.
Data Reduction : Types of Sampling
Simple Random Sampling :There is an equal probability of selecting any particular item
Sampling without replacement :As each item is selected, it is removed from the population
Sampling with replacement:Objects are not removed from the population as they are selected for the
sample.In sampling with replacement, the same object can be picked up more than once
Stratified sampling:
Approximate the percentage of each class (or subpopulation of interest) in the overall database
Used in conjunction with skewed data
Data Discretization: Involves the reduction of a number of values of a continuous attribute
by dividing the range of attribute intervals.
Data Reduction Feature Subset Selection
Another way to reduce dimensionality of data
Remove duplication or all of the information contained in one or more other attributes
Example: purchase price of a product and the amount of sales tax paid.
Irrelevant features:contain no information that is useful for the data mining task at hand  Example:
students' ID is often irrelevant to the task of predicting students' GPA
What are hazards?

In relation to workplace safety and health, hazard can be defined as any source of potential harm
or danger to someone or any adverse health effect produced under certain condition.
A hazard can harm an individual or an organization. For example, hazard to an organization
include loss of property or equipment while hazard to an individual involve harm to health or
body.
A variety of sources can be potential source of hazard at workplace. These hazards include
practices or substances that may cause harm. Here are a few examples of potential hazards:
Material: Knife or sharp edged nails can cause cuts.

o Substance: Chemicals such as Benzene can cause fume suffocation.
Inflammable substances like petrol can cause fire.
Electrical energy: Naked wires or electrodes can result in electric shocks.
Condition: Wet floor can cause slippage. Working conditions in mines can cause health hazards.
o Gravitational energy: Objects falling on you can cause injury.
o Rotating or moving objects: Clothes entangled into ratting objects can cause serious harm.
Potential Sources of Hazards in an Organization

Here are some potential sources of hazards in an organization:
Using computers: Hazards include poor sitting postures or excessive duration of sitting in one
position. These hazards may result in pain and strain. Making same movement repetitively can
also cause muscle fatigue In addition, glare from the computer screen can be harmful to eyes.
Stretching up at regular intervals or doing some simple yoga in your seat only can mitigate such
hazards.
Handling office equipment: Improper handling of office equipment can result in injuries. For
example, sharp-edged equipment if not handled properly can cause cuts. Staff members should be
trained to handle equipment properly. Relevant manual should be made available by
administration on handling equipment.
Handling objects: Lifting or moving heavy items without proper procedure or techniques can be
a source of potential hazard. Always follow approved procedure and proper posture for lifting or
moving objects.
Stress at work: In today’s organization, you may encounter various stress causing hazards. Long
working hours can be stressful and so can be aggressive conflicts or arguments with colleagues.
Always look for ways for conflict resolution with colleagues. Have some relaxing hobbies for
stress against long working hours.
Working environment: Potential hazards may include poor ventilation, inappropriate height
chairs and tables, stiffness of furniture, poor lighting, staff unaware of emergency procedures, or
poor housekeeping. Hazards may also include physical or emotional intimidation, such as
bullying or ganging up against someone. Staff should be made aware of organization’s policies to
Key Points
fight against all the given hazards related to working environment.
General Evacuation Procedures
Each organization will has its own evacuation procedures as listed in its policies. An alert
employee, who is well-informed about evacuation procedures, can not only save him or herself,
but also helps others in case of emergencies. Therefore, you should be aware of these procedures
and follow them properly during an emergency evacuation. Read your organization’s policies to
know about the procedures endorsed by it. In addition, here are a few general evacuation steps that
will always be useful in such situations:
o Leave the premises immediately and start moving towards the
nearest emergency exit.
o Guide your customers to the emergency exits.
o If possible, assist any person with disability to move towards the emergency exit. However, do
not try to carry anyone unless you are trained to do so.
o Keep yourself light when evacuating the premises. You may carry your hand-held belongings,
such as bags or briefcase as you move towards the emergency exit. However, do not come back
into the building to pick up your belongings unless the area is declared safe.
o Do not use the escalators or elevators (lifts) to avoid overcrowding and getting trapped, in case
there is a power failure. Use the stairs instead.
o Go to the emergency assembly area. Check if any of your colleagues are missing and
immediately inform the personnel in charge of emergency evacuation or your supervisor.
o Do not go back to the building you have evacuated till you are informed by authorized
personnel that it is safe to go inside
Some of the common safety signs are given below.

Note down the labels for each sign.
Discuss and check the participants understanding of the various safety signs given in the picture
above.
Review: Safety Guidelines Checklist
1. Store all cleaning chemicals in tightly closed containers in separate cupboards.
2. Keep the kitchen clean and dry all the time.
3. Throw away rubbish daily.
4. Make sure all areas have proper lighting.
5. In case of any injury or fracture, do not move the person until he or she has received medical
attention.
6. Do not wear loose clothing or jewelry when working with machines. It may catch on moving
equipment and cause a serious injury.
7. Never distract the attention of people who are working near fire or with some machinery, tools or
equipment.
8. Where required, wear protective items, such as goggles, safety glasses, masks, gloves, hair nets,
etc.
9. Shut down all machines before leaving for the day.
10. Do not play with electrical controls or switches.
11. Do not operate machines or equipment until you have been properly trained and allowed to do so
by your supervisor.
12. Do not adjust, clean or oil moving machinery.
13. Stack all shelves in an orderly way.
14. Stack all boxes and crates properly.
15. Never leave dishrags, aprons and other clothing near any hot surface.
16. Repair torn wires or broken plugs before using any electrical equipment.
17. Do not use equipment if it smokes, sparks or looks unsafe.
18. Cover all food with a lid, plastic wrap or aluminium foil.
19. Do not smoke in “No Smoking” areas.
20. Report any unsafe condition or acts to your supervisor. These could include:

Unit I 1.1 Design Data Architecture and Manage The Data For Analysis

Uploaded by

Copyright:

Available Formats

Unit I 1.1 Design Data Architecture and Manage The Data For Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit I 1.1 Design Data Architecture and Manage The Data For Analysis

Uploaded by

Copyright:

Available Formats

UNIT I

• Data processing needs

The General Approach is based on designing the Architecture at three Levels of

Fig 1.1: Three levels architecture in data analytics.

The logical view/user's view, of a data analytics represents data in a format that is

The sources of generating primary data are -

Fig 1.2: Data collections

An observation is a data collection method, by which you gather knowledge of the

CRD - Completely Randomized Design

where ri, is the number of times the ith treatment replicated

Lower Terrace 1200

Design a CRD for this experiment

2800 trees is divided into terrace as shown below

Total no of Terrace Random Fertilizer

Tyre Automobile Automobil Automobile Automobile Automobile Automobile

Tyre Automobil Automobile Automobil Automobil Automobile Automobil Total

Tyre Automobil Automobile Automobil Automobil Automobile Automobil Total

Total sum of squares (SST) = sum(X2) – Correlation factor

Sum of Squares Error (SSE) SST – SSTr = 201.33 – 75 = 126.33

Source of Sum of Degrees of Mean of sum of F - Ratio

In this scenario the value of F-ratio is greater than 1.

1.2 Various Sources of Data

Sources of Secondary Data

 Internal Sources - These are within theorganization

External Sources of Data

External data can be divided into following classes.

Registrar General of India- It is an office which generates demographic data. It includes

Central Statistical Organization- This organization publishes the national accounts

Planning Commission- It provides the basic statistics of Indian Economy.

Department of Economic Affairs- It conducts economic survey and it also generates

Non-Government Publications- These includes publications of various industrial and trade

The Indian Cotton Mill

Different Mills like - Woolen mills, Textile mills etc

Electronic Scanner Services- These are used to generate data on volume.

The International Monetary Fund (IMA) - It publishes reports on national and

1.2.1 Comparison of sources of data

Comparison Feature Primary data Secondary data

1.3 Understanding Sources of Data from Sensor

1.4 Understanding Sources of Data from Signal

1.5 Understanding Sources of Data from GPS

1.6 Data Management

Data management is the development and execution of architectures, policies,

1.7 Data Quality

The seven characteristics that define data quality are:

 Ignore the record with missing values.

Listwise deletion/casewise deletion

In computing, data deduplication is a specialized data compression technique for

1.8 Data Preprocessing

Data preprocessing is a data mining technique that involves transforming

Data goes through a series of steps during preprocessing:

Binning method is used to smoothing data or to handle noisy data. In this method,

Data Transformation: Data is normalized, aggregated and generalized. In data

Data transformation can involve the following:

And hence, we get the value of v’ as 0.11

For numeric data following techniques can be followed:

Data Reduction : Sampling

What are hazards?

Material: Knife or sharp edged nails can cause cuts.

Electrical energy: Naked wires or electrodes can result in electric shocks.