FDS 2 Marks All Units For File

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

CS3352 – FOUNDATIONS OF DATA SCIENCE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CS3352 – FOUNDATIONS OF DATA SCIENCE


2 MARKS

UNIT I
1. Define data Science?
a. Data science is an evolutionary extension of statistics capable of dealing with the
massive amounts of data produced
2. List the facets of data?
a. Structured data
b. Unstructured data
c. Natural language data
d. Machine data
e. Graph-based data
f. Streaming data
3. List down the steps in data science process
a. Setting the research goal
b. Gathering data
c. Data preparation
d. Data exploration
e. Data Modeling
f. Presentation and automation
4. Give characteristics of big data
a. Volume—How much data is there?
b. Variety—How diverse are different types of data?
c. Velocity—At what speed is new data generated?
5. List out the official data repositories in which data can be stored
a. Databases
b. data marts
c. data warehouses
d. data lakes
6. What are the essential items to be available in the project charter.
a. A clear research goal
b. The project mission and context
c. How you’re going to perform your analysis
d. What resources you expect to use
e. Proof that it’s an achievable project, or proof of concepts
f. Deliverables and a measure of success
g. A timeline

1
CS3352 – FOUNDATIONS OF DATA SCIENCE

7. Define a Graph-based or network data?


This is a Graph Theory based data. In graph theory, a graph is a mathematical structure to
model pair-wise relationships between objects. Graph or network data is, in short, data that
focuses on the relationship or adjacency of objects. The graph structures use nodes, edges,
and properties to represent and store graphical data.

8. Define Data exploration process.


Data exploration is concerned with building a deeper understanding of your data. we try to
understand how variables interact with each other, the distribution of the data, and whether
there are outliers.

9. What is Data preparation?


This is the process of gathering the data from internal and external sources. This involves
data cleansing removes false values from a data source and inconsistencies across data
sources, data integration enriches data sources by combining information from multiple
data sources, and data transformation ensures that the data is in a suitable format for use in
your models

10. List the process involved in Data cleansing


a. Combining data
b. Physically impossible values
c. Missing values
d. Outliers
e. Spaces, typos, …
f. Errors against codebook

11. Define an outliers.


An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the
other observations

12. How an outlier can be found?


The outlier can be found using a plot with the minimum and maximum values. The plot on
the top shows no outliers, whereas the plot on the bottom shows possible outliers on the
upper side when a normal distribution is expected. The normal distribution, or Gaussian
distribution.

13. How a missing value problem be handled?

a. Omit the values : Easy to perform You lose the information from an observation
b. Set value to null : Easy to perform Not every modeling technique and/or
implementation can handle null values
c. Impute a static value such as 0 or the mean : Easy to perform You don’t lose
information from the other variables in the observation. Can lead to false
estimations from a model
d. Impute a value from an estimated or theoretical distribution: Does not disturb
the model as much Harder to execute You make data assumptions

2
CS3352 – FOUNDATIONS OF DATA SCIENCE

e. Modeling the value (nondependent) Does not disturb the model too much Can
lead to too much confidence in the model Can artificially raise dependence among
the variables Harder to execute You make data assumptions

14. How can one Combine data from different data sources?
a. Combining data
b. Set operators
c. Creating views

15. How two tables can be joined ?


Joining tables allows you to combine the information of one observation found in one table
with the information that you find in another table. The focus is on enriching a single

observation.

16. Define Appending a Table


Appending or stacking tables is effectively adding observations from one table to another
table

17. Define Views


A view behaves as if you’re working on a table, but this table is nothing but a virtual layer
that combines the tables for you

3
CS3352 – FOUNDATIONS OF DATA SCIENCE

18. List down the steps involved in Data transformation?


a. Aggregating data
b. Extrapolating data
c. Derived measures
d. Creating dummies
e. Reducing number of variables

19. What is histogram?


Histogram is a visual representation/Graph on categorical data. In a histogram a variable is
cut into discrete categories and the number of occurrences in each category are summed up
and shown in the graph.

20. What can be observed from a box plot?


A box plot offers an impression of the distribution within categories. It can show the
maximum, minimum, median, and other characterizing measures at the same time.
21. What does a model contains?
a. Selection of a modeling technique and variables to enter in the model
b. Execution of the model
c. Diagnosis and model comparison
22. Define a predictor and target variables?
The target variable is the variable whose values are modeled and predicted by other
variables. A predictor variable is a variable whose values will be used to predict the value
of the target variable.
23. How do we check whether a model is fit ?
R-squared or adjusted R-squared is used. This measure is an indication of the amount of
variation in the data that gets captured by the model. Rules of thumb exist, but for models in
businesses, models above 0.85 are often considered good
24. Differentiate R-squared or adjusted R-squared
R-squared : test the significance of dependent variables in the model
adjusted R-squared : Test the significance of both dependent and independent variable in
the model.
25. What is the use of Predictor variables coefficient?
The coefficients describe the mathematical relationship between each independent variable
and the dependent variable.
26. What is the use of P value?
The p-values for the coefficients indicate whether these relationships are statistically
significant. if the p-value is lower than 0.05, the variable is considered significant, It means
there’s a 5% chance the predictor doesn’t have any influence. There are 5%change that the
4
CS3352 – FOUNDATIONS OF DATA SCIENCE

model may perform wrongly.


27. What Is the use of confusion matrix?
It means there’s a 5% chance the predictor doesn’t have any influence

28. Give Mean square error

i.
Mean square error is a simple measure: check for every prediction how far it was from the
truth, square this error, and add up the error of every prediction

UNIT–II

1. WhatarethethreestagesofIDAprocess?

o Datapreparation
o Dataminingandrulefinding
o Resultvalidationandinterpretation

2. Whatislinearregression?

Linearregressionisanapproachformodelingtherelationshipbetweenascalardependentvaria
bleyandoneormoreexplanatoryvariables(orindependentvariables)denotedX.Thecaseof
oneexplanatoryvariableiscalledsimplelinearregression.

3. ExplainBayesianInference?

Bayesian inference is a method of statistical inference in which Bayes' theorem is


usedto update the probability for a hypothesis as more evidence or information becomes
available.Bayesianinferenceisanimportanttechniqueinstatistics,
andespeciallyinmathematicalstatistics.
5
CS3352 – FOUNDATIONS OF DATA SCIENCE

4. Whatismeantbyruleinduction?

Rule induction is an area of machine learning in which formal rules are extracted from
aset of observations. The rules extracted may represent a full scientific model of the data,
ormerelyrepresentlocalpatternsinthedata.

5. Whatarethe twostrategiesinLearn-One-RuleFunction.

o Generaltospecific
o Specifictogeneral

6. WritedownthetopologiesofNeuralNetwork.

 Singlelayer
 Multilayer
 Recurrent
 Self-organized

7. Whatismeantbyfuzzylogic.

More thandata mining tasks such as prediction, classification, etc., fuzzy models cangive
insight to the underlying system and an be automatically derived from system’s
dataset.Forachievingthis, thetechniqueusedis gridbasedruleset.

8. Writeshortnote on fuzzyqualitativemodeling.

The fuzzy modeling can be interpreted as a qualitative modeling scheme by which


thesystem behavior is qualitatively described using a natural language. A fuzzy qualitative model
isa generalized fuzzy model consisting of linguistic explanations about system behavior in
theframeworkoffuzzylogicinsteadofmathematicalequationswithnumericalvaluesorconventionallo
gical formulawithlogicalsymbols.

9. WhatarethestepsforBayesiandataanalysis.

 Settingupthepriordistribution
 Settinguptheposteriordistribution
 Evaluatingthefitofthemodel

10. Writeshortnotesontimeseriesmodel.

A time series is a sequential set of data points, measured typically at successive times.
Itismathematicallydefinedasasetofvectorsx(t),t=0,1,2,…wheretrepresentsthetimeelapsed.TheVari
6
CS3352 – FOUNDATIONS OF DATA SCIENCE

ablex9t0is treatedas arandomvariable.

UNIT-III

1. Whatisdatastreammodel?

Adatastreamisareal-time,continuousandorderedsequenceofitems.Itisnotpossible to control
the order in which the items arrive, nor it is feasible to locally store a streamin its entirety inany
memorydevice.

2. DefineDataStreamMining.

DataStreamMiningistheprocessof extractinguseful knowledgefromcontinuous,rapid data


streams. Many traditional data mining algorithms can be recast to work with
largerdatasets,buttheycannotaddresstheproblemof acontinuoussupplyofdata.

3. Writeshortnoteaboutsensornetworks.

Sensor networks are a huge source of data occurring instreams. They are used
innumerous situations that require constant monitoring of several variables, based on
whichimportantdecisionsaremade.inmanycases,alertsandalarmsmaybegeneratedasaresponsetothei
nformationreceivedfromaseriesofsensors.

4. whatismeantbyone-timequeries?

One-Time queries are queries that are evaluated once over a point-in-time snapshot
ofthedataset,withtheanswerreturnedtotheuser.

Eg:Astockpricecheckermayalerttheuserwhenastockpricecrossesaparticularpricepoint.

5. Definebiasedreservoirsampling.

Biased reservoir sampling is defined as bias function to regulate the sampling from
thestream. The bias gives a higher probability of selecting data points from recent parts of
thestreamas comparedtodistantpast.

6. WhatisBloomFilter?

7
CS3352 – FOUNDATIONS OF DATA SCIENCE

A Bloom Filter is aspace-efficientprobabilisticdata structure, conceived by


BurtonHoward Bloom in 1970, that is used to test whether an element is a member of set.
FalsePositive matches are possible but false negative are not, thus a Bloom filter has a 100%
recallrate.

7. Listoutthe applicationsofRTAP.

o Financialservices
o Government
o E-Commercesites

8. DrawaHigh-LevelarchitectureforRADAR.

9. Whatarethe three layersofLambdaarchitecture.

o BatchLayer-for batchprocessingofalldata.
o SpeedLayer-for real-timeprocessingofstreamingdata.
o ServingLayer-forrespondingtoqueries.

10. WhatisRTSA?

Real-Time Sentiment analysis (also known as opinion mining) refers to the use of
naturallanguageprocessingtextanalysisandcomputationallinguisticstoidentifyandextractsubjective
informationinsourcematerials.

8
CS3352 – FOUNDATIONS OF DATA SCIENCE

Unit-IV

1. WhatisAssociationRuleMining?

The Association Rule Mining is main purpose to discovering frequent itemsets from
alarge dataset is todiscovera set ofif-thenrulescalledAssociationrules. Theformof
anassociationrules isI→j,whereIisasetofitems(products)andjisaparticular item.

2. ListanytwoalgorithmsforFindingFrequent Itemset.

o AprioriAlgorithm
o FP-GrowthAlgorithm
o SONalgorithm
o PCYalgorithm

3. Whatismeantbycurseofdimensionality?

Points in high-dimensional Euclidean spaces, as well as points in non-Euclidean


spacesoften behave unintuitively. Two unexpected properties of these spaces are that the
randompoints are almost always at about the same distance, and random vectors are almost
alwaysorthogonal.

4. WriteanalgorithmofPark-Chen-Yu.

FOR(eachbasket):

FOR(each item in

basket):add1toitem’sc

ount;

FOR(eachpairofitems):

{hashthepairtoabucket;

add1tothecountforthatbucket:}

5. DefineToivonen’sAlgorithm

Toivonen’s algorithm makes only one full pass over the database. The algorithm
thusproduces exact association rules in one full pass over the database. The algorithm will
giveneither false negatives nor positives, but there is a small yet non-zero probability that it will
failto produce any answer at all. Toivonen’s algorithm begins by selecting a small sample of
theinputdatasetandfindingfromitthecandidatefrequentitemsets.
9
CS3352 – FOUNDATIONS OF DATA SCIENCE

6. List outsomeapplicationsofclustering.

o Collaborativefiltering
o Customersegmentation
o Datasummarization
o Dynamictrenddetection
o Multimediadataanalysis
o Biologicaldataanalysis
o Socialnetworkanalysis

7. Whatarethe typesofHierarchicalClusteringMethods.

o Single-linkclustering
o Complete-linkclustering
o Average-linkclustering
o Centroidlinkclustering

8. DefineCLIQUE

CLIQUE is a subspace clustering algorithm that automatically finds subspaces with high-
density clustering in high dimensional attribute spaces. CLIQUE is a simple grid-based
methodfor finding density-based clusters in subspaces. The procedure for this grid-baased
clustering isrelativelysimple.

9. Whatismeantbyk-meansalgorithm?

The family of algorithms is of the point-assignment type and assumes a Euclidean


space.It is assumed that there are exactly k clusters for some known k. After picking k initial
clustercentroids,thepointsareconsideredoneatatimeandassignedtotheclosestcentroid.

10. DrawthediagramforHierarchicalClustering.

10
CS3352 – FOUNDATIONS OF DATA SCIENCE

UNIT-V

1. WhatarethemaingoalsofHadoop?

o Saclable
o Faulttolerance
o Economical
o Handlehardwarefailures.

2. Whatishive?

HiveprovidesawarehousestructureforotherHadoopinputsourcesandSQL-Likeaccess for
data in HDFS. Hive’s query language, HiveQL, compiles to MapReduce and also allowsuser-
definedfunctions(UDFS).

3. Whatarethe responsibilitiesofMapReduceFramework?

o Providesoverallcoordinationofexecution.
o Selectsnodesforrunningmappers.
o Startsandmonitorsmapper’sexecution.
o Sortsandshufflesoutputofmappers.
o Chooseslocationsforreducer’sexecution.
o Deliverstheoutputofmappertoreducersnode.
o Startsandmonitorsreducers’sexecution.

4. WhatisaKey-Valuestore?

The key-value store uses a key to access a value. The key-value store has a schema-
lessformat. The key can be artificially generated or auto-generated while the value can be
string,JSON, BLOB, etc. the key-value uses a hash table with a unique key and a pointer to a
particularitemofdata.
11
CS3352 – FOUNDATIONS OF DATA SCIENCE

5. Whatisvisualization?Whatare thethreemajorgoalsinvisualization.

VisualVisualizationisthepresentationorcommunicationofdatausinginteractiveinterfaces.Ith
as threemajorgoals:

 Communicating/presentingtheanalysisresultsefficientlyandeffectively.
 As a tool for confirmatory analysis that is to examine the hypothesis, analyze
andconfirm.
 Exploratorydataanalysisasaninteractiveandmostlyundirectedsearchforfindingstruct
uresandtrends.

6. Whatissharding?

Horizontal partitioning of a large database leads to partitioning of rows of the


database.Each partition forms part of a shard, meaning small part of the whole. Each part can be
locatedonaseparatedatabaseserverorany physical location.

7. Massively Parallel Processing Databases


Massively parallel processing (MPP) databases were built on the concept of the relational data
warehouses but are designed to be much faster, to be efficient, and to support reduced query times
List Types of NoSQL Databases
 Document stores
 Key-value stores
 Wide-column stores
 Graph stores

8. What are the prime elements of Hadoop?


Initially, the project had two key elements:
 Hadoop Distributed File System (HDFS): A system for storing data across multiple nodes
 MapReduce: A distributed processing engine that splits a large task into smaller ones that
can be run in parallel.
9. What are the specialized nodes in HDFS
 NameNodes
 DataNodes

10. Define name node and data node


 Name Node : coordinate where the data is stored, and maintain a map of
where each block of data is stored and where it is replicated
 DataNodes : These are the servers where the data is stored at the direction of
the NameNode. It is common to have many DataNodes in a Hadoop cluster to

12
CS3352 – FOUNDATIONS OF DATA SCIENCE

store the data. Data blocks are distributed across several nodes and often are
replicated three, four, or more times across nodes for redundancy.

11. What is YARN in Hadoop?

YARN was developed to take over the resource negotiation and job/task tracking,
allowing MapReduce to be responsible only for data processing

12. Give the Hadoop Ecosystem


 Apache Kafka
 Apache Spark
 Apache Storm and Apache Flink
 Lambda Architecture

13. What are the Core Functions of Edge Analytics


 To perform analytics at the edge, data needs to be viewed as real-time flows.
 Whereas big data analytics is focused on large quantities of data at rest, edge
analytics continually processes streaming flows of data in motion
 Streaming analytics at the edge can be broken down into three simple stages:
 Raw input data
 Analytics processing unit (APU)
 Output streams

13

You might also like