FDS 2 Marks All Units For File
FDS 2 Marks All Units For File
FDS 2 Marks All Units For File
UNIT I
1. Define data Science?
a. Data science is an evolutionary extension of statistics capable of dealing with the
massive amounts of data produced
2. List the facets of data?
a. Structured data
b. Unstructured data
c. Natural language data
d. Machine data
e. Graph-based data
f. Streaming data
3. List down the steps in data science process
a. Setting the research goal
b. Gathering data
c. Data preparation
d. Data exploration
e. Data Modeling
f. Presentation and automation
4. Give characteristics of big data
a. Volume—How much data is there?
b. Variety—How diverse are different types of data?
c. Velocity—At what speed is new data generated?
5. List out the official data repositories in which data can be stored
a. Databases
b. data marts
c. data warehouses
d. data lakes
6. What are the essential items to be available in the project charter.
a. A clear research goal
b. The project mission and context
c. How you’re going to perform your analysis
d. What resources you expect to use
e. Proof that it’s an achievable project, or proof of concepts
f. Deliverables and a measure of success
g. A timeline
1
CS3352 – FOUNDATIONS OF DATA SCIENCE
a. Omit the values : Easy to perform You lose the information from an observation
b. Set value to null : Easy to perform Not every modeling technique and/or
implementation can handle null values
c. Impute a static value such as 0 or the mean : Easy to perform You don’t lose
information from the other variables in the observation. Can lead to false
estimations from a model
d. Impute a value from an estimated or theoretical distribution: Does not disturb
the model as much Harder to execute You make data assumptions
2
CS3352 – FOUNDATIONS OF DATA SCIENCE
e. Modeling the value (nondependent) Does not disturb the model too much Can
lead to too much confidence in the model Can artificially raise dependence among
the variables Harder to execute You make data assumptions
14. How can one Combine data from different data sources?
a. Combining data
b. Set operators
c. Creating views
observation.
3
CS3352 – FOUNDATIONS OF DATA SCIENCE
i.
Mean square error is a simple measure: check for every prediction how far it was from the
truth, square this error, and add up the error of every prediction
UNIT–II
1. WhatarethethreestagesofIDAprocess?
o Datapreparation
o Dataminingandrulefinding
o Resultvalidationandinterpretation
2. Whatislinearregression?
Linearregressionisanapproachformodelingtherelationshipbetweenascalardependentvaria
bleyandoneormoreexplanatoryvariables(orindependentvariables)denotedX.Thecaseof
oneexplanatoryvariableiscalledsimplelinearregression.
3. ExplainBayesianInference?
4. Whatismeantbyruleinduction?
Rule induction is an area of machine learning in which formal rules are extracted from
aset of observations. The rules extracted may represent a full scientific model of the data,
ormerelyrepresentlocalpatternsinthedata.
5. Whatarethe twostrategiesinLearn-One-RuleFunction.
o Generaltospecific
o Specifictogeneral
6. WritedownthetopologiesofNeuralNetwork.
Singlelayer
Multilayer
Recurrent
Self-organized
7. Whatismeantbyfuzzylogic.
More thandata mining tasks such as prediction, classification, etc., fuzzy models cangive
insight to the underlying system and an be automatically derived from system’s
dataset.Forachievingthis, thetechniqueusedis gridbasedruleset.
8. Writeshortnote on fuzzyqualitativemodeling.
9. WhatarethestepsforBayesiandataanalysis.
Settingupthepriordistribution
Settinguptheposteriordistribution
Evaluatingthefitofthemodel
10. Writeshortnotesontimeseriesmodel.
A time series is a sequential set of data points, measured typically at successive times.
Itismathematicallydefinedasasetofvectorsx(t),t=0,1,2,…wheretrepresentsthetimeelapsed.TheVari
6
CS3352 – FOUNDATIONS OF DATA SCIENCE
UNIT-III
1. Whatisdatastreammodel?
Adatastreamisareal-time,continuousandorderedsequenceofitems.Itisnotpossible to control
the order in which the items arrive, nor it is feasible to locally store a streamin its entirety inany
memorydevice.
2. DefineDataStreamMining.
3. Writeshortnoteaboutsensornetworks.
Sensor networks are a huge source of data occurring instreams. They are used
innumerous situations that require constant monitoring of several variables, based on
whichimportantdecisionsaremade.inmanycases,alertsandalarmsmaybegeneratedasaresponsetothei
nformationreceivedfromaseriesofsensors.
4. whatismeantbyone-timequeries?
One-Time queries are queries that are evaluated once over a point-in-time snapshot
ofthedataset,withtheanswerreturnedtotheuser.
Eg:Astockpricecheckermayalerttheuserwhenastockpricecrossesaparticularpricepoint.
5. Definebiasedreservoirsampling.
Biased reservoir sampling is defined as bias function to regulate the sampling from
thestream. The bias gives a higher probability of selecting data points from recent parts of
thestreamas comparedtodistantpast.
6. WhatisBloomFilter?
7
CS3352 – FOUNDATIONS OF DATA SCIENCE
7. Listoutthe applicationsofRTAP.
o Financialservices
o Government
o E-Commercesites
8. DrawaHigh-LevelarchitectureforRADAR.
o BatchLayer-for batchprocessingofalldata.
o SpeedLayer-for real-timeprocessingofstreamingdata.
o ServingLayer-forrespondingtoqueries.
10. WhatisRTSA?
Real-Time Sentiment analysis (also known as opinion mining) refers to the use of
naturallanguageprocessingtextanalysisandcomputationallinguisticstoidentifyandextractsubjective
informationinsourcematerials.
8
CS3352 – FOUNDATIONS OF DATA SCIENCE
Unit-IV
1. WhatisAssociationRuleMining?
The Association Rule Mining is main purpose to discovering frequent itemsets from
alarge dataset is todiscovera set ofif-thenrulescalledAssociationrules. Theformof
anassociationrules isI→j,whereIisasetofitems(products)andjisaparticular item.
2. ListanytwoalgorithmsforFindingFrequent Itemset.
o AprioriAlgorithm
o FP-GrowthAlgorithm
o SONalgorithm
o PCYalgorithm
3. Whatismeantbycurseofdimensionality?
4. WriteanalgorithmofPark-Chen-Yu.
FOR(eachbasket):
FOR(each item in
basket):add1toitem’sc
ount;
FOR(eachpairofitems):
{hashthepairtoabucket;
add1tothecountforthatbucket:}
5. DefineToivonen’sAlgorithm
Toivonen’s algorithm makes only one full pass over the database. The algorithm
thusproduces exact association rules in one full pass over the database. The algorithm will
giveneither false negatives nor positives, but there is a small yet non-zero probability that it will
failto produce any answer at all. Toivonen’s algorithm begins by selecting a small sample of
theinputdatasetandfindingfromitthecandidatefrequentitemsets.
9
CS3352 – FOUNDATIONS OF DATA SCIENCE
6. List outsomeapplicationsofclustering.
o Collaborativefiltering
o Customersegmentation
o Datasummarization
o Dynamictrenddetection
o Multimediadataanalysis
o Biologicaldataanalysis
o Socialnetworkanalysis
7. Whatarethe typesofHierarchicalClusteringMethods.
o Single-linkclustering
o Complete-linkclustering
o Average-linkclustering
o Centroidlinkclustering
8. DefineCLIQUE
CLIQUE is a subspace clustering algorithm that automatically finds subspaces with high-
density clustering in high dimensional attribute spaces. CLIQUE is a simple grid-based
methodfor finding density-based clusters in subspaces. The procedure for this grid-baased
clustering isrelativelysimple.
9. Whatismeantbyk-meansalgorithm?
10. DrawthediagramforHierarchicalClustering.
10
CS3352 – FOUNDATIONS OF DATA SCIENCE
UNIT-V
1. WhatarethemaingoalsofHadoop?
o Saclable
o Faulttolerance
o Economical
o Handlehardwarefailures.
2. Whatishive?
HiveprovidesawarehousestructureforotherHadoopinputsourcesandSQL-Likeaccess for
data in HDFS. Hive’s query language, HiveQL, compiles to MapReduce and also allowsuser-
definedfunctions(UDFS).
3. Whatarethe responsibilitiesofMapReduceFramework?
o Providesoverallcoordinationofexecution.
o Selectsnodesforrunningmappers.
o Startsandmonitorsmapper’sexecution.
o Sortsandshufflesoutputofmappers.
o Chooseslocationsforreducer’sexecution.
o Deliverstheoutputofmappertoreducersnode.
o Startsandmonitorsreducers’sexecution.
4. WhatisaKey-Valuestore?
The key-value store uses a key to access a value. The key-value store has a schema-
lessformat. The key can be artificially generated or auto-generated while the value can be
string,JSON, BLOB, etc. the key-value uses a hash table with a unique key and a pointer to a
particularitemofdata.
11
CS3352 – FOUNDATIONS OF DATA SCIENCE
5. Whatisvisualization?Whatare thethreemajorgoalsinvisualization.
VisualVisualizationisthepresentationorcommunicationofdatausinginteractiveinterfaces.Ith
as threemajorgoals:
Communicating/presentingtheanalysisresultsefficientlyandeffectively.
As a tool for confirmatory analysis that is to examine the hypothesis, analyze
andconfirm.
Exploratorydataanalysisasaninteractiveandmostlyundirectedsearchforfindingstruct
uresandtrends.
6. Whatissharding?
12
CS3352 – FOUNDATIONS OF DATA SCIENCE
store the data. Data blocks are distributed across several nodes and often are
replicated three, four, or more times across nodes for redundancy.
YARN was developed to take over the resource negotiation and job/task tracking,
allowing MapReduce to be responsible only for data processing
13