Question Bank R
Question Bank R
Question Bank R
1. Dr.M.KamalaKumari - Chairman
Dept of CSE, AKNU, RJY
2. Dr.P.Venkateswara Rao – Member
Dept of CSE, AKNU, RJY
3. Mr.M. Simhadri – Member
Lecturer, Aditya Degree College, Kakinada
4. Mr.B N S Gupta – Member
Lecturer, SVKP & Dr. K.S Raju Arts & Science College Penugonda
Data Science is a fast-growing interdisciplinary field, focusing on the analysis of data to extract
knowledge and insight. This course will introduce students to the collection. Preparation, analysis,
modelling and visualization of data, covering both conceptual and practical issues. Examples and
case studies from diverse fields will be presented, and hands-on use of statistical and data
manipulation software will be included.
Outcomes
i. Recognize the various discipline that contribute to a successful data science effort.
ii. Understand the processes of data science identifying the problem to be solved, data collection,
preparation, modelling, evaluation and visualization.
iv. Be able to identify the application of the type of algorithm based on the type of the problem.
v. Be comfortable using commercial and open source tools such as the R/python language and
its associated libraries for data analytics and visualization.
Unit-I
Defining Data Science and Big data, Benefits and Uses, facets of Data, Data Science Process.
History and Overview of R, Getting Started with R, R Nuts and Bolts
Unit-II
The Data Science Process: Overview of the Data Science Process-Setting the research goal,
Retrieving Data, Data Preparation, Exploration, Modeling, data Presentation and Automation.
Getting Data in and out of R, Using readr package, Interfaces to the outside world.
Unit-III
Machine Learning: Understanding why data scientists use machine learning-What is machine
learning and why we should care about, Applications of machine learning in data science, Where it is
used in data science, The modeling process, Types of Machine Learning-Supervised and
Unsupervised.
Unit-IV
Handling large Data on a Single Computer: The problems we face when handling large data, General
Techniques for handling large volumes of data, Generating programming tips for dealing with large
datasets. Case study- Predicting malicious URLs(This can be implemented in R)
Unit-V
Subsetting R objects, Vectorised Operations, Managing Data Frames with the dplyr, Control
structures, functions, Scoping rules of R, Coding Standards in R, Loop Functions, Debugging,
Simulation
References
1. DavyCielen, Arno.D.B.Maysman, Mohamed Ali, “Introducing Data Science” Manning
Publications, 2016.
2. Roger D. Peng, “R Programming for DataScience” Lean Publishing, 2015.
3. Nina Zumel, John Mount, “Practical Data Science with R”, Manning Publications, 2014.
4. Mark Gardener, “Beginning R - The Statistical Programming Language”, John Wiley &
Sons, Inc., 2012.
5. W. N. Venables, D. M. Smith and the R Core Team, “An Introduction to R”, 2013.
6.Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, AbhijitDasgupta, “Practical Data
Science Cookbook”, Packt Publishing Ltd., 2014.
Student Activity
Students should be able to create a database and read and write from it. Transfer data to and from csv
and different types of files.
Should clean data and make it consistent for any sort of analysis in R
Perform statistical analysis on variety of data
Perform appropriate statistical tests using R and visualize the outcome
Continuous assessment:
Let the students be tested in the following questions from each unit
1. Define Data Science. Discuss any application as an example
2. What are the main components of R and explain basic R commands
3. Explain the phases in Data Science Process
4. What is machine learning. What are the differences between machine learning, artificial
intelligence and data science
5. What are the general techniques to handle large volumes of data
6. Develop any data visualisation ion application by creating data frames and applying operations on
it and using relevant packages
BASICS OF R LAB
Part - B
Outcomes
Compare various conceptions of data mining as evidenced in both research and application.
Should be able to apply the type of techniques based on the problems considered
Unit-I
An idea on Data Warehouse, Data mining-KDD versus data mining, Stages of the Data Mining
Process-Task primitives., Data Mining Techniques – Data mining knowledge representation.
Unit-II
Data mining query languages- Integration of Data Mining System with a Data Warehouse- Issues,
Data pre-processing – Data Cleaning,Data transformation – Feature selection – Dimensionality
reduction
Unit-III
Concept Description: Characterization and comparison What is Concept Description,Data
Generalization by Attribute-Oriented Induction(AOI), AOI for Data Characterization,
Efficient Implementation of AOI.
Mining Frequent Patterns, Associations and Correlations: Basic Concepts, FrequentItemset Mining
Methods: Apriori method, generating Association Rules, Improvingthe Efficiency of Apriori, Pattern-
Growth Approach for mining Frequent Item sets.
UNITIV
Classification Basic Concepts: Basic Concepts, Decision Tree Induction: Decision TreeInduction
Algorithm, Attribute Selection Measures, Tree Pruning. Bayes Classification Methods.
UNIT V
Classification by Back Propagation:Multi_Layer Feed Forward Neural Network.
Support Vector Machines: Cases when the data are linearly separable and linearly inseparable.
Cluster Analysis: Cluster Analysis, Partitioning Methods, Hierarchal methods, Density based
methods-DBSCAN.
References
1. Jiawei Han and MichelineKamber, “Data Mining: Concepts and Techniques”, 3rd Edition,
Morgan Kaufmann Publishers, 2011.
2. AdelchiAzzalini, Bruno Scapa, “Data Analysis and Data mining” , 2ndEdiiton, Oxford Univeristy
Press Inc., 2012.
3. Alex Berson and Stephen J. Smith, “Data Warehousing, Data Mining & OLAP”, 10th Edition,
TataMcGraw Hill Edition , 2007.
4. G.K. Gupta, “Introduction to Data Mining with Case Studies”, 1st Edition, Easter Economy
Edition, PHI, 2006.
Student Activities
1. Students should be able to implement Data Mining algorithms provided the relevant
data
2. Given the data, students can visualize all statistical measures
3. Differentiate the types of mining problems and identify what type of algorithms are to be
implemented.
Continuous assessment:
Let the students be tested in the following questions from each unit
1. What is Data Mining and KDD? Where Data Mining fits in KDD Process
2. Describe all Preprocessing methods
3. Explain Data Description and AOI Algorithm
4. Explain Classification and Write any Decision tree induction algorithm
5. Explain the concept of clustering and write any algorithm to form clusters.
DATA MINIG USING R PROGRAMMING LAB
1. Get and Clean data using swirl exercises.(Use ‘swirl’ package, library andinstall that topic from
swirl).
2. Visualize all Statistical measures(Mean ,Mode, Median, Range, Inter QuartileRange etc., using
Histograms, Boxplots and Scatter Plots).
3. Create a data frame with the following structure.
4. Create a data frame with 10 observations and 3 variables and add new rows and
columns to it using ‘rbind’ and ‘cbind’ function.
5. Create a function to discretize a numeric variable into 3 quantiles and label them as
low, medium, and high. Apply it on each attribute of any dataset to create a new
data frame. ‘discrete’ with Categorical variables and the class label.
6. Create a simple scatter plot using any dataset using ‘dplyr’ library. Use the
same data to indicate distribution densities using boxwhiskers.
9. Implement decision trees using any dataset using package party and ‘rpart’.
Outcomes
Understands and learn all basic concepts of Python
Program Data Analysis methods in Python
Get used with Python Programming environments
UNIT I
What is Data Analysis? Differences between Data Analysis and Analytics, What is Python, Why
Python for Data Analysis? What is Library, Essential Python Libraries. Python Language basics,
IPython and Jupyter Notebook. Python Language Basics.
UNIT II
Built-in Data Structures, Functions, Files and Operating System.
NumPy Basics: Arrays and Vectorized Computation, The Numpyndarray, Universal Functions,
Array-Oriented Programming with Arrays, File Input and Output with Arrays, Linear Algebra,
Pseudorandom Number Generation.
UNIT III
Getting Started with Pandas: Introduction to Pandas Data Structures, Essential Functionality,
Summarizing and Computing Descriptive Statistics
Data Loading, Storage and File Formats: Reading and Writing Data in Text Format, Binary Data
Formats, Interacting with Web APIs, Interacting with Databases.
UNIT IV
Data Cleaning and Preperation: Handling Missing Data, Data Transformation, String Manipulation.
Data Wrangling: Join, Combine and Reshape: Hierarchical Indexing, Combining and Merging
Datasets, Reshaping and Pivoting.
UNIT V
Introduction to Modeling Libraries in Python: Interfacing between pandas and Model code, Creating
model descriptions with Patsy, Introduction to stas models.
Plotting and Visualization: A brief matplotlib API Primer, Plotting with Pandas and seaborn, Other
Python visualization tools.
Reference Books
1. Wes McKinney “Python for Data Analysis” O’reilly Publications Second edition
2. Charles R Suverance “Python for Everybody” Exploring data using Python 3
3. John Zelle Michael Smith Python Programming, second edition 2010
Student Activities
Take up any application which involves the python coding.Example Case studies/Simulators:
(https://knightlab.northwestern.edu/2014/06/05/five-mini-programming-projects-for-the-python-
beginner/)
1. Dice Rolling Simulator
2. Guess the number
3. Text based adventure game
4. Hangman
Continuous assessment:
Let the students be tested in the following questions from each unit
1. What is Data Analysis.List out the differences between data analysis and data analytics
2. What is Python? Explain Python basics
3. Explain NumPy Basics
4. What is data loading. Explain Pandas Data Structures
5. What is data Cleaning. Explain different phases in it
6. Explain Plotting and Visualization in Python
PYTHON PROGRAMMING LAB
PART - B
Answer Any FIVE Questions 5*10=50M
11. What is python? Explain about python libraies?
12. What is data Analysis? Why python is used for data analysis?
13. Explain in detail about arrays and its related concepts in python?
14. Discuss about input and output files in python?
15. Explain about storage and file formats in python?
16. Discuss in detail about Pandas in python with suitable example.
17. Explain string manipulation functions in python?
18. Discuss about combing and merging data sets in python?
19. Describe about plotting and visulization concepts in python?
20. Explain modeling libraries in python?
PAPER 4: BIG DATA ANALYTICS USING SPARK
OBJECTIVES
To Understand the Complete Architecture of Spark
To know the differences between Hadoop and Spark
To know the concepts of Spark Programming
OUTCOMES
Students will get well knowledge of what is Big Data
Knowledge in Spark Eco System
Mapping of Data Analytics techniques in Spark
Application of Spark Programming to Analytics problems
UNIT - I
Introduction to Big Data:What is Big Data-Characteristics, Data in the Warehouse and Data in
Hadoop, Why is Big Data Important- When to consider Big Data Solution, Applications.
Introduction to Hadoop: Hadoop- definition, Application development in Hadoop. The building
blocks of Hadoop, NameNode, DataNode, Secondary NameNode, JobTracker and Task Tracker.
UNIT-II
Introduction to Spark: What is Apache Spark, Why Spark when Hadoop is there, Spark Features, ,
Spark components, Spark program flow, Spark Eco System. Differences between implementation of
programs in Hadoop and Spark Programming environments.
UNIT III
Spark Fundamentals- Using spark in action VM, Using Spark Shell and writing first spark program,
Basic RDD actions and transformations.
Spark SQL-Working with Data Frames, Using SQL Commands, Saving and loading DataFrame.
UNIT IV
Streaming in Spark- Writing spark streaming applications, Using external data sources, structured
streaming.
Spark MLlib-Introduction to Machine Learning. Definition of Machine Learning, Machine Learning
with Spark.
UNIT V
Graph Representation in MapReduce:Graph Processing with Spark, Spark GraphX, GraphX
features, GraphXExamples, Graph algorithms-Shortest Path Algorithm.
REFERENCE BOOKS:
1. Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data by Dirk
deRoos, Chris Eaton, George Lapis, Paul Zikopoulos, Tom Deutsch, 1st Edition, TMH,2012.
2. Spark in Action PetarZecevic, markoBonaci Manning Publications-2016.
3. Learning Spark“Holden KarauA. Konwinskietc.,”O’reilly Publications.
4. Hadoop in Action by Chuck Lam, MANNING Publishers.
5. Hadoop: The Definitive Guide by Tom White, 3rd Edition, O’reilly
6. Mining of massive datasets, AnandRajaraman, Jeffrey D Ullman, Wiley Publications.
Student Activities
2. Data Preprocessing
Continuous assessment:
Let the students be tested in the following questions from each unit
1. What is Big Data? Explain the characteristics of it
2. What is Spark? What are the advantages of it over Hadoop
3. Explain Spark SQL
4. Explain Spark Streaming
5. Explain Shortest Path Algorithm.
SPARK PROGRAMMING LAB
PART - B
Answer any Five Questions 5*5=25M
OBJECTIVES
To know the importance of data Visualization in the world of Data Analytics and Prediction
To know the important libraries in Tableau
To get equipped with Tableau Tool
OUTCOMES
Students should be able to visualize data through seven stages of data analysis process
Should be able to do explanatory and hybrid types of data visualization
Should be able to understand various stages of visualizing data
UNIT I
Creating Visual Analytics with tableau desktop, connecting to your data-How to Connect to your
data, What are generated Values? Knowing when to use a direct connection, Joining tables with
tableau, blending different datasources in a single worksheet.
UNIT II
Building your first Visualization- How Me works- Chart types, Text Tables, Maps, bar chart, Line
charts, Area Fill charts and Pie charts, scatter plot, Bullet graph, Gantt charts, Sorting data in tableau,
Enhancing Views with filters, sets groups and hierarchies.
UNIT III
Creating calculations to enhance your data- What is aggregation, what are calculated values and
table calculations, Using the calculation dialog box to create, Building formulas using table
calculations, Using table calculation functions
UNIT IV
Using maps to improve insights-Create a Standard Map View, Plotting your own locations on a
map, Replace Tableau’s standard maps, Shaping data to enable Point-to-Point mapping.
UNIT V
Developing an Adhoc analysis environment- generating new data with forecasts, providing self
evidence adhoc analysis with parameters, Editing views in tableau Server.
Reference Books
1. Tableau your data-Daniel G. Murray and the Inter works BI team, Wiley Publications
2. Tableau Data Visualizaton Cookbook, AshutoshNandeshwar, PACKT publishing.
3. Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole
NussbaumerKnaflic (2014)
4. ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham (2009)
General Requirements
1. Dashboard size is 1250px wide by 750px tall.
2. Prefer using containers
3. The dashboard has a total of 5 containers (no more, no less)
4. The Filter Pane
5. Each filter has some padding
Charts Pane Requirement
1. All 3 charts must be in one vertical container
2. Do proper formatting
3. Each chart has some padding between them and other objects
4. Each chart has a grey border, slightly darker than the Pane background color.
5. The Pane under the Title has a border
Business Requirements
1. Show four filters- Category, Sub-Category, Region, and Segment. These filters should have only
relevant values.
2. The dashboard should have the title “Executive sales”
3. The first chart should have the title “YTS KPIs” and should show the following-
Total Discount
Overall Profit
Total Quantity and
Total Sales
4. The second graph should have the title as “Sales” and should show monthly sales per year. Make
sure it is an area chart with proper formatting.
5. The third graph should the title as “Profit” and should show monthly profit per year. Make sure it
is an area chart with proper formatting.
Continuous assessment:
Let the students be tested in the following questions from each unit
1. What are generated values? Join tables using Tableau
2. Create any visualization charts using Chart types, Text Tables, Maps, bar chart, Line charts, Area
Fill charts and Pie charts, scatter plot etc.,
3. What is aggregation, what are calculated values and table calculations?
4. Using Standard Map View, Plot your own locations on a map
5. Develop an Adhoc analysis environment.
DATA VISUALIZATION LAB USING TABLEAU
1. Connect to data Sources
2. Create Univariate Charts
3. Create Bivariate and Multivariate charts
4. Create Maps
5. Calculate user-defined fields
6. Create a workbook data extract
7. Save a workbook on a Tableau server and web
8. Export images, data.
PAPER 5: DATA VISUALIZATION
MODEL PAPER
PART - A
PART - A