DES-IBM Certificate in Data Science

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Certificate in Data Science

Course outline for Collaborative Program


Program Audience: Graduate Students, Post Graduate students &Working
Professional

Course Mode: Lecture, Tutorial, Practical

Course Objectives: - The art of uncovering the insights and trends in data has been around
since ancient times. The ancient Egyptians used census data to increase efficiency in tax
collection and they accurately predicted the flooding of the Nile river every year. Since then,
people working in data science have carved out a unique and distinct field for the work they
do, this field is data science. In this course, we will meet some data science practitioners and
we will get an overview of what data science is today. All the Faculty Members participating
in the program will be provided relevant teaching aids after they complete the sessions.
Teaching aids consist of – Instructor Guide, Case Study presentation slide, access to online
IBM Business Analytics @ Campus Portal

Course Pre-requisite

• A familiarity with the basic concepts in programming will be useful for Data Science sessions

• Participants must have working knowledge of Windows, Linux, etc

• Concepts of diverse data set will be useful

Note: - DES-IBM Reserve the right to make changes in course structure and content. change
will be intimated to the participants during the session.
Module 1: Python (20-25 Hrs)

• Introduction to Python,
• Understanding Operators, Variables and Data Types,
• Conditional Statements,
• Looping Constructs, Functions,
• Data Structure, Lists, Dictionaries
• Understanding Standard Libraries in Python, Reading a CSV File in Python
• Data Frames and basic operations with Data Frames, Indexing a Data Frame
• Libraries in Python –
• NumPy,
• SciPy, Matplotlib, Scikit-learn,
• Web development frameworks: Django/Flask

Module 2: Basic Statistics and Statistical Inference (20-25 Hrs)

• Concept of statistics, population, sample, parameter and statistic, examples of use of


statistic, data sources, representation of data, types of statistical analyses, sampling
methods, types of variables, measures of central tendency, statistical estimation: point
and interval, co-variance, coefficient of correlation, formulae
• Permutations and combinations, Probability concepts, types of probabilities,
collectively exhaustive event set, joint probability, Bayes Theorem, probability
distribution for a discreet random variable, probabilistic view on variance, covariance
• Distributions: Bernoulli’s trail, binomial distribution, Poisson distribution,
Hypergeometric distribution, student-t distribution, Chi-square distribution, F-
distribution, Normal distribution, explanation of derivation of population parameter
through samples and central limit theorem, Z score
• Hypothesis and testing, single parameter and two-parameter testing, single sided and
two-sided testing, p-value, tests and test statistic and logic behind it, problems on
hypothesis testing, diagnostic tests: goodness of fit, t-test, f-test and chi-sq test,
contingency table, degree of freedom, analysis of variances
• Regression and allied concepts, data transformation, Linear and Matrix algebra
concepts

Module 3: R Programming (20-25 Hrs)

• Introduction to R-studio, mathematical and logical operators in R, Data types and data
structures, simple operations and programs, matrix operations
• Data frames, string operations, factors, handling categorical data, lists and list
operations
• Loops and conditional statements, switch and break function, Apply functions
• Statistical problem solving in R, Visualizations in R
• Hands-on data manipulations: cleaning, sub-setting, sampling, data transformations and
allied data operations
Module 4: Machine Learning (25-30 Hrs)

• Supervised, Unsupervised and Reinforcement Learning, geometry (lines, curves and


3D spaces) and visualisation of algebraic concepts
• Regression as a concept, simple one variable regression line, coefficients of the line,
assumptions of linear regression, Gradient descent algorithm, cost function to find 'beta'
values and concept, local and global minima, concept of learning rate
• Matrix representation of problem, Gradient descent for multiple features, use of feature
scaling techniques in gradient descent, types of feature scaling, finding coefficients
analytically, normal equation (matrix)non-invertibility
• Logistic regression model, matrix representation, general Sigmoid function and
graphical representation, decision boundary (linear and non-linear), metrics for logistic
regression (accuracy, sensitivity, specificity etcetera concepts), Receiver-operating
characteristic curve, use of RoC curve to find out optimum decision boundary,
convexity and non-convexity of a group of points
• Optimization objective from logistic regression to support vector machines, large
margin classifier, concepts behind large margin classifications, kernels (concept, types
and graphical explanations), using SVM
• Decision trees and random forests: Concept, diagrammatic representation, random
forest as a voting committee of decision trees, parameter meaning and explanation.
• Naive Bayes: Venn diagrams, Naive Bayes algorithm, application and problems, Naive
Bayes learning, Bayesian inference, Retail basket analysis; Concept of boosting and
bagging
• Unsupervised learning methods/Clustering: K-means algorithm, optimization
objective, graphical representation, random initialization, choosing number of clusters
• Association rule mining, K-nearest neighbours algorithm.

• Control flow and Pandas: Write conditional constructs to tweak the execution of your
scripts and get to know the Pandas DataFrame: the key data structure for Data Science
in Python.

Module 5: Big Data and Data Analytics (24-30 Hrs)


• Hortonworks Data Platform (HDP), Apache Ambari, Hadoop and the Hadoop
Distributed File System, MapReduce and Yarn, Apache Spark, Storing and Quering
data , ZooKeeper, Slider, and Knox , Loading data with Sqooq
• Dataplane Service, Stream Computing, Data Science essentials, Drew Conway’s Venn
Diagram - and that of others, The Scientific Process applied to Data Science, the steps
in running a Data Science project
• Languages used for Data Science (Python, R, Scala, Julia, ...), Survey of Data Science
Notebooks, Markdown language with notebooks, Resources for Data Science,
including GitHub, Jupyter Notebook, Essential packages: NumPy, SciPy, Pandas,
Scikit-learn, NLTK, BeautifulSoup.
• Data visualizations: matplotlib, ..., PixieDust , Using Jupyter “Magic” commands
• Using Big SQL to access HDFS data, Creating Big SQL schemas and tables, Querying
Big SQL tables, Managing the Big SQL Server, Configuring Big SQL security,
• Data federation with Big SQL, IBM Watson Studio, Analyzing data with Watson
Studio Prerequisites Skills

You might also like