Download as DOCX, PDF, TXT or read online from Scribd
Download as docx, pdf, or txt
You are on page 1of 3
Spark and PySpark Training
Duration: 10 Days
Module 1: Introduction to Big Data Hadoop and Spark
What is Big Data? Big Data Customer Scenarios Limitations and Solutions of Existing Data Analytics Architecture with Uber Use Case How Hadoop Solves the Big Data Problem? What is Hadoop? Hadoop’s Key Characteristics Hadoop Ecosystem and HDFS Hadoop Core Components Rack Awareness and Block Replication YARN and its Advantage Hadoop Cluster and its Architecture Hadoop: Different Cluster Modes Big Data Analytics with Batch & Real-Time Processing Why Spark is needed? What is Spark? How Spark Differs from its Competitors? Spark at eBay Spark’s Place in Hadoop Ecosystem
Module 2: Introduction to Python for Apache Spark
Overview of Python Different Applications where Python is Used Values, Types, Variables Operands and Expressions Conditional Statements Loops Command Line Arguments Writing to the Screen Python files I/O Functions Numbers Strings and related operations Tuples and related operations Lists and related operations Dictionaries and related operations Sets and related operations
Module 3: Functions, OOPs, and Modules in Python
Functions Function Parameters Global Variables Variable Scope and Returning Values Lambda Functions Object-Oriented Concepts Standard Libraries Modules Used in Python The Import Statements Module Search Path Package Installation Way
Module 4: Deep Dive into Apache Spark Framework
Spark Components & its Architecture Spark Deployment Modes Introduction to PySpark Shell Submitting PySpark Job Spark Web UI Writing your first PySpark Job Using Jupyter Notebook Data Ingestion using Sqoop
Module 5: Playing with Spark RDDs
Challenges in Existing Computing Methods Probable Solution & How RDD Solves the Problem What is RDD, It’s Operations, Transformations & Actions Data Loading and Saving Through RDDs Key-Value Pair RDDs Other Pair RDDs, Two Pair RDDs RDD Lineage RDD Persistence WordCount Program Using RDD Concepts RDD Partitioning & How it Helps Achieve Parallelization Passing Functions to Spark
Module 6: DataFrames and Spark SQL
Need for Spark SQL What is Spark SQL Spark SQL Architecture SQL Context in Spark SQL Schema RDDs User Defined Functions Data Frames & Datasets Interoperating with RDDs JSON and Parquet File Formats Loading Data through Different Sources Spark-Hive Integration
Module 7: Machine Learning using Spark MLlib
Why Machine Learning What is Machine Learning Where Machine Learning is used Different Types of Machine Learning Techniques Introduction to MLlib Features of MLlib and MLlib Tools Various ML algorithms supported by MLlib
Module 8: Deep Dive into Spark MLlib
Supervised Learning: Linear Regression, Logistic Regression, Decision Tree, Random Forest Unsupervised Learning: K-Means Clustering & How It Works with MLlib Analysis of US Election Data using MLlib (K-Means) Module 9: Understanding Apache Kafka and Apache Flume Need for Kafka What is Kafka Core Concepts of Kafka Kafka Architecture Where is Kafka Used Understanding the Components of Kafka Cluster Configuring Kafka Cluster Kafka Producer and Consumer Java API Need of Apache Flume What is Apache Flume Basic Flume Architecture Flume Sources Flume Sinks Flume Channels Flume Configuration Integrating Apache Flume and Apache Kafka
Drawbacks in Existing Computing Methods Why Streaming is Necessary What is Spark Streaming Spark Streaming Features Spark Streaming Workflow How Uber Uses Streaming Data Streaming Context & DStreams Transformations on DStreams Describe Windowed Operators and Why it is Useful Important Windowed Operators Slice, Window and ReduceByWindow Operators Stateful Operators
Module 11: Apache Spark Streaming - Data Sources
Apache Spark Streaming: Data Sources Streaming Data Source Overview Apache Flume and Apache Kafka Data Sources Example: Using a Kafka Direct Data Source
Module 12: Spark GraphX
Introduction to Spark GraphX Information about a Graph GraphX Basic APIs and Operations Spark GraphX Algorithm - PageRank, Personalized PageRank, Triangle Count, Shortest Paths, Connected Components, Strongly Connected Components, Label Propagation