PySpark Training

Spark and PySpark Training
Duration: 10 Days
Module 1: Introduction to Big Data Hadoop and Spark

What is Big Data?
Big Data Customer Scenarios
Limitations and Solutions of Existing Data Analytics Architecture with Uber Use Case
How Hadoop Solves the Big Data Problem?
What is Hadoop?
Hadoop’s Key Characteristics
Hadoop Ecosystem and HDFS
Hadoop Core Components
Rack Awareness and Block Replication
YARN and its Advantage
Hadoop Cluster and its Architecture
Hadoop: Different Cluster Modes
Big Data Analytics with Batch & Real-Time Processing
Why Spark is needed?
What is Spark?
How Spark Differs from its Competitors?
Spark at eBay
Spark’s Place in Hadoop Ecosystem
Module 2: Introduction to Python for Apache Spark

Overview of Python
Different Applications where Python is Used
Values, Types, Variables
Operands and Expressions
Conditional Statements
Loops
Command Line Arguments
Writing to the Screen
Python files I/O Functions
Numbers
Strings and related operations
Tuples and related operations
Lists and related operations
Dictionaries and related operations
Sets and related operations
Module 3: Functions, OOPs, and Modules in Python

Functions
Function Parameters
Global Variables
Variable Scope and Returning Values
Lambda Functions
Object-Oriented Concepts
Standard Libraries
Modules Used in Python
The Import Statements
Module Search Path
Package Installation Way
Module 4: Deep Dive into Apache Spark Framework

Spark Components & its Architecture
Spark Deployment Modes
Introduction to PySpark Shell
Submitting PySpark Job
Spark Web UI
Writing your first PySpark Job Using Jupyter Notebook
Data Ingestion using Sqoop
Module 5: Playing with Spark RDDs

Challenges in Existing Computing Methods
Probable Solution & How RDD Solves the Problem
What is RDD, It’s Operations, Transformations & Actions
Data Loading and Saving Through RDDs
Key-Value Pair RDDs
Other Pair RDDs, Two Pair RDDs
RDD Lineage
RDD Persistence
WordCount Program Using RDD Concepts
RDD Partitioning & How it Helps Achieve Parallelization
Passing Functions to Spark
Module 6: DataFrames and Spark SQL

Need for Spark SQL
What is Spark SQL
Spark SQL Architecture
SQL Context in Spark SQL
Schema RDDs
User Defined Functions
Data Frames & Datasets
Interoperating with RDDs
JSON and Parquet File Formats
Loading Data through Different Sources
Spark-Hive Integration
Module 7: Machine Learning using Spark MLlib

Why Machine Learning
What is Machine Learning
Where Machine Learning is used
Different Types of Machine Learning Techniques
Introduction to MLlib
Features of MLlib and MLlib Tools
Various ML algorithms supported by MLlib
Module 8: Deep Dive into Spark MLlib

Supervised Learning: Linear Regression, Logistic Regression, Decision Tree, Random
Forest
Unsupervised Learning: K-Means Clustering & How It Works with MLlib
Analysis of US Election Data using MLlib (K-Means)
Module 9: Understanding Apache Kafka and Apache Flume
Need for Kafka
What is Kafka
Core Concepts of Kafka
Kafka Architecture
Where is Kafka Used
Understanding the Components of Kafka Cluster
Configuring Kafka Cluster
Kafka Producer and Consumer Java API
Need of Apache Flume
What is Apache Flume
Basic Flume Architecture
Flume Sources
Flume Sinks
Flume Channels
Flume Configuration
Integrating Apache Flume and Apache Kafka
Module 10: Apache Spark Streaming - Processing Multiple Batches

Drawbacks in Existing Computing Methods
Why Streaming is Necessary
What is Spark Streaming
Spark Streaming Features
Spark Streaming Workflow
How Uber Uses Streaming Data
Streaming Context & DStreams
Transformations on DStreams
Describe Windowed Operators and Why it is Useful
Important Windowed Operators
Slice, Window and ReduceByWindow Operators
Stateful Operators
Module 11: Apache Spark Streaming - Data Sources

Apache Spark Streaming: Data Sources
Streaming Data Source Overview
Apache Flume and Apache Kafka Data Sources
Example: Using a Kafka Direct Data Source
Module 12: Spark GraphX

Introduction to Spark GraphX
Information about a Graph
GraphX Basic APIs and Operations
Spark GraphX Algorithm - PageRank, Personalized PageRank, Triangle Count, Shortest
Paths, Connected Components, Strongly Connected Components, Label Propagation

PySpark Training

Uploaded by

Copyright:

Available Formats

PySpark Training

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PySpark Training

Uploaded by

Copyright:

Available Formats

Spark and PySpark Training

Module 1: Introduction to Big Data Hadoop and Spark

Module 2: Introduction to Python for Apache Spark

Module 3: Functions, OOPs, and Modules in Python

Module 4: Deep Dive into Apache Spark Framework

Module 5: Playing with Spark RDDs

Module 6: DataFrames and Spark SQL

Module 7: Machine Learning using Spark MLlib

Module 8: Deep Dive into Spark MLlib

Module 10: Apache Spark Streaming - Processing Multiple Batches

Module 11: Apache Spark Streaming - Data Sources

Module 12: Spark GraphX

You might also like