Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $9.99/month after trial. Cancel anytime.

Spark Cookbook
Spark Cookbook
Spark Cookbook
Ebook483 pages2 hours

Spark Cookbook

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book
  • Become an expert at graph processing using GraphX
  • Use Apache Spark as your single big data compute platform and master its libraries
  • Learn with recipes that can be run on a single machine as well as on a production cluster of thousands of machines
Who This Book Is For

If you are a data engineer, an application developer, or a data scientist who would like to leverage the power of Apache Spark to get better insights from big data, then this is the book for you.

LanguageEnglish
Release dateJul 27, 2015
ISBN9781783987078
Spark Cookbook

Related to Spark Cookbook

Related ebooks

Computers For You

View More

Related articles

Reviews for Spark Cookbook

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Spark Cookbook - Rishi Yadav

    Table of Contents

    Spark Cookbook

    Credits

    About the Author

    About the Reviewers

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    Why Subscribe?

    Free Access for Packt account holders

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Sections

    Getting ready

    How to do it…

    How it works…

    There's more…

    See also

    Conventions

    Reader feedback

    Customer support

    Downloading the color images of this book

    Errata

    Piracy

    Questions

    1. Getting Started with Apache Spark

    Introduction

    Installing Spark from binaries

    Getting ready

    How to do it...

    Building the Spark source code with Maven

    Getting ready

    How to do it...

    Launching Spark on Amazon EC2

    Getting ready

    How to do it...

    See also

    Deploying on a cluster in standalone mode

    Getting ready

    How to do it...

    How it works...

    See also

    Deploying on a cluster with Mesos

    How to do it...

    Deploying on a cluster with YARN

    Getting ready

    How to do it...

    How it works…

    Using Tachyon as an off-heap storage layer

    How to do it...

    See also

    2. Developing Applications with Spark

    Introduction

    Exploring the Spark shell

    How to do it...

    Developing Spark applications in Eclipse with Maven

    Getting ready

    How to do it...

    Developing Spark applications in Eclipse with SBT

    How to do it...

    Developing a Spark application in IntelliJ IDEA with Maven

    How to do it...

    Developing a Spark application in IntelliJ IDEA with SBT

    How to do it...

    3. External Data Sources

    Introduction

    Loading data from the local filesystem

    How to do it...

    Loading data from HDFS

    How to do it...

    There's more…

    Loading data from HDFS using a custom InputFormat

    How to do it...

    Loading data from Amazon S3

    How to do it...

    Loading data from Apache Cassandra

    How to do it...

    There's more...

    Merge strategies in sbt-assembly

    Loading data from relational databases

    Getting ready

    How to do it...

    How it works…

    4. Spark SQL

    Introduction

    Understanding the Catalyst optimizer

    How it works…

    Analysis

    Logical plan optimization

    Physical planning

    Code generation

    Creating HiveContext

    Getting ready

    How to do it...

    Inferring schema using case classes

    How to do it...

    Programmatically specifying the schema

    How to do it...

    How it works…

    Loading and saving data using the Parquet format

    How to do it...

    How it works…

    There's more…

    Loading and saving data using the JSON format

    How to do it...

    How it works…

    There's more…

    Loading and saving data from relational databases

    Getting ready

    How to do it...

    Loading and saving data from an arbitrary source

    How to do it...

    There's more…

    5. Spark Streaming

    Introduction

    Word count using Streaming

    How to do it...

    Streaming Twitter data

    How to do it...

    Streaming using Kafka

    Getting ready

    How to do it...

    There's more…

    6. Getting Started with Machine Learning Using MLlib

    Introduction

    Creating vectors

    How to do it…

    How it works...

    Creating a labeled point

    How to do it…

    Creating matrices

    How to do it…

    Calculating summary statistics

    How to do it…

    Calculating correlation

    Getting ready

    How to do it…

    Doing hypothesis testing

    How to do it…

    Creating machine learning pipelines using ML

    Getting ready

    How to do it…

    7. Supervised Learning with MLlib – Regression

    Introduction

    Using linear regression

    Getting ready

    How to do it…

    Understanding cost function

    Doing linear regression with lasso

    How to do it…

    Doing ridge regression

    How to do it…

    8. Supervised Learning with MLlib – Classification

    Introduction

    Doing classification using logistic regression

    Getting ready

    How to do it…

    Doing binary classification using SVM

    How to do it…

    Doing classification using decision trees

    Getting ready

    How to do it…

    How it works…

    Doing classification using Random Forests

    Getting ready

    How to do it…

    How it works…

    Doing classification using Gradient Boosted Trees

    Getting ready

    How to do it…

    Doing classification with Naïve Bayes

    Getting ready

    How to do it…

    9. Unsupervised Learning with MLlib

    Introduction

    Clustering using k-means

    Getting ready

    How to do it…

    Dimensionality reduction with principal component analysis

    Getting ready

    How to do it…

    Dimensionality reduction with singular value decomposition

    Getting ready

    How to do it…

    10. Recommender Systems

    Introduction

    Collaborative filtering using explicit feedback

    Getting ready

    How to do it…

    Collaborative filtering using implicit feedback

    Getting ready

    How to do it…

    How it works…

    There's more…

    11. Graph Processing Using GraphX

    Introduction

    Fundamental operations on graphs

    Getting ready

    How to do it…

    Using PageRank

    Getting ready

    How to do it…

    Finding connected components

    Getting ready

    How to do it…

    Performing neighborhood aggregation

    Getting ready

    How to do it…

    12. Optimizations and Performance Tuning

    Introduction

    Optimizing memory

    Using compression to improve performance

    Using serialization to improve performance

    How to do it…

    Optimizing garbage collection

    How to do it…

    Optimizing the level of parallelism

    How to do it…

    Understanding the future of optimization – project Tungsten

    Manual memory management by leverage application semantics

    Using algorithms and data structures

    Code generation

    Index

    Spark Cookbook


    Spark Cookbook

    Copyright © 2015 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: July 2015

    Production reference: 1160715

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78398-706-1

    www.packtpub.com

    Cover image by: InfoObjects design team

    Credits

    Author

    Rishi Yadav

    Reviewers

    Thomas W. Dinsmore

    Cheng Lian

    Amir Sedighi

    Commissioning Editor

    Kunal Parikh

    Acquisition Editors

    Shaon Basu

    Neha Nagwekar

    Content Development Editor

    Ritika Singh

    Technical Editor

    Ankita Thakur

    Copy Editors

    Ameesha Smith-Green

    Swati Priya

    Project Coordinator

    Milton Dsouza

    Proofreader

    Safis Editing

    Indexer

    Mariammal Chettiyar

    Graphics

    Sheetal Aute

    Production Coordinator

    Nilesh R. Mohite

    Cover Work

    Nilesh R. Mohite

    About the Author

    Rishi Yadav has 17 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He finished his bachelor's degree at the prestigious Indian Institute of Technology (IIT) Delhi in 1998.

    About 10 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data.

    InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 4 years in a row. InfoObjects has also been awarded with the #1 best place to work in the Bay Area in 2014 and 2015.

    Rishi is an open source contributor and active blogger.

    My special thanks go to my better half, Anjali, for putting up with the long, arduous hours that were added to my already swamped schedule; our 8 year old son, Vedant, who tracked my progress on a daily basis; InfoObjects' CTO and my business partner, Sudhir Jangir, for leading the big data effort in the company; Helma Zargarian, Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and our internal review team, especially Arivoli Tirouvingadame, Lalit Shravage, and Sanjay Shroff, for helping with the review. I could not have written without your support. I would also like to thank Marcel Izumi for putting together amazing graphics.

    About the Reviewers

    Thomas W. Dinsmore is an independent consultant, offering product advisory services to analytic software vendors. To this role, he brings 30 years of experience, delivering analytics solutions to enterprises around the world. He uniquely combines hands-on analytics experience with the ability to lead analytic projects and interpret results.

    Thomas' previous services include roles with SAS, IBM, The Boston Consulting Group, PricewaterhouseCoopers, and Oliver Wyman.

    Thomas coauthored Modern Analytics Methodologies and Advanced Analytics Methodologies, published in 2014 by Pearson FT Press, and is under contract for a forthcoming book on business analytics from Apress. He publishes The Big Analytics Blog at www.thomaswdinsmore.com.

    I would like to thank the entire editorial and production team at Packt Publishing, who work tirelessly to bring out quality books to the public.

    Cheng Lian is a Chinese software engineer and Apache Spark committer from Databricks. His major technical interests include big data analytics, distributed systems, and functional programming languages.

    Cheng is also the translator of the Chinese edition of Erlang and OTP in Action and Concurrent Programming in Erlang (Part I).

    I would like to thank Yi Tian from AsiaInfo for helping me review some parts of Chapter 6, Getting Started with Machine Learning Using MLlib.

    Amir Sedighi is an experienced software engineer, a keen learner, and a creative problem solver. His experience spans a wide range of software development areas, including cross-platform development, big data processing and data streaming, information retrieval, and machine learning. He is a big data lecturer and expert, working in Iran. He holds a bachelor's and master's degree in software engineering. Amir is currently the CEO of Rayanesh Dadegan Ekbatan, the company he cofounded in 2013 after several years of designing and implementing distributed big data and data streaming solutions for private sector companies.

    I would like to thank the entire team at Packt Publishing, who work hard to bring awesomeness to the books and the readers' professional life.

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    For support files and downloads related to your book, please visit www.PacktPub.com.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www2.packtpub.com/books/subscription/packtlib

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

    Why Subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Free Access for Packt account holders

    If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

    Preface

    The success of Hadoop as a big data platform raised user expectations, both in terms of solving different analytics challenges as well as reducing latency. Various tools evolved over time, but when Apache Spark came, it provided one single runtime to address all these challenges. It eliminated the need to combine multiple tools with their own challenges and learning curves. By using memory for persistent storage besides compute, Apache Spark eliminates the need to store intermedia data in disk and increases processing speed up to 100 times. It also provides a single runtime, which addresses various analytics needs such as machine-learning and real-time streaming using various libraries.

    This book covers the installation and configuration of Apache Spark and building solutions using Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX libraries.

    Note

    For more information on this book's recipes, please visit infoobjects.com/spark-cookbook.

    What this book covers

    Chapter 1, Getting Started with Apache Spark, explains how to install Spark on various environments and cluster managers.

    Chapter 2, Developing Applications with Spark, talks about developing Spark applications on different IDEs and using different build tools.

    Chapter 3, External Data Sources, covers how to read and write to various data sources.

    Chapter 4, Spark SQL, takes you through the Spark SQL module that helps you to access the Spark functionality using the SQL interface.

    Chapter 5, Spark Streaming, explores the Spark

    Enjoying the preview?
    Page 1 of 1