Clojure for Data Science
By Garner Henry
()
About this ebook
Related to Clojure for Data Science
Related ebooks
Scala Data Analysis Cookbook Rating: 0 out of 5 stars0 ratingsJulia for Data Science Rating: 0 out of 5 stars0 ratingsClojure Web Development Essentials Rating: 0 out of 5 stars0 ratingsClojure High Performance Programming - Second Edition Rating: 0 out of 5 stars0 ratingsClojure Data Analysis Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsMastering Clojure Rating: 0 out of 5 stars0 ratingsLearn ClojureScript: Functional programming for the web Rating: 0 out of 5 stars0 ratingsClojure Data Structures and Algorithms Cookbook Rating: 4 out of 5 stars4/5The Clojure Workshop: Use functional programming to build data-centric applications with Clojure and ClojureScript Rating: 0 out of 5 stars0 ratingsClojure Programming Cookbook Rating: 0 out of 5 stars0 ratingsThe Right to Read Rating: 0 out of 5 stars0 ratingsClojure for Java Developers Rating: 0 out of 5 stars0 ratingsClojure Reactive Programming Rating: 0 out of 5 stars0 ratingsLearning ClojureScript Rating: 0 out of 5 stars0 ratingsFunctional Programming in Scala Rating: 4 out of 5 stars4/5Haskell Design Patterns Rating: 0 out of 5 stars0 ratingsNatural Language Processing with Java Rating: 0 out of 5 stars0 ratingsXamarin in Action: Creating native cross-platform mobile apps Rating: 0 out of 5 stars0 ratingsSynthetic data Third Edition Rating: 0 out of 5 stars0 ratingsClojure in Action Rating: 0 out of 5 stars0 ratingsMastering F# Rating: 5 out of 5 stars5/5The Golden Ticket: P, NP, and the Search for the Impossible Rating: 4 out of 5 stars4/5Learning Heroku Postgres Rating: 0 out of 5 stars0 ratingsMastering Transformers: The Journey from BERT to Large Language Models and Stable Diffusion Rating: 0 out of 5 stars0 ratingsJulia Cookbook Rating: 0 out of 5 stars0 ratingsBuild an Orchestrator in Go (From Scratch) Rating: 0 out of 5 stars0 ratingsGeometrical Solutions Derived from Mechanics; a Treatise of Archimedes Rating: 0 out of 5 stars0 ratingsScientific Computing with Scala Rating: 0 out of 5 stars0 ratings
Enterprise Applications For You
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Product Operations: How successful companies build better products at scale Rating: 0 out of 5 stars0 ratingsNotion for Beginners: Notion for Work, Play, and Productivity Rating: 4 out of 5 stars4/5Lean Management for Beginners: Fundamentals of Lean Management for Small and Medium-Sized Enterprises - With many Practical Examples Rating: 0 out of 5 stars0 ratingsAgile Project Management: Scrum for Beginners Rating: 4 out of 5 stars4/5ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology Rating: 1 out of 5 stars1/5Bitcoin For Dummies Rating: 4 out of 5 stars4/5Trend Following: Learn to Make a Fortune in Both Bull and Bear Markets Rating: 5 out of 5 stars5/5Low-Code/No-Code: Citizen Developers and the Surprising Future of Business Applications Rating: 3 out of 5 stars3/5Excel VBA Programming For Dummies Rating: 4 out of 5 stars4/5Change Management for Beginners: Understanding Change Processes and Actively Shaping Them Rating: 5 out of 5 stars5/5Learning Python Rating: 5 out of 5 stars5/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Learn MongoDB in 24 Hours Rating: 5 out of 5 stars5/5Logseq for Students: Super Powered Outliner Notebook for Learning with Confidence Rating: 5 out of 5 stars5/5React Projects: Build 12 real-world applications from scratch using React, React Native, and React 360 Rating: 0 out of 5 stars0 ratingsLearn SAP Basis in 24 Hours Rating: 5 out of 5 stars5/5Enterprise AI For Dummies Rating: 3 out of 5 stars3/5Microsoft Power Platform A Deep Dive: Dig into Power Apps, Power Automate, Power BI, and Power Virtual Agents (English Edition) Rating: 0 out of 5 stars0 ratingsLearn SAP MM in 24 Hours Rating: 0 out of 5 stars0 ratings101 Most Popular Excel Formulas: 101 Excel Series, #1 Rating: 4 out of 5 stars4/5Financial Modelling in Power BI: Forecasting Business Intelligently Rating: 5 out of 5 stars5/5
Reviews for Clojure for Data Science
0 ratings0 reviews
Book preview
Clojure for Data Science - Garner Henry
Table of Contents
Clojure for Data Science
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Statistics
Downloading the sample code
Running the examples
Downloading the data
Inspecting the data
Data scrubbing
Descriptive statistics
The mean
Interpreting mathematical notation
The median
Variance
Quantiles
Binning data
Histograms
The normal distribution
The central limit theorem
Poincaré's baker
Generating distributions
Skewness
Quantile-quantile plots
Comparative visualizations
Box plots
Cumulative distribution functions
The importance of visualizations
Visualizing electorate data
Adding columns
Adding derived columns
Comparative visualizations of electorate data
Visualizing the Russian election data
Comparative visualizations
Probability mass functions
Scatter plots
Scatter transparency
Summary
2. Inference
Introducing AcmeContent
Download the sample code
Load and inspect the data
Visualizing the dwell times
The exponential distribution
The distribution of daily means
The central limit theorem
Standard error
Samples and populations
Confidence intervals
Sample comparisons
Bias
Visualizing different populations
Hypothesis testing
Significance
Testing a new site design
Performing a z-test
Student's t-distribution
Degrees of freedom
The t-statistic
Performing the t-test
Two-tailed tests
One-sample t-test
Resampling
Testing multiple designs
Calculating sample means
Multiple comparisons
Introducing the simulation
Compile the simulation
The browser simulation
jStat
B1
Scalable Vector Graphics
Plotting probability densities
State and Reagent
Updating state
Binding the interface
Simulating multiple tests
The Bonferroni correction
Analysis of variance
The F-distribution
The F-statistic
The F-test
Effect size
Cohen's d
Summary
3. Correlation
About the data
Inspecting the data
Visualizing the data
The log-normal distribution
Visualizing correlation
Jittering
Covariance
Pearson's correlation
Sample r and population rho
Hypothesis testing
Confidence intervals
Regression
Linear equations
Residuals
Ordinary least squares
Slope and intercept
Interpretation
Visualization
Assumptions
Goodness-of-fit and R-square
Multiple linear regression
Matrices
Dimensions
Vectors
Construction
Addition and scalar multiplication
Matrix-vector multiplication
Matrix-matrix multiplication
Transposition
The identity matrix
Inversion
The normal equation
More features
Multiple R-squared
Adjusted R-squared
Incanter's linear model
The F-test of model significance
Categorical and dummy variables
Relative power
Collinearity
Multicollinearity
Prediction
The confidence interval of a prediction
Model scope
The final model
Summary
4. Classification
About the data
Inspecting the data
Comparisons with relative risk and odds
The standard error of a proportion
Estimation using bootstrapping
The binomial distribution
The standard error of a proportion formula
Significance testing proportions
Adjusting standard errors for large samples
Chi-squared multiple significance testing
Visualizing the categories
The chi-squared test
The chi-squared statistic
The chi-squared test
Classification with logistic regression
The sigmoid function
The logistic regression cost function
Parameter optimization with gradient descent
Gradient descent with Incanter
Convexity
Implementing logistic regression with Incanter
Creating a feature matrix
Evaluating the logistic regression classifier
The confusion matrix
The kappa statistic
Probability
Bayes theorem
Bayes theorem with multiple predictors
Naive Bayes classification
Implementing a naive Bayes classifier
Evaluating the naive Bayes classifier
Comparing the logistic regression and naive Bayes approaches
Decision trees
Information
Entropy
Information gain
Using information gain to identify the best predictor
Recursively building a decision tree
Using the decision tree for classification
Evaluating the decision tree classifier
Classification with clj-ml
Loading data with clj-ml
Building a decision tree in clj-ml
Bias and variance
Overfitting
Cross-validation
Addressing high bias
Ensemble learning and random forests
Bagging and boosting
Saving the classifier to a file
Summary
5. Big Data
Downloading the code and data
Inspecting the data
Counting the records
The reducers library
Parallel folds with reducers
Loading large files with iota
Creating a reducers processing pipeline
Curried reductions with reducers
Statistical folds with reducers
Associativity
Calculating the mean using fold
Calculating the variance using fold
Mathematical folds with Tesser
Calculating covariance with Tesser
Commutativity
Simple linear regression with Tesser
Calculating a correlation matrix
Multiple regression with gradient descent
The gradient descent update rule
The gradient descent learning rate
Feature scaling
Feature extraction
Creating a custom Tesser fold
Creating a matrix-sum fold
Calculating the total model error
Creating a matrix-mean fold
Applying a single step of gradient descent
Running iterative gradient descent
Scaling gradient descent with Hadoop
Gradient descent on Hadoop with Tesser and Parkour
Parkour distributed sources and sinks
Running a feature scale fold with Hadoop
Running gradient descent with Hadoop
Preparing our code for a Hadoop cluster
Building an uberjar
Submitting the uberjar to Hadoop
Stochastic gradient descent
Stochastic gradient descent with Parkour
Defining a mapper
Parkour shaping functions
Defining a reducer
Specifying Hadoop jobs with Parkour graph
Chaining mappers and reducers with Parkour graph
Summary
6. Clustering
Downloading the data
Extracting the data
Inspecting the data
Clustering text
Set-of-words and the Jaccard index
Tokenizing the Reuters files
Applying the Jaccard index to documents
The bag-of-words and Euclidean distance
Representing text as vectors
Creating a dictionary
Creating term frequency vectors
The vector space model and cosine distance
Removing stop words
Stemming
Clustering with k-means and Incanter
Clustering the Reuters documents
Better clustering with TF-IDF
Zipf's law
Calculating the TF-IDF weight
k-means clustering with TF-IDF
Better clustering with n-grams
Large-scale clustering with Mahout
Converting text documents to a sequence file
Using Parkour to create Mahout vectors
Creating distributed unique IDs
Distributed unique IDs with Hadoop
Sharing data with the distributed cache
Building Mahout vectors from input documents
Running k-means clustering with Mahout
Viewing k-means clustering results
Interpreting the clustered output
Cluster evaluation measures
Inter-cluster density
Intra-cluster density
Calculating the root mean square error with Parkour
Loading clustered points and centroids
Calculating the cluster RMSE
Determining optimal k with the elbow method
Determining optimal k with the Dunn index
Determining optimal k with the Davies-Bouldin index
The drawbacks of k-means
The Mahalanobis distance measure
The curse of dimensionality
Summary
7. Recommender Systems
Download the code and data
Inspect the data
Parse the data
Types of recommender systems
Collaborative filtering
Item-based and user-based recommenders
Slope One recommenders
Calculating the item differences
Making recommendations
Practical considerations for user and item recommenders
Building a user-based recommender with Mahout
k-nearest neighbors
Recommender evaluation with Mahout
Evaluating distance measures
The Pearson correlation similarity
Spearman's rank similarity
Determining optimum neighborhood size
Information retrieval statistics
Precision
Recall
Mahout's information retrieval evaluator
F-measure and the harmonic mean
Fall-out
Normalized discounted cumulative gain
Plotting the information retrieval results
Recommendation with Boolean preferences
Implicit versus explicit feedback
Probabilistic methods for large sets
Testing set membership with Bloom filters
Jaccard similarity for large sets with MinHash
Reducing pair comparisons with locality-sensitive hashing
Bucketing signatures
Dimensionality reduction
Plotting the Iris dataset
Principle component analysis
Singular value decomposition
Large-scale machine learning with Apache Spark and MLlib
Loading data with Sparkling
Mapping data
Distributed datasets and tuples
Filtering data
Persistence and caching
Machine learning on Spark with MLlib
Movie recommendations with alternating least squares
ALS with Spark and MLlib
Making predictions with ALS
Evaluating ALS
Calculating the sum of squared errors
Summary
8. Network Analysis
Download the data
Inspecting the data
Visualizing graphs with Loom
Graph traversal with Loom
The seven bridges of Königsberg
Breadth-first and depth-first search
Finding the shortest path
Minimum spanning trees
Subgraphs and connected components
SCC and the bow-tie structure of the web
Whole-graph analysis
Scale-free networks
Distributed graph computation with GraphX
Creating RDGs with Glittering
Measuring graph density with triangle counting
GraphX partitioning strategies
Running the built-in triangle counting algorithm
Implement triangle counting with Glittering
Step one – collecting neighbor IDs
Steps two, three, and four – aggregate messages
Step five – dividing the counts
Running the custom triangle counting algorithm
The Pregel API
Connected components with the Pregel API
Step one – map vertices
Steps two and three – the message function
Step four – update the attributes
Step five – iterate to convergence
Running connected components
Calculating the size of the largest connected component
Detecting communities with label propagation
Step one – map vertices
Step two – send the vertex attribute
Step three – aggregate value
Step four – vertex function
Step five – set the maximum iterations count
Running label propagation
Measuring community influence using PageRank
The flow formulation
Implementing PageRank with Glittering
Sort by highest influence
Running PageRank to determine community influencers
Summary
9. Time Series
About the data
Loading the Longley data
Fitting curves with a linear model
Time series decomposition
Inspecting the airline data
Visualizing the airline data
Stationarity
De-trending and differencing
Discrete time models
Random walks
Autoregressive models
Determining autocorrelation in AR models
Moving-average models
Determining autocorrelation in MA models
Combining the AR and MA models
Calculating partial autocorrelation
Autocovariance
PACF with Durbin-Levinson recursion
Plotting partial autocorrelation
Determining ARMA model order with ACF and PACF
ACF and PACF of airline data
Removing seasonality with differencing
Maximum likelihood estimation
Calculating the likelihood
Estimating the maximum likelihood
Nelder-Mead optimization with Apache Commons Math
Identifying better models with Akaike Information Criterion
Time series forecasting
Forecasting with Monte Carlo simulation
Summary
10. Visualization
Download the code and data
Exploratory data visualization
Representing a two-dimensional histogram
Using Quil for visualization
Drawing to the sketch window
Quil's coordinate system
Plotting the grid
Specifying the fill color
Color and fill
Outputting an image file
Visualization for communication
Visualizing wealth distribution
Bringing data to life with Quil
Drawing bars of differing widths
Adding a title and axis labels
Improving the clarity with illustrations
Adding text to the bars
Incorporating additional data
Drawing complex shapes
Drawing curves
Plotting compound charts
Output to PDF
Summary
Index
Clojure for Data Science
Clojure for Data Science
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: September 2015
Production reference: 1280815
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78439-718-0
www.packtpub.com
Credits
Author
Henry Garner
Reviewer
Dan Hammer
Commissioning Editor
Ashwin Nair
Acquisition Editor
Meeta Rajani
Content Development Editor
Shubhangi Dhamgaye
Technical Editor
Shivani Kiran Mistry
Copy Editor
Akshata Lobo
Project Coordinator
Harshal Ved
Proofreader
Safis Editing
Indexer
Monica Ajmera Mehta
Graphics
Nicholas Garner
Disha Haria
Production Coordinator
Arvindkumar Gupta
Cover Work
Arvindkumar Gupta
About the Author
Henry Garner is a graduate from the University of Oxford and an experienced developer, CTO, and coach.
He started his technical career at Britain's largest telecoms provider, BT, working with a traditional data warehouse infrastructure. As a part of a small team for 3 years, he built sophisticated data models to derive insight from raw data and use web applications to present the results. These applications were used internally by senior executives and operatives to track both business and systems performance.
He then went on to co-found Likely, a social media analytics start-up. As the CTO, he set the technical direction, leading to the introduction of an event-based append-only data pipeline modeled after the Lambda architecture. He adopted Clojure in 2011 and led a hybrid team of programmers and data scientists, building content recommendation engines based on collaborative filtering and clustering techniques. He developed a syllabus and copresented a series of evening classes from Likely's offices for professional developers who wanted to learn Clojure.
Henry now works with growing businesses, consulting in both a development and technical leadership capacity. He presents regularly at seminars and Clojure meetups in and around London.
Acknowledgments
Thank you Shubhangi Dhamgaye, Meeta Rajani, Shivani Mistry, and the entire team at Packt for their help in bringing this project to fruition. Without you, this book would never have come to pass.
I'm grateful to Dan Hammer, my Packt reviewer, for his valuable perspective as a practicing data scientist, and to those other brave souls who patiently read through the very rough early (and not-so-early) drafts. Foremost among these are Éléonore Mayola, Paul Butcher, and Jeremy Hoyland. Your feedback was not always easy to hear, but it made the book so much better than it would otherwise have been.
Thank you to the wonderful team at MastodonC who tackled a pre-release version of this book in their company book club, especially Éléonore Mayola, Jase Bell, and Elise Huard. I'm grateful to Francine Bennett for her advice early on—which helped to shape the structure of the book—and also to Bruce Durling, Neale Swinnerton, and Chris Adams for their company during the otherwise lonely weekends spent writing in the office.
Thank you to my friends from the machine learning study group: Sam Joseph, Geoff Hogg, and Ben Taylor for reading the early drafts and providing feedback suitable for Clojure newcomers; and also to Luke Snape and Tom Coupland of the Bristol Clojurians for providing the opportunity to test the material out on its intended audience.
A heartfelt thanks to my dad, Nicholas, for interpreting my vague scribbles into the fantastic figures you see in this book, and to my mum, Jacqueline, and sister, Mary, for being such patient listeners in the times I felt like thinking aloud. Last, but by no means least, thank you to the Nuggets of Wynford Road, Russell and Wendy, for the tea and sympathy whenever it occasionally became a bit too much. I look forward to seeing much more of you both from now on.
About the Reviewer
Dan Hammer is a presidential innovation fellow working on Data Innovation initiatives at the NASA headquarters in the CTO's office. Dan is an economist and data scientist. He was the chief data scientist at the World Resources Institute, where he launched Global Forest Watch in partnership with Google, USAID, and many others. Dan is on leave from a PhD program at UC Berkeley, as advised by Max Auffhammer and George Judge. He teaches mathematics at the San Quentin State Prison as a lead instructor with the Prison University Project. Dan graduated with high honors in economics and mathematics from Swarthmore College, where he was a language scholar. He spent a full year building and racing Polynesian outrigger canoes in the South Pacific as a Watson Fellow. He has also reviewed Learning R for Geospatial Analysis by Packt Publishing.
Thanks to my wonderful wife Emily for suffering through my terrible jokes.
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
For Helen.
You provided support, encouragement, and welcome distraction in roughly equal measure.
Preface
A web search for data science Venn diagram
returns numerous interpretations of the skills required to be an effective data scientist (it appears that data science commentators love Venn diagrams). Author and data scientist Drew Conway produced the prototypical diagram back in 2010, putting data science at the intersection of hacking skills, substantive expertise (that is, subject domain understanding), and mathematics and statistics knowledge. Between hacking skills and substantive expertise—those practicing without strong mathematics and statistics knowledge—lies the danger zone.
Five years on, as a growing number of developers seek to plug the data science skills' shortage, there's more need than ever for statistical and mathematical education to help developers out of this danger zone. So, when Packt Publishing invited me to write a book on data science suitable for Clojure programmers, I gladly agreed. In addition to appreciating the need for such a book, I saw it as an opportunity to consolidate much of what I had learned as CTO of my own Clojure-based data analytics company. The result is the book I wish I had been able to read before starting out.
Clojure for Data Science aims to be much more than just a book of statistics for Clojure programmers. A large reason for the spread of data science into so many diverse areas is the enormous power of machine learning. Throughout the book, I'll show how to use pure Clojure functions and third-party libraries to construct machine learning models for the primary tasks of regression, classification, clustering, and recommendation.
Approaches that scale to very large datasets, so-called big data,
are of particular interest to data scientists, because they can reveal subtleties that are lost in smaller samples. This book shows how Clojure can be used to concisely express jobs to run on the Hadoop and Spark distributed computation frameworks, and how to incorporate machine learning through the use of both dedicated external libraries and general optimization techniques.
Above all, this book aims to foster an understanding not just on how to perform particular types of analysis, but why such techniques work. In addition to providing practical knowledge (almost every concept in this book is expressed as a runnable example), I aim to explain the theory that will allow you to take a principle and apply it to related problems. I hope that this approach will enable you to effectively apply statistical thinking in diverse situations well into the future, whether or not you decide to pursue a career in data science.
What this book covers
Chapter 1, Statistics, introduces Incanter, Clojure's primary statistical computing library used throughout the book. With reference to the data from the elections in the United Kingdom and Russia, we demonstrate the use of summary statistics and the value of statistical distributions while showing a variety of comparative visualizations.
Chapter 2, Inference, covers the difference between samples and populations, and statistics and parameters. We introduce hypothesis testing as a formal means of determining whether the differences are significant in the context of A / B testing website designs. We also cover sample bias, effect size, and solutions to the problem of multiple testing.
Chapter 3, Correlation, shows how we can discover linear relationships between variables and use the relationship to make predictions about some variables given others. We implement linear regression—a machine learning algorithm—to predict the weights of Olympic swimmers given their heights, using only core Clojure functions. We then make our model more sophisticated using matrices and more data to improve its accuracy.
Chapter 4, Classification, describes how to implement several different types of machine learning algorithm (logistic regression, naive Bayes, C4.5, and random forests) to make predictions about the survival rates of passengers on the Titanic. We learn about another test for statistical significance that works for categories instead of continuous values, explain various issues you're likely to encounter while training machine learning models such as bias and overfitting, and demonstrate how to use the clj-ml machine learning library.
Chapter 5, Big Data, shows how Clojure can leverage the parallel capabilities in computers of all sizes using the reducers library, and how to scale up these techniques to clusters of machines on Hadoop with Tesser and Parkour. Using ZIP code level tax data from the IRS, we demonstrate how to perform statistical analysis and machine learning in a scalable way.
Chapter 6, Clustering, shows how to identify text documents that share similar subject matter using Hadoop and the Java machine learning library, Mahout. We describe a variety of techniques particular to text processing as well as more general concepts related to clustering. We also introduce some more advanced features of Parkour that can help get the best performance from your Hadoop jobs.
Chapter 7, Recommender Systems, covers a variety of different approaches to the challenge of recommendation. In addition to implementing a recommender with core Clojure functions, we tackle the ancillary challenge of dimensionality reduction by using principle component analysis and singular value decomposition, as well as probabilistic set compression using Bloom filters and the MinHash algorithm. Finally, we introduce the Sparkling and MLlib libraries for machine learning on the Spark distributed computation framework and use them to produce movie recommendations with alternating least squares.
Chapter 8, Network Analysis, shows a variety of ways of analyzing graph-structured data. We demonstrate the methods of traversal using the Loom library and then show how to use the Glittering and GraphX libraries with Spark to discover communities and influencers in social networks.
Chapter 9, Time Series, demonstrates how to fit curves to simple time series data. Using data on the monthly airline passenger counts, we show how to forecast future values for more complex series by training an autoregressive moving-average model. We do this by implementing a method of parameter optimization called maximum likelihood estimation with help from the Apache Commons Math library.
Chapter 10, Visualization, shows how the Clojure library Quil can be used to create custom visualizations for charts not provided by Incanter, and attractive graphics that can communicate findings clearly to your audience, whatever their background.
What you need for this book
The code for each chapter has been made available as a project on GitHub at https://github.com/clojuredatascience. The example code can be downloaded as a zip file from there, or cloned with the Git command-line tool. All of the book's examples can be compiled and run with the Leiningen build tool as described in Chapter 1, Statistics.
This book assumes that you're already able to compile and run Clojure code using Leiningen (http://leiningen.org/). Refer to Leiningen's website if you're not yet set up to do this.
In addition, the code for many of the sample chapters makes use of external datasets. Where possible, these have been included together with the sample code. Where this has not been possible, instructions for downloading the data have been provided in the sample code's README file. Bash scripts have also been provided with the relevant sample code to automate this process. These can be run directly by Linux and OS X users, as described in the relevant chapter, provided the curl, wget, tar, gzip, and unzip utilities are installed. Windows users may have to install a Linux emulator such as Cygwin (https://www.cygwin.com/) to run the scripts.
Who this book is for
This book is intended for intermediate and advanced Clojure programmers who want to build their statistical knowledge, apply machine learning algorithms, or process large amounts of data with Hadoop and Spark. Many aspiring data scientists will benefit from learning all of these skills, and Clojure for Data Science is intended to be read in order from the beginning to the end. Readers who approach the book in this way will find that each chapter builds on concepts introduced in the prior chapters.
If you're not already comfortable reading Clojure code, you're likely to find this book particularly challenging. Fortunately, there are now many excellent resources for learning Clojure and I do not attempt to replicate their work here. At the time of writing, Clojure for the Brave and True (http://www.braveclojure.com/) is a fantastic free resource for learning the language. Consult http://clojure.org/getting_started for links to many other books and online tutorials suitable for newcomers.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Each example is a function in the cljds.ch1.examples namespace that can be run.
A block of code is set as follows:
(defmulti load-data identity)
(defmethod load-data :uk [_]
(-> (io/resource UK2010.xls
)
(str)
(xls/read-xls)))
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
(q/fill (fill-fn x y))
(q/rect x-pos y-pos x-scale y-scale))
(q/save heatmap.png
))]
(q/sketch :setup setup :size size))
Any command-line input or output is written as follows:
lein run –e 1.1
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Each time the New Sample button is pressed, a pair of new samples from an exponential distribution with population means taken from the sliders are generated.
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/Clojure_for_Data_Science_ColorImages.pdf.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
Questions
If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.
Chapter 1. Statistics
Over the course of the following ten chapters of Clojure for Data Science, we'll attempt to discover a broadly linear path through the field of data science. In fact, we'll find as we go that the path is not quite so linear, and the attentive reader ought to notice many recurring themes along the way.
Descriptive statistics concern themselves with summarizing sequences of numbers and they'll appear, to some extent, in every chapter in this book. In this chapter, we'll build foundations for what's to come by implementing functions to calculate the mean, median, variance, and standard deviation of numerical sequences in Clojure. While doing so, we'll attempt to take the fear out of interpreting mathematical formulae.
As soon as we have more than one number to analyze it becomes meaningful to ask how those numbers are distributed. You've probably already heard expressions such as long tail
and the 80/20 rule
. They concern the spread of numbers throughout a range. We demonstrate the value of distributions in this chapter and introduce the most useful of them all: the normal distribution.
The study of distributions is aided immensely by visualization, and for this we'll use the Clojure library Incanter. We'll show how Incanter can be used to load, transform, and visualize real data. We'll compare the results of two national elections—the 2010 United Kingdom general election and the 2011 Russian presidential election—and see how even basic analysis can provide evidence of potentially fraudulent activity.
Downloading the sample code
All of the book's sample code is available on Packt Publishing's website at http://www.packtpub.com/support or from GitHub at http://github.com/clojuredatascience. Each chapter's sample code is available in its own repository.
Note
The sample code for Chapter 1, Statistics can be downloaded from https://github.com/clojuredatascience/ch1-statistics.
Executable examples are provided regularly throughout all chapters, either to demonstrate the effect of code that has been just been explained, or to demonstrate statistical principles that have been introduced. All example function names begin with ex- and are numbered sequentially throughout each chapter. So, the first runnable example of Chapter 1, Statistics is named ex-1-1, the second is named ex-1-2, and so on.
Running the examples
Each example is a function in the cljds.ch1.examples namespace that can be run in two ways—either from the REPL or on the command line with Leiningen. If you'd like to run the examples in the REPL, you can execute:
lein repl
on the command line. By default, the REPL will open in the examples namespace. Alternatively, to run a specific numbered example, you can execute:
lein run –-example 1.1
or pass the single-letter equivalent:
lein run –e 1.1
We only assume basic command-line familiarity throughout this book. The ability to run Leiningen and shell scripts is all that's required.
Tip
If you become stuck at any point, refer to the book's wiki at http://wiki.clojuredatascience.com. The wiki will provide troubleshooting tips for known issues, including advice for running examples on a variety of platforms.
In fact, shell scripts are only used for fetching data from remote locations automatically. The book's wiki will also provide alternative instructions for those not wishing or unable to execute the shell scripts.
Downloading the data
The dataset for this chapter has been made available by the Complex Systems Research Group at the Medical University of Vienna. The analysis we'll be performing closely mirrors their research to determine the signals of systematic election fraud in the national elections of countries around the world.
Note
For more information about the research, and for links to download other datasets, visit the book's wiki or the research group's website at http://www.complex-systems.meduniwien.ac.at/elections/election.html.
Throughout this book we'll be making use of numerous datasets. Where possible, we've included the data with the example code. Where this hasn't been possible—either because of the size of the data or due to licensing constraints—we've included a script to download the data instead.
Chapter 1, Statistics is just such a chapter. If you've cloned the chapter's code and intend to follow the examples, download the data now by executing the following on the command line from within the project's directory:
script/download-data.sh
The script will download and decompress the sample data into the project's data directory.
Tip
If you have any difficulty running the download script or would like to follow manual instructions instead, visit the book's wiki at http://wiki.clojuredatascience.com for assistance.
We'll begin investigating the data in the next section.
Inspecting the data
Throughout this chapter, and for many other chapters in this book, we'll be using the Incanter library (http://incanter.org/) to load, manipulate, and display data.
Incanter is a modular suite of Clojure libraries that provides statistical computing and visualization capabilities. Modeled after the extremely popular R environment for data analysis, it brings together the power of Clojure, an interactive REPL, and a set of powerful abstractions for working with data.
Each module of Incanter focuses on a specific area of functionality. For example incanter-stats contains a suite of related functions for analyzing data and producing summary statistics, while incanter-charts provides a large number of visualization capabilities. incanter-core provides the most fundamental and generally useful functions for transforming data.
Each module can be included separately in your own code. For access to stats, charts, and Excel features, you could include the following in your project.clj:
:dependencies [[incanter/incanter-core 1.5.5
]
[incanter/incanter-stats 1.5.5
]
[incanter/incanter-charts 1.5.5
]
[incanter/incanter-excel 1.5.5
]
...]
If you don't mind including more libraries than you need, you can simply include the full Incanter distribution instead:
:dependencies [[incanter/incanter 1.5.5
]
...]
At Incanter's core is the concept of a dataset—a structure of rows and columns. If you have experience with relational databases, you can think of a dataset as a table. Each column in a dataset is named, and each row in the dataset has the same number of columns as every other. There are a several ways to load data into an Incanter dataset, and which we use will depend how our data is stored:
If our data is a text file (a CSV or tab-delimited file), we can use the read-dataset function from incanter-io
If our data is an Excel file (for example, an .xls or .xlsx file), we can use the read-xls function from incanter-excel
For any other data source (an external database, website, and so on), as long as we can get our data into a Clojure data structure we can create a dataset with the dataset function in incanter-core
This chapter makes use of Excel data sources, so we'll be using read-xls. The function takes one required argument—the file to load—and an optional keyword argument specifying the sheet number or name. All of our examples have only one sheet, so we'll just provide the file argument as string:
(ns cljds.ch1.data
(:require [clojure.java.io :as io]
[incanter.core :as i]
[incanter.excel :as xls]))
In general, we will not reproduce the namespace declarations from the example code. This is both for brevity and because the required namespaces can usually be inferred by the symbol used to reference them. For example, throughout this book we will always refer to clojure.java.io as io, incanter.core as I, and incanter.excel as xls wherever they are used.
We'll be loading several data sources throughout this chapter, so we've created a multimethod called load-data in the cljds.ch1.data namespace:
(defmulti load-data identity)
(defmethod load-data :uk [_]
(-> (io/resource UK2010.xls
)
(str)
(xls/read-xls)))
In the preceding code, we define the load-data multimethod that dispatches on the identity of the first argument. We also define the implementation that will be called if the first argument is :uk. Thus, a call to (load-data :uk) will return an Incanter dataset containing the UK data. Later in the chapter, we'll define additional load-data implementations for other datasets.
The first row of the UK2010.xls spreadsheet contains column names. Incanter's read-xls function will preserve these as the column names of the returned dataset. Let's begin our exploration of the data by inspecting them now—the col-names function in incanter.core returns the column names as a vector. In the following code (and throughout the book, where we use functions from the incanter.core namespace) we require it as i:
(defn ex-1-1 []
(i/col-names (load-data :uk)))
As described in running the examples earlier, functions beginning with ex- can be run on the command line with Leiningen like this:
lein run –e 1.1
The output of the preceding command should be the following Clojure vector:
[Press Association Reference
Constituency Name
Region
Election Year
Electorate
Votes
AC
AD
AGS
APNI
APP
AWL
AWP
BB
BCP
Bean
Best
BGPV
BIB
BIC
Blue
BNP
BP Elvis
C28
Cam Soc
CG
Ch M
Ch P
CIP
CITY
CNPG
Comm
Comm L
Con
Cor D
CPA
CSP
CTDP
CURE
D Lab
D Nat
DDP
DUP
ED
EIP
EPA
FAWG
FDP
FFR
Grn
GSOT
Hum
ICHC
IEAC
IFED
ILEU
Impact
Ind1
Ind2
Ind3
Ind4
Ind5
IPT
ISGB
ISQM
IUK
IVH
IZB
JAC
Joy
JP
Lab
Land
LD
Lib
Libert
LIND
LLPB
LTT
MACI
MCP
MEDI
MEP
MIF
MK
MPEA
MRLP
MRP
Nat Lib
NCDV
ND
New
NF
NFP
NICF
Nobody
NSPS
PBP
PC
Pirate
PNDP
Poet
PPBF
PPE
PPNV
Reform
Respect
Rest
RRG
RTBP
SACL
Sci
SDLP
SEP
SF
SIG
SJP
SKGP
SMA
SMRA
SNP
Soc
Soc Alt
Soc Dem
Soc Lab
South
Speaker
SSP
TF
TOC
Trust
TUSC
TUV
UCUNF
UKIP
UPS
UV
VCCA
Vote
Wessex Reg
WRP
You
Youth
YRDPL
]
This is a very wide dataset. The first six columns in the data file are described as follows; subsequent columns break the number of votes down by party:
Press Association Reference: This is a number identifying the constituency (voting district, represented by one MP)
Constituency Name: This is the common name given to the voting district
Region: This is the geographic region of the UK where the constituency is based
Election Year: This is the year in which the election was held
Electorate: This is the total number of people eligible to vote in the constituency
Votes: This is the total number of votes cast
Whenever we're confronted with new data, it's important to take time to understand it. In the absence of detailed data definitions, one way we could do this is to begin by validating our assumptions about the data. For example, we expect that this dataset contains information about the 2010 election so let's review the contents of the Election Year column.
Incanter provides the i/$ function (i, as before, signifying the incanter.core namespace) for selecting columns from a dataset. We'll encounter the function regularly throughout this chapter—it's Incanter's primary way of selecting columns from a variety of data representations and it provides several different arities. For now, we'll be providing just the name of the column we'd like to extract and the dataset from which to extract it:
(defn ex-1-2 []
(i/$ Election Year
(load-data :uk)))
;; (2010.0 2010.0 2010.0 2010.0 2010.0 ... 2010.0 2010.0 nil)
The years are returned as a single sequence of values. The output may be hard to interpret since the dataset contains so many rows. As we'd like to know which unique values the column contains, we can use the Clojure core function distinct. One of the advantages of using Incanter is that its useful data manipulation functions augment those that Clojure already provides as shown in the following example:
(defn ex-1-3 []
(->> (load-data :uk)
(i/$ Election Year
)
(distinct)))
;; (2010 nil)
The 2010 year goes a long way to confirming our expectations that this data is from 2010. The nil value is unexpected, though, and may indicate a problem with our data.
We don't yet know how many nils exist in the dataset and determining this could help us decide what to do next. A simple way of counting values such as this it to use the core library function frequencies, which returns a map of values to counts:
(defn ex-1-4 [ ]
(->> (load-data :uk)
(i/$ Election Year
)
(frequencies)))
;; {2010.0 650 nil 1}
In the preceding examples, we used Clojure's thread-last macro ->> to chain a several functions together for legibility.
Tip
Along with Clojure's large core library of data manipulation functions, macros such as the one discussed earlier—including the thread-last macro ->>—are other great reasons for using Clojure to analyze data. Throughout this book, we'll see how Clojure can make even sophisticated analysis concise and comprehensible.
It wouldn't take us long to confirm that in 2010 the UK had 650 electoral districts, known as constituencies. Domain knowledge such as this is invaluable when sanity-checking new data. Thus, it's highly probable that the nil value is extraneous and can be removed. We'll see how to do this in the next section.
Data scrubbing
It is a commonly repeated statistic that at least 80 percent of a data scientist's work is data scrubbing. This is the process of detecting potentially corrupt or incorrect data and either correcting or filtering it out.
Note
Data scrubbing is one of the most important (and time-consuming) aspects of working with data. It's a key step to ensuring that subsequent analysis is performed on data that is valid, accurate, and consistent.
The nil value at the end of the election year column may indicate dirty data that ought to be removed. We've already seen that filtering columns of data can be accomplished with Incanter's i/$ function. For filtering rows of data we can use Incanter's i/query-dataset function.
We let Incanter know which rows we'd like it to filter by passing a Clojure map of column names and predicates. Only rows for which all predicates return true will be retained. For example, to select only the nil values from our dataset:
(-> (load-data :uk)
(i/query-dataset {Election Year
{:$eq nil}}))
If you know SQL, you'll notice this is very similar to a WHERE clause. In fact, Incanter also provides the i/$where function, an alias to i/query-dataset that reverses the order of the arguments.
The query is a map of column names to predicates and each predicate is itself a map of operator to operand. Complex queries can be constructed by specifying multiple columns and multiple operators together. Query operators include:
:$gt greater than
:$lt less than
:$gte greater than or equal to
:$lte less than or equal to
:$eq equal to
:$ne not equal to
:$in to test for membership of a collection
:$nin to test for non-membership of a collection
:$fn a predicate function that should return a true response for rows to keep
If none of the built-in operators suffice, the last operator provides the ability to pass a custom function instead.
We'll continue to use Clojure's thread-last macro to make the code intention a little clearer, and return the row as a map of keys and values using the i/to-map function:
(defn ex-1-5 []
(->> (load-data :uk)
(i/$where {Election Year
{:$eq nil}})
(i/to-map)))
;; {:ILEU nil, :TUSC nil, :Vote nil ... :IVH nil, :FFR nil}
Looking at the results carefully, it's apparent that all (but one) of the columns in this row are nil. In fact, a bit of further exploration confirms that the non-nil row is a summary total and ought to be removed from the data. We can remove the problematic row by updating the predicate map to use the :$ne operator, returning only rows where the election year is not equal to nil:
(->> (load-data :uk)
(i/$where {Election Year
{:$ne nil}}))
The preceding function is one we'll almost always want to make sure we call in advance of using the data. One way of doing this is to add another implementation of our load-data multimethod, which also includes this filtering step:
(defmethod load-data :uk-scrubbed [_]
(->> (load-data :uk)
(i/$where {Election Year
{:$ne nil}})))
Now with any code we write, can choose whether to refer to the :uk or :uk-scrubbed datasets.
By always loading the source file and performing our scrubbing on top, we're preserving an audit trail of the transformations we've applied. This makes it clear to us—and future readers of our code—what adjustments have been made to the source. It also means that, should we need to re-run our analysis with new source data, we may be able to just load the new file in place of the existing file.
Descriptive statistics
Descriptive statistics are numbers that are used to summarize and describe data. In the next chapter, we'll turn our attention to a more sophisticated analysis, the so-called inferential statistics, but for now we'll limit ourselves to simply describing what we can observe about the data contained in the file.
To demonstrate what we mean, let's look at the Electorate column of the data. This column lists the total number of registered voters in each constituency:
(defn ex-1-6 []
(->> (load-data :uk-scrubbed)
(i/$ Electorate
)
(count)))
;;