Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $9.99/month after trial. Cancel anytime.

Clojure for Data Science
Clojure for Data Science
Clojure for Data Science
Ebook1,134 pages9 hours

Clojure for Data Science

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book is aimed at developers who are already productive in Clojure but who are overwhelmed by the breadth and depth of understanding required to be effective in the field of data science. Whether you're tasked with delivering a specific analytics project or simply suspect that you could be deriving more value from your data, this book will inspire you with the opportunities—and inform you of the risks—that exist in data of all shapes and sizes.
LanguageEnglish
Release dateSep 3, 2015
ISBN9781784397500
Clojure for Data Science

Related to Clojure for Data Science

Related ebooks

Enterprise Applications For You

View More

Related articles

Reviews for Clojure for Data Science

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Clojure for Data Science - Garner Henry

    Table of Contents

    Clojure for Data Science

    Credits

    About the Author

    Acknowledgments

    About the Reviewer

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    Why subscribe?

    Free access for Packt account holders

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Downloading the color images of this book

    Errata

    Piracy

    Questions

    1. Statistics

    Downloading the sample code

    Running the examples

    Downloading the data

    Inspecting the data

    Data scrubbing

    Descriptive statistics

    The mean

    Interpreting mathematical notation

    The median

    Variance

    Quantiles

    Binning data

    Histograms

    The normal distribution

    The central limit theorem

    Poincaré's baker

    Generating distributions

    Skewness

    Quantile-quantile plots

    Comparative visualizations

    Box plots

    Cumulative distribution functions

    The importance of visualizations

    Visualizing electorate data

    Adding columns

    Adding derived columns

    Comparative visualizations of electorate data

    Visualizing the Russian election data

    Comparative visualizations

    Probability mass functions

    Scatter plots

    Scatter transparency

    Summary

    2. Inference

    Introducing AcmeContent

    Download the sample code

    Load and inspect the data

    Visualizing the dwell times

    The exponential distribution

    The distribution of daily means

    The central limit theorem

    Standard error

    Samples and populations

    Confidence intervals

    Sample comparisons

    Bias

    Visualizing different populations

    Hypothesis testing

    Significance

    Testing a new site design

    Performing a z-test

    Student's t-distribution

    Degrees of freedom

    The t-statistic

    Performing the t-test

    Two-tailed tests

    One-sample t-test

    Resampling

    Testing multiple designs

    Calculating sample means

    Multiple comparisons

    Introducing the simulation

    Compile the simulation

    The browser simulation

    jStat

    B1

    Scalable Vector Graphics

    Plotting probability densities

    State and Reagent

    Updating state

    Binding the interface

    Simulating multiple tests

    The Bonferroni correction

    Analysis of variance

    The F-distribution

    The F-statistic

    The F-test

    Effect size

    Cohen's d

    Summary

    3. Correlation

    About the data

    Inspecting the data

    Visualizing the data

    The log-normal distribution

    Visualizing correlation

    Jittering

    Covariance

    Pearson's correlation

    Sample r and population rho

    Hypothesis testing

    Confidence intervals

    Regression

    Linear equations

    Residuals

    Ordinary least squares

    Slope and intercept

    Interpretation

    Visualization

    Assumptions

    Goodness-of-fit and R-square

    Multiple linear regression

    Matrices

    Dimensions

    Vectors

    Construction

    Addition and scalar multiplication

    Matrix-vector multiplication

    Matrix-matrix multiplication

    Transposition

    The identity matrix

    Inversion

    The normal equation

    More features

    Multiple R-squared

    Adjusted R-squared

    Incanter's linear model

    The F-test of model significance

    Categorical and dummy variables

    Relative power

    Collinearity

    Multicollinearity

    Prediction

    The confidence interval of a prediction

    Model scope

    The final model

    Summary

    4. Classification

    About the data

    Inspecting the data

    Comparisons with relative risk and odds

    The standard error of a proportion

    Estimation using bootstrapping

    The binomial distribution

    The standard error of a proportion formula

    Significance testing proportions

    Adjusting standard errors for large samples

    Chi-squared multiple significance testing

    Visualizing the categories

    The chi-squared test

    The chi-squared statistic

    The chi-squared test

    Classification with logistic regression

    The sigmoid function

    The logistic regression cost function

    Parameter optimization with gradient descent

    Gradient descent with Incanter

    Convexity

    Implementing logistic regression with Incanter

    Creating a feature matrix

    Evaluating the logistic regression classifier

    The confusion matrix

    The kappa statistic

    Probability

    Bayes theorem

    Bayes theorem with multiple predictors

    Naive Bayes classification

    Implementing a naive Bayes classifier

    Evaluating the naive Bayes classifier

    Comparing the logistic regression and naive Bayes approaches

    Decision trees

    Information

    Entropy

    Information gain

    Using information gain to identify the best predictor

    Recursively building a decision tree

    Using the decision tree for classification

    Evaluating the decision tree classifier

    Classification with clj-ml

    Loading data with clj-ml

    Building a decision tree in clj-ml

    Bias and variance

    Overfitting

    Cross-validation

    Addressing high bias

    Ensemble learning and random forests

    Bagging and boosting

    Saving the classifier to a file

    Summary

    5. Big Data

    Downloading the code and data

    Inspecting the data

    Counting the records

    The reducers library

    Parallel folds with reducers

    Loading large files with iota

    Creating a reducers processing pipeline

    Curried reductions with reducers

    Statistical folds with reducers

    Associativity

    Calculating the mean using fold

    Calculating the variance using fold

    Mathematical folds with Tesser

    Calculating covariance with Tesser

    Commutativity

    Simple linear regression with Tesser

    Calculating a correlation matrix

    Multiple regression with gradient descent

    The gradient descent update rule

    The gradient descent learning rate

    Feature scaling

    Feature extraction

    Creating a custom Tesser fold

    Creating a matrix-sum fold

    Calculating the total model error

    Creating a matrix-mean fold

    Applying a single step of gradient descent

    Running iterative gradient descent

    Scaling gradient descent with Hadoop

    Gradient descent on Hadoop with Tesser and Parkour

    Parkour distributed sources and sinks

    Running a feature scale fold with Hadoop

    Running gradient descent with Hadoop

    Preparing our code for a Hadoop cluster

    Building an uberjar

    Submitting the uberjar to Hadoop

    Stochastic gradient descent

    Stochastic gradient descent with Parkour

    Defining a mapper

    Parkour shaping functions

    Defining a reducer

    Specifying Hadoop jobs with Parkour graph

    Chaining mappers and reducers with Parkour graph

    Summary

    6. Clustering

    Downloading the data

    Extracting the data

    Inspecting the data

    Clustering text

    Set-of-words and the Jaccard index

    Tokenizing the Reuters files

    Applying the Jaccard index to documents

    The bag-of-words and Euclidean distance

    Representing text as vectors

    Creating a dictionary

    Creating term frequency vectors

    The vector space model and cosine distance

    Removing stop words

    Stemming

    Clustering with k-means and Incanter

    Clustering the Reuters documents

    Better clustering with TF-IDF

    Zipf's law

    Calculating the TF-IDF weight

    k-means clustering with TF-IDF

    Better clustering with n-grams

    Large-scale clustering with Mahout

    Converting text documents to a sequence file

    Using Parkour to create Mahout vectors

    Creating distributed unique IDs

    Distributed unique IDs with Hadoop

    Sharing data with the distributed cache

    Building Mahout vectors from input documents

    Running k-means clustering with Mahout

    Viewing k-means clustering results

    Interpreting the clustered output

    Cluster evaluation measures

    Inter-cluster density

    Intra-cluster density

    Calculating the root mean square error with Parkour

    Loading clustered points and centroids

    Calculating the cluster RMSE

    Determining optimal k with the elbow method

    Determining optimal k with the Dunn index

    Determining optimal k with the Davies-Bouldin index

    The drawbacks of k-means

    The Mahalanobis distance measure

    The curse of dimensionality

    Summary

    7. Recommender Systems

    Download the code and data

    Inspect the data

    Parse the data

    Types of recommender systems

    Collaborative filtering

    Item-based and user-based recommenders

    Slope One recommenders

    Calculating the item differences

    Making recommendations

    Practical considerations for user and item recommenders

    Building a user-based recommender with Mahout

    k-nearest neighbors

    Recommender evaluation with Mahout

    Evaluating distance measures

    The Pearson correlation similarity

    Spearman's rank similarity

    Determining optimum neighborhood size

    Information retrieval statistics

    Precision

    Recall

    Mahout's information retrieval evaluator

    F-measure and the harmonic mean

    Fall-out

    Normalized discounted cumulative gain

    Plotting the information retrieval results

    Recommendation with Boolean preferences

    Implicit versus explicit feedback

    Probabilistic methods for large sets

    Testing set membership with Bloom filters

    Jaccard similarity for large sets with MinHash

    Reducing pair comparisons with locality-sensitive hashing

    Bucketing signatures

    Dimensionality reduction

    Plotting the Iris dataset

    Principle component analysis

    Singular value decomposition

    Large-scale machine learning with Apache Spark and MLlib

    Loading data with Sparkling

    Mapping data

    Distributed datasets and tuples

    Filtering data

    Persistence and caching

    Machine learning on Spark with MLlib

    Movie recommendations with alternating least squares

    ALS with Spark and MLlib

    Making predictions with ALS

    Evaluating ALS

    Calculating the sum of squared errors

    Summary

    8. Network Analysis

    Download the data

    Inspecting the data

    Visualizing graphs with Loom

    Graph traversal with Loom

    The seven bridges of Königsberg

    Breadth-first and depth-first search

    Finding the shortest path

    Minimum spanning trees

    Subgraphs and connected components

    SCC and the bow-tie structure of the web

    Whole-graph analysis

    Scale-free networks

    Distributed graph computation with GraphX

    Creating RDGs with Glittering

    Measuring graph density with triangle counting

    GraphX partitioning strategies

    Running the built-in triangle counting algorithm

    Implement triangle counting with Glittering

    Step one – collecting neighbor IDs

    Steps two, three, and four – aggregate messages

    Step five – dividing the counts

    Running the custom triangle counting algorithm

    The Pregel API

    Connected components with the Pregel API

    Step one – map vertices

    Steps two and three – the message function

    Step four – update the attributes

    Step five – iterate to convergence

    Running connected components

    Calculating the size of the largest connected component

    Detecting communities with label propagation

    Step one – map vertices

    Step two – send the vertex attribute

    Step three – aggregate value

    Step four – vertex function

    Step five – set the maximum iterations count

    Running label propagation

    Measuring community influence using PageRank

    The flow formulation

    Implementing PageRank with Glittering

    Sort by highest influence

    Running PageRank to determine community influencers

    Summary

    9. Time Series

    About the data

    Loading the Longley data

    Fitting curves with a linear model

    Time series decomposition

    Inspecting the airline data

    Visualizing the airline data

    Stationarity

    De-trending and differencing

    Discrete time models

    Random walks

    Autoregressive models

    Determining autocorrelation in AR models

    Moving-average models

    Determining autocorrelation in MA models

    Combining the AR and MA models

    Calculating partial autocorrelation

    Autocovariance

    PACF with Durbin-Levinson recursion

    Plotting partial autocorrelation

    Determining ARMA model order with ACF and PACF

    ACF and PACF of airline data

    Removing seasonality with differencing

    Maximum likelihood estimation

    Calculating the likelihood

    Estimating the maximum likelihood

    Nelder-Mead optimization with Apache Commons Math

    Identifying better models with Akaike Information Criterion

    Time series forecasting

    Forecasting with Monte Carlo simulation

    Summary

    10. Visualization

    Download the code and data

    Exploratory data visualization

    Representing a two-dimensional histogram

    Using Quil for visualization

    Drawing to the sketch window

    Quil's coordinate system

    Plotting the grid

    Specifying the fill color

    Color and fill

    Outputting an image file

    Visualization for communication

    Visualizing wealth distribution

    Bringing data to life with Quil

    Drawing bars of differing widths

    Adding a title and axis labels

    Improving the clarity with illustrations

    Adding text to the bars

    Incorporating additional data

    Drawing complex shapes

    Drawing curves

    Plotting compound charts

    Output to PDF

    Summary

    Index

    Clojure for Data Science


    Clojure for Data Science

    Copyright © 2015 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: September 2015

    Production reference: 1280815

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78439-718-0

    www.packtpub.com

    Credits

    Author

    Henry Garner

    Reviewer

    Dan Hammer

    Commissioning Editor

    Ashwin Nair

    Acquisition Editor

    Meeta Rajani

    Content Development Editor

    Shubhangi Dhamgaye

    Technical Editor

    Shivani Kiran Mistry

    Copy Editor

    Akshata Lobo

    Project Coordinator

    Harshal Ved

    Proofreader

    Safis Editing

    Indexer

    Monica Ajmera Mehta

    Graphics

    Nicholas Garner

    Disha Haria

    Production Coordinator

    Arvindkumar Gupta

    Cover Work

    Arvindkumar Gupta

    About the Author

    Henry Garner is a graduate from the University of Oxford and an experienced developer, CTO, and coach.

    He started his technical career at Britain's largest telecoms provider, BT, working with a traditional data warehouse infrastructure. As a part of a small team for 3 years, he built sophisticated data models to derive insight from raw data and use web applications to present the results. These applications were used internally by senior executives and operatives to track both business and systems performance.

    He then went on to co-found Likely, a social media analytics start-up. As the CTO, he set the technical direction, leading to the introduction of an event-based append-only data pipeline modeled after the Lambda architecture. He adopted Clojure in 2011 and led a hybrid team of programmers and data scientists, building content recommendation engines based on collaborative filtering and clustering techniques. He developed a syllabus and copresented a series of evening classes from Likely's offices for professional developers who wanted to learn Clojure.

    Henry now works with growing businesses, consulting in both a development and technical leadership capacity. He presents regularly at seminars and Clojure meetups in and around London.

    Acknowledgments

    Thank you Shubhangi Dhamgaye, Meeta Rajani, Shivani Mistry, and the entire team at Packt for their help in bringing this project to fruition. Without you, this book would never have come to pass.

    I'm grateful to Dan Hammer, my Packt reviewer, for his valuable perspective as a practicing data scientist, and to those other brave souls who patiently read through the very rough early (and not-so-early) drafts. Foremost among these are Éléonore Mayola, Paul Butcher, and Jeremy Hoyland. Your feedback was not always easy to hear, but it made the book so much better than it would otherwise have been.

    Thank you to the wonderful team at MastodonC who tackled a pre-release version of this book in their company book club, especially Éléonore Mayola, Jase Bell, and Elise Huard. I'm grateful to Francine Bennett for her advice early on—which helped to shape the structure of the book—and also to Bruce Durling, Neale Swinnerton, and Chris Adams for their company during the otherwise lonely weekends spent writing in the office.

    Thank you to my friends from the machine learning study group: Sam Joseph, Geoff Hogg, and Ben Taylor for reading the early drafts and providing feedback suitable for Clojure newcomers; and also to Luke Snape and Tom Coupland of the Bristol Clojurians for providing the opportunity to test the material out on its intended audience.

    A heartfelt thanks to my dad, Nicholas, for interpreting my vague scribbles into the fantastic figures you see in this book, and to my mum, Jacqueline, and sister, Mary, for being such patient listeners in the times I felt like thinking aloud. Last, but by no means least, thank you to the Nuggets of Wynford Road, Russell and Wendy, for the tea and sympathy whenever it occasionally became a bit too much. I look forward to seeing much more of you both from now on.

    About the Reviewer

    Dan Hammer is a presidential innovation fellow working on Data Innovation initiatives at the NASA headquarters in the CTO's office. Dan is an economist and data scientist. He was the chief data scientist at the World Resources Institute, where he launched Global Forest Watch in partnership with Google, USAID, and many others. Dan is on leave from a PhD program at UC Berkeley, as advised by Max Auffhammer and George Judge. He teaches mathematics at the San Quentin State Prison as a lead instructor with the Prison University Project. Dan graduated with high honors in economics and mathematics from Swarthmore College, where he was a language scholar. He spent a full year building and racing Polynesian outrigger canoes in the South Pacific as a Watson Fellow. He has also reviewed Learning R for Geospatial Analysis by Packt Publishing.

    Thanks to my wonderful wife Emily for suffering through my terrible jokes.

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    For support files and downloads related to your book, please visit www.PacktPub.com.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www2.packtpub.com/books/subscription/packtlib

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Free access for Packt account holders

    If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

    For Helen.

    You provided support, encouragement, and welcome distraction in roughly equal measure.

    Preface

    A web search for data science Venn diagram returns numerous interpretations of the skills required to be an effective data scientist (it appears that data science commentators love Venn diagrams). Author and data scientist Drew Conway produced the prototypical diagram back in 2010, putting data science at the intersection of hacking skills, substantive expertise (that is, subject domain understanding), and mathematics and statistics knowledge. Between hacking skills and substantive expertise—those practicing without strong mathematics and statistics knowledge—lies the danger zone.

    Five years on, as a growing number of developers seek to plug the data science skills' shortage, there's more need than ever for statistical and mathematical education to help developers out of this danger zone. So, when Packt Publishing invited me to write a book on data science suitable for Clojure programmers, I gladly agreed. In addition to appreciating the need for such a book, I saw it as an opportunity to consolidate much of what I had learned as CTO of my own Clojure-based data analytics company. The result is the book I wish I had been able to read before starting out.

    Clojure for Data Science aims to be much more than just a book of statistics for Clojure programmers. A large reason for the spread of data science into so many diverse areas is the enormous power of machine learning. Throughout the book, I'll show how to use pure Clojure functions and third-party libraries to construct machine learning models for the primary tasks of regression, classification, clustering, and recommendation.

    Approaches that scale to very large datasets, so-called big data, are of particular interest to data scientists, because they can reveal subtleties that are lost in smaller samples. This book shows how Clojure can be used to concisely express jobs to run on the Hadoop and Spark distributed computation frameworks, and how to incorporate machine learning through the use of both dedicated external libraries and general optimization techniques.

    Above all, this book aims to foster an understanding not just on how to perform particular types of analysis, but why such techniques work. In addition to providing practical knowledge (almost every concept in this book is expressed as a runnable example), I aim to explain the theory that will allow you to take a principle and apply it to related problems. I hope that this approach will enable you to effectively apply statistical thinking in diverse situations well into the future, whether or not you decide to pursue a career in data science.

    What this book covers

    Chapter 1, Statistics, introduces Incanter, Clojure's primary statistical computing library used throughout the book. With reference to the data from the elections in the United Kingdom and Russia, we demonstrate the use of summary statistics and the value of statistical distributions while showing a variety of comparative visualizations.

    Chapter 2, Inference, covers the difference between samples and populations, and statistics and parameters. We introduce hypothesis testing as a formal means of determining whether the differences are significant in the context of A / B testing website designs. We also cover sample bias, effect size, and solutions to the problem of multiple testing.

    Chapter 3, Correlation, shows how we can discover linear relationships between variables and use the relationship to make predictions about some variables given others. We implement linear regression—a machine learning algorithm—to predict the weights of Olympic swimmers given their heights, using only core Clojure functions. We then make our model more sophisticated using matrices and more data to improve its accuracy.

    Chapter 4, Classification, describes how to implement several different types of machine learning algorithm (logistic regression, naive Bayes, C4.5, and random forests) to make predictions about the survival rates of passengers on the Titanic. We learn about another test for statistical significance that works for categories instead of continuous values, explain various issues you're likely to encounter while training machine learning models such as bias and overfitting, and demonstrate how to use the clj-ml machine learning library.

    Chapter 5, Big Data, shows how Clojure can leverage the parallel capabilities in computers of all sizes using the reducers library, and how to scale up these techniques to clusters of machines on Hadoop with Tesser and Parkour. Using ZIP code level tax data from the IRS, we demonstrate how to perform statistical analysis and machine learning in a scalable way.

    Chapter 6, Clustering, shows how to identify text documents that share similar subject matter using Hadoop and the Java machine learning library, Mahout. We describe a variety of techniques particular to text processing as well as more general concepts related to clustering. We also introduce some more advanced features of Parkour that can help get the best performance from your Hadoop jobs.

    Chapter 7, Recommender Systems, covers a variety of different approaches to the challenge of recommendation. In addition to implementing a recommender with core Clojure functions, we tackle the ancillary challenge of dimensionality reduction by using principle component analysis and singular value decomposition, as well as probabilistic set compression using Bloom filters and the MinHash algorithm. Finally, we introduce the Sparkling and MLlib libraries for machine learning on the Spark distributed computation framework and use them to produce movie recommendations with alternating least squares.

    Chapter 8, Network Analysis, shows a variety of ways of analyzing graph-structured data. We demonstrate the methods of traversal using the Loom library and then show how to use the Glittering and GraphX libraries with Spark to discover communities and influencers in social networks.

    Chapter 9, Time Series, demonstrates how to fit curves to simple time series data. Using data on the monthly airline passenger counts, we show how to forecast future values for more complex series by training an autoregressive moving-average model. We do this by implementing a method of parameter optimization called maximum likelihood estimation with help from the Apache Commons Math library.

    Chapter 10, Visualization, shows how the Clojure library Quil can be used to create custom visualizations for charts not provided by Incanter, and attractive graphics that can communicate findings clearly to your audience, whatever their background.

    What you need for this book

    The code for each chapter has been made available as a project on GitHub at https://github.com/clojuredatascience. The example code can be downloaded as a zip file from there, or cloned with the Git command-line tool. All of the book's examples can be compiled and run with the Leiningen build tool as described in Chapter 1, Statistics.

    This book assumes that you're already able to compile and run Clojure code using Leiningen (http://leiningen.org/). Refer to Leiningen's website if you're not yet set up to do this.

    In addition, the code for many of the sample chapters makes use of external datasets. Where possible, these have been included together with the sample code. Where this has not been possible, instructions for downloading the data have been provided in the sample code's README file. Bash scripts have also been provided with the relevant sample code to automate this process. These can be run directly by Linux and OS X users, as described in the relevant chapter, provided the curl, wget, tar, gzip, and unzip utilities are installed. Windows users may have to install a Linux emulator such as Cygwin (https://www.cygwin.com/) to run the scripts.

    Who this book is for

    This book is intended for intermediate and advanced Clojure programmers who want to build their statistical knowledge, apply machine learning algorithms, or process large amounts of data with Hadoop and Spark. Many aspiring data scientists will benefit from learning all of these skills, and Clojure for Data Science is intended to be read in order from the beginning to the end. Readers who approach the book in this way will find that each chapter builds on concepts introduced in the prior chapters.

    If you're not already comfortable reading Clojure code, you're likely to find this book particularly challenging. Fortunately, there are now many excellent resources for learning Clojure and I do not attempt to replicate their work here. At the time of writing, Clojure for the Brave and True (http://www.braveclojure.com/) is a fantastic free resource for learning the language. Consult http://clojure.org/getting_started for links to many other books and online tutorials suitable for newcomers.

    Conventions

    In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

    Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Each example is a function in the cljds.ch1.examples namespace that can be run.

    A block of code is set as follows:

    (defmulti load-data identity)

     

    (defmethod load-data :uk [_]

      (-> (io/resource UK2010.xls)

          (str)

          (xls/read-xls)))

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

        (q/fill (fill-fn x y))

        (q/rect x-pos y-pos x-scale y-scale))

        (q/save heatmap.png))]

     

        (q/sketch :setup setup :size size))

    Any command-line input or output is written as follows:

    lein run –e 1.1

    New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Each time the New Sample button is pressed, a pair of new samples from an exponential distribution with population means taken from the sliders are generated.

    Note

    Warnings or important notes appear in a box like this.

    Tip

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

    To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

    If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    Downloading the color images of this book

    We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/Clojure_for_Data_Science_ColorImages.pdf.

    Errata

    Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

    To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

    Piracy

    Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

    Please contact us at <[email protected]> with a link to the suspected pirated material.

    We appreciate your help in protecting our authors and our ability to bring you valuable content.

    Questions

    If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

    Chapter 1. Statistics

    Over the course of the following ten chapters of Clojure for Data Science, we'll attempt to discover a broadly linear path through the field of data science. In fact, we'll find as we go that the path is not quite so linear, and the attentive reader ought to notice many recurring themes along the way.

    Descriptive statistics concern themselves with summarizing sequences of numbers and they'll appear, to some extent, in every chapter in this book. In this chapter, we'll build foundations for what's to come by implementing functions to calculate the mean, median, variance, and standard deviation of numerical sequences in Clojure. While doing so, we'll attempt to take the fear out of interpreting mathematical formulae.

    As soon as we have more than one number to analyze it becomes meaningful to ask how those numbers are distributed. You've probably already heard expressions such as long tail and the 80/20 rule. They concern the spread of numbers throughout a range. We demonstrate the value of distributions in this chapter and introduce the most useful of them all: the normal distribution.

    The study of distributions is aided immensely by visualization, and for this we'll use the Clojure library Incanter. We'll show how Incanter can be used to load, transform, and visualize real data. We'll compare the results of two national elections—the 2010 United Kingdom general election and the 2011 Russian presidential election—and see how even basic analysis can provide evidence of potentially fraudulent activity.

    Downloading the sample code

    All of the book's sample code is available on Packt Publishing's website at http://www.packtpub.com/support or from GitHub at http://github.com/clojuredatascience. Each chapter's sample code is available in its own repository.

    Note

    The sample code for Chapter 1, Statistics can be downloaded from https://github.com/clojuredatascience/ch1-statistics.

    Executable examples are provided regularly throughout all chapters, either to demonstrate the effect of code that has been just been explained, or to demonstrate statistical principles that have been introduced. All example function names begin with ex- and are numbered sequentially throughout each chapter. So, the first runnable example of Chapter 1, Statistics is named ex-1-1, the second is named ex-1-2, and so on.

    Running the examples

    Each example is a function in the cljds.ch1.examples namespace that can be run in two ways—either from the REPL or on the command line with Leiningen. If you'd like to run the examples in the REPL, you can execute:

    lein repl

    on the command line. By default, the REPL will open in the examples namespace. Alternatively, to run a specific numbered example, you can execute:

    lein run –-example 1.1

    or pass the single-letter equivalent:

    lein run –e 1.1

    We only assume basic command-line familiarity throughout this book. The ability to run Leiningen and shell scripts is all that's required.

    Tip

    If you become stuck at any point, refer to the book's wiki at http://wiki.clojuredatascience.com. The wiki will provide troubleshooting tips for known issues, including advice for running examples on a variety of platforms.

    In fact, shell scripts are only used for fetching data from remote locations automatically. The book's wiki will also provide alternative instructions for those not wishing or unable to execute the shell scripts.

    Downloading the data

    The dataset for this chapter has been made available by the Complex Systems Research Group at the Medical University of Vienna. The analysis we'll be performing closely mirrors their research to determine the signals of systematic election fraud in the national elections of countries around the world.

    Note

    For more information about the research, and for links to download other datasets, visit the book's wiki or the research group's website at http://www.complex-systems.meduniwien.ac.at/elections/election.html.

    Throughout this book we'll be making use of numerous datasets. Where possible, we've included the data with the example code. Where this hasn't been possible—either because of the size of the data or due to licensing constraints—we've included a script to download the data instead.

    Chapter 1, Statistics is just such a chapter. If you've cloned the chapter's code and intend to follow the examples, download the data now by executing the following on the command line from within the project's directory:

    script/download-data.sh

    The script will download and decompress the sample data into the project's data directory.

    Tip

    If you have any difficulty running the download script or would like to follow manual instructions instead, visit the book's wiki at http://wiki.clojuredatascience.com for assistance.

    We'll begin investigating the data in the next section.

    Inspecting the data

    Throughout this chapter, and for many other chapters in this book, we'll be using the Incanter library (http://incanter.org/) to load, manipulate, and display data.

    Incanter is a modular suite of Clojure libraries that provides statistical computing and visualization capabilities. Modeled after the extremely popular R environment for data analysis, it brings together the power of Clojure, an interactive REPL, and a set of powerful abstractions for working with data.

    Each module of Incanter focuses on a specific area of functionality. For example incanter-stats contains a suite of related functions for analyzing data and producing summary statistics, while incanter-charts provides a large number of visualization capabilities. incanter-core provides the most fundamental and generally useful functions for transforming data.

    Each module can be included separately in your own code. For access to stats, charts, and Excel features, you could include the following in your project.clj:

      :dependencies [[incanter/incanter-core 1.5.5]

                    [incanter/incanter-stats 1.5.5]

                    [incanter/incanter-charts 1.5.5]

                    [incanter/incanter-excel 1.5.5]

                    ...]

    If you don't mind including more libraries than you need, you can simply include the full Incanter distribution instead:

    :dependencies [[incanter/incanter 1.5.5]

                  ...]

    At Incanter's core is the concept of a dataset—a structure of rows and columns. If you have experience with relational databases, you can think of a dataset as a table. Each column in a dataset is named, and each row in the dataset has the same number of columns as every other. There are a several ways to load data into an Incanter dataset, and which we use will depend how our data is stored:

    If our data is a text file (a CSV or tab-delimited file), we can use the read-dataset function from incanter-io

    If our data is an Excel file (for example, an .xls or .xlsx file), we can use the read-xls function from incanter-excel

    For any other data source (an external database, website, and so on), as long as we can get our data into a Clojure data structure we can create a dataset with the dataset function in incanter-core

    This chapter makes use of Excel data sources, so we'll be using read-xls. The function takes one required argument—the file to load—and an optional keyword argument specifying the sheet number or name. All of our examples have only one sheet, so we'll just provide the file argument as string:

    (ns cljds.ch1.data

      (:require [clojure.java.io :as io]

                [incanter.core :as i]

                [incanter.excel :as xls]))

    In general, we will not reproduce the namespace declarations from the example code. This is both for brevity and because the required namespaces can usually be inferred by the symbol used to reference them. For example, throughout this book we will always refer to clojure.java.io as io, incanter.core as I, and incanter.excel as xls wherever they are used.

    We'll be loading several data sources throughout this chapter, so we've created a multimethod called load-data in the cljds.ch1.data namespace:

    (defmulti load-data identity)

     

    (defmethod load-data :uk [_]

      (-> (io/resource UK2010.xls)

          (str)

          (xls/read-xls)))

    In the preceding code, we define the load-data multimethod that dispatches on the identity of the first argument. We also define the implementation that will be called if the first argument is :uk. Thus, a call to (load-data :uk) will return an Incanter dataset containing the UK data. Later in the chapter, we'll define additional load-data implementations for other datasets.

    The first row of the UK2010.xls spreadsheet contains column names. Incanter's read-xls function will preserve these as the column names of the returned dataset. Let's begin our exploration of the data by inspecting them now—the col-names function in incanter.core returns the column names as a vector. In the following code (and throughout the book, where we use functions from the incanter.core namespace) we require it as i:

    (defn ex-1-1 []

      (i/col-names (load-data :uk)))

    As described in running the examples earlier, functions beginning with ex- can be run on the command line with Leiningen like this:

    lein run –e 1.1

    The output of the preceding command should be the following Clojure vector:

    [Press Association Reference Constituency Name Region Election Year Electorate Votes AC AD AGS APNI APP AWL AWP BB BCP Bean Best BGPV BIB BIC Blue BNP BP Elvis C28 Cam Soc CG Ch M Ch P CIP CITY CNPG Comm Comm L Con Cor D CPA CSP CTDP CURE D Lab D Nat DDP DUP ED EIP EPA FAWG FDP FFR Grn GSOT Hum ICHC IEAC IFED ILEU Impact Ind1 Ind2 Ind3 Ind4 Ind5 IPT ISGB ISQM IUK IVH IZB JAC Joy JP Lab Land LD Lib Libert LIND LLPB LTT MACI MCP MEDI MEP MIF MK MPEA MRLP MRP Nat Lib NCDV ND New NF NFP NICF Nobody NSPS PBP PC Pirate PNDP Poet PPBF PPE PPNV Reform Respect Rest RRG RTBP SACL Sci SDLP SEP SF SIG SJP SKGP SMA SMRA SNP Soc Soc Alt Soc Dem Soc Lab South Speaker SSP TF TOC Trust TUSC TUV UCUNF UKIP UPS UV VCCA Vote Wessex Reg WRP You Youth YRDPL]

    This is a very wide dataset. The first six columns in the data file are described as follows; subsequent columns break the number of votes down by party:

    Press Association Reference: This is a number identifying the constituency (voting district, represented by one MP)

    Constituency Name: This is the common name given to the voting district

    Region: This is the geographic region of the UK where the constituency is based

    Election Year: This is the year in which the election was held

    Electorate: This is the total number of people eligible to vote in the constituency

    Votes: This is the total number of votes cast

    Whenever we're confronted with new data, it's important to take time to understand it. In the absence of detailed data definitions, one way we could do this is to begin by validating our assumptions about the data. For example, we expect that this dataset contains information about the 2010 election so let's review the contents of the Election Year column.

    Incanter provides the i/$ function (i, as before, signifying the incanter.core namespace) for selecting columns from a dataset. We'll encounter the function regularly throughout this chapter—it's Incanter's primary way of selecting columns from a variety of data representations and it provides several different arities. For now, we'll be providing just the name of the column we'd like to extract and the dataset from which to extract it:

    (defn ex-1-2 []

      (i/$ Election Year (load-data :uk)))

     

    ;; (2010.0 2010.0 2010.0 2010.0 2010.0 ... 2010.0 2010.0 nil)

    The years are returned as a single sequence of values. The output may be hard to interpret since the dataset contains so many rows. As we'd like to know which unique values the column contains, we can use the Clojure core function distinct. One of the advantages of using Incanter is that its useful data manipulation functions augment those that Clojure already provides as shown in the following example:

    (defn ex-1-3 []

      (->> (load-data :uk)

          (i/$ Election Year)

          (distinct)))

     

    ;; (2010 nil)

    The 2010 year goes a long way to confirming our expectations that this data is from 2010. The nil value is unexpected, though, and may indicate a problem with our data.

    We don't yet know how many nils exist in the dataset and determining this could help us decide what to do next. A simple way of counting values such as this it to use the core library function frequencies, which returns a map of values to counts:

    (defn ex-1-4 [ ]

      (->> (load-data :uk)

          (i/$ Election Year)

          (frequencies)))

     

    ;; {2010.0 650 nil 1}

    In the preceding examples, we used Clojure's thread-last macro ->> to chain a several functions together for legibility.

    Tip

    Along with Clojure's large core library of data manipulation functions, macros such as the one discussed earlier—including the thread-last macro ->>—are other great reasons for using Clojure to analyze data. Throughout this book, we'll see how Clojure can make even sophisticated analysis concise and comprehensible.

    It wouldn't take us long to confirm that in 2010 the UK had 650 electoral districts, known as constituencies. Domain knowledge such as this is invaluable when sanity-checking new data. Thus, it's highly probable that the nil value is extraneous and can be removed. We'll see how to do this in the next section.

    Data scrubbing

    It is a commonly repeated statistic that at least 80 percent of a data scientist's work is data scrubbing. This is the process of detecting potentially corrupt or incorrect data and either correcting or filtering it out.

    Note

    Data scrubbing is one of the most important (and time-consuming) aspects of working with data. It's a key step to ensuring that subsequent analysis is performed on data that is valid, accurate, and consistent.

    The nil value at the end of the election year column may indicate dirty data that ought to be removed. We've already seen that filtering columns of data can be accomplished with Incanter's i/$ function. For filtering rows of data we can use Incanter's i/query-dataset function.

    We let Incanter know which rows we'd like it to filter by passing a Clojure map of column names and predicates. Only rows for which all predicates return true will be retained. For example, to select only the nil values from our dataset:

    (-> (load-data :uk)

        (i/query-dataset {Election Year {:$eq nil}}))

    If you know SQL, you'll notice this is very similar to a WHERE clause. In fact, Incanter also provides the i/$where function, an alias to i/query-dataset that reverses the order of the arguments.

    The query is a map of column names to predicates and each predicate is itself a map of operator to operand. Complex queries can be constructed by specifying multiple columns and multiple operators together. Query operators include:

    :$gt greater than

    :$lt less than

    :$gte greater than or equal to

    :$lte less than or equal to

    :$eq equal to

    :$ne not equal to

    :$in to test for membership of a collection

    :$nin to test for non-membership of a collection

    :$fn a predicate function that should return a true response for rows to keep

    If none of the built-in operators suffice, the last operator provides the ability to pass a custom function instead.

    We'll continue to use Clojure's thread-last macro to make the code intention a little clearer, and return the row as a map of keys and values using the i/to-map function:

    (defn ex-1-5 []

      (->> (load-data :uk)

          (i/$where {Election Year {:$eq nil}})

          (i/to-map)))

     

    ;; {:ILEU nil, :TUSC nil, :Vote nil ... :IVH nil, :FFR nil}

    Looking at the results carefully, it's apparent that all (but one) of the columns in this row are nil. In fact, a bit of further exploration confirms that the non-nil row is a summary total and ought to be removed from the data. We can remove the problematic row by updating the predicate map to use the :$ne operator, returning only rows where the election year is not equal to nil:

    (->> (load-data :uk)

          (i/$where {Election Year {:$ne nil}}))

    The preceding function is one we'll almost always want to make sure we call in advance of using the data. One way of doing this is to add another implementation of our load-data multimethod, which also includes this filtering step:

    (defmethod load-data :uk-scrubbed [_]

      (->> (load-data :uk)

          (i/$where {Election Year {:$ne nil}})))

    Now with any code we write, can choose whether to refer to the :uk or :uk-scrubbed datasets.

    By always loading the source file and performing our scrubbing on top, we're preserving an audit trail of the transformations we've applied. This makes it clear to us—and future readers of our code—what adjustments have been made to the source. It also means that, should we need to re-run our analysis with new source data, we may be able to just load the new file in place of the existing file.

    Descriptive statistics

    Descriptive statistics are numbers that are used to summarize and describe data. In the next chapter, we'll turn our attention to a more sophisticated analysis, the so-called inferential statistics, but for now we'll limit ourselves to simply describing what we can observe about the data contained in the file.

    To demonstrate what we mean, let's look at the Electorate column of the data. This column lists the total number of registered voters in each constituency:

    (defn ex-1-6 []

      (->> (load-data :uk-scrubbed)

          (i/$ Electorate)

          (count)))

     

    ;;

    Enjoying the preview?
    Page 1 of 1