Jupyter Notebook For Beginners
Jupyter Notebook For Beginners
Jupyter Notebook For Beginners
Beginners: A Tutorial
https://www.dataquest.io/blog/jupyter-notebook-tutorial/
The Jupyter Notebook is an incredibly powerful tool for interactively developing and
presenting data science projects. This article will walk you through how to set up
Jupyter Notebooks on your local machine and how to start using it to do data
science projects.
First, though: what is a “notebook”? A notebook integrates code and its output into
a single document that combines visualizations, narrative text, mathematical
equations, and other rich media. This intuitive workflow promotes iterative and
rapid development, making notebooks an increasingly popular choice at the heart
of contemporary data science, analysis, and increasingly science at large.
Best of all, as part of the open source Project Jupyter, they are completely free.
The Jupyter project is the successor to the earlier IPython Notebook, which was first
published as a prototype in 2010. Although it is possible to use many different
programming languages within Jupyter Notebooks, this article will focus on Python
as it is the most common use case. (Among R users, R Studio tends to be a more
popular choice).
To get the most out of this tutorial you should be familiar with programming,
specifically Python and pandas specifically. That said, if you have experience with
another language, the Python in this article shouldn’t be too cryptic, and will still
help you get Jupyter Notebooks set up locally. Jupyter Notebooks can also act as a
flexible platform for getting to grips with pandas and even Python, as will become
apparent in this article.
We will:
Installation
The easiest way for a beginner to get started with Jupyter Notebooks is by
installing Anaconda. Anaconda is the most widely used Python distribution for data
science and comes pre-loaded with all the most popular libraries and tools. Some
of the biggest Python libraries wrapped up in Anaconda
include NumPy, pandas and Matplotlib, though the full 1000+ list is exhaustive. This
lets you hit the ground running in your own fully stocked data science workshop
without the hassle of managing countless installations or worrying about
dependencies and OS-specific (read: Windows-specific) installation issues.
To get Anaconda, simply:
Running Jupyter
On Windows, you can run Jupyter via the shortcut Anaconda adds to your start
menu, which will open a new tab in your default web browser that should look
something like the following screenshot.
This isn’t a notebook just yet, but don’t panic! There’s not much to it. This is the
Notebook Dashboard, specifically designed for managing your Jupyter Notebooks.
Think of it as the launchpad for exploring, editing and creating your notebooks.
Be aware that the dashboard will give you access only to the files and sub-folders
contained within Jupyter’s start-up directory; however, the start-up directory can be
changed. It is also possible to start the dashboard on any system via the command
prompt (or terminal on Unix systems) by entering the command jupyter
notebook; in this case, the current working directory will be the start-up directory.
The astute reader may have noticed that the URL for the dashboard is something
like http://localhost:8888/tree. Localhost is not a website, but indicates that
the content is being served from your local machine: your own computer. Jupyter’s
Notebooks and dashboard are web apps, and Jupyter starts up a local Python
server to serve these apps to your web browser, making it essentially platform
independent and opening the door to easier sharing on the web.
The dashboard’s interface is mostly self-explanatory — though we will come back to
it briefly later. So what are we waiting for? Browse to the folder in which you would
like to create your first notebook, click the “New” drop-down button in the top-right
and select “Python 3” (or the version of your choice).
Hey presto, here we are! Your first Jupyter Notebook will open in new tab — each
notebook uses its own tab because you can open multiple notebooks
simultaneously. If you switch back to the dashboard, you will see the new
file Untitled.ipynb and you should see some green text that tells you your
notebook is running.
What is an ipynb File?
It will be useful to understand what this file really is. Each .ipynb file is a text file
that describes the contents of your notebook in a format called JSON. Each cell and
its contents, including image attachments that have been converted into strings of
text, is listed therein along with some metadata. You can edit this yourself — if you
know what you are doing! — by selecting “Edit > Edit Notebook Metadata” from the
menu bar in the notebook.
You can also view the contents of your notebook files by selecting “Edit” from the
controls on the dashboard, but the keyword here is “can“; there’s no reason other
than curiosity to do so unless you really know what you are doing.
button in the toolbar above or press Ctrl + Enter. The result should
look like this:
print('Hello World!')
Hello World!
When you ran the cell, its output will have been displayed below and the label to its
left will have changed from In [ ] to In [1]. The output of a code cell also forms
part of the document, which is why you can see it in this article. You can always tell
the difference between code and Markdown cells because code cells have that label
on the left and Markdown cells do not.
The “In” part of the label is simply short for “Input,” while the label number indicates
when the cell was executed on the kernel — in this case the cell was executed first.
Run the cell again and the label will change to In [2] because now the cell was the
second to be run on the kernel. It will become clearer why this is so useful later on
when we take a closer look at kernels.
From the menu bar, click Insert and select Insert Cell Below to create a new code cell
underneath your first and try out the following code to see what happens. Do you
notice anything different?
import time
time.sleep(3)
This cell doesn’t produce any output, but it does take three seconds to execute.
Notice how Jupyter signifies that the cell is currently running by changing its label
to In [*].
In general, the output of a cell comes from any text data specifically printed during
the cells execution, as well as the value of the last line in the cell, be it a lone
variable, a function call, or something else. For example:
def say_hello(recipient):
say_hello('Tim')
'Hello, Tim!'
You’ll find yourself using this almost constantly in your own projects, and we’ll see
more of it later on.
Keyboard Shortcuts
One final thing you may have observed when running your cells is that their border
turned blue, whereas it was green while you were editing. There is always one
“active” cell highlighted with a border whose color denotes its current mode, where
green means “edit mode” and blue is “command mode.”
So far we have seen how to run a cell with Ctrl + Enter, but there are plenty
more. Keyboard shortcuts are a very popular aspect of the Jupyter environment
because they facilitate a speedy cell-based workflow. Many of these are actions you
can carry out on the active cell when it’s in command mode.
Below, you’ll find a list of some of Jupyter’s keyboard shortcuts. You’re not expected
to pick them up immediately, but the list should give you a good idea of what’s
possible.
Toggle between edit and command mode with Esc and Enter,
respectively.
Once in command mode:
o Scroll up and down your cells with your Up and Down keys.
o Press A or B to insert a new cell above or below the active
cell.
o M will transform the active cell to a Markdown cell.
o Y will set the active cell to a code cell.
o D + D (D twice) will delete the active cell.
o Z will undo cell deletion.
o Hold Shift and press Up or Down to select multiple cells at
once.
With multiple cells selected, Shift + M will merge
your selection.
Ctrl + Shift + -, in edit mode, will split the active cell at the
cursor.
You can also click and Shift + Click in the margin to the left of
your cells to select them.
Go ahead and try these out in your own notebook. Once you’ve had a play, create a
new Markdown cell and we’ll learn how to format the text in our notebooks.
Markdown
Markdown is a lightweight, easy to learn markup language for formatting plain text.
Its syntax has a one-to-one correspondance with HTML tags, so some prior
knowledge here would be helpful but is definitely not a prerequisite. Remember
that this article was written in a Jupyter notebook, so all of the narrative text and
images you have seen so far was achieved in Markdown. Let’s cover the basics with
a quick example.
# This is a level 1 heading
Inline code uses single backticks: `foo()`, and code blocks use triple
backticks:
```
bar()
```
foo()
Kernels
Behind every notebook runs a kernel. When you run a code cell, that code is
executed within the kernel and any output is returned back to the cell to be
displayed. The kernel’s state persists over time and between cells — it pertains to
the document as a whole and not individual cells.
For example, if you import libraries or declare variables in one cell, they will be
available in another. In this way, you can think of a notebook document as being
somewhat comparable to a script file, except that it is multimedia. Let’s try this out
to get a feel for it. First, we’ll import a Python package and define a function.
import numpy as np
def square(x):
return x * x
Once we’ve executed the cell above, we can reference np and square in any other
cell.
x = np.random.randint(1, 10)
y = square(x)
Restart: restarts the kernel, thus clearing all the variables etc that
were defined.
Restart & Clear Output: same as above but will also wipe the
output displayed below your code cells.
Restart & Run All: same as above but will also run all your cells in
order from first to last.
If your kernel is ever stuck on a computation and you wish to stop it, you can
choose the Interupt option.
Choosing a Kernel
You may have noticed that Jupyter gives you the option to change kernel, and in
fact there are many different options to choose from. Back when you created a new
notebook from the dashboard by selecting a Python version, you were actually
choosing which kernel to use.
Not only are there kernels for different versions of Python, but also for over 100
languages including Java, C, and even Fortran. Data scientists may be particularly
interested in the kernels for R and Julia, as well as both imatlab and the Calysto
MATLAB Kernel for Matlab. The SoS kernel provides multi-language support within
a single notebook. Each kernel has its own installation instructions, but will likely
require you to run some commands on your computer.
Example Analysis
Now we’ve looked at what a Jupyter Notebook is, it’s time to look at how they’re
used in practice, which should give you a clearer understanding of why they are so
popular. It’s finally time to get started with that Fortune 500 data set mentioned
earlier. Remember, our goal is to find out how the profits of the largest companies
in the US changed historically.
It’s worth noting that everyone will develop their own preferences and style, but the
general principles still apply, and you can follow along with this section in your own
notebook if you wish, which gives you the scope to play around.
Note that closing the notebook tab in your browser will not “close” your notebook
in the way closing a document in a traditional application will. The notebook’s
kernel will continue to run in the background and needs to be shut down before it
is truly “closed” — though this is pretty handy if you accidentally close your tab or
browser! If the kernel is shut down, you can close the tab without worrying about
whether it is still running or not.
Once you’ve named your notebook, open it back up and we’ll get going.
Setup
It’s common to start off with a code cell specifically for imports and setup, so that if
you choose to add or change anything, you can simply edit and re-run the cell
without causing any side-effects.
import pandas as pd
SIGN UP
Investigating Our Data Set
Now we’re really rolling! Our notebook is safely saved and we’ve loaded our data
set df into the most-used pandas data structure, which is called a DataFrame and
basically looks like a table. What does ours look like?
df.head()
df.tail()
Looking good. We have the columns we need, and each row corresponds to a single
company in a single year.
Let’s just rename those columns so we can refer to them later.
df.columns = ['year', 'rank', 'company', 'revenue', 'profit']
Next, we need to explore our data set. Is it complete? Did pandas read it as
expected? Are any values missing?
len(df)
25500
Okay, that looks good — that’s 500 rows for every year from 1955 to 2005, inclusive.
Let’s check whether our data set has been imported as we would expect. A simple
check is to see if the data types (or dtypes) have been correctly interpreted.
df.dtypes
year int64
rank int64
company object
revenue float64
profit object
dtype: object
Uh oh. It looks like there’s something wrong with the profits column — we would
expect it to be a float64 like the revenue column. This indicates that it probably
contains some non-integer values, so let’s take a look.
non_numberic_profits = df.profit.str.contains('[^0-9.-]')
df.loc[non_numberic_profits].head()
year rank company revenue profit
Just as we suspected! Some of the values are strings, which have been used to
indicate missing data. Are there any other values that have crept in?
set(df.profit[non_numberic_profits])
{'N.A.'}
That makes it easy to interpret, but what should we do? Well, that depends how
many values are missing.
len(df.profit[non_numberic_profits])
369
It’s a small fraction of our data set, though not completely inconsequential as it is
still around 1.5%. If rows containing N.A. are, roughly, uniformly distributed over
the years, the easiest solution would just be to remove them. So let’s have a quick
look at the distribution.
bin_sizes, _, _ = plt.hist(df.year[non_numberic_profits],
bins=range(1955, 2006))
At a glance, we can see that the most invalid values in a single year is fewer than 25,
and as there are 500 data points per year, removing these values would account for
less than 4% of the data for the worst years. Indeed, other than a surge around the
90s, most years have fewer than half the missing values of the peak. For our
purposes, let’s say this is acceptable and go ahead and remove these rows.
df = df.loc[~non_numberic_profits]
df.profit = df.profit.apply(pd.to_numeric)
We should check that worked.
len(df)
25131
df.dtypes
year int64
rank int64
company object
revenue float64
profit float64
dtype: object
Great! We have finished our data set setup.
If you were going to present your notebook as a report, you could get rid of the
investigatory cells we created, which are included here as a demonstration of the
flow of working with notebooks, and merge relevant cells (see the Advanced
Functionality section below for more on this) to create a single data set setup cell.
This would mean that if we ever mess up our data set elsewhere, we can just rerun
the setup cell to restore it.
Plotting with matplotlib
Next, we can get to addressing the question at hand by plotting the average profit
by year. We might as well plot the revenue as well, so first we can define some
variables and a method to reduce our code.
group_by_year = df.loc[:, ['year', 'revenue', 'profit']].groupby('year')
avgs = group_by_year.mean()
x = avgs.index
y1 = avgs.profit
ax.set_title(title)
ax.set_ylabel(y_label)
ax.plot(x, y)
ax.margins(x=0, y=0)
Now let’s plot!
fig, ax = plt.subplots()
plot(x, y1, ax, 'Increase in mean Fortune 500 company profits from 1955
to 2005', 'Profit (millions)')
Wow, that looks like an exponential, but it’s got some huge dips. They must
correspond to the early 1990s recession and the dot-com bubble. It’s pretty
interesting to see that in the data. But how come profits recovered to even higher
levels post each recession?
Maybe the revenues can tell us more.
y2 = avgs.revenue
fig, ax = plt.subplots()
plot(x, y2, ax, 'Increase in mean Fortune 500 company revenues from 1955
to 2005', 'Revenue (millions)')
That adds another side to the story. Revenues were no way nearly as badly hit,
that’s some great accounting work from the finance departments.
With a little help from Stack Overflow, we can superimpose these plots with +/-
their standard deviations.
def plot_with_std(x, y, stds, ax, title, y_label):
title = 'Increase in mean and std Fortune 500 company %s from 1955 to
2005'
stds1 = group_by_year.std().profit.values
stds2 = group_by_year.std().revenue.values
fig.set_size_inches(14, 4)
fig.tight_layout()
That’s staggering, the standard deviations are huge. Some Fortune 500 companies
make billions while others lose billions, and the risk has increased along with rising
profits over the years. Perhaps some companies perform better than others; are
the profits of the top 10% more or less volatile than the bottom 10%?
There are plenty of questions that we could look into next, and it’s easy to see how
the flow of working in a notebook matches one’s own thought process, so now it’s
time to draw this example to a close. This flow helped us to easily investigate our
data set in one place without context switching between applications, and our work
is immediately sharable and reproducible. If we wished to create a more concise
report for a particular audience, we could quickly refactor our work by merging cells
and removing intermediary code.
Note: The original version of this article used as_matrix() instead of .values in the
code snippet above. As of this writing, .as_matrix() still works, but it is slated for
removal in a future version of pandas, so we’ve replaced it with the future-
proof .values.
GitHub
With the number of public notebooks on GitHub exceeding 1.8 million by early
2018, it is surely the most popular independent platform for sharing Jupyter
projects with the world. GitHub has integrated support for rendering .ipynb files
directly both in repositories and gists on its website. If you aren’t already
aware, GitHub is a code hosting platform for version control and collaboration for
repositories created with Git. You’ll need an account to use their services, but
standard accounts are free.
Once you have a GitHub account, the easiest way to share a notebook on GitHub
doesn’t actually require Git at all. Since 2008, GitHub has provided its Gist service
for hosting and sharing code snippets, which each get their own repository. To
share a notebook using Gists:
If you created a public Gist, you will now be able to share its URL with anyone, and
others will be able to fork and clone your work.
Creating your own Git repository and sharing this on GitHub is beyond the scope of
this tutorial, but GitHub provides plenty of guides for you to get started on your
own.
An extra tip for those using git is to add an exception to your .gitignore for those
hidden .ipynb_checkpoints directories Jupyter creates, so as not to commit
checkpoint files unnecessarily to your repo.
Nbviewer
Having grown to render hundreds of thousands of notebooks every week by 2015,
NBViewer is the most popular notebook renderer on the web. If you already have
somewhere to host your Jupyter Notebooks online, be it GitHub or elsewhere,
NBViewer will render your notebook and provide a shareable URL along with it.
Provided as a free service as part of Project Jupyter, it is available
at nbviewer.jupyter.org.
Initially developed before GitHub’s Jupyter Notebook integration, NBViewer allows
anyone to enter a URL, Gist ID, or GitHub username/repo/file and it will render the
notebook as a webpage. A Gist’s ID is the unique number at the end of its URL; for
example, the string of characters after the last backslash
in https://gist.github.com/username/50896401c23e0bf417e89cd57e89e1de.
If you enter a GitHub username or username/repo, you will see a minimal file
browser that lets you explore a user’s repos and their contents.
The URL NBViewer displays when displaying a notebook is a constant based on the
URL of the notebook it is rendering, so you can share this with anyone and it will
work as long as the original files remain online — NBViewer doesn’t cache files for
very long.
Final Thoughts
Starting with the basics, we have come to grips with the natural workflow of Jupyter
Notebooks, delved into IPython’s more advanced features, and finally learned how
to share our work with friends, colleagues, and the world. And we accomplished all
this from a notebook itself!
It should be clear how notebooks promote a productive working experience by
reducing context switching and emulating a natural development of thoughts
during a project. The power of Jupyter Notebooks should also be evident, and we
covered plenty of leads to get you started exploring more advanced features in
your own projects.
If you’d like further inspiration for your own Notebooks, Jupyter has put together a
gallery of interesting Jupyter Notebooks that you may find helpful and the Nbviewer
homepage links to some really fancy examples of quality notebooks.