Scip y Lectures
Scip y Lectures
Scip y Lectures
SciPy Python
2 The Python language 12
EDITION 2.1 First steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Basic types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Defining functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Reusing code: scripts and modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
IP[y]: 2.6 Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.7 Standard Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.8 Exception handling in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Cython IPython 2.9 Object-oriented programming (OOP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Scipy
Edited by 3.3 Some new features in Python 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Matplotlib: plotting 98
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Simple plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3 Figures, Subplots, Axes and Ticks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Gaël Varoquaux • Emmanuelle Gouillart • Olav Vahtras • Pierre de Buyl 5.4 Other Types of Plots: examples and exercises . . . . . . . . . . . . . . . . . . . . . . . . . 109
Christopher Burns • Adrian Chauve • Robert Cimrman • Christophe Combelles 5.5 Beyond this tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.6 Quick references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Ralf Gommers • André Espaze • Zbigniew Jędrzejewski-Szmek
Valentin Haenel • Michael Hartmann • Gert-Ludwig Ingold • Fabian Pedregosa
Didrik Pinte • Nicolas P. Rougier • Joris Van den Bossche • Pauli Virtanen i
6 Scipy : high-level scientific computing 185 14 Mathematical optimization: finding minima of functions 395
6.1 File input/output: scipy.io . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 14.1 Knowing your problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
6.2 Special functions: scipy.special . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 14.2 A review of the different optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
6.3 Linear algebra operations: scipy.linalg . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 14.3 Full code examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
6.4 Interpolation: scipy.interpolate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 14.4 Examples for the mathematical optimization chapter . . . . . . . . . . . . . . . . . . . . 403
6.5 Optimization and fit: scipy.optimize . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 14.5 Practical guide to optimization with scipy . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
6.6 Statistics and random numbers: scipy.stats . . . . . . . . . . . . . . . . . . . . . . . . 194 14.6 Special case: non-linear least-squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
6.7 Numerical integration: scipy.integrate . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 14.7 Optimization with constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
6.8 Fast Fourier transforms: scipy.fftpack . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 14.8 Full code examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
6.9 Signal processing: scipy.signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 14.9 Examples for the mathematical optimization chapter . . . . . . . . . . . . . . . . . . . . 444
6.10 Image manipulation: scipy.ndimage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.11 Summary exercises on scientific computing . . . . . . . . . . . . . . . . . . . . . . . . . . 208 15 Interfacing with C 445
6.12 Full code examples for the scipy chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
15.2 Python-C-Api . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
7 Getting help and finding documentation 259 15.3 Ctypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
15.4 SWIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
15.5 Cython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
II Advanced topics 262 15.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
15.7 Further Reading and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
8 Advanced Python Constructs 264 15.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
8.1 Iterators, generator expressions and generators . . . . . . . . . . . . . . . . . . . . . . . . 265
8.2 Decorators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
8.3 Context managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 III Packages and applications 468
9 Advanced NumPy 281 16 Statistics in Python 470
9.1 Life of ndarray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 16.1 Data representation and interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
9.2 Universal functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 16.2 Hypothesis testing: comparing two groups . . . . . . . . . . . . . . . . . . . . . . . . . . 476
9.3 Interoperability features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 16.3 Linear models, multiple factors, and analysis of variance . . . . . . . . . . . . . . . . . . . 478
9.4 Array siblings: chararray, maskedarray, matrix . . . . . . . . . . . . . . . . . . . . . . 307 16.4 More visualization: seaborn for statistical exploration . . . . . . . . . . . . . . . . . . . . 483
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 16.5 Testing for interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
9.6 Contributing to NumPy/Scipy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 16.6 Full code for the figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
16.7 Solutions to this chapter’s exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
10 Debugging code 313
10.1 Avoiding bugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 17 Sympy : Symbolic Mathematics in Python 516
10.2 Debugging workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 17.1 First Steps with SymPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
10.3 Using the Python debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 17.2 Algebraic manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
10.4 Debugging segmentation faults using gdb . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 17.3 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
17.4 Equation solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
11 Optimizing code 324 17.5 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
11.1 Optimization workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
11.2 Profiling Python code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 18 Scikit-image: image processing 524
11.3 Making code go faster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 18.1 Introduction and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
11.4 Writing faster numerical code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 18.2 Input/output, data types and colorspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
18.3 Image preprocessing / enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
12 Sparse Matrices in SciPy 331 18.4 Image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 18.5 Measuring regions’ properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
12.2 Storage Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 18.6 Data visualization and interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
12.3 Linear System Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 18.7 Feature extraction for computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
12.4 Other Interesting Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 18.8 Full code examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
18.9 Examples for the scikit-image chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
13 Image manipulation and processing using Numpy and Scipy 351
13.1 Opening and writing to image files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 19 Traits: building interactive dialogs 548
13.2 Displaying images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
13.3 Basic manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 19.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
13.4 Image filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 19.3 What are Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
13.5 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
13.6 Measuring objects properties: ndimage.measurements . . . . . . . . . . . . . . . . . . . . 365 20 3D plotting with Mayavi 567
13.7 Full code examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
ii iii
Scipy lecture notes, Edition 2022.1
Index 669
iv Contents 1
Scipy lecture notes, Edition 2022.1
This part of the Scipy lecture notes is a self-contained introduction to everything that is needed to use
Python for science, from the language itself, to numerical computing or plotting.
Part I
2 3
Scipy lecture notes, Edition 2022.1
• Universal Python is a language used for many different problems. Learning Python avoids learning
a new software for each new problem.
1
Compiled languages: C, C++, Fortran. . .
Pros
• Very fast. For heavy computations, it’s difficult to outperform these languages.
Cons
Python • Scipy : high-level numerical routines. Optimization, regression, interpolation, etc http://www.
scipy.org/
Pros
See also:
• Very rich scientific computing libraries
chapter on scipy
• Well thought out language, allowing to write very readable and well structured code:
we “code what we think”. • Matplotlib : 2-D visualization, “publication-ready” plots http://matplotlib.org/
• Many libraries beyond scientific computing (web server, serial port access, etc.) See also:
• Free and open-source software, widely spread, with a vibrant community. chapter on matplotlib
• A variety of powerful environments to work in, such as IPython, Spyder, Jupyter
notebooks, Pycharm, Visual Studio Code
Cons
• Not all the algorithms that can be found in more specialized software or toolboxes.
1.2. The Scientific Python ecosystem 6 1.2. The Scientific Python ecosystem 7
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• scikit-learn for machine learning Getting help by using the ? operator after an object:
and much more packages not documented in the scipy lectures.
In [2]: print?
See also: Type: builtin_function_or_method
Base Class: <type 'builtin_function_or_method'>
chapters on advanced topics String Form: <built-in function print>
chapters on packages and applications Namespace: Python builtin
Docstring:
print(value, ..., sep=' ', end='\n', file=sys.stdout)
1.3 Before starting: Installing a working environment
Prints the values to a stream, or to sys.stdout by default.
Python comes in many flavors, and there are many ways to install it. However, we recommend to install Optional keyword arguments:
file: a file-like object (stream); defaults to the current sys.stdout.
a scientific-computing distribution, that comes readily with optimized versions of scientific modules.
sep: string inserted between values, default a space.
end: string appended after the last value, default a newline.
Warning: You should install Python 3
See also:
Python 2.7 is end of life, and will not be maintained past January 1, 2020.
• IPython user manual: https://ipython.readthedocs.io/en/stable/
Working with Python 2.7 is at your own risk. Do not expect much support.
• Jupyter Notebook QuickStart: http://jupyter.readthedocs.io/en/latest/content-quickstart.html
• Official announcement
• The end is nigh 1.4.2 Elaboration of the work in an editor
As you move forward, it will be important to not only work interactively, but also to create and reuse
Under Linux Python files. For this, a powerful code editor will get you far. Here are several good easy-to-use editors:
If you have a recent distribution, most of the tools are probably packaged, and it is recommended to use • Spyder: integrates an IPython console, a debugger, a profiler. . .
your package manager.
• PyCharm: integrates an IPython console, notebooks, a debugger. . . (freely available, but commer-
Other systems cial)
There are several fully-featured Scientific Python distributions: • Visual Studio Code: integrates a Python console, notebooks, a debugger, . . .
• Anaconda • Atom
• EPD Some of these are shipped by the various scientific Python distributions, and you can find them in the
menus.
• WinPython
As an exercise, create a file my_file.py in a code editor, and add the following lines:
1.4 The workflow: interactive environments and text editors s = 'Hello world'
print(s)
Interactive work to test and understand algorithms: In this section, we describe a workflow
combining interactive work and consolidation. Now, you can run it in IPython console or a notebook and explore the resulting variables:
Python is a general-purpose language. As such, there is not one blessed environment to work in, and In [1]: %run my_file.py
not only one way of using it. Although this makes it harder for beginners to find their way, it makes it Hello world
possible for Python to be used for programs, in web servers, or embedded devices.
In [2]: s
Out[2]: 'Hello world'
1.4.1 Interactive work
We recommend an interactive work with the IPython console, or its offspring, the Jupyter notebook. In [3]: %whos
They are handy to explore and understand algorithms. Variable Type Data/Info
----------------------------
s str Hello world
Under the notebook
Start ipython:
1.3. Before starting: Installing a working environment 8 1.4. The workflow: interactive environments and text editors 9
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
While it is tempting to work only with scripts, that is a file full of instructions following each other, In [2]: %cpaste
do plan to progressively evolve the script to a set of functions: Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:>>> for i in range(3):
• A script is not reusable, functions are.
:... print(i)
• Thinking in terms of functions helps breaking the problem in small blocks. :--
0
1
2
1.4.3 IPython and Jupyter Tips and Tricks
The user manuals contain a wealth of information. Here we give a quick introduction to four useful • %timeit allows you to time the execution of short snippets using the timeit module from the
features: history, tab completion, magic functions, and aliases. standard library:
In [3]: %timeit x = 10
10000000 loops, best of 3: 39 ns per loop
See also:
Command history Like a UNIX shell, the IPython console supports command history. Type up and Chapter on optimizing code
down to navigate previously typed commands:
• %debug allows you to enter post-mortem debugging. That is to say, if the code you try to execute,
In [1]: x = 10 raises an exception, using %debug will enter the debugger at the point where the exception was
thrown.
In [2]: <UP>
In [4]: x === 10
In [2]: x = 10 File "<ipython-input-6-12fd421b5f28>", line 1
x === 10
^
SyntaxError: invalid syntax
In [5]: %debug
Tab completion Tab completion, is a convenient way to explore the structure of any object you’re > /.../IPython/core/compilerop.py (87)ast_parse()
86 and are passed to the built-in compile function."""
dealing with. Simply type object_name.<TAB> to view the object’s attributes. Besides Python objects
---> 87 return compile(source, filename, symbol, self.flags | PyCF_ONLY_AST, 1)
and keywords, tab completion also works on file and directory names.* 88
In [1]: x = 10
ipdb>locals()
In [2]: x.<TAB> {'source': u'x === 10\n', 'symbol': 'exec', 'self':
x.bit_length x.denominator x.imag x.real <IPython.core.compilerop.CachingCompiler instance at 0x2ad8ef0>,
x.conjugate x.from_bytes x.numerator x.to_bytes 'filename': '<ipython-input-6-12fd421b5f28>'}
See also:
Chapter on debugging
Magic functions The console and the notebooks support so-called magic functions by prefixing a
command with the % character. For example, the run and whos functions from the previous section are
magic functions. Note that, the setting automagic, which is enabled by default, allows you to omit the
preceding % sign. Thus, you can just type the magic function and it will work. Aliases Furthermore IPython ships with various aliases which emulate common UNIX command line
tools such as ls to list files, cp to copy files and rm to remove files (a full list of aliases is shown when
Other useful magic functions are: typing alias).
• %cd to change the current directory.
Getting help
In [1]: cd /tmp
/tmp
• The built-in cheat-sheet is accessible via the %quickref magic function.
• %cpaste allows you to paste code, especially code from websites which has been prefixed with the • A list of all available magic functions is shown when typing %magic.
standard Python prompt (e.g. >>>) or with an ipython prompt, (e.g. in [3]):
1.4. The workflow: interactive environments and text editors 10 1.4. The workflow: interactive environments and text editors 11
Scipy lecture notes, Edition 2022.1
• a language for which a large variety of high-quality packages are available for various applications,
from web frameworks to scientific computing.
• a language very easy to interface with other languages, in particular C and C++.
• Some other features of the language are illustrated just below. For example, Python is an object-
2
oriented language, with dynamic typing (the same variable can contain objects of different types
during the course of a program).
See https://www.python.org/about/ for more information about distinguishing features of Python.
CHAPTER
2.1 First steps
Start the Ipython shell (an enhanced interactive Python shell):
• by typing “ipython” from a Linux/Mac terminal, or from the Windows cmd shell,
• or by starting the program from a menu, e.g. in the Python(x,y) or EPD menu if you have installed
one of these scientific-Python suites.
The Python language • or by starting the program from a menu, e.g. the Anaconda Navigator, the Python(x,y) menu or
the EPD menu if you have installed one of these scientific-Python suites.
Tip: If you don’t have Ipython installed on your computer, other Python shells are available, such
as the plain Python shell started by typing “python” in a terminal, or the Idle interpreter. However,
we advise to use the Ipython shell because of its enhanced features, especially for interactive scientific
computing.
Authors: Chris Burns, Christophe Combelles, Emmanuelle Gouillart, Gaël Varoquaux Once you have started the interpreter, type
We introduce here the Python language. Only the bare minimum necessary for getting started with
Numpy and Scipy is addressed here. To learn more about the language, consider going through the
Tip: The message “Hello, world!” is then displayed. You just executed your first Python instruction,
excellent tutorial https://docs.python.org/tutorial. Dedicated books are also available, such as Dive
congratulations!
into Python 3.
>>> a = 3
>>> b = 2*a
>>> type(b)
<type 'int'>
>>> print(b)
6
>>> a*b
Tip: Python is a programming language, as are C, Fortran, BASIC, PHP, etc. Some specific 18
features of Python are as follows: >>> b = 'hello'
>>> type(b)
• an interpreted (as opposed to compiled) language. Contrary to e.g. C or Fortran, one does not <type 'str'>
compile Python code before executing it. In addition, Python can be used interactively: many >>> b + b
Python interpreters are available, from which commands and scripts can be executed. 'hellohello'
• a free software released under an open-source license: Python can be used and distributed free >>> 2*b
'hellohello'
of charge, even for building commercial software.
• multi-platform: Python is available for all major operating systems, Windows, Linux/Unix,
MacOS X, most likely your mobile phone OS, etc. Tip: Two variables a and b have been defined above. Note that one does not declare the type of a
• a very readable language with clear non-verbose syntax variable before assigning its value. In C, conversely, one should write:
Booleans Note: The behaviour of the division operator has changed in Python 3.
>>> 3 > 4
False
>>> test = (3 > 4)
>>> test
False 2.2.2 Containers
>>> type(test)
<type 'bool'>
Tip: Python provides many efficient types of containers, in which collections of objects can be stored.
Tip: A Python shell can therefore replace your pocket calculator, with the basic arithmetic operations
+, -, *, /, % (modulo) natively implemented Lists
Tip: A list is an ordered collection of objects, that may have different types. For example:
>>> 7 * 3.
21.0
>>> 2**10
1024
(continues on next page)
>>> colors = ['red', 'blue', 'green', 'black', 'white'] >>> colors = [3, -200, 'hello']
>>> type(colors) >>> colors
<type 'list'> [3, -200, 'hello']
>>> colors[1], colors[2]
Indexing: accessing individual objects contained in the list: (-200, 'hello')
>>> colors[2]
'green' Tip: For collections of numerical data that all have the same type, it is often more efficient to use the
array type provided by the numpy module. A NumPy array is a chunk of memory containing fixed-sized
Counting from the end with negative indices: items. With NumPy arrays, operations on elements can be faster because elements are regularly spaced
in memory and more operations are performed through specialized C functions instead of Python loops.
>>> colors[-1]
'white'
>>> colors[-2]
'black'
Tip: Python offers a large panel of functions to modify lists, or query them. Here are a few examples;
for more details, see https://docs.python.org/tutorial/datastructures.html#more-on-lists
Warning: Indexing starts at 0 (as in C), not at 1 (as in Fortran or Matlab)!
Add and remove elements:
Slicing: obtaining sublists of regularly-spaced elements:
>>> colors = ['red', 'blue', 'green', 'black', 'white']
>>> colors >>> colors.append('pink')
['red', 'blue', 'green', 'black', 'white'] >>> colors
>>> colors[2:4] ['red', 'blue', 'green', 'black', 'white', 'pink']
['green', 'black'] >>> colors.pop() # removes and returns the last item
'pink'
>>> colors
['red', 'blue', 'green', 'black', 'white']
Warning: Note that colors[start:stop] contains the elements with indices i such as start<= >>> colors.extend(['pink', 'purple']) # extend colors, in-place
i < stop (i ranging from start to stop-1). Therefore, colors[start:stop] has (stop - start) >>> colors
elements. ['red', 'blue', 'green', 'black', 'white', 'pink', 'purple']
>>> colors = colors[:-2]
>>> colors
Slicing syntax: colors[start:stop:stride] ['red', 'blue', 'green', 'black', 'white']
Lists are mutable objects and can be modified: Concatenate and repeat lists:
Tip: (Remember that negative indices correspond to counting from the right end.)
Reminder: in Ipython: tab-completion (press tab) Tip: Accents and special characters can also be handled as in Python 3 strings consist of Unicode
In [28]: rcolors.<TAB> characters.
rcolors.append rcolors.index rcolors.remove
rcolors.count rcolors.insert rcolors.reverse
rcolors.extend rcolors.pop rcolors.sort
A string is an immutable object and it is not possible to modify its contents. One may however create
new strings from the original one.
In [1]: 'Hi, what's up?' Tip: Strings have many useful methods, such as a.replace as seen above. Remember the a. object-
------------------------------------------------------------ oriented notation and use tab completion or help(str) to search for new methods.
File "<ipython console>", line 1
'Hi, what's up?'
^ See also:
SyntaxError: invalid syntax
Python offers advanced possibilities for manipulating strings, looking for patterns or formatting.
The interested reader is referred to https://docs.python.org/library/stdtypes.html#string-methods and
This syntax error can be avoided by enclosing the string in double quotes instead of single quotes. https://docs.python.org/3/library/string.html#format-string-syntax
Alternatively, one can prepend a backslash to the second single quote. Other uses of the backslash are,
e.g., the newline character \n and the tab character \t. String formatting:
>>> 'An integer: %i ; a float: %f ; another string: %s ' % (1, 0.1, 'string') # with more values␣
Tip: Strings are collections like lists. Hence they can be indexed and sliced, using the same syntax and ˓→use tuple after %
>>> i = 102
Indexing: >>> filename = 'processing_of_dataset_%d .txt' % i # no need for tuples with just one value␣
˓→after %
>>> a = "hello" >>> filename
>>> a[0] 'processing_of_dataset_102.txt'
'h'
>>> a[1]
(continues on next page)
Dictionaries 1. an expression on the right hand side is evaluated, the corresponding object is created/obtained
2. a name on the left hand side is assigned, or bound, to the r.h.s. object
Tip: A dictionary is basically an efficient table that maps keys to values. It is an unordered
container
Things to note:
• a single object can have several names bound to it:
>>> tel = {'emmanuelle': 5752, 'sebastian': 5578}
>>> tel['francis'] = 5915 In [1]: a = [1, 2, 3]
>>> tel In [2]: b = a
{'sebastian': 5578, 'francis': 5915, 'emmanuelle': 5752} In [3]: a
>>> tel['sebastian'] Out[3]: [1, 2, 3]
5578 In [4]: b
>>> tel.keys() Out[4]: [1, 2, 3]
['sebastian', 'francis', 'emmanuelle'] In [5]: a is b
>>> tel.values() Out[5]: True
[5578, 5915, 5752] In [6]: b[1] = 'hi!'
>>> 'francis' in tel In [7]: a
True Out[7]: [1, 'hi!', 3]
the indentation depth, go four spaces to the left with the Backspace key. Press the Enter key twice to (continued from previous page)
leave the logical block. ... continue
... print(1. / element)
1.0
>>> a = 10 0.5
0.25
>>> if a == 1:
... print(1)
... elif a == 2:
... print(2)
2.3.4 Conditional Expressions
... else: if <OBJECT>
... print('A lot')
A lot Evaluates to False:
• any number equal to zero (0, 0.0, 0+0j)
Indentation is compulsory in scripts as well. As an exercise, re-type the previous lines with the same
indentation in a script condition.py, and execute the script with run condition.py in Ipython. • an empty container (list, tuple, set, dictionary, . . . )
• False, None
2.3.2 for/range Evaluates to True:
Iterating with an index:
• everything else
>>> for i in range(4): a == b Tests equality, with logics:
... print(i)
0 >>> 1 == 1.
1 True
2
3
a is b Tests identity: both sides are the same object:
>>> message = "Hello how are you?" >>> [i**2 for i in range(4)]
>>> message.split() # returns a list [0, 1, 4, 9]
['Hello', 'how', 'are', 'you?']
>>> for word in message.split():
... print(word)
...
Exercise
Hello
how
are Compute the decimals of Pi using the Wallis formula:
you?
4𝑖2
∞
∏︁
𝜋=2
𝑖=1
4𝑖2 − 1
Tip: Few languages (in particular, languages for scientific computing) allow to loop over anything but
integers/indices. With Python it is possible to loop exactly over the objects of interest without bothering
with indices you often don’t care about. This feature can often be used to make code more readable.
2.4 Defining functions
2.4.1 Function definition
Warning: Not safe to modify the sequence you are iterating over.
In [56]: def test():
....: print('in test function')
Keeping track of enumeration number ....:
....:
Common task is to iterate over a sequence while keeping track of the item number.
In [57]: test()
• Could use while loop with a counter as above. Or a for loop:
in test function
>>> words = ('cool', 'powerful', 'readable')
>>> for i in range(0, len(words)):
... print((i, words[i])) Warning: Function blocks must be indented as other control-flow blocks.
(0, 'cool')
(1, 'powerful')
(2, 'readable')
2.4.2 Return statement
• But, Python provides a built-in function - enumerate - for this: Functions can optionally return values.
>>> for index, item in enumerate(words): In [6]: def disk_area(radius):
... print((index, item)) ...: return 3.14 * radius * radius
(0, 'cool') ...:
(1, 'powerful')
(2, 'readable') In [8]: disk_area(1.5)
Out[8]: 7.0649999999999995
>>> for key, val in sorted(d.items()): Note: Note the syntax to define a function:
... print('Key: %s has value: %s ' % (key, val))
• the def keyword;
Key: a has value: 1
Key: b has value: 1.2 • is followed by the function’s name, then
Key: c has value: 1j
• the arguments of the function are given between parentheses followed by a colon.
• the function body;
Note: The ordering of a dictionary is random, thus we use sorted() which will sort on the keys.
• and return object for optionally returning values.
In [125]: def double_it(x=bigx): Keyword arguments are a very convenient feature for defining functions with a variable number of argu-
.....: return x * 2 ments, especially when default values are to be used in most calls to the function.
.....:
In [128]: double_it() Tip: Can you modify the value of a variable inside a function? Most languages (C, Java, . . . ) distinguish
Out[128]: 20 “passing by value” and “passing by reference”. In Python, such a distinction is somewhat artificial, and
Using an mutable type in a keyword argument (and modifying it inside the function body): it is a bit subtle whether your variables are going to be modified or not. Fortunately, there exist clear
rules.
In [2]: def add_to_dict(args={'a': 1, 'b': 2}):
...: for i in args.keys(): Parameters to functions are references to objects, which are passed by value. When you pass a variable
...: args[i] += 1 to a function, python passes the reference to the object to which the variable refers (the value). Not the
...: print(args) variable itself.
...:
In [3]: add_to_dict If the value passed in a function is immutable, the function does not modify the caller’s variable. If the
Out[3]: <function __main__.add_to_dict> value is mutable, the function may modify the caller’s variable in-place:
In [120]: x
Out[120]: 5 Note: Docstring guidelines
This works: For the sake of standardization, the Docstring Conventions webpage documents the semantics and con-
ventions associated with Python docstrings.
In [121]: def setx(y):
Also, the Numpy and Scipy modules have defined a precise standard for documenting scientific functions,
.....: global x
.....: x = y
that you may want to follow for your own functions, with a Parameters section, an Examples section,
.....: print('x is %d ' % x) etc. See https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard
.....:
.....:
2.4.8 Functions are objects
In [122]: setx(10)
x is 10 Functions are first-class objects, which means they can be:
• assigned to a variable
(continues on next page)
In [39]: va('three', x=1, y=2) Tip: Let us now execute the script interactively, that is inside the Ipython interpreter. This is maybe
args is ('three',)
the most common use of scripts in scientific computing.
kwargs is {'x': 1, 'y': 2}
2.4.9 Methods Note: in Ipython, the syntax to execute a script is %run script.py. For example,
Methods are functions attached to objects. You’ve seen these in our examples on lists, dictionaries,
strings, etc. . . In [1]: %run test.py
Hello
2.4.10 Exercises how
are
you?
Exercise: Fibonacci sequence
In [2]: message
Write a function that displays the n first terms of the Fibonacci sequence, defined by: Out[2]: 'Hello how are you?'
⎧
⎨ 𝑈0 = 0 The script has been executed. Moreover the variables defined in the script (such as message) are now
𝑈1 = 1 available inside the interpreter’s namespace.
⎩
𝑈𝑛+2 = 𝑈𝑛+1 + 𝑈𝑛
Tip: Other interpreters also offer the possibility to execute scripts (e.g., execfile in the plain Python
interpreter, etc.).
Exercise: Quicksort
It is also possible In order to execute this script as a standalone program, by executing the script inside
Implement the quicksort algorithm, as defined by wikipedia a shell terminal (Linux/Mac console or cmd Windows console). For example, if we are in the same
directory as the test.py file, we can execute this in a console:
function quicksort(array)
$ python test.py
var list less, greater
Hello
if length(array) < 2
how
return array
are
select and remove a pivot value pivot from array
you?
for each x in array
if x < pivot + 1 then append x to less
else append x to greater
return concatenate(quicksort(less), pivot, quicksort(greater)) Tip: Standalone scripts may also take command-line arguments
In file.py:
2.5.1 Scripts Warning: Don’t implement option parsing yourself. Use a dedicated module such as argparse.
Tip: Let us first write a script, that is a file with a sequence of instructions that are executed each
time the script is called. Instructions may be e.g. copied-and-pasted from the interpreter (but take care
to respect indentation rules!). 2.5.2 Importing objects from modules
In [1]: import os
The extension for Python files is .py. Write or copy-and-paste the following lines in a file called test.py
(continues on next page)
2.5. Reusing code: scripts and modules 30 2.5. Reusing code: scripts and modules 31
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
This is called the star import and please, Do not use it Importing the module gives access to its objects, using the module.object syntax. Don’t forget to put
the module’s name before the object’s name, otherwise Python won’t recognize the instruction.
• Makes the code harder to read and understand: where do symbols come from?
Introspection
• Makes it impossible to guess the functionality by the context and the name (hint: os.name is
the name of the OS), and to profit usefully from tab completion. In [4]: demo?
Type: module
• Restricts the variable names you can use: os.name might override name, or vise-versa. Base Class: <type 'module'>
String Form: <module 'demo' from 'demo.py'>
• Creates possible name clashes between modules.
Namespace: Interactive
• Makes the code impossible to statically check for undefined symbols. File: /home/varoquau/Projects/Python_talks/scipy_2009_tutorial/source/demo.py
Docstring:
A demo module.
Tip: Modules are thus a good way to organize code in a hierarchical way. Actually, all the scientific
computing tools we are going to use are modules: In [5]: who
demo
>>> import numpy as np # data arrays
>>> np.linspace(0, 10, 6) In [6]: whos
array([ 0., 2., 4., 6., 8., 10.]) Variable Type Data/Info
>>> import scipy # scientific computing ------------------------------
demo module <module 'demo' from 'demo.py'>
In [7]: dir(demo)
2.5.3 Creating modules Out[7]:
['__builtins__',
'__doc__',
Tip: If we want to write larger and better organized programs (compared to simple scripts), where '__file__',
some objects are defined, (variables, functions, classes) and that we want to reuse several times, we have '__name__',
to create our own modules. '__package__',
'c',
'd',
Let us create a module demo contained in the file demo.py: 'print_a',
(continues on next page)
2.5. Reusing code: scripts and modules 32 2.5. Reusing code: scripts and modules 33
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
File demo2.py: Modules must be located in the search path, therefore you can:
• write your own modules within directories already defined in the search path (e.g. $HOME/.local/
def print_b():
"Prints b."
lib/python2.7/dist-packages). You may use symbolic links (on Linux) to keep the code some-
print('b') where else.
• modify the environment variable PYTHONPATH to include the directories containing the user-defined
def print_a(): modules.
"Prints a."
print('a')
Tip: On Linux/Unix, add the following line to a file read by the shell at startup (e.g. /etc/profile,
# print_b() runs on import .profile)
print_b()
export PYTHONPATH=$PYTHONPATH:/home/emma/user_defined_modules
if __name__ == '__main__':
# print_a() is only executed when the module is run directly. On Windows, http://support.microsoft.com/kb/310519 explains how to handle environment vari-
print_a() ables.
Importing it:
2.5. Reusing code: scripts and modules 34 2.5. Reusing code: scripts and modules 35
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• or modify the sys.path variable itself within a Python script. (continued from previous page)
Type: function
Base Class: <type 'function'>
Tip:
String Form: <function binary_dilation at 0x9bedd84>
import sys Namespace: Interactive
new_path = '/home/emma/user_defined_modules' File: /usr/lib/python2.6/dist-packages/scipy/ndimage/morphology.py
if new_path not in sys.path: Definition: morphology.binary_dilation(input, structure=None,
sys.path.append(new_path) iterations=1, mask=None, output=None, border_value=0, origin=0,
brute_force=False)
Docstring:
This method is not very robust, however, because it makes the code less portable (user-dependent
Multi-dimensional binary dilation with the given structure.
path) and because you have to add the directory to your sys.path each time you want to import
from a module in this directory. An output array can optionally be provided. The origin parameter
controls the placement of the filter. If no structuring element is
See also: provided an element is generated with a squared connectivity equal
to one. The dilation operation is repeated iterations times. If
See https://docs.python.org/tutorial/modules.html for more information about modules. iterations is less than 1, the dilation is repeated until the
result does not change anymore. If a mask is given, only those
elements with a true value at the corresponding mask element are
2.5.6 Packages modified at each iteration.
A directory that contains many modules is called a package. A package is a module with submodules
(which can have submodules themselves, etc.). A special file called __init__.py (which may be empty)
tells Python that the directory is a Python package, from which modules can be imported. 2.5.7 Good practices
$ ls • Use meaningful object names
cluster/ io/ README.txt@ stsci/ • Indentation: no choice!
__config__.py@ LATEST.txt@ setup.py@ __svn_version__.py@
__config__.pyc lib/ setup.pyc __svn_version__.pyc
constants/ linalg/ setupscons.py@ THANKS.txt@ Tip: Indenting is compulsory in Python! Every command block following a colon bears an
fftpack/ linsolve/ setupscons.pyc TOCHANGE.txt@ additional indentation level with respect to the previous line with a colon. One must therefore
__init__.py@ maxentropy/ signal/ version.py@ indent after def f(): or while:. At the end of such logical blocks, one decreases the indentation
__init__.pyc misc/ sparse/ version.pyc depth (and re-increases it if a new block is entered, etc.)
INSTALL.txt@ ndimage/ spatial/ weave/
integrate/ odr/ special/ Strict respect of indentation is the price to pay for getting rid of { or ; characters that delineate
interpolate/ optimize/ stats/ logical blocks in other languages. Improper indentation leads to errors such as
$ cd ndimage
$ ls ------------------------------------------------------------
doccer.py@ fourier.pyc interpolation.py@ morphology.pyc setup.pyc IndentationError: unexpected indent (test.py, line 2)
doccer.pyc info.py@ interpolation.pyc _nd_image.so
setupscons.py@ All this indentation business can be a bit confusing in the beginning. However, with the clear
filters.py@ info.pyc measurements.py@ _ni_support.py@ indentation, and in the absence of extra characters, the resulting code is very nice to read compared
setupscons.pyc to other languages.
filters.pyc __init__.py@ measurements.pyc _ni_support.pyc tests/
fourier.py@ __init__.pyc morphology.py@ setup.py@
• Indentation depth: Inside your text editor, you may choose to indent with any positive number
From Ipython: of spaces (1, 2, 3, 4, . . . ). However, it is considered good practice to indent with 4 spaces. You
may configure your editor to map the Tab key to a 4-space indentation.
In [1]: import scipy
• Style guidelines
In [2]: scipy.__file__
Long lines: you should not write very long lines that span over more than (e.g.) 80 characters.
Out[2]: '/usr/lib/python2.6/dist-packages/scipy/__init__.pyc'
Long lines can be broken with the \ character
In [3]: import scipy.version >>> long_line = "Here is a very very long line \
... that we break in two parts."
In [4]: scipy.version.version
Out[4]: '0.7.0'
Spaces
In [5]: import scipy.ndimage.morphology Write well-spaced code: put whitespaces after commas, around arithmetic operators, etc.:
In [6]: from scipy.ndimage import morphology >>> a = 1 # yes
>>> a=1 # too cramped
In [17]: morphology.binary_dilation?
(continues on next page)
2.5. Reusing code: scripts and modules 36 2.5. Reusing code: scripts and modules 37
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
A certain number of rules for writing “beautiful” code (and more importantly using the same • Append a file: a
conventions as anybody else!) are given in the Style Guide for Python Code.
• Read and Write: r+
• Binary mode: b
Quick read – Note: Use for binary files, especially on Windows.
If you want to do a first quick pass through the Scipy lectures to learn the ecosystem, you can directly
skip to the next chapter: NumPy: creating and manipulating numerical data. 2.7 Standard Library
The remainder of this chapter is not necessary to follow the rest of the intro part. But be sure to
come back and finish this chapter later. Note: Reference document for this section:
• The Python Standard Library documentation: https://docs.python.org/library/index.html
• Python Essential Reference, David Beazley, Addison-Wesley Professional
2.6 Input and Output
To be exhaustive, here are some information about input and output in Python. Since we will use the
Numpy methods to read and write files, you may skip this chapter at first reading. 2.7.1 os module: operating system functionality
We write or read strings to/from files (other types must be converted to strings). To write in a file: “A portable way of using operating system dependent functionality.”
>>> f = open('workfile', 'w') # opens the workfile file Directory and file manipulation
>>> type(f)
<type 'file'> Current directory:
>>> f.write('This is a test \nand another test')
>>> f.close() In [17]: os.getcwd()
Out[17]: '/Users/cburns/src/scipy2009/scipy_2009_tutorial/source'
To read from a file
List a directory:
In [1]: f = open('workfile', 'r')
In [31]: os.listdir(os.curdir)
In [2]: s = f.read() Out[31]:
['.index.rst.swo',
In [3]: print(s) '.python_language.rst.swp',
This is a test '.view_array.py.swp',
and another test '_static',
'_templates',
In [4]: f.close() 'basic_types.rst',
'conf.py',
'control_flow.rst',
See also:
'debugging.rst',
For more details: https://docs.python.org/tutorial/inputoutput.html ...
In [20]: import sh
In [47]: os.remove('junk.txt') In [20]: com = sh.ls()
In [48]: 'junk.txt' in os.listdir(os.curdir) In [21]: print(com)
Out[48]: False basic_types.rst exceptions.rst oop.rst standard_library.rst
control_flow.rst first_steps.rst python_language.rst
demo2.py functions.rst python-logo.png
os.path: path manipulations demo.py io.rst reusing_code.rst
os.path provides common operations on pathnames.
In [22]: print(com.exit_code)
In [70]: fp = open('junk.txt', 'w') 0
In [23]: type(com)
In [71]: fp.close() Out[23]: sh.RunningCommand
In [72]: a = os.path.abspath('junk.txt')
In [84]: os.path.exists('junk.txt')
Environment variables:
Out[84]: True
In [9]: import os
In [86]: os.path.isfile('junk.txt')
Out[86]: True In [11]: os.environ.keys()
Out[11]:
In [87]: os.path.isdir('junk.txt') ['_',
Out[87]: False 'FSLDIR',
'TERM_PROGRAM_VERSION',
In [88]: os.path.expanduser('~/local') 'FSLREMOTECALL',
Out[88]: '/Users/cburns/local' 'USER',
'HOME',
In [92]: os.path.join(os.path.expanduser('~'), 'local', 'bin') 'PATH',
Out[92]: '/Users/cburns/local/bin' 'PS1',
'SHELL',
'EDITOR',
'WORKON_HOME',
(continues on next page)
In [16]: os.getenv('PYTHONPATH')
Out[16]: '.:/Users/cburns/src/utils:/Users/cburns/src/nitools: 2.7.5 pickle: easy persistence
/Users/cburns/local/lib/python2.5/site-packages/:
/usr/local/lib/python2.5/site-packages/: Useful to store arbitrary objects to a file. Not safe or fast!
/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5'
In [1]: import pickle
In [19]: glob.glob('*.txt')
2.8 Exception handling in Python
Out[19]: ['holy_grail.txt', 'junk.txt', 'newfile.txt']
It is likely that you have raised Exceptions if you have typed all the previous commands of the tutorial.
For example, you may have raised an exception if you entered a command with a typo.
2.7.4 sys module: system-specific information Exceptions are raised by different kinds of errors arising when executing Python code. In your own code,
System-specific information related to the Python interpreter. you may also catch errors, or define custom error types. You may want to look at the descriptions of the
the built-in Exceptions when looking for the right exception type.
• Which version of python are you running and where is it installed:
In [7]: l.foobar
--------------------------------------------------------------------------- 2.8.3 Raising exceptions
AttributeError: 'list' object has no attribute 'foobar'
• Capturing and reraising an exception:
As you can see, there are different types of exceptions for different errors. In [15]: def filter_name(name):
....: try:
2.8.2 Catching exceptions ....: name = name.encode('ascii')
....: except UnicodeError as e:
try/except ....: if name == 'Gaël':
....: print('OK, Gaël')
In [10]: while True: ....: else:
....: try: ....: raise e
....: x = int(raw_input('Please enter a number: ')) ....: return name
....: break ....:
....: except ValueError:
....: print('That was no valid number. Try again...') In [16]: filter_name('Gaël')
....: OK, Gaël
Please enter a number: a Out[16]: 'Ga\xc3\xabl'
That was no valid number. Try again...
Please enter a number: 1 In [17]: filter_name('Stéfan')
---------------------------------------------------------------------------
In [9]: x UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in␣
Out[9]: 1 ˓→range(128)
Here is a small example: we create a Student class, which is an object gathering several custom functions
(methods) and variables (attributes), we will be able to use:
3
... def set_age(self, age):
... self.age = age
... def set_major(self, major):
... self.major = major
...
>>> anna = Student('anna')
>>>
>>>
anna.set_age(21)
anna.set_major('physics')
CHAPTER
In the previous example, the Student class has __init__, set_age and set_major methods. Its at-
tributes are name, age and major. We can call these methods and attributes with the following notation:
classinstance.method or classinstance.attribute. The __init__ constructor is a special method
we call with: MyClass(init parameters if any).
Now, suppose we want to create a new class MasterStudent with the same methods and attributes as
the previous one, but with an additional internship attribute. We won’t copy the previous class, but
inherit from it:
The MasterStudent class inherited from the Student attributes and methods. Python 2 / 3
Thanks to classes and object-oriented programming, we can organize code with different classes corre-
Two major versions of Python exist, Python 2 and Python 3. Python 3 is the only supported version
sponding to different objects we encounter (an Experiment class, an Image class, a Flow class, etc.), with
since january 2020 but the two versions coexisted for about a decade of transition from
their own methods and attributes. Then we can use inheritance to consider variations around a base
Python 2 to Python 3. The transition has come to and end as most software libraries drop Python
class and re-use code. Ex : from a Flow base class, we can create derived StokesFlow, TurbulentFlow,
2 support.
PotentialFlow, etc.
• Most scientific libraries have moved to Python 3. NumPy and many scientific software libraries
>>> name = 'SciPy'
dropped Python 2 support or will do so soon, see the Python 3 statement. >>> print(f"Hello, { name} !")
The SciPy Lecture Notes dropped Python 2 support in 2020. The release 2020.1 is almost entirely Python Hello, SciPy!
2 compatible, so you may use it as a reference if necessary. Know that installing suitable packages will
probably be challenging. • In Python 2, range(N) return a list. For large value of N (for a loop iterating many times), this
implies the creation of a large list in memory even though it is not necessary. Python 2 provided
the alternative xrange, that you will find in many scientific programs.
3.2 Breaking changes between Python 2 and Python 3 In Python 3, range() return a dedicated type and does not allocate the memory for the corre-
Python 3 differs from Python 2 in several ways. We list the most relevant ones for scientific users below. sponding list.
>>> type(range(8))
3.2.1 Print function <class 'range'>
>>> range(8)
The most visible change is that print is not a “statement” anymore but a function. range(0, 8)
Whereas in Python 2 you could write
You can transform the output of range into a list if necessary:
>>> print 'hello, world'
hello, world >>> list(range(8))
[0, 1, 2, 3, 4, 5, 6, 7]
in Python 3 you must write
By making print() a function, one can pass arguments such a file identifier where the output will be
sent.
3.2.2 Division
In Python 2, the division of two integers with a single slash character results in floor-based integer
division:
>>> 1/2
0
>>> 1/2
0.5
>>> 1//2
0
• Since Python 3.6, there is a new string formatting method, the “f-string”:
3.2. Breaking changes between Python 2 and Python 3 48 3.3. Some new features in Python 3 49
Scipy lecture notes, Edition 2022.1
4
• closer to hardware (efficiency)
• designed for scientific computation (convenience)
• Also known as array oriented computing
CHAPTER
numerical data
• values of an experiment/simulation at discrete time steps
• signal recorded by a measurement device, e.g. sound wave
• pixels of an image, grey-level or colour
• 3-D data measured at different X-Y-Z positions, e.g. MRI scan
• ...
(continued from previous page) • Create a simple two dimensional array. First, redo the examples from above. And then create
Create an array. your own: how about odd numbers counting backwards on the first row, and even numbers on
numpy.memmap the second?
Create a memory-map to an array stored in a *binary* file on disk.
• Use the functions len(), numpy.shape() on these arrays. How do they relate to each other?
And to the ndim attribute of the arrays?
In [6]: np.con*?
np.concatenate
np.conj
np.conjugate Functions for creating arrays
np.convolve
Tip: In practice, we rarely enter items one by one. . .
Import conventions
The recommended convention to import numpy is: • Evenly spaced:
4.1. The NumPy array object 52 4.1. The NumPy array object 53
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Much more
Exercise: Creating arrays using functions • int32
• int64
• Experiment with arange, linspace, ones, zeros, eye and diag.
• uint32
• Create different kinds of arrays with random numbers.
• uint64
• Try setting the seed before creating an array with random values.
• Look at the function np.empty. What does it do? When might this be useful? 4.1.4 Basic visualization
Now that we have our first data arrays, we are going to visualize them.
4.1.3 Basic data types Start by launching IPython:
You may have noticed that, in some instances, array elements are displayed with a trailing dot (e.g. 2. $ ipython # or ipython3 depending on your install
vs 2). This is due to a difference in the data-type used:
>>> a = np.array([1, 2, 3]) Or the notebook:
>>> a.dtype
$ jupyter notebook
dtype('int64')
>>> b = np.array([1., 2., 3.]) Once IPython has started, enable interactive plots:
>>> b.dtype
>>> %matplotlib
dtype('float64')
The default data type is floating point: >>> plt.plot(x, y) # line plot
>>> plt.show() # <-- shows the plot (not needed with interactive plots)
>>> a = np.ones((3, 3))
>>> a.dtype Or, if you have enabled interactive plots with %matplotlib:
dtype('float64')
>>> plt.plot(x, y) # line plot
There are also other types:
• 1D plotting:
Complex
>>> x = np.linspace(0, 3, 20)
>>> d = np.array([1+2j, 3+4j, 5+6*1j])
>>> y = np.linspace(0, 9, 20)
>>> d.dtype
>>> plt.plot(x, y) # line plot
dtype('complex128')
[<matplotlib.lines.Line2D object at ...>]
>>> plt.plot(x, y, 'o') # dot plot
Bool [<matplotlib.lines.Line2D object at ...>]
>>> e = np.array([True, False, False, True])
>>> e.dtype
dtype('bool')
4.1. The NumPy array object 54 4.1. The NumPy array object 55
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Warning: Indices begin at 0, like other Python sequences (and C/C++). In contrast, in Fortran
or Matlab, indices begin at 1.
>>> a[::-1]
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
>>> a = np.diag(np.arange(3))
• 2D arrays (such as images): >>> a
array([[0, 0, 0],
>>> image = np.random.rand(30, 30) [0, 1, 0],
>>> plt.imshow(image, cmap=plt.cm.hot) [0, 0, 2]])
<matplotlib.image.AxesImage object at ...> >>> a[1, 1]
>>> plt.colorbar() 1
<matplotlib.colorbar.Colorbar object at ...> >>> a[2, 1] = 10 # third line, second column
>>> a
array([[ 0, 0, 0],
[ 0, 1, 0],
[ 0, 10, 2]])
>>> a[1]
array([0, 1, 0])
Note:
• In 2D, the first dimension corresponds to rows, the second to columns.
• for multidimensional a, a[0] is interpreted by taking all elements in the unspecified dimensions.
>>> a = np.arange(10)
See also: >>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
More in the: matplotlib chapter >>> a[2:9:3] # [start:end:step]
array([2, 5, 8])
4.1. The NumPy array object 56 4.1. The NumPy array object 57
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
>>> np.may_share_memory(a, c)
Exercise: Array creation False
Create the following arrays (with correct data types): This behavior can be surprising at first sight. . . but it allows to save both memory and time.
[[1, 1, 1, 1],
[1, 1, 1, 1], Worked example: Prime number sieve
[1, 1, 1, 2],
[1, 6, 1, 1]]
4.1. The NumPy array object 58 4.1. The NumPy array object 59
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Indexing with a mask can be very useful to assign a new value to a sub-array:
>>> a[a % 3 == 0] = -1
>>> a
array([10, -1, 8, -1, 19, 10, 11, -1, 10, -1, -1, 20, -1, 7, 14])
4.1. The NumPy array object 60 4.1. The NumPy array object 61
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
>>> a = np.arange(10000)
>>> %timeit a + 1 Logical operations:
10000 loops, best of 3: 24.3 us per loop
>>> l = range(10000) >>> a = np.array([1, 1, 0, 0], dtype=bool)
>>> %timeit [i+1 for i in l] >>> b = np.array([1, 0, 1, 0], dtype=bool)
1000 loops, best of 3: 861 us per loop >>> np.logical_or(a, b)
array([ True, True, True, False])
>>> np.logical_and(a, b)
array([ True, False, False, False])
Warning: Array multiplication is not matrix multiplication:
Transcendental functions:
>>> a = np.arange(5)
4.2.2 Basic reductions
>>> np.sin(a) Computing sums
array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025 ])
>>> np.log(a) >>> x = np.array([1, 2, 3, 4])
array([ -inf, 0. , 0.69314718, 1.09861229, 1.38629436]) >>> np.sum(x)
>>> np.exp(a) 10
array([ 1. , 2.71828183, 7.3890561 , 20.08553692, 54.59815003]) >>> x.sum()
10
Shape mismatches
>>> a = np.arange(4)
>>> a + np.array([1, 2])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (4) (2)
>>> x.std() # full population standard dev. >>> positions = np.cumsum(steps, axis=1) # axis = 1: dimension of time
0.82915619758884995 >>> sq_distance = positions**2
• Given there is a sum, what other function might you expect to see?
Plot the results:
• What is the difference between sum and cumsum?
>>> plt.figure(figsize=(4, 3))
<Figure size ... with 0 Axes>
>>> plt.plot(t, np.sqrt(mean_sq_distance), 'g.', t, np.sqrt(t), 'y-')
Worked Example: diffusion using a random walk algorithm [<matplotlib.lines.Line2D object at ...>, <matplotlib.lines.Line2D object at ...>]
>>> plt.xlabel(r"$t$")
Text(...'$t$')
>>> plt.ylabel(r"$\sqrt{\langle (\delta x)^2 \rangle}$")
Text(...'$\\sqrt{\\langle (\\delta x)^2 \\rangle}$')
>>> plt.tight_layout() # provide sufficient space for labels
Tip: Broadcasting seems a bit magical, but it is actually quite natural to use it when we want to solve
a problem whose output data is an array with more dimensions than input data.
Let’s construct an array of distances (in miles) between cities of Route 66: Chicago, Springfield,
Saint-Louis, Tulsa, Oklahoma City, Amarillo, Santa Fe, Albuquerque, Flagstaff and Los Angeles.
>>> mileposts = np.array([0, 198, 303, 736, 871, 1175, 1475, 1544,
... 1913, 2448])
>>> distance_array = np.abs(mileposts - mileposts[:, np.newaxis])
>>> distance_array
array([[ 0, 198, 303, 736, 871, 1175, 1475, 1544, 1913, 2448],
[ 198, 0, 105, 538, 673, 977, 1277, 1346, 1715, 2250],
[ 303, 105, 0, 433, 568, 872, 1172, 1241, 1610, 2145],
[ 736, 538, 433, 0, 135, 439, 739, 808, 1177, 1712],
Let’s verify:
[ 871, 673, 568, 135, 0, 304, 604, 673, 1042, 1577],
[1175, 977, 872, 439, 304, 0, 300, 369, 738, 1273],
[1475, 1277, 1172, 739, 604, 300, 0, 69, 438, 973],
[1544, 1346, 1241, 808, 673, 369, 69, 0, 369, 904],
4.2. Numerical operations on arrays 68 [1913, 1715,
4.2. Numerical 1610, on
operations 1177, 1042, 738, 438, 369,
arrays 0, 535], 69
[2448, 2250, 2145, 1712, 1577, 1273, 973, 904, 535, 0]])
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Tip: So, np.ogrid is very useful as soon as we have to handle computations on a grid. On the other
hand, np.mgrid directly provides matrices full of indices for cases where we can’t (or don’t want to)
benefit from broadcasting:
Reshaping
The inverse operation to flattening:
>>> a.shape
(2, 3)
>>> b = a.ravel()
>>> b = b.reshape((2, 3))
>>> b
Remark : the numpy.ogrid() function allows to array([[1, 2, 3],
directly create vectors x and y of the previous example, with two “significant dimensions”: [4, 5, 6]])
Resizing
Warning: ndarray.reshape may return a view (cf help(np.reshape))), or copy
Size of an array can be changed with ndarray.resize:
>>> a = np.arange(4)
Tip: >>> a.resize((8,))
>>> a
>>> b[0, 0] = 99 array([0, 1, 2, 3, 0, 0, 0, 0])
>>> a
array([[99, 2, 3], However, it must not be referred to somewhere else:
[ 4, 5, 6]])
>>> b = a
Beware: reshape may also return a copy!: >>> a.resize((4,))
Traceback (most recent call last):
>>> a = np.zeros((3, 2)) File "<stdin>", line 1, in <module>
>>> b = a.T.reshape(3*2) ValueError: cannot resize an array that has been referenced or is
>>> b[0] = 9 referencing another array in this way. Use the resize function
>>> a
array([[0., 0.],
[0., 0.], Exercise: Shape manipulations
[0., 0.]])
• Look at the docstring for reshape, especially the notes section which has some more information
To understand this you need to learn more about the memory layout of a numpy array. about copies and views.
• Use flatten as an alternative to ravel. What is the difference? (Hint: check which one returns
a view and which a copy)
Adding a dimension
• Experiment with transpose for dimension shuffling.
Indexing with the np.newaxis object allows us to add an axis to an array (you have seen this already
above in the broadcasting section):
Dimension shuffling
In-place sort:
>>> a = np.arange(4*3*2).reshape(4, 3, 2)
>>> a.shape >>> a.sort(axis=1)
(4, 3, 2) >>> a
>>> a[0, 2, 1] array([[3, 4, 5],
5 [1, 1, 2]])
>>> b = a.transpose(1, 2, 0)
>>> b.shape Sorting with fancy indexing:
(3, 2, 4)
>>> b[2, 1, 0] >>> a = np.array([4, 3, 1, 2])
5 >>> j = np.argsort(a)
>>> j
array([2, 3, 1, 0])
Also creates a view:
>>> a[j]
>>> b[2, 1, 0] = -1 array([1, 2, 3, 4])
>>> a[0, 2, 1]
-1 Finding minima and maxima:
• Know the shape of the array with array.shape, then use slicing to obtain different views of the >>> a = np.array([1.2, 1.5, 1.6, 2.5, 3.5, 4.5])
array: array[::2], etc. Adjust the shape of the array using reshape or flatten it with ravel. >>> b = np.around(a)
>>> b # still floating-point
• Obtain a subset of the elements of an array and/or modify their values with masks array([1., 2., 2., 2., 4., 4.])
>>> c = np.around(a).astype(int)
>>> a[a < 0] = 0 >>> c
array([1, 2, 2, 2, 4, 4])
• Know miscellaneous operations on arrays, such as finding the mean or max (array.max(), array.
mean()). No need to retain everything, but have the reflex to search in the documentation (online
docs, help(), lookfor())!! Different data type sizes
• For advanced use: master the indexing with arrays of integers, as well as broadcasting. Know more Integers (signed):
NumPy functions to handle various array operations.
int8 8 bits
int16 16 bits
Quick read
int32 32 bits (same as int on 32-bit platform)
If you want to do a first quick pass through the Scipy lectures to learn the ecosystem, you can directly int64 64 bits (same as int on 64-bit platform)
skip to the next chapter: Matplotlib: plotting.
The remainder of this chapter is not necessary to follow the rest of the intro part. But be sure to >>> np.array([1], dtype=int).dtype
dtype('int64')
come back and finish this chapter, as well as to do some more exercices.
>>> np.iinfo(np.int32).max, 2**31 - 1
(2147483647, 2147483647)
uint8 8 bits
Section contents uint16 16 bits
uint32 32 bits
• More data types uint64 64 bits
• Structured data types
• maskedarray: dealing with (propagation of) missing data >>> np.iinfo(np.uint32).max, 2**32 - 1
(4294967295, 4294967295)
Python 2 has a specific type for ‘long’ integers, that cannot overflow, represented with an ‘L’ at the sensor_code (4-character string)
end. In Python 3, all integers are long, and thus cannot overflow. position (float)
>>> np.iinfo(np.int64).max, 2**63 - 1 value (float)
(9223372036854775807, 9223372036854775807L)
>>> samples = np.zeros((6,), dtype=[('sensor_code', 'S4'),
... ('position', float), ('value', float)])
Floating-point numbers: >>> samples.ndim
1
float16 16 bits >>> samples.shape
float32 32 bits (6,)
float64 64 bits (same as float) >>> samples.dtype.names
('sensor_code', 'position', 'value')
float96 96 bits, platform-dependent (same as np.longdouble)
float128 128 bits, platform-dependent (same as np.longdouble) >>> samples[:] = [('ALFA', 1, 0.37), ('BETA', 1, 0.11), ('TAU', 1, 0.13),
... ('ALFA', 1.5, 0.37), ('ALFA', 3, 0.11), ('TAU', 1.2, 0.13)]
>>> np.finfo(np.float32).eps >>> samples
1.1920929e-07 array([('ALFA', 1.0, 0.37), ('BETA', 1.0, 0.11), ('TAU', 1.0, 0.13),
>>> np.finfo(np.float64).eps ('ALFA', 1.5, 0.37), ('ALFA', 3.0, 0.11), ('TAU', 1.2, 0.13)],
2.2204460492503131e-16 dtype=[('sensor_code', 'S4'), ('position', '<f8'), ('value', '<f8')])
>>> np.float32(1e-8) + np.float32(1) == 1 Field access works by indexing with field names:
True
>>> np.float64(1e-8) + np.float64(1) == 1 >>> samples['sensor_code']
False array(['ALFA', 'BETA', 'TAU', 'ALFA', 'ALFA', 'TAU'],
dtype='|S4')
>>> samples['value']
Complex floating-point numbers:
array([0.37, 0.11, 0.13, 0.37, 0.11, 0.13])
>>> samples[0]
complex64 two 32-bit floats ('ALFA', 1.0, 0.37)
complex128 two 64-bit floats
complex192 two 96-bit floats, platform-dependent >>> samples[0]['sensor_code'] = 'TAU'
complex256 two 128-bit floats, platform-dependent >>> samples[0]
('TAU', 1.0, 0.37)
>>> t = np.linspace(0, 1, 200) # use a larger number of points for smoother plotting
>>> y = np.ma.array([1, 2, 3, 4], mask=[0, 1, 1, 1]) >>> plt.plot(x, y, 'o', t, p(t), '-')
>>> x + y [<matplotlib.lines.Line2D object at ...>, <matplotlib.lines.Line2D object at ...>]
masked_array(data=[2, --, --, --],
mask=[False, True, True, True],
fill_value=999999)
While it is off topic in a chapter on numpy, let’s take a moment to recall good coding practice, which See http://numpy.org/doc/stable/reference/
really do pay off in the long run: routines.polynomials.poly1d.html for more.
Example using polynomials in Chebyshev basis, for polynomials in range [-1, 1]:
4.4 Advanced operations
>>> x = np.linspace(-1, 1, 2000)
>>> y = np.cos(x) + 0.3*np.random.rand(2000)
Section contents >>> p = np.polynomial.Chebyshev.fit(x, y, 90)
4.4.1 Polynomials
NumPy also contains polynomials in different bases:
For example, 3𝑥2 + 2𝑥 − 1:
>>> p = np.poly1d([3, 2, -1])
>>> p(0)
-1
>>> p.roots
array([-1. , 0.33333333])
>>> p.order
2
Note: If you have a complicated text file, what you can try are:
• np.genfromtxt Other libraries:
• Using Python’s I/O functions and e.g. regexps for parsing (Python is quite well suited for this) >>> import imageio
>>> imageio.imsave('tiny_elephant.png', img[::6,::6])
>>> plt.imshow(plt.imread('tiny_elephant.png'), interpolation='nearest')
<matplotlib.image.AxesImage object at ...>
Reminder: Navigating the filesystem with IPython
Images
Using Matplotlib:
and generate a new array containing its 2nd and 4th rows.
2. Divide each column of the array:
elementwise with the array b = np.array([1., 5, 10, 15, 20]). (Hint: np.newaxis).
3. Harder one: Generate a 10 x 3 array of random numbers (in range [0,1]). For each row, pick the
number closest to 0.5.
• Use abs and argsort to find the column j closest for each row.
• Use fancy indexing to extract the numbers. (Hint: a[i,j] – the array i must contain the
row numbers corresponding to stuff in j.)
Here are a few images we will be able to obtain with our manipulations: use different colormaps, crop
Well-known (& more obscure) file formats the image, change some parts of the image.
• HDF5: h5py, PyTables
• NetCDF: scipy.io.netcdf_file, netcdf4-python, . . .
• Matlab: scipy.io.loadmat, scipy.io.savemat
• MatrixMarket: scipy.io.mmread, scipy.io.mmwrite
• IDL: scipy.io.readsav
. . . if somebody uses it, there’s probably also a Python library for it. • Let’s use the imshow function of matplotlib to display the image.
4.5.1 Array manipulations • We will now frame the face with a black locket. For this, we need to create a mask cor-
1. Form the 2-D array (without typing it in explicitly): responding to the pixels we want to be black. The center of the face is around (660, 330), so
we defined the mask by this condition (y-300)**2 + (x-660)**2
[[1, 6, 11],
[2, 7, 12], >>> sy, sx = face.shape
[3, 8, 13], >>> y, x = np.ogrid[0:sy, 0:sx] # x and y indices of pixels
[4, 9, 14], >>> y.shape, x.shape
[5, 10, 15]] ((768, 1), (1, 1024))
(continues on next page)
then we assign the value 0 to the pixels of the image corresponding to the mask. The syntax 4.5.4 Crude integral approximations
is extremely simple and intuitive: Write a function f(a, b, c) that returns 𝑎𝑏 −𝑐. Form a 24x12x6 array containing its values in parameter
>>> face[mask] = 0 ranges [0,1] x [0,1] x [0,1].
>>> plt.imshow(face) Approximate the 3-d integral
<matplotlib.image.AxesImage object at 0x...>
∫︁ 1 ∫︁ 1 ∫︁ 1
• Follow-up: copy all instructions of this exercise in a script called face_locket.py then (𝑎𝑏 − 𝑐)𝑑𝑎 𝑑𝑏 𝑑𝑐
0 0 0
execute this script in IPython with %run face_locket.py.
over this volume with the mean. The exact result is: ln 2 − 12 ≈ 0.1931 . . . — what is your relative error?
Change the circle to an ellipsoid.
(Hints: use elementwise operations and broadcasting. You can make np.ogrid give a number of points
4.5.3 Data statistics in given range with np.ogrid[0:1:20j].)
Reminder Python functions:
The data in populations.txt describes the populations of hares and lynxes (and carrots) in northern
Canada during 20 years: def f(a, b, c):
return some_result
>>> data = np.loadtxt('data/populations.txt')
>>> year, hares, lynxes, carrots = data.T # trick: columns to variables
Solution: Python source file
>>> import matplotlib.pyplot as plt
>>> plt.axes([0.2, 0.1, 0.5, 0.8]) 4.5.5 Mandelbrot set
<matplotlib.axes...Axes object at ...>
>>> plt.plot(year, hares, year, lynxes, year, carrots)
[<matplotlib.lines.Line2D object at ...>, ...]
>>> plt.legend(('Hare', 'Lynx', 'Carrot'), loc=(1.05, 0.5))
<matplotlib.legend.Legend object at ...>
1. The mean and std of the populations of each species for the years in the period. z = 0
for j in range(N_max):
2. Which year each species had the largest population.
z = z**2 + c
3. Which species has the largest population for each year. (Hint: argsort & fancy indexing of
np.array(['H', 'L', 'C'])) Point (x, y) belongs to the Mandelbrot set if |𝑧| < some_threshold.
4. Which years any of the populations is above 50000. (Hint: comparisons and np.any) Do this computation by:
5. The top 2 years for each species when they had the lowest populations. (Hint: argsort, fancy 1. Construct a grid of c = x + 1j*y values in range [-2, 1] x [-1.5, 1.5]
indexing)
2. Do the iteration
6. Compare (plot) the change in hare population (see help(np.gradient)) and the number of lynxes.
3. Form the 2-d boolean mask indicating which points are in the set
Check correlation (see help(np.corrcoef)).
4. Save the result to an image with:
2D plotting
>>> import matplotlib.pyplot as plt
>>> plt.imshow(mask.T, extent=[-2, 1, -1.5, 1.5]) Plot a basic 2D figure
<matplotlib.image.AxesImage object at ...>
>>> plt.gray()
>>> plt.savefig('mandelbrot.png')
Total running time of the script: ( 0 minutes 0.013 seconds) Total running time of the script: ( 0 minutes 0.084 seconds)
Note: Click here to download the full example code Note: Click here to download the full example code
np.random.seed(12) np.random.seed(0)
Total running time of the script: ( 0 minutes 0.012 seconds) Total running time of the script: ( 0 minutes 0.020 seconds)
Note: Click here to download the full example code Note: Click here to download the full example code
import numpy as np
red channel displayed in grey
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
original figure
plt.figure()
img = plt.imread('../../../data/elephant.png')
plt.imshow(img)
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt # We create 1000 realizations with 200 steps each
from numpy import newaxis n_stories = 1000
t_max = 200
def compute_mandelbrot(N_max, some_threshold, nx, ny):
# A grid of c-values t = np.arange(t_max)
x = np.linspace(-2, 1, nx) # Steps can be -1 or 1 (note that randint excludes the upper limit)
y = np.linspace(-1.5, 1.5, ny) steps = 2 * np.random.randint(0, 1 + 1, (n_stories, t_max)) - 1
c = x[:,newaxis] + 1j*y[newaxis,:] # The time evolution of the position is obtained by successively
# summing up individual steps. This is done for each of the
# Mandelbrot iteration # realizations, i.e. along axis 1.
positions = np.cumsum(steps, axis=1)
z = c
# Determine the time evolution of the mean square distance.
# The code below overflows in many regions of the x-y grid, suppress sq_distance = positions**2
# warnings temporarily mean_sq_distance = np.mean(sq_distance, axis=0)
with np.warnings.catch_warnings():
np.warnings.simplefilter("ignore") # Plot the distance d from the origin as a function of time and
for j in range(N_max): # compare with the theoretically expected result where d(t)
z = z**2 + c # grows as a square root of time t.
mandelbrot_set = (abs(z) < some_threshold) plt.figure(figsize=(4, 3))
plt.plot(t, np.sqrt(mean_sq_distance), 'g.', t, np.sqrt(t), 'y-')
return mandelbrot_set plt.xlabel(r"$t$")
plt.ylabel(r"$\sqrt{\langle (\delta x)^2 \rangle}$")
mandelbrot_set = compute_mandelbrot(50, 50., 601, 401) plt.tight_layout()
plt.show()
plt.imshow(mandelbrot_set.T, extent=[-2, 1, -1.5, 1.5])
plt.gray()
plt.show() Total running time of the script: ( 0 minutes 0.057 seconds)
Tip: The Jupyter notebook and the IPython enhanced interactive Python, are tuned for the scientific-
computing workflow in Python, in combination with Matplotlib:
5
For interactive matplotlib sessions, turn on the matplotlib mode
IPython console When using the IPython console, use:
In [1]: %matplotlib
CHAPTER Jupyter notebook In the notebook, insert, at the beginning of the notebook the
following magic:
%matplotlib inline
5.1.2 pyplot
Tip: pyplot provides a procedural interface to the matplotlib object-oriented plotting library. It is
Matplotlib: plotting modeled closely after Matlab™. Therefore, the majority of plotting commands in pyplot have Matlab™
analogs with similar arguments. Important commands are explained with interactive examples.
Thanks Tip: In this section, we want to draw the cosine and sine functions on the same plot. Starting from
the default settings, we’ll enrich the figure step by step to make it nicer.
Many thanks to Bill Wing and Christoph Deil for review and corrections.
First step is to get the data for the sine and cosine functions:
Tip: Matplotlib is probably the most used Python package for 2D-graphics. It provides both a quick Tip: You can also download each of the examples and run it using regular python, but you will lose
way to visualize data from Python and publication-quality figures in many formats. We are going to interactive data manipulation:
explore matplotlib in interactive mode covering most common cases.
$ python plot_exercise_1.py
Hint: Documentation
You can get source for each step by clicking on the corresponding figure. • Customizing matplotlib
In the script below, we’ve instantiated (and commented) all the figure settings that influence the ap-
5.2.1 Plotting with default settings pearance of the plot.
Tip: The settings have been explicitly set to their default values, but now you can interactively play
with the values to explore their affect (see Line properties and Line styles below).
import numpy as np
import matplotlib.pyplot as plt
Tip: Matplotlib comes with a set of default settings that allow customizing all kinds of properties. You # Plot sine with a green continuous line of width 1 (pixels)
can control the defaults of almost every property in matplotlib: figure size and dpi, line width, color and plt.plot(X, S, color="green", linewidth=1.0, linestyle="-")
style, axes, axis and grid properties, text and font properties and so on.
# Set x limits
plt.xlim(-4.0, 4.0)
import numpy as np
import matplotlib.pyplot as plt # Set x ticks
plt.xticks(np.linspace(-4, 4, 9))
X = np.linspace(-np.pi, np.pi, 256)
C, S = np.cos(X), np.sin(X) # Set y limits
plt.ylim(-1.0, 1.0)
plt.plot(X, C)
plt.plot(X, S) # Set y ticks
plt.yticks(np.linspace(-1, 1, 5))
plt.show()
# Save figure using 72 dots per inch
# plt.savefig("exercise_2.png", dpi=72)
5.2.2 Instantiating defaults
# Show result on screen
plt.show()
5.2.3 Changing colors and line widths (continued from previous page)
plt.ylim(C.min() * 1.1, C.max() * 1.1)
...
Hint: Documentation
• Controlling line properties
• Line2D API
Hint: Documentation
Tip: First step, we want to have the cosine in blue and the sine in red and a slighty thicker line for • xticks() command
both of them. We’ll also slightly alter the figure size to make it more horizontal.
• yticks() command
• Tick container
...
plt.figure(figsize=(10, 6), dpi=80) • Tick locating and formatting
plt.plot(X, C, color="blue", linewidth=2.5, linestyle="-")
plt.plot(X, S, color="red", linewidth=2.5, linestyle="-")
...
Tip: Current ticks are not ideal because they do not show the interesting values (±𝜋,:math:pm pi/2)
for sine and cosine. We’ll change them such that they show only these values.
5.2.4 Setting limits
...
plt.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi])
plt.yticks([-1, 0, +1])
...
Hint: Documentation
• xlim() command
• ylim() command
Tip: Current limits of the figure are a bit too tight and we want to make some space in order to clearly
Hint: Documentation
see all data points.
• Working with text
... • xticks() command
plt.xlim(X.min() * 1.1, X.max() * 1.1)
• yticks() command
(continues on next page)
Tip: Ticks are now properly placed but their label is not very explicit. We could guess that 3.142 is 𝜋
but it would be better to make it explicit. When we set tick values, we can also provide a corresponding
label in the second argument list. Note that we’ll use latex to allow for nice rendering of the label.
...
plt.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi],
[r'$-\pi$', r'$-\pi/2$', r'$0$', r'$+\pi/2$', r'$+\pi$'])
plt.yticks([-1, 0, +1],
[r'$-1$', r'$0$', r'$+1$']) Hint: Documentation
... • Legend guide
• legend() command
5.2.7 Moving spines • legend API
Tip: Let’s add a legend in the upper left corner. This only requires adding the keyword argument label
(that will be used in the legend box) to the plot commands.
...
plt.plot(X, C, color="blue", linewidth=2.5, linestyle="-", label="cosine")
plt.plot(X, S, color="red", linewidth=2.5, linestyle="-", label="sine")
plt.legend(loc='upper left')
...
Hint: Documentation
• spines API 5.2.9 Annotate some points
• Axis container
• Transformations tutorial
Tip: Spines are the lines connecting the axis tick marks and noting the boundaries of the data area.
They can be placed at arbitrary positions and until now, they were on the border of the axis. We’ll change
that since we want to have them in the middle. Since there are four of them (top/bottom/left/right),
we’ll discard the top and right by setting their color to none and we’ll move the bottom and left ones to
coordinate 0 in data space coordinates.
...
Hint: Documentation
ax = plt.gca() # gca stands for 'get current axis'
ax.spines['right'].set_color('none') • Annotating axis
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom') • annotate() command
ax.spines['bottom'].set_position(('data',0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data',0))
...
Tip: Let’s annotate some interesting points using the annotate command. We chose the 2𝜋/3 value
and we want to annotate both the sine and the cosine. We’ll first draw a marker on the curve as well as
a straight dotted line. Then, we’ll use the annotate command to display some text with an arrow.
more control over the display using figure, subplot, and axes explicitly. While subplot positions the plots
...
in a regular grid, axes allows free placement within the figure. Both can be useful depending on your
t = 2 * np.pi / 3 intention. We’ve already worked with figures and subplots without explicitly calling them. When we
plt.plot([t, t], [0, np.cos(t)], color='blue', linewidth=2.5, linestyle="--") call plot, matplotlib calls gca() to get the current axes and gca in turn calls gcf() to get the current
plt.scatter([t, ], [np.cos(t), ], 50, color='blue') figure. If there is none it calls figure() to make one, strictly speaking, to make a subplot(111). Let’s
look at the details.
plt.annotate(r'$cos(\frac{2\pi}{3} )=-\frac{1} {2} $',
xy=(t, np.cos(t)), xycoords='data',
xytext=(-90, -50), textcoords='offset points', fontsize=16,
arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2")) 5.3.1 Figures
plt.plot([t, t],[0, np.sin(t)], color='red', linewidth=2.5, linestyle="--") Tip: A figure is the windows in the GUI that has “Figure #” as title. Figures are numbered starting
plt.scatter([t, ],[np.sin(t), ], 50, color='red')
from 1 as opposed to the normal Python way starting from 0. This is clearly MATLAB-style. There are
plt.annotate(r'$sin(\frac{2\pi}{3} )=\frac{\sqrt{3} }{2} $',
several parameters that determine what the figure looks like:
xy=(t, np.sin(t)), xycoords='data',
xytext=(+10, +30), textcoords='offset points', fontsize=16,
arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2")) Argument Default Description
... num 1 number of figure
figsize figure.figsize figure size in inches (width, height)
dpi figure.dpi resolution in dots per inch
5.2.10 Devil is in the details facecolor figure.facecolor color of the drawing background
edgecolor figure.edgecolor color of edge around the drawing background
frameon True draw figure frame or not
Tip: The defaults can be specified in the resource file and will be used most of the time. Only the
number of the figure is frequently changed.
As with other objects, you can set figure properties also setp or with the set_something methods.
When you work with the GUI you can close a figure by clicking on the x in the upper right corner. But
you can close a figure programmatically by calling close. Depending on the argument it closes (1) the
current figure (no argument), (2) a specific figure (figure number or figure instance as argument), or (3)
all figures ("all" as argument).
Hint: Documentation
• artist API plt.close(1) # Closes figure 1
• set_bbox() method
5.3.2 Subplots
Tip: The tick labels are now hardly visible because of the blue and red lines. We can make them Tip: With subplot you can arrange plots in a regular grid. You need to specify the number of rows and
bigger and we can also adjust their properties such that they’ll be rendered on a semi-transparent white columns and the number of the plot. Note that the gridspec command is a more powerful alternative.
background. This will allow us to see both the data and the labels.
...
for label in ax.get_xticklabels() + ax.get_yticklabels():
label.set_fontsize(16)
label.set_bbox(dict(facecolor='white', edgecolor='None', alpha=0.65))
...
Tip: So far we have used implicit figure and axes creation. This is handy for fast plots. We can have
5.3. Figures, Subplots, Axes and Ticks 106 5.3. Figures, Subplots, Axes and Ticks 107
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
5.3.4 Ticks
Well formatted ticks are an important part of publishing-ready figures. Matplotlib provides a totally
configurable system for ticks. There are tick locators to specify where ticks should appear and tick
formatters to give ticks the appearance you want. Major and minor ticks can be located and formatted
independently from each other. Per default minor ticks are not shown, i.e. there is only an empty list
for them because it is as NullLocator (see below).
Tick Locators
Tick locators control the positions of the ticks. They are set as follows:
ax = plt.gca()
ax.xaxis.set_major_locator(eval(locator))
All of these locators derive from the base class matplotlib.ticker.Locator. You can make your own
locator deriving from it. Handling dates as ticks can be especially tricky. Therefore, matplotlib provides
special locators in matplotlib.dates.
5.3. Figures, Subplots, Axes and Ticks 108 5.4. Other Types of Plots: examples and exercises 109
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Starting from the code below, try to reproduce the graphic taking
care of marker size, color and transparency.
n = 1024
X = np.random.normal(0,1,n)
Y = np.random.normal(0,1,n)
plt.scatter(X,Y)
Starting from the code below, try to reproduce the graphic taking
care of filled areas: Starting from the code below, try to reproduce the graphic by
adding labels for red bars.
Hint: You need to use the fill_between() command.
Hint: You need to take care of text alignment.
n = 256
X = np.linspace(-np.pi, np.pi, n) n = 12
Y = np.sin(2 * X) X = np.arange(n)
Y1 = (1 - X / float(n)) * np.random.uniform(0.5, 1.0, n)
plt.plot(X, Y + 1, color='blue', alpha=1.00) Y2 = (1 - X / float(n)) * np.random.uniform(0.5, 1.0, n)
plt.plot(X, Y - 1, color='blue', alpha=1.00)
plt.bar(X, +Y1, facecolor='#9999ff', edgecolor='white')
Click on the figure for solution. plt.bar(X, -Y2, facecolor='#ff9999', edgecolor='white')
plt.ylim(-1.25, +1.25)
5.4. Other Types of Plots: examples and exercises 110 5.4. Other Types of Plots: examples and exercises 111
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Starting from the code below, try to reproduce the graphic taking Starting from the code below, try to reproduce the graphic taking
care of the colormap (see Colormaps below). care of colors and slices size.
Hint: You need to use the clabel() command. Hint: You need to modify Z.
5.4.5 Imshow
Starting from the code below, try to reproduce the graphic taking
care of colors and orientations.
n = 8
Starting from the code below, try to reproduce the graphic taking X, Y = np.mgrid[0:n, 0:n]
care of colormap, image interpolation and origin. plt.quiver(X, Y)
n = 10
x = np.linspace(-3, 3, 4 * n)
y = np.linspace(-3, 3, 3 * n)
X, Y = np.meshgrid(x, y)
plt.imshow(f(X, Y))
5.4. Other Types of Plots: examples and exercises 112 5.4. Other Types of Plots: examples and exercises 113
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Starting from the code below, try to reproduce the graphic. Hint: You need to use contourf()
Hint: You can use several subplots with different partition. from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
plt.subplot(2, 2, 1)
ax = Axes3D(fig)
plt.subplot(2, 2, 3)
X = np.arange(-4, 4, 0.25)
plt.subplot(2, 2, 4)
Y = np.arange(-4, 4, 0.25)
X, Y = np.meshgrid(X, Y)
Click on figure for solution. R = np.sqrt(X**2 + Y**2)
Z = np.sin(R)
5.4.10 Polar Axis
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap='hot')
5.4.12 Text
5.4. Other Types of Plots: examples and exercises 114 5.4. Other Types of Plots: examples and exercises 115
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
The remainder of this chapter is not necessary to follow the rest of the intro part. But be sure to
come back and finish this chapter later. 5.5.3 Code documentation
The code is well documented and you can quickly access a specific command from within a python
session:
5.5 Beyond this tutorial >>> import matplotlib.pyplot as plt
>>> help(plt.plot)
Matplotlib benefits from extensive documentation as well as a large community of users and developers. Help on function plot in module matplotlib.pyplot:
Here are some links of interest:
plot(*args,...)
Plot y versus x as lines and/or markers.
5.5.1 Tutorials
• Pyplot tutorial Call signatures::
• Introduction
plot([x], y, [fmt],...data=None, **kwargs)
• Controlling line properties
plot([x], y, [fmt], [x2], y2, [fmt2], ..., **kwargs)
• Working with multiple figures and axes
...
• Working with text
• Image tutorial
• Startup commands 5.5.4 Galleries
• Importing image data into Numpy arrays
• Plotting numpy arrays as images The matplotlib gallery is also incredibly useful when you search how to render a given graphic. Each
• Text tutorial example comes with its source.
• Text introduction
• Basic text commands 5.5.5 Mailing lists
• Text properties and layout
• Writing mathematical expressions Finally, there is a user mailing list where you can ask for help and a developers mailing list that is more
• Text rendering With LaTeX technical.
• Annotating text
• Artist tutorial
• Introduction
5.6 Quick references
• Customizing your objects
Here is a set of tables that show main properties and styles.
• Object containers
• Figure container
• Axes container
• Axis containers
• Tick containers
• Path tutorial
• Introduction
• Bézier example
• Compound paths
• Transforms tutorial
• Introduction
• Data coordinates
• Axes coordinates
• Blended transformations
Pie chart
A simple pie chart example with matplotlib.
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.axes([0.025, 0.025, 0.95, 0.95]) Total running time of the script: ( 0 minutes 0.011 seconds)
plt.pie(Z, explode=Z*.05, colors = ['%f ' % (i/float(n)) for i in range(n)])
plt.axis('equal') Note: Click here to download the full example code
plt.xticks([])
plt.yticks()
Plotting a scatter of points
plt.show()
A simple example showing how to plot a scatter of points with matplotlib.
Total running time of the script: ( 0 minutes 0.022 seconds)
5.7. Full code examples 120 5.7. Full code examples 121
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.show() plt.subplot(2, 3, 6)
plt.xticks([])
Total running time of the script: ( 0 minutes 0.013 seconds) plt.yticks([])
plt.show()
Note: Click here to download the full example code
Total running time of the script: ( 0 minutes 0.035 seconds)
Subplots
Note: Click here to download the full example code
Show multiple subplots in matplotlib.
5.7. Full code examples 122 5.7. Full code examples 123
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
5.7. Full code examples 124 5.7. Full code examples 125
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.figure(figsize=(6, 4))
plt.subplot(2, 1, 1) plt.figure(figsize=(6, 4))
plt.xticks([]) plt.subplot(1, 2, 1)
plt.yticks([]) plt.xticks([])
plt.text(0.5, 0.5, 'subplot(2,1,1)', ha='center', va='center', plt.yticks([])
size=24, alpha=.5) plt.text(0.5, 0.5, 'subplot(1,2,1)', ha='center', va='center',
size=24, alpha=.5)
plt.subplot(2, 1, 2)
plt.xticks([]) plt.subplot(1, 2, 2)
plt.yticks([]) plt.xticks([])
plt.text(0.5, 0.5, 'subplot(2,1,2)', ha='center', va='center', plt.yticks([])
size=24, alpha=.5) plt.text(0.5, 0.5, 'subplot(1,2,2)', ha='center', va='center',
size=24, alpha=.5)
plt.tight_layout()
plt.show() plt.tight_layout()
plt.show()
Total running time of the script: ( 0 minutes 0.026 seconds)
Total running time of the script: ( 0 minutes 0.023 seconds)
Note: Click here to download the full example code
Note: Click here to download the full example code
5.7. Full code examples 126 5.7. Full code examples 127
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.show() plt.xticks([])
plt.yticks([])
Total running time of the script: ( 0 minutes 0.122 seconds) plt.show()
5.7. Full code examples 128 5.7. Full code examples 129
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
5.7. Full code examples 130 5.7. Full code examples 131
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import numpy as np matplotlib.rc('grid', color='black', linestyle='-', linewidth=1)
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,4),dpi=72)
def f(x,y): axes = fig.add_axes([0.01, 0.01, .98, 0.98], facecolor='.75')
return (1 - x / 2 + x**5 + y**3) * np.exp(-x**2 -y**2) X = np.linspace(0, 2, 40)
Y = np.sin(2 * np.pi * X)
n = 256 plt.plot(X, Y, lw=.05, c='b', antialiased=False)
x = np.linspace(-3, 3, n)
y = np.linspace(-3, 3, n) plt.xticks([])
X,Y = np.meshgrid(x, y) plt.yticks(np.arange(-1., 1., 0.2))
plt.grid()
plt.axes([0.025, 0.025, 0.95, 0.95]) ax = plt.gca()
plt.contourf(X, Y, f(X, Y), 8, alpha=.75, cmap=plt.cm.hot) plt.show()
C = plt.contour(X, Y, f(X, Y), 8, colors='black', linewidth=.5)
plt.clabel(C, inline=1, fontsize=10)
Total running time of the script: ( 0 minutes 0.020 seconds)
plt.xticks([])
plt.yticks([]) Note: Click here to download the full example code
plt.show()
5.7. Full code examples 132 5.7. Full code examples 133
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
n = 256 n = 12
X = np.linspace(-np.pi, np.pi, n) X = np.arange(n)
Y = np.sin(2 * X) Y1 = (1 - X / float(n)) * np.random.uniform(0.5, 1.0, n)
Y2 = (1 - X / float(n)) * np.random.uniform(0.5, 1.0, n)
plt.axes([0.025, 0.025, 0.95, 0.95])
plt.axes([0.025, 0.025, 0.95, 0.95])
plt.plot(X, Y + 1, color='blue', alpha=1.00) plt.bar(X, +Y1, facecolor='#9999ff', edgecolor='white')
plt.fill_between(X, 1, Y + 1, color='blue', alpha=.25) plt.bar(X, -Y2, facecolor='#ff9999', edgecolor='white')
5.7. Full code examples 134 5.7. Full code examples 135
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Subplot grid
Note: Click here to download the full example code
An example showing the subplot grid in matplotlib.
Axes
This example shows various axes command to position matplotlib axes.
plt.figure(figsize=(6, 4))
plt.subplot(2, 2, 1)
plt.xticks([])
plt.yticks([])
plt.text(0.5, 0.5, 'subplot(2,2,1)', ha='center', va='center',
size=20, alpha=.5)
5.7. Full code examples 136 5.7. Full code examples 137
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.show() plt.show()
Total running time of the script: ( 0 minutes 0.086 seconds) Total running time of the script: ( 0 minutes 0.016 seconds)
Note: Click here to download the full example code Note: Click here to download the full example code
Grid 3D plotting
Displaying a grid on the axes in matploblib. Demo 3D plotting with matplotlib and style the figure.
5.7. Full code examples 138 5.7. Full code examples 139
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.figure(figsize=(6, 4))
G = gridspec.GridSpec(3, 3)
5.7. Full code examples 140 5.7. Full code examples 141
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Exercise 1
Solution of the exercise 1 with matplotlib.
import numpy as np
import matplotlib.pyplot as plt
eqs = []
eqs.append((r"$W^{3\beta}_{\delta_1 \rho_1 \sigma_2} = U^{3\beta}_{\delta_1 \rho_1} + \frac{1}
˓→{8 \pi 2} \int^{\alpha_2}_{\alpha_2} d \alpha^\prime_2 \left[\frac{ U^{2\beta}_{\delta_1␣
˓→"))
5.7. Full code examples 142 5.7. Full code examples 143
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
5.7. Full code examples 144 5.7. Full code examples 145
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
5.7. Full code examples 146 5.7. Full code examples 147
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Exercise 7
Exercise 7 with matplotlib
import numpy as np
import matplotlib.pyplot as plt
# Create a new figure of size 8x6 points, using 100 dots per inch
plt.figure(figsize=(8, 6), dpi=80)
5.7. Full code examples 148 5.7. Full code examples 149
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Total running time of the script: ( 0 minutes 0.020 seconds) plt.ylim(C.min() * 1.1, C.max() * 1.1)
plt.yticks([-1, +1],
[r'$-1$', r'$+1$'])
Note: Click here to download the full example code
plt.legend(loc='upper left')
Exercise 8 plt.show()
Exercise 8 with matplotlib. Total running time of the script: ( 0 minutes 0.023 seconds)
Exercise 9
Exercise 9 with matplotlib.
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5), dpi=80)
plt.subplot(111)
5.7. Full code examples 150 5.7. Full code examples 151
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data',0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data',0))
t = 2*np.pi/3
plt.plot([t, t], [0, np.cos(t)],
color='blue', linewidth=1.5, linestyle="--")
plt.scatter([t, ], [np.cos(t), ], 50, color='blue')
plt.annotate(r'$sin(\frac{2\pi}{3} )=\frac{\sqrt{3} }{2} $',
xy=(t, np.sin(t)), xycoords='data', import numpy as np
xytext=(+10, +30), textcoords='offset points', fontsize=16, import matplotlib.pyplot as plt
arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
plt.figure(figsize=(8, 5), dpi=80)
plt.plot([t, t], [0, np.sin(t)], plt.subplot(111)
color='red', linewidth=1.5, linestyle="--")
plt.scatter([t, ], [np.sin(t), ], 50, color='red') X = np.linspace(-np.pi, np.pi, 256)
plt.annotate(r'$cos(\frac{2\pi}{3} )=-\frac{1} {2} $', xy=(t, np.cos(t)), C, S = np.cos(X), np.sin(X)
xycoords='data', xytext=(-90, -50), textcoords='offset points',
fontsize=16, plt.plot(X, C, color="blue", linewidth=2.5, linestyle="-", label="cosine")
arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2")) plt.plot(X, S, color="red", linewidth=2.5, linestyle="-", label="sine")
plt.legend(loc='upper left')
t = 2*np.pi/3
plt.plot([t, t], [0, np.cos(t)],
color='blue', linewidth=1.5, linestyle="--")
plt.scatter([t, ], [np.cos(t), ], 50, color='blue')
plt.annotate(r'$sin(\frac{2\pi}{3} )=\frac{\sqrt{3} }{2} $',
xy=(t, np.sin(t)), xycoords='data',
(continues on next page)
5.7. Full code examples 152 5.7. Full code examples 153
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Alpha: transparency
The colors matplotlib line plots This example demonstrates using alpha for transparency.
An example demoing the various colors taken by matplotlib’s plot.
5.7. Full code examples 154 5.7. Full code examples 155
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
5.7. Full code examples 156 5.7. Full code examples 157
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
import matplotlib.pyplot as plt
size = 256,16
dpi = 72.0
figsize= size[0] / float(dpi), size[1] / float(dpi)
fig = plt.figure(figsize=figsize, dpi=dpi)
fig.patch.set_alpha(0)
plt.axes([0, 0, 1, 1], frameon=False)
plt.xlim(0, 11)
plt.xticks([])
plt.yticks([])
plt.show()
Total running time of the script: ( 0 minutes 0.021 seconds) import numpy as np
import matplotlib.pyplot as plt
Note: Click here to download the full example code plt.rc('text', usetex=False)
a = np.outer(np.arange(0, 1, 0.01), np.ones(10))
plt.figure(figsize=(10, 5))
Marker face color
plt.subplots_adjust(top=0.8, bottom=0.05, left=0.01, right=0.99)
Demo the marker face color of matplotlib’s markers. maps = [m for m in plt.cm.datad if not m.endswith("_r")]
maps.sort()
l = len(maps) + 1
5.7. Full code examples 158 5.7. Full code examples 159
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
import matplotlib.pyplot as plt Note: Click here to download the full example code
size = 256, 16
dpi = 72.0 Dash join style
figsize = size[0] / float(dpi), size[1] / float(dpi)
fig = plt.figure(figsize=figsize, dpi=dpi) Example demoing the dash join style.
fig.patch.set_alpha(0)
plt.axes([0, 0, 1, 1], frameon=False)
Note: Click here to download the full example code plt.xlim(0, 12)
plt.ylim(-1, 2)
plt.xticks([])
Dash capstyle plt.yticks([])
5.7. Full code examples 160 5.7. Full code examples 161
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Total running time of the script: ( 0 minutes 0.015 seconds) Total running time of the script: ( 0 minutes 0.167 seconds)
Note: Click here to download the full example code Note: Click here to download the full example code
Linestyles Markers
Plot the different line styles. Show the different markers of matplotlib.
import numpy as np
import matplotlib.pyplot as plt
linestyles = ['-', '--', ':', '-.', '.', ',', 'o', '^', 'v', '<', '>', 's', n_markers = len(markers)
'+', 'x', 'd', '1', '2', '3', '4', 'h', 'p', '|', '_', 'D', 'H']
n_lines = len(linestyles) size = 20 * n_markers, 300
dpi = 72.0
size = 20 * n_lines, 300 figsize= size[0] / float(dpi), size[1] / float(dpi)
dpi = 72.0 fig = plt.figure(figsize=figsize, dpi=dpi)
figsize= size[0] / float(dpi), size[1] / float(dpi) plt.axes([0, 0.01, 1, .9], frameon=False)
fig = plt.figure(figsize=figsize, dpi=dpi)
plt.axes([0, 0.01, 1, .9], frameon=False) for i, m in enumerate(markers):
marker(m, i)
for i, ls in enumerate(linestyles):
linestyle(ls, i) plt.xlim(-.2, .2 + .5 * n_markers)
plt.xticks([])
plt.xlim(-.2, .2 + .5*n_lines) plt.yticks([])
plt.xticks([])
plt.yticks([]) plt.show()
5.7. Full code examples 162 5.7. Full code examples 163
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
import matplotlib.pyplot as plt
def tickline():
plt.xlim(0, 10), plt.ylim(-1, 1), plt.yticks([])
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['left'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data',0))
ax.yaxis.set_ticks_position('none')
ax.xaxis.set_minor_locator(plt.MultipleLocator(0.1))
ax.plot(np.arange(11), np.zeros(11))
return ax
locators = [
'plt.NullLocator()',
'plt.MultipleLocator(1.0)',
'plt.FixedLocator([0, 2, 8, 9, 10])',
'plt.IndexLocator(3, 1)',
'plt.LinearLocator(5)',
'plt.LogLocator(2, [1.0])',
'plt.AutoLocator()',
]
n_locators = len(locators)
import numpy as np
size = 512, 40 * n_locators import matplotlib.pyplot as plt
dpi = 72.0
(continues on next page) (continues on next page)
5.7. Full code examples 164 5.7. Full code examples 165
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
N = 20
theta = np.arange(0.0, 2 * np.pi, 2 * np.pi / N)
radii = 10 * np.random.rand(N)
width = np.pi / 4 * np.random.rand(N)
bars = plt.bar(theta, radii, width=width, bottom=0.0)
for r, bar in zip(radii, bars):
bar.set_facecolor(plt.cm.jet(r / 10.))
bar.set_alpha(0.5)
plt.gca().set_xticklabels([])
plt.gca().set_yticklabels([])
plt.show()
fig = plt.figure()
3D plotting vignette ax = Axes3D(fig)
X = np.arange(-4, 4, 0.25)
Demo 3D plotting with matplotlib and decorate the figure. Y = np.arange(-4, 4, 0.25)
X, Y = np.meshgrid(X, Y)
R = np.sqrt(X ** 2 + Y ** 2)
Z = np.sin(R)
5.7. Full code examples 166 5.7. Full code examples 167
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.show()
import numpy as np
import matplotlib.pyplot as plt
n = 256
X = np.linspace(0, 2, n)
Y = np.sin(2 * np.pi * X)
5.7. Full code examples 168 5.7. Full code examples 169
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
ax = plt.subplot(2, 1, 1)
ax.set_xticklabels([])
ax.set_yticklabels([])
plt.xticks([])
Note: Click here to download the full example code plt.yticks([])
5.7. Full code examples 170 5.7. Full code examples 171
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.show()
import numpy as np
import matplotlib.pyplot as plt
n = 1024
X = np.random.normal(0, 1, n)
Y = np.random.normal(0, 1, n)
T = np.arctan2(Y,X)
5.7. Full code examples 172 5.7. Full code examples 173
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.show()
import numpy as np
import matplotlib.pyplot as plt
n = 20
X = np.ones(n)
X[-1] *= 2
plt.pie(X, explode=X*.05, colors = ['%f ' % (i/float(n)) for i in range(n)])
fig = plt.gcf()
w, h = fig.get_figwidth(), fig.get_figheight()
r = h / float(w)
plt.xlim(-1.5, 1.5)
plt.ylim(-1.5 * r, 1.5 * r)
plt.xticks([])
plt.yticks([])
5.7. Full code examples 174 5.7. Full code examples 175
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.show()
plt.quiver(X, Y, U, V, R, alpha=.5)
plt.quiver(X, Y, U, V, edgecolor='k', facecolor='None', linewidth=.5)
plt.xlim(-1, n)
plt.xticks([])
plt.ylim(-1, n)
plt.yticks([])
5.7. Full code examples 176 5.7. Full code examples 177
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.show()
import numpy as np
import matplotlib.pyplot as plt
n = 10
x = np.linspace(-3, 3, 8 * n)
(continues on next page)
5.7. Full code examples 178 5.7. Full code examples 179
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.text(-0.05, 1.01, "\n\n Draw contour lines and filled contours ",
horizontalalignment='left',
verticalalignment='top',
size='large',
transform=plt.gca().transAxes)
plt.show()
Grid elaborate
An example displaying a grid on the axes and tweaking the layout.
import numpy as np
import matplotlib.pyplot as plt
def f(x,y):
return (1 - x / 2 + x ** 5 + y ** 3) * np.exp(-x ** 2 - y ** 2)
n = 256
x = np.linspace(-3, 3, n)
y = np.linspace(-3, 3, n)
X, Y = np.meshgrid(x, y)
5.7. Full code examples 180 5.7. Full code examples 181
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
eqs = []
Text printing decorated eqs.append((r"$W^{3\beta}_{\delta_1 \rho_1 \sigma_2} = U^{3\beta}_{\delta_1 \rho_1} + \frac{1}
˓→{8 \pi 2} \int^{\alpha_2}_{\alpha_2} d \alpha^\prime_2 \left[\frac{ U^{2\beta}_{\delta_1␣
An example showing text printing and decorating the resulting figure. ˓→\rho_1} - \alpha^\prime_2U^{1\beta}_{\rho_1 \sigma_2} }{U^{0\beta}_{\rho_1 \sigma_2}}\right]$
˓→"))
eqs.append((r"$\int_{-\infty}^\infty e^{-x^2}dx=\sqrt{\pi}$"))
eqs.append((r"$E = mc^2 = \sqrt{{m_0}^2c^4 + p^2c^2}$"))
eqs.append((r"$F_G = G\frac{m_1m_2} {r^2}$"))
for i in range(24):
index = np.random.randint(0,len(eqs))
eq = eqs[index]
size = np.random.uniform(12,32)
x,y = np.random.uniform(0,1,2)
alpha = np.random.uniform(0.25,.75)
plt.text(x, y, eq, ha='center', va='center', color="#11557c", alpha=alpha,
transform=plt.gca().transAxes, fontsize=size, clip_on=True)
5.7. Full code examples 182 5.7. Full code examples 183
Scipy lecture notes, Edition 2022.1
6
transform=plt.gca().transAxes))
plt.show()
Authors: Gaël Varoquaux, Adrien Chauve, Andre Espaze, Emmanuelle Gouillart, Ralf Gommers
Scipy
The scipy package contains various toolboxes dedicated to common issues in scientific computing.
Its different submodules correspond to different applications, such as interpolation, integration, opti-
mization, image processing, statistics, special functions, etc.
Tip: scipy can be compared to other standard scientific-computing libraries, such as the GSL (GNU
Scientific Library for C and C++), or Matlab’s toolboxes. scipy is the core package for scientific routines
in Python; it is meant to operate efficiently on numpy arrays, so that numpy and scipy work hand in
hand.
Before implementing a routine, it is worth checking if the desired data processing is not already imple-
mented in Scipy. As non-professional programmers, scientists often tend to re-invent the wheel, which
leads to buggy, non-optimal, difficult-to-share and unmaintainable code. By contrast, Scipy’s routines
are optimized and tested, and should therefore be used when possible.
Chapters contents
6.1. File input/output: scipy.io 186 6.2. Special functions: scipy.special 187
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
>>> spec
array([14.88982544, 0.45294236, 0.29654967])
6.5 Optimization and fit: scipy.optimize
Optimization is the problem of finding a numerical solution to a minimization or equality.
The original matrix can be re-composed by matrix multiplication of the outputs of svd with np.dot:
>>> sarr = np.diag(spec) Tip: The scipy.optimize module provides algorithms for function minimization (scalar or multi-
>>> svd_mat = uarr.dot(sarr).dot(vharr) dimensional), curve fitting and root finding.
>>> np.allclose(svd_mat, arr)
True >>> from scipy import optimize
SVD is commonly used in statistics and signal processing. Many other standard decompositions
(QR, LU, Cholesky, Schur), as well as solvers for linear systems, are available in scipy.linalg.
6.3. Linear algebra operations: scipy.linalg 188 6.4. Interpolation: scipy.interpolate 189
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
6.5.1 Curve fitting 5. Is the time offset for min and max temperatures the same within the fit accuracy?
solution
If we know that the data lies on a sine wave, but not the amplitudes or the period, we can find those
by least squares curve fitting. First we have to define the test function to fit, here a sine with unknown
amplitude and period:
Let’s define the following function:
>>> def test_func(x, a, b):
... return a * np.sin(b * x) >>> def f(x):
... return x**2 + 10*np.sin(x)
This function has a global minimum around -1.3 and a local minimum around 3.8.
Searching for minimum can be done with scipy.optimize.minimize(), given a starting point x0, it
returns the location of the minimum that it has found:
6.5. Optimization and fit: scipy.optimize 190 6.5. Optimization and fit: scipy.optimize 191
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Methods: As the function is a smooth function, gradient-descent based methods are good options. The
lBFGS algorithm is a good choice in general: Tip: What has happened? Why are we finding 0, which is not a mimimum of our function.
Global minimum: A possible issue with this approach is that, if the function has local minima, the
Exercise: 2-D minimization
algorithm may find these local minima instead of the global minimum depending on the initial point x0:
If we don’t know the neighborhood of the global minimum to choose the initial point, we need to resort
to costlier global optimization. To find the global minimum, we use scipy.optimize.basinhopping()
(added in version 0.12.0 of Scipy). It combines a local optimizer with sampling of starting points:
>>> optimize.basinhopping(f, 0)
nfev: 1725
minimization_failures: 0
fun: -7.9458233756152845
x: array([-1.30644001])
message: ['requested number of basinhopping iterations completed successfully']
njev: 575 The six-hump camelback function
nit: 100
4
𝑥
𝑓 (𝑥, 𝑦) = (4 − 2.1𝑥2 + )𝑥2 + 𝑥𝑦 + (4𝑦 2 − 4)𝑦 2
3
Note: scipy used to contain the routine anneal, it has been removed in SciPy 0.16.0.
has multiple global and local minima. Find the global minima of this function.
Hints:
• Variables can be restricted to −2 < 𝑥 < 2 and −1 < 𝑦 < 1.
• Use numpy.meshgrid() and matplotlib.pyplot.imshow() to find visually the re-
gions.
Constraints: We can constrain the variable to the interval (0, 10) using the “bounds” argument:
• Use scipy.optimize.minimize(), optionally trying out several of its methods.
A list of bounds How many global minima are there, and what is the function value at those points? What
happens for an initial guess of (𝑥, 𝑦) = (0, 0) ?
As minimize() works in general with x multidimensionsal, the “bounds” argument is a list of bound solution
on each dimension.
>>> res = optimize.minimize(f, x0=1, 6.5.3 Finding the roots of a scalar function
... bounds=((0, 10), ))
>>> res.x
To find a root, i.e. a point where 𝑓 (𝑥) = 0, of the function 𝑓 above we can use scipy.optimize.root():
array([0.])
6.5. Optimization and fit: scipy.optimize 192 6.5. Optimization and fit: scipy.optimize 193
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Note: scipy.optimize.root() also comes with a variety of algorithms, set via the “method” argument.
Now that we have found the minima and roots of f and used curve fitting on it, we put all those results
If we know that the random process belongs to a given family of random processes, such as normal
together in a single plot:
processes, we can do a maximum-likelihood fit of the observations to estimate the parameters of the
See also: underlying distribution. Here we fit a normal process to the observed data:
You can find all algorithms and functions with similar functionalities in the documentation of scipy. >>> loc, std = stats.norm.fit(samples)
optimize. >>> loc
-0.045256707...
See the summary exercise on Non linear least squares curve fitting: application to point extraction in
>>> std
topographical lidar data for another, more advanced example. 0.9870331586...
6.6. Statistics and random numbers: scipy.stats 194 6.6. Statistics and random numbers: scipy.stats 195
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Extra: the distributions have many useful methods. Explore them by reading the docstring or by using >>> a = np.random.normal(0, 1, size=100)
tab completion. Can you recover the shape parameter 1 by using the fit method on your random >>> b = np.random.normal(1, 1, size=10)
variates? >>> stats.ttest_ind(a, b)
(array(-3.177574054...), 0.0019370639...)
>>> np.median(samples)
-0.0580280347... See also:
The chapter on statistics introduces much more elaborate tools for statistical testing and statistical data
loading and visualization outside of scipy.
Tip: Unlike the mean, the median is not sensitive to the tails of the distribution. It is “robust”.
The median is also the percentile 50, because 50% of the observation are below it: >>> from scipy.integrate import quad
>>> res, err = quad(np.sin, 0, np.pi/2)
>>> stats.scoreatpercentile(samples, 50) >>> np.allclose(res, 1) # res is the result, is should be close to 1
-0.0580280347... True
>>> np.allclose(err, 1 - res) # err is an estimate of the err
Similarly, we can calculate the percentile 90: True
>>> stats.scoreatpercentile(samples, 90) Other integration schemes are available: scipy.integrate.fixed_quad(), scipy.integrate.
1.2315935511... quadrature(), scipy.integrate.romberg(). . .
As an introduction, let us solve the ODE 𝑑𝑦 𝑑𝑡 = −2𝑦 between 𝑡 = 0 . . . 4, with the initial condition
𝑦(𝑡 = 0) = 1. First the function computing the derivative of the position needs to be defined:
6.6. Statistics and random numbers: scipy.stats 196 6.7. Numerical integration: scipy.integrate 197
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Tip: scipy.integrate.odeint() uses the LSODA (Livermore Solver for Ordinary Differential equa-
tions with Automatic method switching for stiff and non-stiff problems), see the ODEPACK Fortran
library for more details.
See also:
Partial Differental Equations
There is no Partial Differential Equations (PDE) solver in Scipy. Some Python packages for solving
PDE’s are available, such as fipy or SfePy.
Hence:
As an illustration, a (noisy) input signal (sig), and its FFT:
>>> eps = cviscous / (2 * mass * np.sqrt(kspring/mass))
>>> omega = np.sqrt(kspring / mass) >>> from scipy import fftpack
>>> sig_fft = fftpack.fft(sig)
The system is underdamped, as: >>> freqs = fftpack.fftfreq(sig.size, d=time_step)
For odeint(), the 2nd order equation needs to be transformed in a system of two first-order equations
for the vector 𝑌 = (𝑦, 𝑦 ′ ): the function computes the velocity and acceleration:
Signal FFT
As the signal comes from a real function, the Fourier transform is symmetric.
The peak signal frequency can be found with freqs[power.argmax()]
Integration of the system follows:
6.7. Numerical integration: scipy.integrate 198 6.8. Fast Fourier transforms: scipy.fftpack 199
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
numpy.fft
Numpy also has an implementation of FFT (numpy.fft). However, the scipy one should be preferred,
as it uses more efficient underlying implementations.
1. Examine the provided image moonlanding.png, which is heavily contaminated with periodic
noise. In this exercise, we aim to clean up the noise using the Fast Fourier Transform.
2. Load the image using matplotlib.pyplot.imread().
Fully worked examples:
3. Find and use the 2-D FFT function in scipy.fftpack, and plot the spectrum (Fourier transform
of) the image. Do you have any trouble visualising the spectrum? If so, why?
Crude periodicity finding (link) Gaussian image blur (link)
4. The spectrum consists of high and low frequency components. The noise is contained in the
high-frequency part of the spectrum, so set some of those components to zero (use array slicing).
5. Apply the inverse Fourier transform to see the resulting image.
Solution
6.8. Fast Fourier transforms: scipy.fftpack 200 6.9. Signal processing: scipy.signal 201
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Filtering: For non-linear filtering, scipy.signal has filtering (median filter scipy.signal.medfilt(),
Wiener scipy.signal.wiener()), but we will discuss this in the image section.
Tip: scipy.signal also has a full-blown set of tools for the design of linear filter (finite and infinite
response filters), but this is out of the scope of this tutorial.
>>> plt.plot(t, x)
[<matplotlib.lines.Line2D object at ...>]
>>> plt.plot(t[::4], x_resampled, 'ko')
[<matplotlib.lines.Line2D object at ...>]
Tip: Notice how on the side of the window the resampling is less accurate and has a rippling effect.
This resampling is different from the interpolation provided by scipy.interpolate as it only applies to
regularly sampled data.
>>> t = np.linspace(0, 5, 100) >>> from scipy import ndimage # Shift, roate and zoom it
>>> x = t + np.random.normal(size=100) >>> shifted_face = ndimage.shift(face, (50, 50))
>>> shifted_face2 = ndimage.shift(face, (50, 50), mode='nearest')
>>> from scipy import signal >>> rotated_face = ndimage.rotate(face, 30)
>>> x_detrended = signal.detrend(x) >>> cropped_face = face[50:-50, 50:-50]
>>> zoomed_face = ndimage.zoom(face, 2)
>>> plt.plot(t, x) (continues on next page)
(continues on next page)
6.9. Signal processing: scipy.signal 202 6.10. Image manipulation: scipy.ndimage 203
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
>>> plt.subplot(151)
<matplotlib.axes._subplots.AxesSubplot object at 0x...>
>>> plt.axis('off')
(-0.5, 1023.5, 767.5, -0.5)
>>> # etc. Mathematical-morphology operations use a structuring element in order to modify geometrical structures.
Let us first generate a structuring element:
6.10.2 Image filtering >>> el = ndimage.generate_binary_structure(2, 1)
>>> el
Generate a noisy face:
array([[False, True, False],
>>> from scipy import misc [...True, True, True],
>>> face = misc.face(gray=True) [False, True, False]])
>>> face = face[:512, -512:] # crop out square on right >>> el.astype(np.int)
>>> import numpy as np array([[0, 1, 0],
>>> noisy_face = np.copy(face).astype(np.float) [1, 1, 1],
>>> noisy_face += face.std() * 0.5 * np.random.standard_normal(face.shape) [0, 1, 0]])
6.10. Image manipulation: scipy.ndimage 204 6.10. Image manipulation: scipy.ndimage 205
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• Dilation scipy.ndimage.binary_dilation()
• Closing: scipy.ndimage.binary_closing()
6.10.4 Connected components and measurements on images
Exercise Let us first generate a nice synthetic binary image.
6.10. Image manipulation: scipy.ndimage 206 6.10. Image manipulation: scipy.ndimage 207
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Statistical approach
The annual maxima are supposed to fit a normal probability density function. However such function
is not going to be estimated because it gives a probability from a wind speed maxima. Finding the
maximum wind speed occurring every 50 years requires the opposite approach, the result needs to be
found from a defined probability. That is the quantile function role and the exercise goal will be to find
scipy.ndimage.label() assigns a different label to each connected component:
it. In the current model, it is supposed that the maximum wind speed occurring every 50 years is defined
>>> labels, nb = ndimage.label(mask) as the upper 2% quantile.
>>> nb By definition, the quantile function is the inverse of the cumulative distribution function. The latter
8
describes the probability distribution of an annual maxima. In the exercise, the cumulative probability
p_i for a given year i is defined as p_i = i/(N+1) with N = 21, the number of measured years. Thus
Now compute measurements on each connected component: it will be possible to calculate the cumulative probability of every measured wind speed maxima. From
>>> areas = ndimage.sum(mask, labels, range(1, labels.max()+1)) those experimental points, the scipy.interpolate module will be very useful for fitting the quantile function.
>>> areas # The number of pixels in each connected component Finally the 50 years maxima is going to be evaluated from the cumulative probability of the 2% quantile.
array([190., 45., 424., 278., 459., 190., 549., 424.])
>>> maxima = ndimage.maximum(sig, labels, range(1, labels.max()+1)) Computing the cumulative probabilities
>>> maxima # The maximum signal in each connected component
array([ 1.80238238, 1.13527605, 5.51954079, 2.49611818, 6.71673619, The annual wind speeds maxima have already been computed and saved in the numpy format in the file
1.80238238, 16.76547217, 5.51954079]) examples/max-speeds.npy, thus they will be loaded by using numpy:
Following the cumulative probability definition p_i from the previous section, the corresponding values
will be:
Extract the 4th connected component, and crop the array around it:
Prediction with UnivariateSpline
>>> ndimage.find_objects(labels==4)
[(slice(30L, 48L, None), slice(30L, 48L, None))] In this section the quantile function will be estimated by using the UnivariateSpline class which can
>>> sl = ndimage.find_objects(labels==4) represent a spline from points. The default behavior is to build a spline of degree 3 and points can
>>> from matplotlib import pyplot as plt have different weights according to their reliability. Variants are InterpolatedUnivariateSpline and
>>> plt.imshow(sig[sl[0]]) LSQUnivariateSpline on which errors checking is going to change. In case a 2D spline is wanted,
<matplotlib.image.AxesImage object at ...> the BivariateSpline class family is provided. All those classes for 1D and 2D splines use the FIT-
PACK Fortran subroutines, that’s why a lower library access is available through the splrep and
See the summary exercise on Image processing application: counting bubbles and unmolten grains for a splev functions for respectively representing and evaluating a spline. Moreover interpolation functions
more advanced example. without the use of FITPACK parameters are also provided for simpler use (see interp1d, interp2d,
barycentric_interpolate and so on).
6.11 Summary exercises on scientific computing For the Sprogø maxima wind speeds, the UnivariateSpline will be used because a spline of degree 3
seems to correctly fit the data:
The summary exercises use mainly Numpy, Scipy and Matplotlib. They provide some real-life examples
>>> from scipy.interpolate import UnivariateSpline
of scientific computing with Python. Now that the basics of working with Numpy and Scipy have been
>>> quantile_func = UnivariateSpline(cprob, sorted_max_speeds)
introduced, the interested user is invited to try these exercises.
The quantile function is now going to be evaluated from the full range of probabilities:
6.11. Summary exercises on scientific computing 208 6.11. Summary exercises on scientific computing 209
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
In the current model, the maximum wind speed occurring every 50 years is defined as the upper 2%
quantile. As a result, the cumulative probability value will be:
So the storm wind speed occurring every 50 years can be guessed by:
6.11. Summary exercises on scientific computing 210 6.11. Summary exercises on scientific computing 211
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• The second step will be to use the Gumbell distribution on cumulative probabilities p_i defined as contains peaks whose center and amplitude permit to compute the position and some characteristics of
-log( -log(p_i) ) for fitting a linear quantile function (remember that you can define the degree the hit target. When the footprint of the laser beam is around 1m on the Earth surface, the beam can
of the UnivariateSpline). Plotting the annual maxima versus the Gumbell distribution should hit multiple targets during the two-way propagation (for example the ground and the top of a tree or
give you the following figure. building). The sum of the contributions of each target hit by the laser beam then produces a complex
signal with multiple peaks, each one containing information about one target.
One state of the art method to extract information from these data is to decompose them in a sum of
Gaussian functions where each function represents the contribution of a target hit by the laser beam.
Therefore, we use the scipy.optimize module to fit a waveform to one or a sum of Gaussian functions.
As shown below, this waveform is a 80-bin-length signal with a single peak with an amplitude of ap-
proximately 30 in the 15 nanosecond bin. Additionally, the base level of noise is approximately 3. These
values can be used in the initial solution.
• The last step will be to find 34.23 m/s for the maximum wind speed occurring every 50 years.
6.11.2 Non linear least squares curve fitting: application to point extraction in
topographical lidar data
The goal of this exercise is to fit a model to some data. The data used in this tutorial are lidar data
and are described in details in the following introductory paragraph. If you’re impatient and want to
practice now, please skip it and go directly to Loading and visualization.
Introduction
Lidars systems are optical rangefinders that analyze property of scattered light to measure distances.
Most of them emit a short light impulsion towards a target and record the reflected signal. This signal
is then processed to extract the distance between the lidar system and the target.
Topographical lidar systems are such systems embedded in airborne platforms. They measure distances
between the platform and the Earth, so as to deliver information on the Earth’s topography (see1 for
more details).
In this tutorial, the goal is to analyze the waveform recorded by the lidar system2 . Such a signal
1 Mallet, C. and Bretar, F. Full-Waveform Topographic Lidar: State-of-the-Art. ISPRS Journal of Photogrammetry
6.11. Summary exercises on scientific computing 212 6.11. Summary exercises on scientific computing 213
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Model
where
• coeffs[0] is 𝐵 (noise)
• coeffs[1] is 𝐴 (amplitude)
• coeffs[2] is 𝜇 (center)
• coeffs[3] is 𝜎 (width)
Initial solution
Fit
scipy.optimize.leastsq minimizes the sum of squares of the function given as an argument. Basically, Remark: from scipy v0.8 and above, you should rather use scipy.optimize.curve_fit() which takes
the function to minimize is the residuals (the difference between the data and the model): the model and the data as arguments, so you don’t need to define the residuals any more.
>>> from scipy.optimize import leastsq • When we want to detect very small peaks in the signal, or when the initial guess is too far from a
>>> t = np.arange(len(waveform_1)) good solution, the result given by the algorithm is often not satisfying. Adding constraints to the
>>> x, flag = leastsq(residuals, x0, args=(waveform_1, t)) parameters of the model enables to overcome such limitations. An example of a priori knowledge
>>> print(x) we can add is the sign of our variables (which are all positive).
[ 2.70363341 27.82020742 15.47924562 3.05636228]
• See the solution.
And visualize the solution: • Further exercise: compare the result of scipy.optimize.leastsq() and what you can get with
scipy.optimize.fmin_slsqp() when adding boundary constraints.
6.11. Summary exercises on scientific computing 214 6.11. Summary exercises on scientific computing 215
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
6.11. Summary exercises on scientific computing 216 6.11. Summary exercises on scientific computing 217
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
7. Attribute labels to all bubbles and sand grains, and remove from the sand mask grains that are
smaller than 10 pixels. To do so, use ndimage.sum or np.bincount to compute the grain sizes.
8. Compute the mean size of bubbles.
6.11.4 Example of solution for the image processing exercise: unmolten grains in
glass
4. Using the histogram of the filtered image, determine thresholds that allow to define masks for sand
pixels, glass pixels and bubble pixels. Other option (homework): write a function that determines
automatically the thresholds from the minima of the histogram.
5. Display an image in which the three phases are colored with three different colors.
1. Open the image file MV_HFV_012.jpg and display it. Browse through the keyword arguments
in the docstring of imshow to display the image with the “right” orientation (origin in the bottom
left corner, and not the upper left corner as for standard arrays).
>>> dat = plt.imread('data/MV_HFV_012.jpg')
2. Crop the image to remove the lower panel with measure information.
>>> dat = dat[:-60]
3. Slightly filter the image with a median filter in order to refine its histogram. Check how the
histogram changes.
>>> filtdat = ndimage.median_filter(dat, size=(7,7))
>>> hi_dat = np.histogram(dat, bins=np.arange(256))
>>> hi_filtdat = np.histogram(filtdat, bins=np.arange(256))
6.11. Summary exercises on scientific computing 218 6.11. Summary exercises on scientific computing 219
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
import matplotlib.pyplot as plt
def f(x):
return x**2 + 10*np.sin(x)
7. Attribute labels to all bubbles and sand grains, and remove from the sand mask grains that are
smaller than 10 pixels. To do so, use ndimage.sum or np.bincount to compute the grain sizes.
>>> sand_labels, sand_nb = ndimage.label(sand_op)
>>> sand_areas = np.array(ndimage.sum(sand_op, sand_labels, np.arange(sand_labels.
˓→max()+1)))
6.11. Summary exercises on scientific computing 220 6.12. Full code examples for the scipy chapter 221
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Out:
fun: -7.945823375615215
hess_inv: array([[0.08589237]])
jac: array([-1.1920929e-06])
message: 'Optimization terminated successfully.'
nfev: 18
nit: 5
njev: 6
status: 0
success: True
x: array([-1.30644012])
Out:
fun: array([-7.94582338])
hess_inv: <1x1 LbfgsInvHessProduct with dtype=float64>
jac: array([-1.42108547e-06])
message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
nfev: 12
nit: 5
status: 0
success: True
x: array([-1.30644013])
Total running time of the script: ( 0 minutes 0.016 seconds) Note: Click here to download the full example code
6.12. Full code examples for the scipy chapter 222 6.12. Full code examples for the scipy chapter 223
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.figure(figsize=(4, 3))
plt.plot(time_vec, yvec)
plt.xlabel('t: Time')
plt.ylabel('y: Position')
plt.tight_layout()
import numpy as np
from matplotlib import pyplot as plt
6.12. Full code examples for the scipy chapter 224 6.12. Full code examples for the scipy chapter 225
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
from scipy.integrate import odeint
from matplotlib import pyplot as plt import numpy as np
mass = 0.5 # kg # Sample from a normal distribution using numpy's random number generator
kspring = 4 # N/m samples = np.random.normal(size=10000)
cviscous = 0.4 # N s/m
# Compute a histogram of the sample
bins = np.linspace(-5, 5, 30)
eps = cviscous / (2 * mass * np.sqrt(kspring/mass)) histogram, bins = np.histogram(samples, bins=bins, density=True)
omega = np.sqrt(kspring / mass)
bin_centers = 0.5*(bins[1:] + bins[:-1])
def calc_deri(yvec, time, eps, omega): # Compute the PDF on the bin centers from scipy distribution object
return (yvec[1], -eps * omega * yvec[1] - omega **2 * yvec[0]) from scipy import stats
pdf = stats.norm.pdf(bin_centers)
time_vec = np.linspace(0, 10, 100)
yinit = (1, 0) from matplotlib import pyplot as plt
yarr = odeint(calc_deri, yinit, time_vec, args=(eps, omega)) plt.figure(figsize=(6, 4))
plt.plot(bin_centers, histogram, label="Histogram of samples")
plt.figure(figsize=(4, 3)) plt.plot(bin_centers, pdf, label="PDF")
plt.plot(time_vec, yarr[:, 0], label='y') plt.legend()
plt.plot(time_vec, yarr[:, 1], label="y'") plt.show()
plt.legend(loc='best')
plt.show() Total running time of the script: ( 0 minutes 0.014 seconds)
6.12. Full code examples for the scipy chapter 226 6.12. Full code examples for the scipy chapter 227
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
# And plot it
import matplotlib.pyplot as plt
plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data)
6.12. Full code examples for the scipy chapter 228 6.12. Full code examples for the scipy chapter 229
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
6.12. Full code examples for the scipy chapter 230 6.12. Full code examples for the scipy chapter 231
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.show()
Total running time of the script: ( 0 minutes 0.200 seconds) # Generate data
import numpy as np
np.random.seed(0)
Note: Click here to download the full example code measured_time = np.linspace(0, 1, 10)
noise = 1e-1 * (np.random.random(10)*2 - 1)
measures = np.sin(2 * np.pi * measured_time) + noise
6.12. Full code examples for the scipy chapter 232 6.12. Full code examples for the scipy chapter 233
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.subplot(143) plt.subplot(155)
plt.imshow(opened_mask, cmap=plt.cm.gray) plt.imshow(zoomed_face, cmap=plt.cm.gray)
plt.axis('off') plt.axis('off')
plt.title('opened_mask')
plt.subplots_adjust(wspace=.05, left=.01, bottom=.01, right=.99, top=.99)
plt.subplot(144)
plt.imshow(closed_mask, cmap=plt.cm.gray) plt.show()
plt.title('closed_mask')
plt.axis('off')
Total running time of the script: ( 0 minutes 0.773 seconds)
plt.subplots_adjust(wspace=.05, left=.01, bottom=.01, right=.99, top=.99)
Note: Click here to download the full example code
plt.show()
6.12. Full code examples for the scipy chapter 234 6.12. Full code examples for the scipy chapter 235
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
np.random.seed(0)
x, y = np.indices((100, 100))
sig = np.sin(2*np.pi*x/50.) * np.sin(2*np.pi*y/50.) * (1+x*y/50.**2)**2
mask = sig > 1
plt.figure(figsize=(7, 3.5))
plt.subplot(1, 2, 1)
plt.imshow(sig)
plt.axis('off')
plt.title('sig')
plt.subplot(1, 2, 2)
plt.imshow(mask, cmap=plt.cm.gray)
plt.axis('off')
plt.title('mask')
plt.subplots_adjust(wspace=.05, left=.01, bottom=.01, right=.99, top=.9)
Extract the 4th connected component, and crop the array around it
sl = ndimage.find_objects(labels==4)
plt.figure(figsize=(3.5, 3.5))
plt.imshow(sig[sl[0]])
plt.title('Cropped connected component')
plt.axis('off')
plt.show()
plt.figure(figsize=(3.5, 3.5))
plt.imshow(labels)
plt.title('label')
plt.axis('off')
6.12. Full code examples for the scipy chapter 236 6.12. Full code examples for the scipy chapter 237
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Find minima
# Global optimization
grid = (-10, 10, 0.1)
xmin_global = optimize.brute(f, (grid, ))
print("Global minima found %s " % xmin_global)
# Constrain optimization
xmin_local = optimize.fminbound(f, 0, 10)
print("Local minimum found %s " % xmin_local)
Out:
Root finding
6.12. Full code examples for the scipy chapter 238 6.12. Full code examples for the scipy chapter 239
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
noisy_face = np.copy(face).astype(np.float)
noisy_face += face.std() * 0.5 * np.random.standard_normal(face.shape)
blurred_face = ndimage.gaussian_filter(noisy_face, sigma=3)
median_face = ndimage.median_filter(noisy_face, size=5)
wiener_face = signal.wiener(noisy_face, (5, 5))
plt.figure(figsize=(12, 3.5))
plt.subplot(141)
plt.imshow(noisy_face, cmap=plt.cm.gray)
plt.axis('off')
plt.title('noisy')
plt.subplot(142)
plt.imshow(blurred_face, cmap=plt.cm.gray)
plt.axis('off')
plt.title('Gaussian filter')
plt.subplot(143)
plt.imshow(median_face, cmap=plt.cm.gray)
plt.axis('off')
plt.title('median filter')
plt.subplot(144)
plt.imshow(wiener_face, cmap=plt.cm.gray)
plt.title('Wiener filter')
plt.axis('off')
plt.show()
A 3D surface plot of the function
Total running time of the script: ( 0 minutes 0.382 seconds) from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
Note: Click here to download the full example code surf = ax.plot_surface(xg, yg, sixhump([xg, yg]), rstride=1, cstride=1,
cmap=plt.cm.jet, linewidth=0, antialiased=False)
6.12. Full code examples for the scipy chapter 240 6.12. Full code examples for the scipy chapter 241
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
from scipy import optimize Note: Click here to download the full example code
time_step = 0.02
period = 5.
plt.figure(figsize=(6, 5))
plt.plot(time_vec, sig, label='Original signal')
6.12. Full code examples for the scipy chapter 242 6.12. Full code examples for the scipy chapter 243
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
# Check that it does indeed correspond to the frequency that we generate plt.legend(loc='best')
# the signal with
np.allclose(peak_freq, 1./period)
6.12. Full code examples for the scipy chapter 244 6.12. Full code examples for the scipy chapter 245
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Note This is actually a bad way of creating a filter: such brutal cut-off in frequency space does not
control distorsion on the signal.
Filters should be created using the scipy filter design code
plt.show()
Note: Click here to download the full example code ft_populations = fftpack.fft(populations, axis=0)
frequencies = fftpack.fftfreq(populations.shape[0], years[1] - years[0])
periods = 1 / frequencies
Crude periodicity finding
plt.figure()
Discover the periods in evolution of animal populations (../../../../data/populations.txt) plt.plot(periods, abs(ft_populations) * 1e-3, 'o')
plt.xlim(0, 22)
Load the data plt.xlabel('Period')
plt.ylabel('Power ($\cdot10^3$)')
import numpy as np
plt.show()
data = np.loadtxt('../../../../data/populations.txt')
years = data[:, 0]
populations = data[:, 1:]
6.12. Full code examples for the scipy chapter 246 6.12. Full code examples for the scipy chapter 247
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
There’s probably a period of around 10 years (obvious from the plot), but for this crude a method,
there’s not enough data to say much more. Fitting it to a periodic function
Total running time of the script: ( 0 minutes 0.027 seconds) from scipy import optimize
def yearly_temps(times, avg, ampl, time_offset):
return (avg
Note: Click here to download the full example code
+ ampl * np.cos((times + time_offset) * 2 * np.pi / times.max()))
6.12. Full code examples for the scipy chapter 248 6.12. Full code examples for the scipy chapter 249
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Note: Click here to download the full example code # First a 1-D Gaussian
t = np.linspace(-10, 10, 30)
bump = np.exp(-0.1*t**2)
Simple image blur by convolution with a Gaussian kernel bump /= np.trapz(bump) # normalize the integral to 1
Blur an an image (../../../../data/elephant.png) using a Gaussian kernel. # make a 2-D kernel out of it
kernel = bump[:, np.newaxis] * bump[np.newaxis, :]
Convolution is easy to perform with FFT: convolving two signals boils down to multiplying their FFTs
(and performing an inverse FFT).
Implement convolution via FFT
import numpy as np
from scipy import fftpack
import matplotlib.pyplot as plt # Padded fourier transform, with the same shape as the image
# We use :func:`scipy.signal.fftpack.fft2` to have a 2D FFT
kernel_ft = fftpack.fft2(kernel, shape=img.shape[:2], axes=(0, 1))
The original image
# convolve
img_ft = fftpack.fft2(img, axes=(0, 1))
# read image
# the 'newaxis' is to match to color direction
img = plt.imread('../../../../data/elephant.png')
img2_ft = kernel_ft[:, :, np.newaxis] * img_ft
plt.figure()
img2 = fftpack.ifft2(img2_ft, axes=(0, 1)).real
plt.imshow(img)
# clip values to range
img2 = np.clip(img2, 0, 1)
# plot output
plt.figure()
plt.imshow(img2)
6.12. Full code examples for the scipy chapter 250 6.12. Full code examples for the scipy chapter 251
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Further exercise (only if you are familiar with this stuff): Note that we still have a decay to zero at the border of the image. Using scipy.ndimage.
gaussian_filter() would get rid of this artifact
A “wrapped border” appears in the upper left and top edges of the image. This is because the padding
is not done correctly, and does not take the kernel size into account (so the convolution “flows out of plt.show()
bounds of the image”). Try to remove this artifact.
Total running time of the script: ( 0 minutes 0.064 seconds)
A function to do it: scipy.signal.fftconvolve()
The above exercise was only for didactic reasons: there exists a function in scipy that will do Note: Click here to download the full example code
this for us, and probably do a better job: scipy.signal.fftconvolve()
import numpy as np
import matplotlib.pyplot as plt
im = plt.imread('../../../../data/moonlanding.png').astype(float)
plt.figure()
plt.imshow(im, plt.cm.gray)
plt.title('Original image')
6.12. Full code examples for the scipy chapter 252 6.12. Full code examples for the scipy chapter 253
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
from scipy import fftpack # In the lines following, we'll make a copy of the original spectrum and
im_fft = fftpack.fft2(im) # truncate coefficients.
# Show the results # Define the fraction of coefficients (in each direction) we keep
keep_fraction = 0.1
def plot_spectrum(im_fft):
from matplotlib.colors import LogNorm # Call ff a copy of the original transform. Numpy arrays have a copy
# A logarithmic colormap # method for this purpose.
plt.imshow(np.abs(im_fft), norm=LogNorm(vmin=5)) im_fft2 = im_fft.copy()
plt.colorbar()
# Set r and c to be the number of rows and columns of the array.
plt.figure() r, c = im_fft2.shape
plot_spectrum(im_fft)
plt.title('Fourier transform') # Set to zero all rows with indices between r*keep_fraction and
# r*(1-keep_fraction):
im_fft2[int(r*keep_fraction):int(r*(1-keep_fraction))] = 0
plt.figure()
plot_spectrum(im_fft2)
plt.title('Filtered Spectrum')
6.12. Full code examples for the scipy chapter 254 6.12. Full code examples for the scipy chapter 255
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
# Reconstruct the denoised image from the filtered spectrum, keep only the Implementing filtering directly with FFTs is tricky and time consuming. We can use the
# real part for display. Gaussian filter from scipy.ndimage
im_new = fftpack.ifft2(im_fft2).real
from scipy import ndimage
plt.figure() im_blur = ndimage.gaussian_filter(im, 4)
plt.imshow(im_new, plt.cm.gray)
plt.title('Reconstructed Image') plt.figure()
plt.imshow(im_blur, plt.cm.gray)
plt.title('Blurred image')
plt.show()
6.12. Full code examples for the scipy chapter 256 6.12. Full code examples for the scipy chapter 257
Scipy lecture notes, Edition 2022.1
CHAPTER 7
Getting help and finding documentation
Total running time of the script: ( 0 minutes 0.211 seconds) Author: Emmanuelle Gouillart
See also: Rather than knowing all functions in Numpy and Scipy, it is important to find rapidly information
throughout the documentation and the available help. Here are some ways to get information:
References to go further
• In Ipython, help function opens the docstring of the function. Only type the beginning of the
• Some chapters of the advanced and the packages and applications parts of the scipy lectures
function’s name and use tab completion to display the matching functions.
• The scipy cookbook
In [204]: help np.v
np.vander np.vdot np.version np.void0 np.vstack
np.var np.vectorize np.void np.vsplit
In Ipython it is not possible to open a separated window for help and documentation; how-
ever one can always open a second Ipython shell just to display help and docstrings. . .
6.12. Full code examples for the scipy chapter 258 259
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Tutorials on various topics as well as the complete API with all docstrings are found on this website. (continued from previous page)
Search results for 'remove'
---------------------------
os.remove
remove(path)
os.removedirs
removedirs(path)
os.rmdir
rmdir(path)
os.unlink
unlink(path)
os.walk
Directory tree generator.
• Numpy’s and Scipy’s documentation is enriched and updated on a regular basis by users on a
wiki http://numpy.org/doc/stable/. As a result, some docstrings are clearer or more detailed on • If everything listed above fails (and Google doesn’t have the answer). . . don’t despair! Write to the
the wiki, and you may want to read directly the documentation on the wiki instead of the official mailing-list suited to your problem: you should have a quick answer if you describe your problem
documentation website. Note that anyone can create an account on the wiki and write better well. Experts on scientific python often give very enlightening explanations on the mailing-list.
documentation; this is an easy way to contribute to an open-source project and improve the tools
– Numpy discussion ([email protected]): all about numpy arrays, manipulating
you are using!
them, indexation questions, etc.
• The SciPy Cookbook https://scipy-cookbook.readthedocs.io gives recipes on many com-
– SciPy Users List ([email protected]): scientific computing with Python, high-level data
mon problems frequently encountered, such as fitting data points, solving ODE, etc.
processing, in particular with the scipy package.
– [email protected] for plotting with matplotlib.
• Matplotlib’s website http://matplotlib.org/ features a very nice gallery with a large number of
plots, each of them shows both the source code and the resulting plot. This is very useful for
learning by example. More standard documentation is also available.
Finally, two more “technical” possibilities are useful as well:
• In Ipython, the magical function %psearch search for objects matching patterns. This is useful if,
for example, one does not know the exact name of a function.
In [45]: numpy.lookfor('convolution')
Search results for 'convolution'
--------------------------------
numpy.convolve
Returns the discrete, linear convolution of two one-dimensional
sequences.
numpy.bartlett
Return the Bartlett window.
numpy.correlate
Discrete, linear correlation of two 1-dimensional sequences.
In [46]: numpy.lookfor('remove', module='os')
(continues on next page)
260 261
Scipy lecture notes, Edition 2022.1
This part of the Scipy lecture notes is dedicated to advanced usage. It strives to educate the proficient
Python coder to be an expert and tackles various specific topics.
Part II
Advanced topics
262 263
Scipy lecture notes, Edition 2022.1
8
– A while-loop removing decorator
– A plugin registration system
• Context managers
– Catching exceptions
CHAPTER
– Using generators to define context managers
Duplication of effort is wasteful, and replacing the various home-grown approaches with a standard
feature usually ends up making things more readable, and interoperable as well.
Guido van Rossum — Adding Optional Static Typing to Python
An iterator is an object adhering to the iterator protocol — basically this means that it has a next
Author Zbigniew Jędrzejewski-Szmek method, which, when called, returns the next item in the sequence, and when there’s nothing to return,
raises the StopIteration exception.
This section covers some features of the Python language which can be considered advanced — in
the sense that not every language has them, and also in the sense that they are more useful in more An iterator object allows to loop just once. It holds the state (position) of a single iteration, or from the
complicated programs or libraries, but not in the sense of being particularly specialized, or particularly other side, each loop over a sequence requires a single iterator object. This means that we can iterate
complicated. over the same sequence more than once concurrently. Separating the iteration logic from the sequence
allows us to have more than one way of iteration.
It is important to underline that this chapter is purely about the language itself — about features
supported through special syntax complemented by functionality of the Python stdlib, which could not Calling the __iter__ method on a container to create an iterator object is the most straightforward way
be implemented through clever external modules. to get hold of an iterator. The iter function does that for us, saving a few keystrokes.
The process of developing the Python programming language, its syntax, is very transparent; proposed >>> nums = [1, 2, 3] # note that ... varies: these are different objects
changes are evaluated from various angles and discussed via Python Enhancement Proposals — PEPs. >>> iter(nums)
As a result, features described in this chapter were added after it was shown that they indeed solve real <...iterator object at ...>
>>> nums.__iter__()
problems and that their use is as simple as possible.
<...iterator object at ...>
>>> nums.__reversed__()
Chapter contents <...reverseiterator object at ...>
– Replacing or tweaking the original object When used in a loop, StopIteration is swallowed and causes the loop to finish. But with explicit
invocation, we can see that once the iterator is exhausted, accessing it raises an exception. When next is called, the function is executed until the first yield. Each encountered yield statement
gives a value becomes the return value of next. After executing the yield statement, the execution of
Using the for..in loop also uses the __iter__ method. This allows us to transparently start the iteration
this function is suspended.
over a sequence. But if we already have the iterator, we want to be able to use it in an for loop in
the same way. In order to achieve this, iterators in addition to next are also required to have a method >>> def f():
called __iter__ which returns the iterator (self). ... yield 1
... yield 2
Support for iteration is pervasive in Python: all sequences and unordered containers in the standard
>>> f()
library allow this. The concept is also stretched to other things: e.g. file objects support iteration over <generator object f at 0x...>
lines. >>> gen = f()
>>> next(gen)
>>> f = open('/etc/fstab')
1
>>> f is f.__iter__()
>>> next(gen)
True
2
>>> next(gen)
The file is an iterator itself and it’s __iter__ method doesn’t create a separate object: only a single Traceback (most recent call last):
thread of sequential access is allowed. File "<stdin>", line 1, in <module>
StopIteration
8.1.2 Generator expressions
Let’s go over the life of the single invocation of the generator function.
A second way in which iterator objects are created is through generator expressions, the basis for list
comprehensions. To increase clarity, a generator expression must always be enclosed in parentheses >>> def f():
or an expression. If round parentheses are used, then a generator iterator is created. If rectangular ... print("-- start --")
parentheses are used, the process is short-circuited and we get a list. ... yield 3
... print("-- middle --")
>>> (i for i in nums) ... yield 4
<generator object <genexpr> at 0x...> ... print("-- finished --")
>>> [i for i in nums] >>> gen = f()
[1, 2, 3] >>> next(gen)
>>> list(i for i in nums) -- start --
[1, 2, 3] 3
>>> next(gen)
The list comprehension syntax also extends to dictionary and set comprehensions. A set is cre- -- middle --
4
ated when the generator expression is enclosed in curly braces. A dict is created when the generator
>>> next(gen)
expression contains “pairs” of the form key:value: -- finished --
>>> {i for i in range(3)} Traceback (most recent call last):
set([0, 1, 2]) ...
>>> {i:i**2 for i in range(3)} StopIteration
{0: 0, 1: 1, 2: 4}
Contrary to a normal function, where executing f() would immediately cause the first print to be
One gotcha should be mentioned: in old Pythons the index variable (i) would leak, and in versions >= executed, gen is assigned without executing any statements in the function body. Only when gen.
3 this is fixed. next() is invoked by next, the statements up to the first yield are executed. The second next prints
-- middle -- and execution halts on the second yield. The third next prints -- finished -- and
falls of the end of the function. Since no yield was reached, an exception is raised.
8.1.3 Generators
What happens with the function after a yield, when the control passes to the caller? The state of
each generator is stored in the generator object. From the point of view of the generator function, is
Generators looks almost as if it was running in a separate thread, but this is just an illusion: execution is strictly
single-threaded, but the interpreter keeps and restores the state in between the requests for the next
A generator is a function that produces a sequence of results instead of a single value. value.
David Beazley — A Curious Course on Coroutines and Concurrency Why are generators useful? As noted in the parts about iterators, a generator function is just a different
way to create an iterator object. Everything that can be done with yield statements, could also be
A third way to create iterator objects is to call a generator function. A generator is a function containing done with next methods. Nevertheless, using a function and having the interpreter perform its magic
the keyword yield. It must be noted that the mere presence of this keyword completely changes the to create an iterator has advantages. A function can be much shorter than the definition of a class with
nature of the function: this yield statement doesn’t have to be invoked, or even reachable, but causes the required next and __iter__ methods. What is more important, it is easier for the author of the
the function to be marked as a generator. When a normal function is called, the instructions contained in generator to understand the state which is kept in local variables, as opposed to instance attributes,
the body start to be executed. When a generator is called, the execution stops before the first instruction which have to be used to pass data between consecutive invocations of next on an iterator object.
in the body. An invocation of a generator function creates a generator object, adhering to the iterator A broader question is why are iterators useful? When an iterator is used to power a loop, the loop
protocol. As with normal function invocations, concurrent and recursive invocations are allowed. becomes very simple. The code to initialise the state, to decide if the loop is finished, and to find the
8.1. Iterators, generator expressions and generators 266 8.1. Iterators, generator expressions and generators 267
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
next value is extracted into a separate place. This highlights the body of the loop — the interesting (continued from previous page)
part. In addition, it is possible to reuse the iterator code in other places. --yielding 1--
1
8.1.4 Bidirectional communication >>> it.throw(IndexError)
--yield raised IndexError()--
Each yield statement causes a value to be passed to the caller. This is the reason for the introduction --yielding 2--
of generators by PEP 255 (implemented in Python 2.2). But communication in the reverse direction is 2
also useful. One obvious way would be some external state, either a global variable or a shared mutable >>> it.close()
object. Direct communication is possible thanks to PEP 342 (implemented in 2.5). It is achieved --closing--
by turning the previously boring yield statement into an expression. When the generator resumes
execution after a yield statement, the caller can call a method on the generator object to either pass a
value into the generator, which then is returned by the yield statement, or a different method to inject next or __next__?
an exception into the generator.
In Python 2.x, the iterator method to retrieve the next value is called next. It is invoked implicitly
The first of the new methods is send(value), which is similar to next(), but passes value into the through the global function next, which means that it should be called __next__. Just like the global
generator to be used for the value of the yield expression. In fact, g.next() and g.send(None) are function iter calls __iter__. This inconsistency is corrected in Python 3.x, where it.next becomes
equivalent. it.__next__. For other generator methods — send and throw — the situation is more complicated,
The second of the new methods is throw(type, value=None, traceback=None) which is equivalent to: because they are not called implicitly by the interpreter. Nevertheless, there’s a proposed syntax
extension to allow continue to take an argument which will be passed to send of the loop’s iterator.
raise type, value, traceback If this extension is accepted, it’s likely that gen.send will become gen.__send__. The last of generator
methods, close, is pretty obviously named incorrectly, because it is already invoked implicitly.
at the point of the yield statement.
Unlike raise (which immediately raises an exception from the current execution point), throw() first
resumes the generator, and only then raises the exception. The word throw was picked because it 8.1.5 Chaining generators
is suggestive of putting the exception in another location, and is associated with exceptions in other
languages.
Note: This is a preview of PEP 380 (not yet implemented, but accepted for Python 3.3).
What happens when an exception is raised inside the generator? It can be either raised explicitly or
when executing some statements or it can be injected at the point of a yield statement by means of
Let’s say we are writing a generator and we want to yield a number of values generated by a second
the throw() method. In either case, such an exception propagates in the standard manner: it can
generator, a subgenerator. If yielding of values is the only concern, this can be performed without
be intercepted by an except or finally clause, or otherwise it causes the execution of the generator
much difficulty using a loop such as
function to be aborted and propagates in the caller.
For completeness’ sake, it’s worth mentioning that generator iterators also have a close() method, subgen = some_other_generator()
for v in subgen:
which can be used to force a generator that would otherwise be able to provide more values to finish
yield v
immediately. It allows the generator __del__ method to destroy objects holding the state of generator.
Let’s define a generator which just prints what is passed in through send and throw. However, if the subgenerator is to interact properly with the caller in the case of calls to send(), throw()
and close(), things become considerably more difficult. The yield statement has to be guarded by a
>>> import itertools
try..except..finally structure similar to the one defined in the previous section to “debug” the generator
>>> def g():
... print('--start--') function. Such code is provided in PEP 380#id13, here it suffices to say that new syntax to properly
... for i in itertools.count(): yield from a subgenerator is being introduced in Python 3.3:
... print('--yielding %i --' % i)
yield from some_other_generator()
... try:
... ans = yield i
... except GeneratorExit: This behaves like the explicit loop above, repeatedly yielding values from some_other_generator until
... print('--closing--') it is exhausted, but also forwards send, throw and close to the subgenerator.
... raise
... except Exception as e:
... print('--yield raised %r --' % e) 8.2 Decorators
... else:
... print('--yield returned %s --' % ans)
Summary
>>> it = g()
>>> next(it) This amazing feature appeared in the language almost apologetically and with concern that it might
--start-- not be that useful.
--yielding 0--
0 Bruce Eckel — An Introduction to Python Decorators
>>> it.send(11)
--yield returned 11--
Since functions and classes are objects, they can be passed around. Since they are mutable objects, they
(continues on next page)
8.1. Iterators, generator expressions and generators 268 8.2. Decorators 269
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
can be modified. The act of altering a function or class object after it has been constructed but before 8.2.2 Decorators implemented as classes and as functions
is is bound to its name is called decorating.
The only requirement on decorators is that they can be called with a single argument. This means that
There are two things hiding behind the name “decorator” — one is the function which does the work of decorators can be implemented as normal functions, or as classes with a __call__ method, or in theory,
decorating, i.e. performs the real work, and the other one is the expression adhering to the decorator even as lambda functions.
syntax, i.e. an at-symbol and the name of the decorating function.
Let’s compare the function and class approaches. The decorator expression (the part after @) can be
Function can be decorated by using the decorator syntax for functions: either just a name, or a call. The bare-name approach is nice (less to type, looks cleaner, etc.), but is
only possible when no arguments are needed to customise the decorator. Decorators written as functions
@decorator # · can be used in those two cases:
def function(): # ¶
pass >>> def simple_decorator(function):
... print("doing decoration")
• A function is defined in the standard way. ¶ ... return function
>>> @simple_decorator
• An expression starting with @ placed before the function definition is the decorator ·. The part ... def function():
after @ must be a simple expression, usually this is just the name of a function or class. This part is ... print("inside function")
evaluated first, and after the function defined below is ready, the decorator is called with the newly doing decoration
defined function object as the single argument. The value returned by the decorator is attached to >>> function()
the original name of the function. inside function
Decorators can be applied to functions and to classes. For classes the semantics are identical — the >>> def decorator_with_arguments(arg):
original class definition is used as an argument to call the decorator and whatever is returned is assigned ... print("defining the decorator")
under the original name. ... def _decorator(function):
... # in this inner function, arg is available too
Before the decorator syntax was implemented (PEP 318), it was possible to achieve the same effect by
... print("doing decoration, %r " % arg)
assigning the function or class object to a temporary variable and then invoking the decorator explicitly ... return function
and then assigning the return value to the name of the function. This sounds like more typing, and it ... return _decorator
is, and also the name of the decorated function doubling as a temporary variable must be used at least >>> @decorator_with_arguments("abc")
three times, which is prone to errors. Nevertheless, the example above is equivalent to: ... def function():
... print("inside function")
def function(): # ¶ defining the decorator
pass doing decoration, 'abc'
function = decorator(function) # · >>> function()
inside function
Decorators can be stacked — the order of application is bottom-to-top, or inside-out. The semantics
are such that the originally defined function is used as an argument for the first decorator, whatever is The two trivial decorators above fall into the category of decorators which return the original function.
returned by the first decorator is used as an argument for the second decorator, . . . , and whatever is If they were to return a new function, an extra level of nestedness would be required. In the worst case,
returned by the last decorator is attached under the name of the original function. three levels of nested functions.
The decorator syntax was chosen for its readability. Since the decorator is specified before the header >>> def replacing_decorator_with_args(arg):
of the function, it is obvious that its is not a part of the function body and its clear that it can only ... print("defining the decorator")
operate on the whole function. Because the expression is prefixed with @ is stands out and is hard to ... def _decorator(function):
miss (“in your face”, according to the PEP :) ). When more than one decorator is applied, each one is ... # in this inner function, arg is available too
placed on a separate line in an easy to read way. ... print("doing decoration, %r " % arg)
... def _wrapper(*args, **kwargs):
... print("inside wrapper, %r %r " % (args, kwargs))
8.2.1 Replacing or tweaking the original object ... return function(*args, **kwargs)
Decorators can either return the same function or class object or they can return a completely different ... return _wrapper
object. In the first case, the decorator can exploit the fact that function and class objects are mutable ... return _decorator
and add attributes, e.g. add a docstring to a class. A decorator might do something useful even without >>> @replacing_decorator_with_args("abc")
... def function(*args, **kwargs):
modifying the object, for example register the decorated class in a global registry. In the second case,
... print("inside function, %r %r " % (args, kwargs))
virtually anything is possible: when something different is substituted for the original function or class,
... return 14
the new object can be completely different. Nevertheless, such behaviour is not the purpose of decorators: defining the decorator
they are intended to tweak the decorated object, not do something unpredictable. Therefore, when a doing decoration, 'abc'
function is “decorated” by replacing it with a different function, the new function usually calls the original >>> function(11, 12)
function, after doing some preparatory work. Likewise, when a class is “decorated” by replacing if with inside wrapper, (11, 12) {}
a new class, the new class is usually derived from the original class. When the purpose of the decorator inside function, (11, 12) {}
is to do something “every time”, like to log every call to a decorated function, only the second type of 14
decorators can be used. On the other hand, if the first type is sufficient, it is better to use it, because it
is simpler. The _wrapper function is defined to accept all positional and keyword arguments. In general we cannot
know what arguments the decorated function is supposed to accept, so the wrapper function just passes
everything to the wrapped function. One unfortunate consequence is that the apparent argument list is 8.2.3 Copying the docstring and other attributes of the original function
misleading.
When a new function is returned by the decorator to replace the original function, an unfortunate
Compared to decorators defined as functions, complex decorators defined as classes are simpler. When consequence is that the original function name, the original docstring, the original argument list are
an object is created, the __init__ method is only allowed to return None, and the type of the created lost. Those attributes of the original function can partially be “transplanted” to the new function
object cannot be changed. This means that when a decorator is defined as a class, it doesn’t make by setting __doc__ (the docstring), __module__ and __name__ (the full name of the function), and
much sense to use the argument-less form: the final decorated object would just be an instance of the __annotations__ (extra information about arguments and the return value of the function available in
decorating class, returned by the constructor call, which is not very useful. Therefore it’s enough to Python 3). This can be done automatically by using functools.update_wrapper.
discuss class-based decorators where arguments are given in the decorator expression and the decorator
__init__ method is used for decorator construction.
functools.update_wrapper(wrapper, wrapped)
>>> class decorator_class(object):
... def __init__(self, arg): “Update a wrapper function to look like the wrapped function.”
... # this method is called in the decorator expression
>>> import functools
... print("in decorator init, %s " % arg)
>>> def replacing_decorator_with_args(arg):
... self.arg = arg
... print("defining the decorator")
... def __call__(self, function):
... def _decorator(function):
... # this method is called to do the job
... print("doing decoration, %r " % arg)
... print("in decorator call, %s " % self.arg)
... def _wrapper(*args, **kwargs):
... return function
... print("inside wrapper, %r %r " % (args, kwargs))
>>> deco_instance = decorator_class('foo')
... return function(*args, **kwargs)
in decorator init, foo
... return functools.update_wrapper(_wrapper, function)
>>> @deco_instance
... return _decorator
... def function(*args, **kwargs):
>>> @replacing_decorator_with_args("abc")
... print("in function, %s %s " % (args, kwargs))
... def function():
in decorator call, foo
... "extensive documentation"
>>> function()
... print("inside function")
in function, () {}
... return 14
defining the decorator
Contrary to normal rules (PEP 8) decorators written as classes behave more like functions and therefore doing decoration, 'abc'
their name often starts with a lowercase letter. >>> function
<function function at 0x...>
In reality, it doesn’t make much sense to create a new class just to have a decorator which returns the
>>> print(function.__doc__)
original function. Objects are supposed to hold state, and such decorators are more useful when the extensive documentation
decorator returns a new object.
A decorator like this can do pretty much anything, since it can modify the original function object and Class methods are still accessible through the class’ namespace, so they don’t pollute the module’s
mangle the arguments, call the original function or not, and afterwards mangle the return value. namespace. Class methods can be used to provide alternative constructors:
The way that this works, is that the property decorator replaces the getter method with a property >>> @deprecated() # doctest: +SKIP
object. This object in turn has three methods, getter, setter, and deleter, which can be used ... def f():
as decorators. Their job is to set the getter, setter and deleter of the property object (stored as ... pass
attributes fget, fset, and fdel). The getter can be set like in the example above, when creating >>> f() # doctest: +SKIP
the object. When defining the setter, we already have the property object under area, and we add f is deprecated
"""
the setter to it by using the setter method. All this happens when we are creating the class.
def __call__(self, func):
Afterwards, when an instance of the class has been created, the property object is special. When the self.func = func
interpreter executes attribute access, assignment, or deletion, the job is delegated to the methods self.count = 0
of the property object. return self._wrapper
(continues on next page)
This is fine, as long as the body of the loop is fairly compact. Once it becomes more complicated, as • Bruce Eckel
often happens in real code, this becomes pretty unreadable. We could simplify this by using yield – Decorators I: Introduction to Python Decorators
statements, but then the user would have to explicitly call list(find_answers()).
– Python Decorators II: Decorator Arguments
We can define a decorator which constructs the list for us:
– Python Decorators III: A Decorator-Based Build System
def vectorized(generator_func):
def wrapper(*args, **kwargs):
return list(generator_func(*args, **kwargs)) 8.3 Context managers
return functools.update_wrapper(wrapper, generator_func)
A context manager is an object with __enter__ and __exit__ methods which can be used in the with
Our function then becomes: statement:
(continued from previous page) • contextlib.closing å the same as the example above, call close
finally: • parallel programming
manager.__exit__()
– concurrent.futures.ThreadPoolExecutor å invoke in parallel then kill thread pool (py
In other words, the context manager protocol defined in PEP 343 permits the extraction of the boring >= 3.2)
part of a try..except..finally structure into a separate class leaving only the interesting do_something – concurrent.futures.ProcessPoolExecutor å invoke in parallel then kill process pool (py
block. >= 3.2)
1. The __enter__ method is called first. It can return a value which will be assigned to var. The – nogil å solve the GIL problem temporarily (cython only :( )
as-part is optional: if it isn’t present, the value returned by __enter__ is simply ignored.
2. The block of code underneath with is executed. Just like with try clauses, it can either execute 8.3.1 Catching exceptions
successfully to the end, or it can break, continue or return, or it can throw an exception. Either
way, after the block is finished, the __exit__ method is called. If an exception was thrown, When an exception is thrown in the with-block, it is passed as arguments to __exit__. Three arguments
the information about the exception is passed to __exit__, which is described below in the next are used, the same as returned by sys.exc_info(): type, value, traceback. When no exception is thrown,
subsection. In the normal case, exceptions can be ignored, just like in a finally clause, and will None is used for all three arguments. The context manager can “swallow” the exception by returning a
be rethrown after __exit__ is finished. true value from __exit__. Exceptions can be easily ignored, because if __exit__ doesn’t use return
and just falls of the end, None is returned, a false value, and therefore the exception is rethrown after
Let’s say we want to make sure that a file is closed immediately after we are done writing to it: __exit__ is finished.
>>> class closing(object): The ability to catch exceptions opens interesting possibilities. A classic example comes from unit-tests
... def __init__(self, obj): — we want to make sure that some code throws the right kind of exception:
... self.obj = obj
... def __enter__(self): class assert_raises(object):
... return self.obj # based on pytest and unittest.TestCase
... def __exit__(self, *args): def __init__(self, type):
... self.obj.close() self.type = type
>>> with closing(open('/tmp/file', 'w')) as f: def __enter__(self):
... f.write('the contents\n') pass
def __exit__(self, type, value, traceback):
Here we have made sure that the f.close() is called when the with block is exited. Since closing files is if type is None:
such a common operation, the support for this is already present in the file class. It has an __exit__ raise AssertionError('exception expected')
if issubclass(type, self.type):
method which calls close and can be used as a context manager itself:
return True # swallow the expected exception
>>> with open('/tmp/file', 'a') as f: raise AssertionError('wrong exception type')
... f.write('more contents\n')
with assert_raises(KeyError):
{}['foo']
The common use for try..finally is releasing resources. Various different cases are implemented
similarly: in the __enter__ phase the resource is acquired, in the __exit__ phase it is released, and the
exception, if thrown, is propagated. As with files, there’s often a natural operation to perform after the 8.3.2 Using generators to define context managers
object has been used and it is most convenient to have the support built in. With each release, Python
provides support in more places: When discussing generators, it was said that we prefer generators to iterators implemented as classes
because they are shorter, sweeter, and the state is stored as local, not instance, variables. On the other
• all file-like objects:
hand, as described in Bidirectional communication, the flow of data between the generator and its caller
– file å automatically closed can be bidirectional. This includes exceptions, which can be thrown into the generator. We would like to
implement context managers as special generator functions. In fact, the generator protocol was designed
– fileinput, tempfile (py >= 3.2)
to support this use case.
– bz2.BZ2File, gzip.GzipFile, tarfile.TarFile, zipfile.ZipFile
@contextlib.contextmanager
– ftplib, nntplib å close connection (py >= 3.2 or 3.3) def some_generator(<arguments>):
<setup>
• locks try:
– multiprocessing.RLock å lock and unlock yield <value>
finally:
– multiprocessing.Semaphore <cleanup>
– memoryview å automatically release (py >= 3.2 and 2.7)
The contextlib.contextmanager helper takes a generator and turns it into a context manager. The
• decimal.localcontext å modify precision of computations temporarily generator has to obey some rules which are enforced by the wrapper function — most importantly it
• _winreg.PyHKEY å open and close hive key must yield exactly once. The part before the yield is executed from __enter__, the block of code
protected by the context manager is executed when the generator is suspended in yield, and the rest
• warnings.catch_warnings å kill warnings temporarily is executed in __exit__. If an exception is thrown, the interpreter hands it to the wrapper through
__exit__ arguments, and the wrapper function then throws it at the point of the yield statement.
Through the use of generators, the context manager is shorter and simpler.
Let’s rewrite the closing example as a generator:
@contextlib.contextmanager
9
def closing(obj):
try:
yield obj
finally:
obj.close()
Advanced NumPy
else:
raise AssertionError('exception expected')
Prerequisites
• NumPy
• Cython
• Pillow (Python imaging library, used in a couple of examples)
Chapter contents
• Life of ndarray
– It’s. . .
– Block of memory
– Data types
– Indexing scheme: strides
– Findings in dissection
• Universal functions
– What they are?
– Exercise: building an ufunc from scratch
– Solution: building an ufunc from scratch
– Generalized ufuncs
• Interoperability features
typedef struct PyArrayObject {
– Sharing multidimensional, typed data PyObject_HEAD
– The old buffer protocol
/* Block of memory */
– The old buffer protocol char *data;
Tip: In this section, numpy will be imported as follows: Memory address of the data:
>>> import numpy as np >>> x.__array_interface__['data'][0]
64803824
>>> np.dtype(int).type ˓→'uint32'), 24), 'bits_per_sample': (dtype('uint16'), 34), 'chunk_size': (dtype('uint32'), 4),
<type 'numpy.int64'> ˓→ 'fmt_size': (dtype('uint32'), 16), 'data_size': (dtype('uint32'), 40), 'audio_fmt': (dtype(
8 >>> wav_header_dtype.fields['format']
>>> np.dtype(int).byteorder (dtype('S4'), 8)
'='
• The first element is the sub-dtype in the structured data, corresponding to the name format
Example: reading .wav files • The second one is its offset (in bytes) from the beginning of the item
The .wav file header:
• on assignment – ...
Casting
0x01 0x02 || 0x03 0x04
• Casting in arithmetic, in nutshell:
– only type (not value!) of operands matters Note: little-endian: least significant byte is on the left in memory
– largest “safe” type able to represent both is picked
2. Create a new view:
– scalars can “lose” to arrays in some situations
>>> y = x.view("<i4")
>>> y >>> y = np.array([[1, 3], [2, 4]], dtype=np.uint8).transpose()
array([67305985], dtype=int32) >>> x = y.copy()
>>> 0x04030201 >>> x
67305985 array([[1, 2],
[3, 4]], dtype=uint8)
>>> y
0x01 0x02 0x03 0x04 array([[1, 2],
[3, 4]], dtype=uint8)
>>> x.view(np.int16)
array([[ 513],
Note: [1027]], dtype=int16)
>>> 0x0201, 0x0403
• .view() makes views, does not copy (or alter) the memory block (513, 1027)
• only changes the dtype (and adjusts array shape): >>> y.view(np.int16)
array([[ 769, 1026]], dtype=int16)
>>> x[1] = 5
>>> y • What happened?
array([328193], dtype=int32)
• . . . we need to look into what x[0,1] actually means
>>> y.base is x
True >>> 0x0301, 0x0402
(769, 1026)
Warning: Another array taking exactly 4 bytes of memory: Note: The Python built-in bytes returns bytes in C-order by default which can cause confusion when
trying to inspect memory layout. We use numpy.ndarray.tobytes() with order=A instead, which
preserves the C or F ordering of the bytes in memory.
• Need to jump 2 bytes to find the next row >>> x = np.zeros((10, 10, 10), dtype=np.float)
>>> x.strides
• Need to jump 4 bytes to find the next column
(800, 80, 8)
• Similarly to higher dimensions: >>> x[::2,::3,::4].strides
(1600, 240, 32)
– C: last dimensions vary fastest (= smaller strides)
– F: first dimensions vary fastest • Similarly, transposes never make copies (it just swaps strides):
shape = (𝑑1 , 𝑑2 , ..., 𝑑𝑛 ) >>> x = np.zeros((10, 10, 10), dtype=np.float)
strides = (𝑠1 , 𝑠2 , ..., 𝑠𝑛 ) >>> x.strides
(800, 80, 8)
𝑗 = 𝑑𝑗+1 𝑑𝑗+2 ...𝑑𝑛 × itemsize
𝑠𝐶 >>> x.T.strides
(8, 80, 800)
𝑗 = 𝑑1 𝑑2 ...𝑑𝑗−1 × itemsize
𝑠𝐹
But: not all reshaping operations can be represented by playing with strides:
Note: Now we can understand the behavior of .view():
>>> a = np.arange(6, dtype=np.int8).reshape(3, 2)
>>> y = np.array([[1, 3], [2, 4]], dtype=np.uint8).transpose() >>> b = a.T
>>> x = y.copy() >>> b.strides
(1, 2)
Transposition does not affect the memory layout of the data, only strides
So far, so good. However:
>>> x.strides
(2, 1) >>> bytes(a.data)
>>> y.strides b'\x00\x01\x02\x03\x04\x05'
(1, 2) >>> b
array([[0, 2, 4],
>>> x.tobytes('A') [1, 3, 5]], dtype=int8)
b'\x01\x02\x03\x04' >>> c = b.reshape(3*2)
>>> y.tobytes('A') >>> c
b'\x01\x03\x02\x04' array([0, 2, 4, 1, 3, 5], dtype=int8)
• the results are different when interpreted as 2 of int16 Here, there is no way to represent the array c given one stride and the block of memory for a. Therefore,
the reshape operation needs to make a copy here.
• .copy() creates new arrays in the C order (by default)
Example: fake dimensions with strides
Stride manipulation
Note: In-place operations with views
Prior to NumPy version 1.13, in-place operations with views could result in incorrect results for large >>> from numpy.lib.stride_tricks import as_strided
arrays. Since version 1.13, NumPy includes checks for memory overlap to guarantee that results are >>> help(as_strided)
consistent with the non in-place version (e.g. a = a + a.T produces the same result as a += a.T). as_strided(x, shape=None, strides=None)
Note however that this may result in the data being copied (as if using a += a.T.copy()), ultimately Make an ndarray from the given array with the given shape and strides
resulting in more memory being used than might otherwise be expected for in-place operations!
>>> x = np.arange(5*5*5*5).reshape(5, 5, 5, 5)
>>> s = 0 9.1.5 Findings in dissection
>>> for i in range(5):
... for j in range(5):
... s += x[j, i, j, i]
Solution
In [2]: y = np.zeros((20000*67,))[::67]
9.2 Universal functions
In [3]: x.shape, y.shape
((20000,), (20000,)) 9.2.1 What they are?
In [4]: %timeit x.sum() • Ufunc performs and elementwise operation on all elements of an array.
100000 loops, best of 3: 0.180 ms per loop
Examples:
In [5]: %timeit y.sum()
np.add, np.subtract, scipy.special.*, ...
100000 loops, best of 3: 2.34 ms per loop
Parts of an Ufunc
1. Provided by user
char types[3] #
# Fix the parts marked by TODO
types[0] = NPY_BYTE /* type of first input arg */ #
types[1] = NPY_BYTE /* type of second input arg */
types[2] = NPY_BYTE /* type of third input arg */ #
# Compile this file by (Cython >= 0.12 required because of the complex vars)
PyObject *python_ufunc = PyUFunc_FromFuncAndData( #
ufunc_loop, # cython mandel.pyx
NULL, # python setup.py build_ext -i
types, #
1, /* ntypes */ # and try it out with, in this directory,
2, /* num_inputs */ #
1, /* num_outputs */ # >>> import mandel
identity_element, # >>> mandel.mandel(0, 1 + 2j)
name, #
docstring, #
unused)
# The elementwise function
# ------------------------
• A ufunc can also support multiple different input-output type combinations.
cdef void mandel_single_point(double complex *z_in,
Making it easier double complex *c_in,
double complex *z_out) nogil:
3. ufunc_loop is of very generic form, and NumPy provides pre-made ones
#
# The Mandelbrot iteration
PyUfunc_f_f float elementwise_func(float input_1) #
PyUfunc_ff_f float elementwise_func(float input_1, float input_2)
PyUfunc_d_d double elementwise_func(double input_1) #
PyUfunc_dd_d double elementwise_func(double input_1, double input_2) # Some points of note:
PyUfunc_D_D elementwise_func(npy_cdouble *input, npy_cdouble* output) #
# - It's *NOT* allowed to call any Python functions here.
PyUfunc_DD_D elementwise_func(npy_cdouble *in1, npy_cdouble *in2, npy_cdouble*
#
out) # The Ufunc loop runs with the Python Global Interpreter Lock released.
# Hence, the ``nogil``.
• Only elementwise_func needs to be supplied #
• . . . except when your elementwise function is not in one of the above forms # - And so all local variables must be declared with ``cdef``
#
# - Note also that this function receives *pointers* to the data
9.2.2 Exercise: building an ufunc from scratch #
The Mandelbrot fractal is defined by the iteration
cdef double complex z = z_in[0]
cdef double complex c = c_in[0]
𝑧 ← 𝑧2 + 𝑐 cdef int k # the integer we use in the for loop
where 𝑐 = 𝑥 + 𝑖𝑦 is a complex number. This iteration is repeated – if 𝑧 stays finite no matter how long #
the iteration runs, 𝑐 belongs to the Mandelbrot set. # TODO: write the Mandelbrot iteration for one point here,
# as you would write it in Python.
• Make ufunc called mandel(z0, c) that computes:
#
(continues on next page)
from numpy cimport ( # This thing is passed as the ``data`` parameter for the generic
PyUFunc_f_f_As_d_d, # PyUFunc_* loop, to let it know which function it should call.
PyUFunc_d_d, elementwise_funcs[0] = <void*>mandel_single_point
PyUFunc_f_f,
PyUFunc_g_g, # Construct the ufunc:
PyUFunc_F_F_As_D_D,
PyUFunc_F_F, mandel = PyUFunc_FromFuncAndData(
PyUFunc_D_D, loop_func,
PyUFunc_G_G, elementwise_funcs,
PyUFunc_ff_f_As_dd_d, input_output_types,
PyUFunc_ff_f, 1, # number of supported input types
PyUFunc_dd_d, TODO, # number of input args
PyUFunc_gg_g, TODO, # number of output args
PyUFunc_FF_F_As_DD_D, 0, # `identity` element, never mind this
PyUFunc_DD_D, "mandel", # function name
PyUFunc_FF_F, "mandel(z, c) -> computes z*z + c", # docstring
PyUFunc_GG_G) 0 # unused
)
9.2.3 Solution: building an ufunc from scratch (continued from previous page)
cdef PyUFuncGenericFunction loop_func[1]
# The elementwise function
cdef char input_output_types[3]
# ------------------------
cdef void *elementwise_funcs[1]
cdef void mandel_single_point(double complex *z_in,
loop_func[0] = PyUFunc_DD_D
double complex *c_in,
double complex *z_out) nogil:
input_output_types[0] = NPY_CDOUBLE
#
input_output_types[1] = NPY_CDOUBLE
# The Mandelbrot iteration
input_output_types[2] = NPY_CDOUBLE
#
elementwise_funcs[0] = <void*>mandel_single_point
#
# Some points of note:
mandel = PyUFunc_FromFuncAndData(
#
loop_func,
# - It's *NOT* allowed to call any Python functions here.
elementwise_funcs,
#
input_output_types,
# The Ufunc loop runs with the Python Global Interpreter Lock released.
1, # number of supported input types
# Hence, the ``nogil``.
2, # number of input args
#
1, # number of output args
# - And so all local variables must be declared with ``cdef``
0, # `identity` element, never mind this
#
"mandel", # function name
# - Note also that this function receives *pointers* to the data;
"mandel(z, c) -> computes iterated z*z + c", # docstring
# the "traditional" solution to passing complex variables around
0 # unused
#
)
cdef double complex z = z_in[0]
cdef double complex c = c_in[0] """
cdef int k # the integer we use in the for loop Plot Mandelbrot
================
# Straightforward iteration
Plot the Mandelbrot ensemble.
for k in range(100):
z = z*z + c """
if z.real**2 + z.imag**2 > 1000:
break import numpy as np
import mandel
# Return the answer for this point x = np.linspace(-1.7, 0.6, 1000)
z_out[0] = z y = np.linspace(-1.4, 1.4, 1000)
c = x[None,:] + 1j*y[:,None]
z = mandel.mandel(c, c)
# Boilerplate Cython definitions
# import matplotlib.pyplot as plt
# Pulls definitions from the Numpy C headers. plt.imshow(abs(z)**2 < 1000, extent=[-1.7, 0.6, -1.4, 1.4])
# ------------------------------------------- plt.gray()
plt.show()
from numpy cimport import_array, import_ufunc
from numpy cimport (PyUFunc_FromFuncAndData,
PyUFuncGenericFunction)
from numpy cimport NPY_CDOUBLE
from numpy cimport PyUFunc_DD_D
import_array()
import_ufunc()
Matrix product:
Several accepted input types
input_1 shape = (m, n)
E.g. supporting both single- and double-precision versions input_2 shape = (n, p)
output shape = (m, p)
cdef void mandel_single_point(double complex *z_in,
double complex *c_in,
(m, n), (n, p) -> (m, p)
double complex *z_out) nogil:
...
• This is called the “signature” of the generalized ufunc
cdef void mandel_single_point_singleprec(float complex *z_in, • The dimensions on which the g-ufunc acts, are “core dimensions”
float complex *c_in,
float complex *z_out) nogil:
... Status in NumPy
output and input can be arrays with a fixed number of dimensions int input_1_stride_m = steps[3]; /* strides for the core dimensions */
int input_1_stride_n = steps[4]; /* are added after the non-core */
For example, matrix trace (sum of diag elements): int input_2_strides_n = steps[5]; /* steps */
int input_2_strides_p = steps[6];
(continues on next page)
(continued from previous page) Check what happens if data is now modified, and img saved again.
int output_strides_n = steps[7];
int output_strides_p = steps[8]; 9.3.3 The old buffer protocol
int m = dimension[1]; /* core dimensions are added after */ """
int n = dimension[2]; /* the main dimension; order as in */ From buffer
int p = dimension[3]; /* signature */ ============
int i; Show how to exchange data between numpy and a library that only knows
the buffer interface.
for (i = 0; i < dimensions[0]; ++i) { """
matmul_for_strided_matrices(input_1, input_2, output,
strides for each array...); import numpy as np
import Image
input_1 += steps[0];
input_2 += steps[1]; # Let's make a sample image, RGBA format
output += steps[2];
} x = np.zeros((200, 200, 4), dtype=np.int8)
}
x[:,:,0] = 254 # red
x[:,:,3] = 255 # opaque
9.3 Interoperability features data = x.view(np.int32) # Check that you understand why this is OK!
9.3.1 Sharing multidimensional, typed data img = Image.frombuffer("RGBA", (200, 200), data)
img.save('test.png')
Suppose you
1. Write a library than handles (multidimensional) binary data, #
# Modify the original data, and save again.
2. Want to make it easy to manipulate the data with NumPy, or whatever other library, #
# It turns out that PIL, which knows next to nothing about Numpy,
3. . . . but would not like to have NumPy as a dependency.
# happily shares the same data.
Currently, 3 solutions: #
Q:
Note: .view() has a second meaning: it can make an ndarray an instance of a specialized ndarray
subclass
::
Warning: Not all NumPy functions respect masks, for instance np.dot, so check the return types.
>>> from PIL import Image
>>> img = Image.open('data/test.png')
>>> img.__array_interface__ The masked_array returns a view to the original array:
{'data': ...,
>>> mx[1] = 9
'shape': (200, 200, 4),
>>> x
'typestr': '|u1'}
array([ 1, 9, 3, -99, 5])
>>> x = np.asarray(img)
>>> x.shape
(200, 200, 4)
The mask
You can modify the mask by assigning:
Note: A more C-friendly variant of the array interface is also defined.
>>> mx[1] = np.ma.masked
>>> mx
masked_array(data=[1, --, 3, --, 5],
mask=[False, True, False, True, False],
fill_value=999999)
9.3. Interoperability features 306 9.4. Array siblings: chararray, maskedarray, matrix 307
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
The masked entries can be filled with a given value to get an usual array back:
>>> x2 = mx.filled(-1)
>>> x2
array([ 1, 9, 3, -1, 5])
Canadian rangers were distracted when counting hares and lynxes in 1903-1910 and 1917-1918, and
got the numbers are wrong. (Carrot farmers stayed alert, though.) Compute the mean populations
over time, ignoring the invalid numbers.
9.5 Summary
>>> data = np.loadtxt('data/populations.txt') • Anatomy of the ndarray: data, dtype, strides.
>>> populations = np.ma.masked_array(data[:,1:])
>>> year = data[:, 0] • Universal functions: elementwise operations, how to make new ones
• Ndarray subclasses
>>> bad_years = (((year >= 1903) & (year <= 1910))
... | ((year >= 1917) & (year <= 1918))) • Various buffer interfaces for integration with other tools
>>> # '&' means 'and' and '|' means 'or'
>>> populations[bad_years, 0] = np.ma.masked
• Recent additions: PEP 3118, generalized ufuncs
>>> populations[bad_years, 1] = np.ma.masked
>>> populations.mean(axis=0)
9.6 Contributing to NumPy/Scipy
masked_array(data=[40472.72727272727, 18627.272727272728, 42400.0],
mask=[False, False, False], Get this tutorial: http://www.euroscipy.org/talk/882
fill_value=1e+20)
9.6.1 Why
>>> populations.std(axis=0)
masked_array(data=[21087.656489006717, 15625.799814240254, 3322.5062255844787], • “There’s a bug?”
mask=[False, False, False],
fill_value=1e+20)
9.4. Array siblings: chararray, maskedarray, matrix 308 9.5. Summary 309
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• “I’d like to help! What can I do?” In case you have old/broken NumPy installations lying around.
If unsure, try to remove existing NumPy installations, and reinstall. . .
9.6.2 Reporting bugs
• Bug tracker (prefer this) 9.6.3 Contributing to documentation
– https://github.com/numpy/numpy/issues 1. Documentation editor
– https://github.com/scipy/scipy/issues • http://docs.scipy.org/doc/numpy
– Click the “Register” link to get an account • Registration
• Mailing lists ( scipy.org/Mailing_Lists ) – Register an account
– If you’re unsure – Subscribe to scipy-dev mailing list (subscribers-only)
– No replies in a week or so? Just file a bug ticket. – Problem with mailing lists: you get mail
Good bug report ∗ But: you can turn mail delivery off
∗ “change your subscription options”, at the bottom of
Title: numpy.random.permutations fails for non-integer arguments
http://mail.python.org/mailman/listinfo/scipy-dev
I'm trying to generate random permutations, using numpy.random.permutations
– Send a mail @ scipy-dev mailing list; ask for activation:
When calling numpy.random.permutation with non-integer arguments
To: [email protected]
it fails with a cryptic error message::
Hi,
>>> np.random.permutation(12)
array([11, 5, 8, 4, 6, 1, 9, 3, 7, 2, 10, 0])
I'd like to edit NumPy/Scipy docstrings. My account is XXXXX
>>> np.random.permutation(12.) #doctest: +SKIP
Traceback (most recent call last):
Cheers,
File "<stdin>", line 1, in <module>
N. N.
File "mtrand.pyx", line 3311, in mtrand.RandomState.permutation
File "mtrand.pyx", line 3254, in mtrand.RandomState.shuffle
TypeError: len() of unsized object • Check the style guide:
– http://docs.scipy.org/doc/numpy/
This also happens with long arguments, and so
np.random.permutation(X.shape[0]) where X is an array fails on 64 – Don’t be intimidated; to fix a small thing, just fix it
bit windows (where shape is a tuple of longs).
• Edit
It would be great if it could cast to integer or at least raise a 2. Edit sources and send patches (as for bugs)
proper error for non-integer types.
3. Complain on the mailing list
I'm using NumPy 1.4.1, built from the official tarball, on Windows
64 with Visual studio 2008, on Python.org 64-bit Python.
9.6.4 Contributing features
0. What are you trying to do? The contribution of features is documented on https://docs.scipy.org/doc/numpy/dev/
1. Small code snippet reproducing the bug (if possible)
9.6.5 How to help, in general
• What actually happens
• Bug fixes always welcome!
• What you’d expect
– What irks you most
2. Platform (Windows / Linux / OSX, 32/64 bits, x86/PPC, . . . )
– Browse the tracker
3. Version of NumPy/Scipy
• Documentation work
>>> print(np.__version__)
1... – API docs: improvements to docstrings
∗ Know some Scipy module well?
Check that the following is what you expect
– User guide
10
– numpy-discussion list
– scipy-dev list
CHAPTER
Debugging code
Prerequisites
• Numpy
• IPython
• nosetests
• pyflakes
• gdb for the C-debugging part.
Chapter contents
• Avoiding bugs
– Coding best practices to avoid getting in trouble
– pyflakes: fast static analysis
• Debugging workflow
• Using the Python debugger
– Invoking the debugger
– Debugger commands and interaction kdialog --title "pyflakes %f ilename" --msgbox "$(pyflakes %f ilename)"
• Debugging segmentation faults using gdb
• In TextMate
Menu: TextMate -> Preferences -> Advanced -> Shell variables, add a shell variable:
10.1 Avoiding bugs TM_PYCHECKER = /Library/Frameworks/Python.framework/Versions/Current/bin/pyflakes
10.1.1 Coding best practices to avoid getting in trouble Then Ctrl-Shift-V is binded to a pyflakes report
• In vim In your .vimrc (binds F5 to pyflakes):
Brian Kernighan
autocmd FileType python let &mp = 'echo "*** running % ***" ; pyflakes %'
“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re autocmd FileType tex,mp,rst,python imap <Esc>[15~ <C-O>:make!^M
autocmd FileType tex,mp,rst,python map <Esc>[15~ :make!^M
as clever as you can be when you write it, how will you ever debug it?”
autocmd FileType tex,mp,rst,python set autowrite
• We all write buggy code. Accept it. Deal with it. • In emacs In your .emacs (binds F5 to pyflakes):
• Write your code with testing and debugging in mind. (defun pyflakes-thisfile () (interactive)
• Keep It Simple, Stupid (KISS). (compile (format "pyflakes %s " (buffer-file-name)))
)
– What is the simplest thing that could possibly work?
(define-minor-mode pyflakes-mode
• Don’t Repeat Yourself (DRY). "Toggle pyflakes mode.
– Every piece of knowledge must have a single, unambiguous, authoritative representation within With no argument, this command toggles the mode.
a system. Non-null prefix argument turns on the mode.
Null prefix argument turns off the mode."
– Constants, algorithms, etc. . . ;; The initial value.
nil
• Try to limit interdependencies of your code. (Loose Coupling) ;; The indicator for the mode line.
• Give your variables, functions and modules meaningful names (not mathematics names) " Pyflakes"
;; The minor mode bindings.
'( ([f5] . pyflakes-thisfile) )
10.1.2 pyflakes: fast static analysis )
They are several static analysis tools in Python; to name a few:
(add-hook 'python-mode-hook (lambda () (pyflakes-mode t)))
• pylint
• pychecker A type-as-go spell-checker like integration
• pyflakes • In vim
• flake8 – Use the pyflakes.vim plugin:
Here we focus on pyflakes, which is the simplest tool. 1. download the zip file from http://www.vim.org/scripts/script.php?script_id=2441
• Fast, simple 2. extract the files in ~/.vim/ftplugin/python
• Detects syntax errors, missing imports, typos on names. 3. make sure your vimrc has filetype plugin indent on
Another good recommendation is the flake8 tool which is a combination of pyflakes and pep8. Thus, in
addition to the types of errors that pyflakes catches, flake8 detects violations of the recommendation in
PEP8 style guide.
Integrating pyflakes (or flake8) in your editor or IDE is highly recommended, it does yield productivity
gains.
Running pyflakes on the current edited file – Alternatively: use the syntastic plugin. This can be configured to use flake8 too and also
handles on-the-fly checking for many other languages.
You can bind a key to run pyflakes in the current buffer.
• In kate Menu: ‘settings -> configure kate
– In plugins enable ‘external tools’
– In external Tools’, add pyflakes:
Yes, print statements do work as a debugging tool. However to inspect runtime, it is often more
efficient to use the debugger.
ipdb> list
1 """Small snippet to raise an IndexError."""
10.3 Using the Python debugger 2
3 def index_error():
The python debugger, pdb: https://docs.python.org/library/pdb.html, allows you to inspect your code 4 lst = list('foobar')
interactively. ----> 5 print lst[len(lst)]
6
Specifically it allows you to: 7 if __name__ == '__main__':
8 index_error()
• View the source code. 9
• Walk up and down the call stack. (continues on next page)
10.2. Debugging workflow 316 10.3. Using the Python debugger 317
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
In [3]:
• Continue execution to next breakpoint with c(ont(inue)):
10.3. Using the Python debugger 318 10.3. Using the Python debugger 319
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
We can turn these warnings in exception, which enables us to do post-mortem debugging on them, • The Visual Studio Code integrated development environment includes a debugging mode.
and find our problem more quickly:
• The Mu editor is a simple Python editor that includes a debugging mode.
In [3]: np.seterr(all='raise')
Out[3]: {'divide': 'print', 'invalid': 'print', 'over': 'print', 'under': 'ignore'}
In [4]: %run wiener_filtering.py
---------------------------------------------------------------------------
10.3.2 Debugger commands and interaction
FloatingPointError Traceback (most recent call last)
/home/esc/anaconda/lib/python2.7/site-packages/IPython/utils/py3compat.pyc in execfile(fname, l(list) Lists the code at the current position
˓→ *where) u(p) Walk up the call stack
176 else: d(own) Walk down the call stack
177 filename = fname n(ext) Execute the next line (does not go down in new functions)
--> 178 __builtin__.execfile(filename, *where)
s(tep) Execute the next statement (goes down in new functions)
/home/esc/physique-cuso-python-2013/scipy-lecture-notes/advanced/debugging/wiener_filtering. bt Print the call stack
˓→py in <module>() a Print the local variables
55 pl.matshow(noisy_face[cut], cmap=pl.cm.gray) !command Execute the given Python command (by opposition to pdb commands
56
---> 57 denoised_face = iterated_wiener(noisy_face)
58 pl.matshow(denoised_face[cut], cmap=pl.cm.gray)
59 Warning: Debugger commands are not Python code
You cannot name the variables the way you want. For instance, if in you cannot override the variables
/home/esc/physique-cuso-python-2013/scipy-lecture-notes/advanced/debugging/wiener_filtering.
in the current frame with the same name: use different names than your local variable when
˓→py in iterated_wiener(noisy_img, size)
10.3. Using the Python debugger 320 10.4. Debugging segmentation faults using gdb 321
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Program received signal SIGSEGV, Segmentation fault. The corresponding code is:
_strided_byte_copy (dst=0x8537478 "\360\343G", outstrides=4, src=
0x86c0690 <Address 0x86c0690 out of bounds>, instrides=32, N=3, def make_big_array(small_array):
elsize=4) big_array = stride_tricks.as_strided(small_array,
at numpy/core/src/multiarray/ctors.c:365 shape=(2e6, 2e6), strides=(32, 32))
365 _FAST_MOVE(Int32); return big_array
(gdb)
def print_big_array(small_array):
We get a segfault, and gdb captures it for post-mortem debugging in the C level stack (not the Python big_array = make_big_array(small_array)
call stack). We can debug the C call stack using gdb’s commands:
Thus the segfault happens when printing big_array[-10:]. The reason is simply that big_array has
(gdb) up been allocated with its end outside the program memory.
#1 0x004af4f5 in _copy_from_same_shape (dest=<value optimized out>,
src=<value optimized out>, myfunc=0x496780 <_strided_byte_copy>,
swap=0) Note: For a list of Python-specific commands defined in the gdbinit, read the source of this file.
at numpy/core/src/multiarray/ctors.c:748
748 myfunc(dit->dataptr, dest->strides[maxaxis],
As you can see, right now, we are in the C code of numpy. We would like to know what is the Python
Wrap up exercise
code that triggers this segfault, so we go up the stack until we hit the Python execution loop:
(gdb) up The following script is well documented and hopefully legible. It seeks to answer a problem of actual
#8 0x080ddd23 in call_function (f= interest for numerical computing, but it does not work. . . Can you debug it?
Frame 0x85371ec, for file /home/varoquau/usr/lib/python2.6/site-packages/numpy/core/
˓→arrayprint.py, line 156, in _leading_trailing (a=<numpy.ndarray at remote 0x85371b0>, _nc=
Python source code: to_debug.py
˓→<module at remote 0xb7f93a64>), throwflag=0)
at ../Python/ceval.c:3750
3750 ../Python/ceval.c: No such file or directory.
in ../Python/ceval.c
(gdb) up
#9 PyEval_EvalFrameEx (f=
Frame 0x85371ec, for file /home/varoquau/usr/lib/python2.6/site-packages/numpy/core/
˓→arrayprint.py, line 156, in _leading_trailing (a=<numpy.ndarray at remote 0x85371b0>, _nc=
at ../Python/ceval.c:2412
2412 in ../Python/ceval.c
(gdb)
Once we are in the Python execution loop, we can use our special Python helper function. For instance
we can find the corresponding Python code:
(gdb) pyframe
/home/varoquau/usr/lib/python2.6/site-packages/numpy/core/arrayprint.py (158): _leading_
˓→trailing
(gdb)
This is numpy code, we need to go up until we find code that we have written:
(gdb) up
...
(gdb) up
#34 0x080dc97a in PyEval_EvalFrameEx (f=
Frame 0x82f064c, for file segfault.py, line 11, in print_big_array (small_array=<numpy.
˓→ndarray at remote 0x853ecf0>, big_array=<numpy.ndarray at remote 0x853ed20>), throwflag=0)␣
˓→at ../Python/ceval.c:1630
(continues on next page)
10.4. Debugging segmentation faults using gdb 322 10.4. Debugging segmentation faults using gdb 323
Scipy lecture notes, Edition 2022.1
11
1. Make it work: write the code in a simple legible ways.
2. Make it work reliably: write automated test cases, make really sure that your algorithm is right
and that if you break it, the tests will capture the breakage.
CHAPTER 3. Optimize the code by profiling simple use-cases to find the bottlenecks and speeding up these
bottleneck, finding a better algorithm or implementation. Keep in mind that a trade off should
be found between profiling on a realistic example and the simplicity and speed of execution of the
code. For efficient work, it is best to work with profiling runs lasting around 10s.
Optimizing code
No optimization without measuring!
11.2.1 Timeit
In IPython, use timeit (https://docs.python.org/library/timeit.html) to time elementary operations:
In [2]: a = np.arange(1000)
“Premature optimization is the root of all evil”
In [3]: %timeit a ** 2
Author: Gaël Varoquaux 100000 loops, best of 3: 5.73 us per loop
This chapter deals with strategies to make Python code go faster. In [4]: %timeit a ** 2.1
1000 loops, best of 3: 154 us per loop
Prerequisites
In [5]: %timeit a * a
100000 loops, best of 3: 5.56 us per loop
• line_profiler
Use this to guide your choice between strategies.
Chapters contents Note: For long running calls, using %time instead of %timeit; it is less precise but faster
• Optimization workflow
• Profiling Python code 11.2.2 Profiler
– Timeit Useful when you have a large program to profile, for example the following file:
– Profiler # For this example to run, you also need the 'ica.py' file
– Line-profiler
import numpy as np
• Making code go faster from scipy import linalg
– Algorithmic optimization
from ica import fastica
∗ Example of the SVD
(continues on next page)
(continued from previous page) Clearly the svd (in decomp.py) is what takes most of our time, a.k.a. the bottleneck. We have to find a
way to make this step go faster, or to avoid this step (algorithmic optimization). Spending time on the
def test(): rest of the code is useless.
data = np.random.random((5000, 100))
u, s, v = linalg.svd(data)
pca = np.dot(u[:, :10].T, data) Profiling outside of IPython, running ‘‘cProfile‘‘
results = fastica(pca.T, whiten=False)
Similar profiling can be done outside of IPython, simply calling the built-in Python profilers cProfile
if __name__ == '__main__': and profile.
test()
$ python -m cProfile -o demo.prof demo.py
Using the -o switch will output the profiler results to the file demo.prof to view with an external tool.
Note: This is a combination of two unsupervised learning techniques, principal component analysis This can be useful if you wish to process the profiler output with a visualization tool.
(PCA) and independent component analysis (ICA). PCA is a technique for dimensionality reduction,
i.e. an algorithm to explain the observed variance in your data using less dimensions. ICA is a source
separation technique, for example to unmix multiple signals that have been recorded through multiple 11.2.3 Line-profiler
sensors. Doing a PCA first and then an ICA can be useful if you have more sensors than signals. For
more information see: the FastICA example from scikits-learn. The profiler tells us which function takes most of the time, but not where it is called.
For this, we use the line_profiler: in the source file, we decorate a few functions that we want to inspect
To run it, you also need to download the ica module. In IPython we can time the script: with @profile (no need to import it)
11.2. Profiling Python code 326 11.3. Making code go faster 327
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
For a high-level view of the problem, a good understanding of the maths behind the algorithm helps. Know your computational linear algebra. When in doubt, explore scipy.linalg, and use %timeit to
However, it is not uncommon to find simple changes, like moving computation or memory allocation try out different alternatives on your data.
outside a for loop, that bring in big gains.
We can then use this insight to optimize the previous code: note: we need global a in the timeit so that it work, as it is assigning to a, and thus considers it
as a local variable.
def test():
data = np.random.random((5000, 100)) • Be easy on the memory: use views, and not copies
u, s, v = linalg.svd(data, full_matrices=False) Copying big arrays is as costly as making simple numerical operations on them:
pca = np.dot(u[:, :10].T, data)
results = fastica(pca.T, whiten=False) In [1]: a = np.zeros(1e7)
In [4]: c.strides
Computational linear algebra
Out[4]: (80000, 8)
For certain algorithms, many of the bottlenecks will be linear algebra computations. In this case,
using the right function to solve the right problem is key. For instance, an eigenvalue problem with This is the reason why Fortran ordering or C ordering may make a big difference on operations:
a symmetric matrix is easier to solve than with a general matrix. Also, most often, you can avoid
inverting a matrix and use a less costly (and more numerically stable) operation.
11.3. Making code go faster 328 11.4. Writing faster numerical code 329
Scipy lecture notes, Edition 2022.1
12
1 loops, best of 3: 194 ms per loop
In [8]: c = np.ascontiguousarray(a.T)
Note that copying the data to work around this effect may not be worth it: CHAPTER
In [10]: %timeit c = np.ascontiguousarray(a.T)
10 loops, best of 3: 106 ms per loop
Using numexpr can be useful to automatically optimize code for such effects.
• Use compiled code
The last resort, once you are sure that all the high-level optimizations have been explored, is to
transfer the hot spots, i.e. the few lines or functions in which most of the time is spent, to compiled
code. For compiled code, the preferred option is to use Cython: it is easy to transform exiting Sparse Matrices in SciPy
Python code in compiled code, and with a good use of the numpy support yields efficient code on
numpy arrays, for instance by unrolling loops.
Warning: For all the above: profile and time your choices. Don’t base your optimization on
theoretical considerations.
– each row is a Python list (sorted) of column indices of non-zero elements >>> mtx = sparse.lil_matrix([[0, 1, 2, 0], [3, 0, 1, 0], [1, 0, 0, 1]])
>>> mtx.todense()
– rows stored in a NumPy array (dtype=np.object) matrix([[0, 1, 2, 0],
[3, 0, 1, 0],
– non-zero values data stored analogously
[1, 0, 0, 1]]...)
• efficient for constructing sparse matrices incrementally >>> print(mtx)
(0, 1) 1
• constructor accepts: (0, 2) 2
(1, 0) 3
– dense matrix (array)
(1, 2) 1
– sparse matrix (2, 0) 1
(2, 3) 1
– shape tuple (create empty matrix) >>> mtx[:2, :]
• flexible slicing, changing sparsity structure is efficient <2x4 sparse matrix of type '<... 'numpy.int64'>'
with 4 stored elements in LInked List format>
• slow arithmetics, slow column slicing due to being row-based >>> mtx[:2, :].todense()
matrix([[0, 1, 2, 0],
• use: [3, 0, 1, 0]]...)
– when sparsity pattern is not known apriori or changes >>> mtx[1:2, [0,2]].todense()
matrix([[3, 1]]...)
– example: reading a sparse matrix from a text file >>> mtx.todense()
matrix([[0, 1, 2, 0],
Examples [3, 0, 1, 0],
[1, 0, 0, 1]]...)
• create an empty LIL matrix:
>>> mtx = sparse.lil_matrix((4, 5)) Dictionary of Keys Format (DOK)
• subclass of Python dict
• prepare random data:
– keys are (row, column) index tuples (no duplicate entries allowed)
>>> from numpy.random import rand
>>> data = np.round(rand(2, 3)) – values are corresponding non-zero values
>>> data
• efficient for constructing sparse matrices incrementally
array([[1., 1., 1.],
[1., 0., 1.]]) • constructor accepts:
• flexible slicing, changing sparsity structure is efficient • very fast conversion to and from CSR/CSC formats
• can be efficiently converted to a coo_matrix once constructed • fast matrix * vector (sparsetools)
• slow arithmetics (for loops with dict.iteritems()) • fast and easy item-wise operations
• use: – manipulate data array directly (fast NumPy machinery)
– when sparsity pattern is not known apriori or changes • no slicing, no arithmetics (directly)
• use:
Examples
– facilitates fast conversion among sparse formats
• create a DOK matrix element by element:
– when converting to other format (usually CSR or CSC), duplicate entries are summed
>>> mtx = sparse.dok_matrix((5, 5), dtype=np.float64) together
>>> mtx
∗ facilitates efficient construction of finite element matrices
<5x5 sparse matrix of type '<... 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
>>> for ir in range(5): Examples
... for ic in range(5):
... mtx[ir, ic] = 1.0 * (ir != ic) • create empty COO matrix:
>>> mtx
<5x5 sparse matrix of type '<... 'numpy.float64'>' >>> mtx = sparse.coo_matrix((3, 4), dtype=np.int8)
with 20 stored elements in Dictionary Of Keys format> >>> mtx.todense()
>>> mtx.todense() matrix([[0, 0, 0, 0],
matrix([[0., 1., 1., 1., 1.], [0, 0, 0, 0],
[1., 0., 1., 1., 1.], [0, 0, 0, 0]], dtype=int8)
[1., 1., 0., 1., 1.],
[1., 1., 1., 0., 1.], • create using (data, ij) tuple:
[1., 1., 1., 1., 0.]])
>>> row = np.array([0, 3, 1, 0])
>>> col = np.array([0, 3, 1, 2])
• slicing and indexing:
>>> data = np.array([4, 5, 7, 9])
>>> mtx[1, 1] >>> mtx = sparse.coo_matrix((data, (row, col)), shape=(4, 4))
0.0 >>> mtx
>>> mtx[1, 1:3] <4x4 sparse matrix of type '<... 'numpy.int64'>'
<1x2 sparse matrix of type '<... 'numpy.float64'>' with 4 stored elements in COOrdinate format>
with 1 stored elements in Dictionary Of Keys format> >>> mtx.todense()
>>> mtx[1, 1:3].todense() matrix([[4, 0, 9, 0],
matrix([[0., 1.]]) [0, 7, 0, 0],
>>> mtx[[2,1], 1:3].todense() [0, 0, 0, 0],
matrix([[1., 0.], [0, 0, 0, 5]])
[0., 1.]])
• duplicates entries are summed together:
mtx = sps.lil_matrix((1000, 1000), dtype=np.float64) • common interface for performing matrix vector products
mtx[0, :100] = rand(100)
mtx[1, 100:200] = mtx[0, :100] • useful abstraction that enables using dense and sparse matrices within the solvers, as well as
mtx.setdiag(rand(1000)) matrix-free solutions
• has shape and matvec() (+ some optional parameters)
plt.clf()
plt.spy(mtx, marker='.', markersize=2) • example:
plt.show()
>>> import numpy as np
mtx = mtx.tocsr() >>> from scipy.sparse.linalg import LinearOperator
rhs = rand(1000) >>> def mv(v):
... return np.array([2*v[0], 3*v[1]])
(continues on next page) (continues on next page)
12.3. Linear System Solvers 346 12.3. Linear System Solvers 347
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import scipy
from scipy.sparse.linalg import lobpcg
import matplotlib.pyplot as plt
N = 100
K = 9
A = poisson((N,N), format='csr')
# preconditioner based on ml
M = ml.aspreconditioner()
12.3. Linear System Solvers 348 12.3. Linear System Solvers 349
Scipy lecture notes, Edition 2022.1
13
– https://github.com/pyamg/pyamg
• Pysparse
– own sparse matrix classes
– matrix and eigenvalue problem solvers
– http://pysparse.sourceforge.net/ CHAPTER
Chapters contents
13.1 Opening and writing to image files dtype is uint8 for 8-bit images (0-255)
Opening raw files (camera, 3-D images)
Writing an array to a file:
>>> face.tofile('face.raw') # Create raw file
from scipy import misc >>> face_from_raw = np.fromfile('face.raw', dtype=np.uint8)
import imageio >>> face_from_raw.shape
f = misc.face() (2359296,)
imageio.imsave('face.png', f) # uses the Image module (PIL) >>> face_from_raw.shape = (768, 1024, 3)
(data are read from the file, and not loaded into memory)
Working on a list of image files
13.1. Opening and writing to image files 352 13.1. Opening and writing to image files 353
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
For smooth intensity variations, use interpolation='bilinear'. For fine inspection of intensity varia-
tions, use interpolation='nearest':
np.histogram
Exercise
>>> face = misc.face(gray=True) • Increase the contrast of the image by changing its minimum and maximum values. Optional:
>>> face[0, 40] use scipy.stats.scoreatpercentile (read the docstring!) to saturate 5% of the darkest pixels
127 and 5% of the lightest pixels.
>>> # Slicing
>>> face[10:13, 20:23]
• Save the array to two different file formats (png, jpg, tiff)
array([[141, 153, 145],
[133, 134, 125],
[ 96, 92, 94]], dtype=uint8)
>>> face[100:120] = 255
>>>
>>> lx, ly = face.shape
>>> X, Y = np.ogrid[0:lx, 0:ly]
>>> mask = (X - lx / 2) ** 2 + (Y - ly / 2) ** 2 > lx * ly / 4
>>> # Masks
>>> face[mask] = 0
>>> # Fancy indexing
>>> face[range(400), range(400)] = 255
13.3.2 Geometrical transformations
>>> face = misc.face(gray=True)
>>> lx, ly = face.shape
>>> # Cropping
>>> crop_face = face[lx // 4: - lx // 4, ly // 4: - ly // 4]
>>> # up <-> down flip
>>> flip_ud_face = np.flipud(face)
>>> # rotation
>>> rotate_face = ndimage.rotate(face, 45)
>>> rotate_face_noreshape = ndimage.rotate(face, 45, reshape=False)
13.4.1 Blurring/smoothing
Gaussian filter from scipy.ndimage:
>>> from scipy import misc
>>> from scipy import misc >>> f = misc.face(gray=True)
>>> face = misc.face(gray=True) >>> f = f[230:290, 220:320]
>>> blurred_face = ndimage.gaussian_filter(face, sigma=3) >>> noisy = f + 0.4 * f.std() * np.random.random(f.shape)
>>> very_blurred = ndimage.gaussian_filter(face, sigma=5)
A Gaussian filter smoothes the noise out. . . and the edges as well:
Uniform filter
>>> gauss_denoised = ndimage.gaussian_filter(noisy, 2)
>>> local_mean = ndimage.uniform_filter(face, size=11)
Most local linear isotropic filters blur the image (ndimage.uniform_filter)
A median filter preserves better the edges:
13.4.2 Sharpening Median filter: better result for straight boundaries (low curvature):
13.4.3 Denoising
Noisy face:
>>> el = ndimage.generate_binary_structure(2, 1)
>>> el
array([[False, True, False],
[ True, True, True],
[False, True, False]])
>>> el.astype(np.int)
array([[0, 1, 0],
[1, 1, 1],
[0, 1, 0]])
Dilation: maximum filter:
Opening: erosion + dilation: Use a gradient operator (Sobel) to find high intensity variations:
>>> a = np.zeros((5,5), dtype=np.int) >>> sx = ndimage.sobel(im, axis=0, mode='constant')
>>> a[1:4, 1:4] = 1; a[4, 4] = 1 >>> sy = ndimage.sobel(im, axis=1, mode='constant')
>>> a >>> sob = np.hypot(sx, sy)
array([[0, 0, 0, 0, 0],
[0, 1, 1, 1, 0],
[0, 1, 1, 1, 0],
[0, 1, 1, 1, 0],
[0, 0, 0, 0, 1]])
>>> # Opening removes small objects
>>> ndimage.binary_opening(a, structure=np.ones((3,3))).astype(np.int)
array([[0, 0, 0, 0, 0],
[0, 1, 1, 1, 0],
[0, 1, 1, 1, 0],
[0, 1, 1, 1, 0],
[0, 0, 0, 0, 0]])
>>> # Opening can also smooth corners
>>> ndimage.binary_opening(a).astype(np.int)
array([[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 1, 1, 1, 0],
[0, 0, 1, 0, 0], 13.5.2 Segmentation
[0, 0, 0, 0, 0]]) • Histogram-based segmentation (no spatial information)
Many other mathematical morphology operations: hit and miss transform, tophat, etc. >>> binary_img = img > 0.5
See also:
Other Scientific Packages provide algorithms that can be useful for image processing. In this example,
we use the spectral clustering function of the scikit-learn in order to segment glued objects.
>>> l = 100
>>> x, y = np.indices((l, l))
Use mathematical morphology to clean up the result: >>> circle1 = (x - center1[0])**2 + (y - center1[1])**2 < radius1**2
>>> circle2 = (x - center2[0])**2 + (y - center2[1])**2 < radius2**2
>>> # Remove small white regions >>> circle3 = (x - center3[0])**2 + (y - center3[1])**2 < radius3**2
>>> open_img = ndimage.binary_opening(binary_img) >>> circle4 = (x - center4[0])**2 + (y - center4[1])**2 < radius4**2
>>> # Remove small black hole
>>> close_img = ndimage.binary_closing(open_img) >>> # 4 circles
>>> img = circle1 + circle2 + circle3 + circle4
>>> mask = img.astype(bool)
>>> img = img.astype(float)
Exercise
Check how a first denoising step (e.g. with a median filter) modifies the histogram, and check that
the resulting histogram-based segmentation is more accurate.
13.6 Measuring objects properties: ndimage.measurements
See also: Synthetic data:
More advanced segmentation algorithms are found in the scikit-image: see Scikit-image: image pro-
cessing.
13.5. Feature extraction 364 13.6. Measuring objects properties: ndimage.measurements 365
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
>>> n = 10
>>> l = 256
>>> im = np.zeros((l, l))
>>> points = l*np.random.random((2, n**2))
>>> im[(points[0]).astype(np.int), (points[1]).astype(np.int)] = 1
>>> im = ndimage.gaussian_filter(im, sigma=l/(4.*n))
>>> mask = im > im.mean()
13.6. Measuring objects properties: ndimage.measurements 366 13.6. Measuring objects properties: ndimage.measurements 367
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• Other measures
Correlation function, Fourier/wavelet spectrum, etc.
One example with mathematical morphology: granulometry
>>> def disk_structure(n):
... struct = np.zeros((2 * n + 1, 2 * n + 1))
... x, y = np.indices((2 * n + 1, 2 * n + 1))
... mask = (x - n)**2 + (y - n)**2 <= n**2
... struct[mask] = 1
... return struct.astype(np.bool)
...
>>>
>>> def granulometry(data, sizes=None):
... s = max(data.shape)
... if sizes is None:
(continues on next page)
13.6. Measuring objects properties: ndimage.measurements 368 13.6. Measuring objects properties: ndimage.measurements 369
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import scipy.misc
import matplotlib.pyplot as plt
f = scipy.misc.face(gray=True)
plt.figure(figsize=(8, 4))
plt.subplot(1, 2, 1)
plt.imshow(f[320:340, 510:530], cmap=plt.cm.gray)
plt.axis('off')
plt.subplot(1, 2, 2)
plt.imshow(f[320:340, 510:530], cmap=plt.cm.gray, interpolation='nearest')
plt.axis('off')
13.7. Full code examples 370 13.8. Examples for the image processing chapter 371
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
import scipy
import scipy.misc
import matplotlib.pyplot as plt
face = scipy.misc.face(gray=True)
face[10:13, 20:23]
face[100:120] = 255
lx, ly = face.shape
X, Y = np.ogrid[0:lx, 0:ly]
mask = (X - lx/2)**2 + (Y - ly/2)**2 > lx*ly/4
face[mask] = 0
face[range(400), range(400)] = 255 import numpy as np
import scipy
plt.figure(figsize=(3, 3)) from scipy import ndimage
plt.axes([0, 0, 1, 1]) import matplotlib.pyplot as plt
plt.imshow(face, cmap=plt.cm.gray)
plt.axis('off') f = scipy.misc.face(gray=True)
sx, sy = f.shape
plt.show() X, Y = np.ogrid[0:sx, 0:sy]
Note: Click here to download the full example code rbin = (20* r/r.max()).astype(np.int)
radial_mean = ndimage.mean(f, labels=rbin, index=np.arange(1, rbin.max() +1))
plt.figure(figsize=(5, 5))
13.8.4 Radial mean plt.axes([0, 0, 1, 1])
This example shows how to do a radial mean with scikit-image. plt.imshow(rbin, cmap=plt.cm.nipy_spectral)
plt.axis('off')
plt.show()
13.8. Examples for the image processing chapter 372 13.8. Examples for the image processing chapter 373
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
13.8.5 Plot the block mean of an image 13.8.6 Display a Racoon Face
An example showing how to use broad-casting to plot the mean of blocks of an image. An example that displays a racoon face with matplotlib.
import scipy.misc
import matplotlib.pyplot as plt
f = scipy.misc.face(gray=True)
plt.figure(figsize=(10, 3.6))
plt.subplot(131)
plt.imshow(f, cmap=plt.cm.gray)
plt.subplot(132)
plt.imshow(f, cmap=plt.cm.gray, vmin=30, vmax=200)
plt.axis('off')
plt.subplot(133)
plt.imshow(f, cmap=plt.cm.gray)
plt.contour(f, [50, 200])
plt.axis('off')
import numpy as np
import scipy.misc plt.subplots_adjust(wspace=0, hspace=0., top=0.99, bottom=0.01, left=0.05,
from scipy import ndimage right=0.99)
import matplotlib.pyplot as plt plt.show()
plt.show()
13.8. Examples for the image processing chapter 374 13.8. Examples for the image processing chapter 375
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
13.8. Examples for the image processing chapter 376 13.8. Examples for the image processing chapter 377
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
import scipy
import scipy.misc
from scipy import ndimage
import matplotlib.pyplot as plt
f = scipy.misc.face(gray=True)
f = f[230:290, 220:320]
noisy = f + 0.4*f.std()*np.random.random(f.shape)
gauss_denoised = ndimage.gaussian_filter(noisy, 2)
import numpy as np med_denoised = ndimage.median_filter(noisy, 3)
from scipy import ndimage
import matplotlib.pyplot as plt
plt.figure(figsize=(12,2.8))
square = np.zeros((32, 32))
square[10:-10, 10:-10] = 1 plt.subplot(131)
np.random.seed(2) plt.imshow(noisy, cmap=plt.cm.gray, vmin=40, vmax=220)
(continues on next page) (continues on next page)
13.8. Examples for the image processing chapter 378 13.8. Examples for the image processing chapter 379
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
This example shows how to extract the bounding box of the largest object
import numpy as np
from scipy import ndimage
import matplotlib.pyplot as plt
np.random.seed(1)
import numpy as np
n = 10
from scipy import ndimage
l = 256
import matplotlib.pyplot as plt
im = np.zeros((l, l))
points = l*np.random.random((2, n**2))
np.random.seed(1)
im[(points[0]).astype(np.int), (points[1]).astype(np.int)] = 1
n = 10
im = ndimage.gaussian_filter(im, sigma=l/(4.*n))
l = 256
im = np.zeros((l, l))
mask = im > im.mean()
points = l*np.random.random((2, n**2))
im[(points[0]).astype(np.int), (points[1]).astype(np.int)] = 1
label_im, nb_labels = ndimage.label(mask)
im = ndimage.gaussian_filter(im, sigma=l/(4.*n))
# Find the largest connected component
mask = im > im.mean()
sizes = ndimage.sum(mask, label_im, range(nb_labels + 1))
mask_size = sizes < 1000
label_im, nb_labels = ndimage.label(mask)
remove_pixel = mask_size[label_im]
label_im[remove_pixel] = 0
sizes = ndimage.sum(mask, label_im, range(nb_labels + 1))
labels = np.unique(label_im)
(continues on next page)
(continues on next page)
13.8. Examples for the image processing chapter 380 13.8. Examples for the image processing chapter 381
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.subplots_adjust(wspace=0.01, hspace=0.01, top=1, bottom=0, left=0, right=1) Total running time of the script: ( 0 minutes 0.460 seconds)
plt.show()
Total running time of the script: ( 0 minutes 0.032 seconds) Note: Click here to download the full example code
import numpy as np
import scipy.misc
from scipy import ndimage
import numpy as np
import matplotlib.pyplot as plt
from scipy import ndimage
import matplotlib.pyplot as plt
face = scipy.misc.face(gray=True)
lx, ly = face.shape
im = np.zeros((20, 20))
# Cropping
im[5:-5, 5:-5] = 1
crop_face = face[lx//4:-lx//4, ly//4:-ly//4]
im = ndimage.distance_transform_bf(im)
# up <-> down flip
im_noise = im + 0.2*np.random.randn(*im.shape)
flip_ud_face = np.flipud(face)
# rotation
im_med = ndimage.median_filter(im_noise, 3)
rotate_face = ndimage.rotate(face, 45)
rotate_face_noreshape = ndimage.rotate(face, 45, reshape=False)
plt.figure(figsize=(16, 5))
plt.figure(figsize=(12.5, 2.5))
plt.subplot(141)
plt.imshow(im, interpolation='nearest')
plt.axis('off')
plt.subplot(151)
plt.title('Original image', fontsize=20)
plt.imshow(face, cmap=plt.cm.gray)
plt.subplot(142)
plt.axis('off')
plt.imshow(im_noise, interpolation='nearest', vmin=0, vmax=5)
plt.subplot(152)
plt.axis('off')
plt.imshow(crop_face, cmap=plt.cm.gray)
(continues on next page)
(continues on next page)
13.8. Examples for the image processing chapter 382 13.8. Examples for the image processing chapter 383
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
from scipy import ndimage
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(1) from scipy import ndimage
n = 10 import matplotlib.pyplot as plt
l = 256
im = np.zeros((l, l)) im = np.zeros((256, 256))
points = l*np.random.random((2, n**2)) im[64:-64, 64:-64] = 1
im[(points[0]).astype(np.int), (points[1]).astype(np.int)] = 1
im = ndimage.gaussian_filter(im, sigma=l/(4.*n)) im = ndimage.rotate(im, 15, mode='constant')
im = ndimage.gaussian_filter(im, 8)
mask = (im > im.mean()).astype(np.float)
sx = ndimage.sobel(im, axis=0, mode='constant')
mask += 0.1 * im sy = ndimage.sobel(im, axis=1, mode='constant')
sob = np.hypot(sx, sy)
img = mask + 0.2*np.random.randn(*mask.shape)
plt.figure(figsize=(16, 5))
hist, bin_edges = np.histogram(img, bins=60) plt.subplot(141)
(continues on next page) (continues on next page)
13.8. Examples for the image processing chapter 384 13.8. Examples for the image processing chapter 385
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
import numpy as np from scipy import ndimage
import scipy import matplotlib.pyplot as plt
import scipy.misc
import matplotlib.pyplot as plt im = np.zeros((64, 64))
try: np.random.seed(2)
from skimage.restoration import denoise_tv_chambolle x, y = (63*np.random.random((2, 8))).astype(np.int)
except ImportError: im[x, y] = np.arange(8)
# skimage < 0.12
from skimage.filters import denoise_tv_chambolle bigger_points = ndimage.grey_dilation(im, size=(5, 5), structure=np.ones((5, 5)))
13.8. Examples for the image processing chapter 386 13.8. Examples for the image processing chapter 387
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Note: Click here to download the full example code plt.subplots_adjust(wspace=0.02, hspace=0.3, top=1, bottom=0.1, left=0, right=1)
plt.show()
13.8.20 Cleaning segmentation with mathematical morphology
Total running time of the script: ( 0 minutes 0.056 seconds)
An example showing how to clean segmentation with mathematical morphology: removing small regions
and holes.
Note: Click here to download the full example code
import numpy as np
from scipy import ndimage
import matplotlib.pyplot as plt
np.random.seed(1)
n = 10
l = 256
im = np.zeros((l, l))
points = l*np.random.random((2, n**2))
im[(points[0]).astype(np.int), (points[1]).astype(np.int)] = 1
im = ndimage.gaussian_filter(im, sigma=l/(4.*n))
13.8. Examples for the image processing chapter 388 13.8. Examples for the image processing chapter 389
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
13.8.23 Granulometry
This example performs a simple granulometry analysis.
13.8. Examples for the image processing chapter 390 13.8. Examples for the image processing chapter 391
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
def granulometry(data, sizes=None): radius1, radius2, radius3, radius4 = 16, 14, 15, 14
s = max(data.shape)
if sizes is None: circle1 = (x - center1[0])**2 + (y - center1[1])**2 < radius1**2
sizes = range(1, s/2, 2) circle2 = (x - center2[0])**2 + (y - center2[1])**2 < radius2**2
granulo = [ndimage.binary_opening(data, \ circle3 = (x - center3[0])**2 + (y - center3[1])**2 < radius3**2
structure=disk_structure(n)).sum() for n in sizes] circle4 = (x - center4[0])**2 + (y - center4[1])**2 < radius4**2
return granulo
4 circles
np.random.seed(1) img = circle1 + circle2 + circle3 + circle4
n = 10 mask = img.astype(bool)
l = 256 img = img.astype(float)
im = np.zeros((l, l))
points = l*np.random.random((2, n**2)) img += 1 + 0.2*np.random.randn(*img.shape)
im[(points[0]).astype(np.int), (points[1]).astype(np.int)] = 1
im = ndimage.gaussian_filter(im, sigma=l/(4.*n)) # Convert the image into a graph with the value of the gradient on the
# edges.
mask = im > im.mean() graph = image.img_to_graph(img, mask=mask)
granulo = granulometry(mask, sizes=np.arange(2, 19, 4)) # Take a decreasing function of the gradient: we take it weakly
# dependant from the gradient the segmentation is close to a voronoi
plt.figure(figsize=(6, 2.2)) graph.data = np.exp(-graph.data / graph.data.std())
13.8. Examples for the image processing chapter 392 13.8. Examples for the image processing chapter 393
Scipy lecture notes, Edition 2022.1
CHAPTER 14
Mathematical optimization: finding
minima of functions
Total running time of the script: ( 0 minutes 0.241 seconds)
See also:
Authors: Gaël Varoquaux
More on image-processing:
Mathematical optimization deals with the problem of finding numerically minimums (or maximums or
• The chapter on Scikit-image zeros) of a function. In this context, the function is called cost function, or objective function, or energy.
• Other, more powerful and complete modules: OpenCV (Python bindings), CellProfiler, ITK with Here, we are interested in using scipy.optimize for black-box optimization: we do not rely on the
Python bindings mathematical expression of the function that we are optimizing. Note that this expression can often be
used for more efficient, non black-box, optimization.
Prerequisites
• Numpy
• Scipy
• Matplotlib
See also:
References
Mathematical optimization is very . . . mathematical. If you want performance, it really pays to read
the books:
• Convex Optimization by Boyd and Vandenberghe (pdf available free online).
• Numerical Optimization, by Nocedal and Wright. Detailed reference on gradient descent methods.
• Practical Methods of Optimization by Fletcher: good at hand-waving explanations.
– Choosing a method
– Making your optimizer faster Optimizing convex functions is easy. Optimizing non-convex functions can be very hard.
– Computing gradients
Note: It can be proven that for a convex function a local minimum is also a global minimum. Then,
– Synthetic exercices
in some sense, the minimum is unique.
• Special case: non-linear least-squares
– Minimizing the norm of a vector function
14.1.2 Smooth and non-smooth problems
– Curve fitting
• Optimization with constraints
– Box bounds
– General constraints
• Full code examples
• Examples for the mathematical optimization chapter
Optimizing smooth functions is easier (true in the context of black-box optimization, otherwise Lin-
ear Programming is an example of methods which deal very efficiently with piece-wise linear functions).
14.1. Knowing your problem 396 14.1. Knowing your problem 397
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
14.1.3 Noisy versus exact cost functions (continued from previous page)
>>> result.success # check if solver was successful
True
>>> x_min = result.x
>>> x_min
0.699999999...
>>> x_min - 0.7
-2.16...e-10
Noisy gradients
Many optimization methods rely on gradients of the objective function. If the gradient function is not
given, they are computed numerically, which induces errors. In such situation, even if the objective Brent’s method on a non-convex function: note
function is not noisy, a gradient-based optimization may be a noisy optimization. that the fact that the optimizer avoided the local mini-
mum is a matter of luck.
14.1.4 Constraints
Note: You can use different solvers using the parameter method.
14.2. A review of the different optimizers 398 14.2. A review of the different optimizers 399
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
We can see that very anisotropic (ill-conditioned) functions are harder to optimize. Conjugate gradient descent
The gradient descent algorithms above are toys not to be used on real problems.
Take home message: conditioning number and preconditioning
As can be seen from the above experiments, one of the problems of the simple gradient descent algorithms,
is that it tends to oscillate across a valley, each time following the direction of the gradient, that makes
If you know natural scaling for your variables, prescale them so that they behave similarly. This is
it cross the valley. The conjugate gradient solves this problem by adding a friction term: each step
related to preconditioning.
depends on the two last values of the gradient and sharp turns are reduced.
Also, it clearly can be advantageous to take bigger steps. This is done in gradient descent code using a Table 3: Conjugate gradient descent
line search.
An ill-conditioned non-
quadratic function.
A well-conditioned quadratic
function.
An ill-conditioned quadratic scipy provides scipy.optimize.minimize() to find the minimum of scalar functions of one or more
function. variables. The simple conjugate gradient method can be used by setting the parameter method to CG
>>> def f(x): # The rosenbrock function
... return .5*(1 - x[0])**2 + (x[1] - x[0]**2)**2
>>> optimize.minimize(f, [2, -1], method="CG")
fun: 1.6...e-11
jac: array([-6.15...e-06, 2.53...e-07])
message: ...'Optimization terminated successfully.'
nfev: 108
nit: 13
An ill-conditioned non- njev: 27
quadratic function. status: 0
success: True
x: array([0.99999..., 0.99998...])
Gradient methods need the Jacobian (gradient) of the function. They can compute it numerically, but
will perform better if you can pass them the gradient:
>>> def jacobian(x):
... return np.array((-2*.5*(1 - x[0]) - 4*x[0]*(x[1] - x[0]**2), 2*(x[1] - x[0]**2)))
>>> optimize.minimize(f, [2, 1], method="CG", jac=jacobian)
An ill-conditioned very non- fun: 2.957...e-14
jac: array([ 7.1825...e-07, -2.9903...e-07])
quadratic function.
message: 'Optimization terminated successfully.'
nfev: 16
The more a function looks like a quadratic function (elliptic iso-curves), the easier it is to optimize. nit: 8
njev: 16
(continues on next page)
14.2. A review of the different optimizers 400 14.2. A review of the different optimizers 401
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Note: At very high-dimension, the inversion of the Hessian can be costly and unstable (large scale >
250).
In scipy, you can use the Newton method by setting method to Newton-CG in scipy.optimize. Note: Click here to download the full example code
minimize(). Here, CG refers to the fact that an internal inversion of the Hessian is performed by
conjugate gradient
14.4.1 Noisy optimization problem
>>> def f(x): # The rosenbrock function
... return .5*(1 - x[0])**2 + (x[1] - x[0]**2)**2 Draws a figure explaining noisy vs non-noisy optimization
>>> def jacobian(x):
... return np.array((-2*.5*(1 - x[0]) - 4*x[0]*(x[1] - x[0]**2), 2*(x[1] - x[0]**2)))
>>> optimize.minimize(f, [2,-1], method="Newton-CG", jac=jacobian)
fun: 1.5...e-15
jac: array([ 1.0575...e-07, -7.4832...e-08])
message: ...'Optimization terminated successfully.'
nfev: 11
nhev: 0
nit: 10
njev: 52
status: 0
success: True
x: array([0.99999..., 0.99999...])
Note that compared to a conjugate gradient (above), Newton’s method has required less function evalua-
tions, but more gradient evaluations, as it uses it to approximate the Hessian. Let’s compute the Hessian
and pass it to the algorithm:
14.2. A review of the different optimizers 402 14.3. Full code examples 403
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
x = np.linspace(-5, 5, 101)
x_ = np.linspace(-5, 5, 31)
def f(x):
return -np.exp(-x**2)
# A smooth function
plt.figure(1, figsize=(3, 2.5))
plt.clf()
Note: Click here to download the full example code plt.plot(x, np.sqrt(.2 + x**2), linewidth=2)
plt.text(-1, 0, '$f$', size=20)
plt.ylim(ymin=-.2)
14.4.2 Smooth vs non-smooth plt.axis('off')
Draws a figure to explain smooth versus non smooth optimization. plt.tight_layout()
# A non-smooth function
plt.figure(2, figsize=(3, 2.5))
plt.clf()
plt.plot(x, np.abs(x), linewidth=2)
plt.text(-1, 0, '$f$', size=20)
plt.ylim(ymin=-.2)
plt.axis('off')
plt.tight_layout()
plt.show()
14.4. Examples for the mathematical optimization chapter 404 14.4. Examples for the mathematical optimization chapter 405
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
from scipy import optimize
import matplotlib.pyplot as plt
np.random.seed(0)
# Fit the model: the parameters omega and phi can be found in the x = np.linspace(-1, 2)
# `params` vector
params, params_cov = optimize.curve_fit(f, x, y) plt.figure(1, figsize=(3, 2.5))
plt.clf()
# plot the data and the fitted curve
t = np.linspace(0, 3, 1000) # A convex function
plt.plot(x, x**2, linewidth=2)
plt.figure(1) plt.text(-.7, -.6**2, '$f$', size=20)
plt.clf()
plt.plot(x, y, 'bx') # The tangent in one point
plt.plot(t, f(t, *params), 'r-') plt.plot(x, 2*x - 1)
plt.show() plt.plot(1, 1, 'k+')
plt.text(.3, -.75, "Tangent to $f$", size=15)
plt.text(1, 1 - .5, 'C', size=15)
Total running time of the script: ( 0 minutes 0.014 seconds)
# Convexity as barycenter
plt.plot([.35, 1.85], [.35**2, 1.85**2])
(continues on next page)
14.4. Examples for the mathematical optimization chapter 406 14.4. Examples for the mathematical optimization chapter 407
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
from scipy import optimize
import matplotlib.pyplot as plt
def f(x):
return np.exp(-1/(.01*x[0]**2 + x[1]**2))
# A well-conditionned version of f:
def g(x):
return f([10*x[0], x[1]])
plt.figure(0)
plt.clf()
t = np.linspace(-1.1, 1.1, 100)
plt.plot(t, f([0, t]))
(continues on next page)
14.4. Examples for the mathematical optimization chapter 408 14.4. Examples for the mathematical optimization chapter 409
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Note: Click here to download the full example code def f(x):
# Store the list of function calls
accumulator.append(x)
return np.sqrt((x[0] - 3)**2 + (x[1] - 2)**2)
14.4.6 Optimization with constraints
An example showing how to do optimization with general constraints using SLSQP and cobyla. def constraint(x):
return np.atleast_1d(1.5 - np.sum(np.abs(x)))
accumulated = np.array(accumulator)
plt.plot(accumulated[:, 0], accumulated[:, 1])
plt.show()
x, y = np.mgrid[-2.03:4.2:.04, -1.6:3.2:.04]
(continues on next page)
14.4. Examples for the mathematical optimization chapter 410 14.4. Examples for the mathematical optimization chapter 411
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
•
Out:
Converged at 6
Converged at 23
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
x = np.linspace(-1, 3, 100)
• x_0 = np.exp(-1)
def f(x):
return (x - x_0)**2 + epsilon*np.exp(-5*(x - .5 - x_0)**2)
# A convex function
plt.plot(x, f(x), linewidth=2)
this_x = result.x
all_x.append(this_x)
all_y.append(f(this_x))
if iter < 6:
(continues on next page)
14.4. Examples for the mathematical optimization chapter 412 14.4. Examples for the mathematical optimization chapter 413
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
def f(x):
# Store the list of function calls
accumulator.append(x)
return np.sqrt((x[0] - 3)**2 + (x[1] - 2)**2)
•
# We don't use the gradient, as with the gradient, L-BFGS is too fast,
# and finds the optimum without showing us a pretty path
def f_prime(x):
r = np.sqrt((x[0] - 3)**2 + (x[0] - 2)**2)
return np.array(((x[0] - 3)/r, (x[0] - 2)/r))
accumulated = np.array(accumulator)
plt.plot(accumulated[:, 0], accumulated[:, 1])
plt.show()
14.4. Examples for the mathematical optimization chapter 414 14.4. Examples for the mathematical optimization chapter 415
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.xticks(np.arange(n_methods), method_names)
plt.yticks(())
plt.tight_layout()
plt.show()
14.4. Examples for the mathematical optimization chapter 416 14.4. Examples for the mathematical optimization chapter 417
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
t0 = time.time()
x_l_bfgs = optimize.minimize(f, K[0], method="L-BFGS-B").x
print(' L-BFGS: time %.2f s, x error %.2f , f error %.2f ' % (time.time() - t0,
np.sqrt(np.sum((x_l_bfgs - x_ref)**2)), f(x_l_bfgs) - f_ref))
t0 = time.time()
x_bfgs = optimize.minimize(f, K[0], jac=f_prime, method="BFGS").x
print(" BFGS w f': time %.2f s, x error %.2f , f error %.2f " % ( Out:
time.time() - t0, np.sqrt(np.sum((x_bfgs - x_ref)**2)),
f(x_bfgs) - f_ref)) Powell: time 0.27s
BFGS: time 0.75s, x error 0.02, f error -0.02
t0 = time.time() L-BFGS: time 0.06s, x error 0.02, f error -0.02
x_l_bfgs = optimize.minimize(f, K[0], jac=f_prime, method="L-BFGS-B").x BFGS w f': time 0.08s, x error 0.02, f error -0.02
print("L-BFGS w f': time %.2f s, x error %.2f , f error %.2f " % ( L-BFGS w f': time 0.00s, x error 0.02, f error -0.02
time.time() - t0, np.sqrt(np.sum((x_l_bfgs - x_ref)**2)), Newton: time 0.01s, x error 0.02, f error -0.02
f(x_l_bfgs) - f_ref))
Total running time of the script: ( 0 minutes 1.490 seconds)
t0 = time.time()
x_newton = optimize.minimize(f, K[0], jac=f_prime, hess=hessian, method="Newton-CG").x
print(" Newton: time %.2f s, x error %.2f , f error %.2f " % ( Note: Click here to download the full example code
time.time() - t0, np.sqrt(np.sum((x_newton - x_ref)**2)),
f(x_newton) - f_ref))
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
import sys, os
sys.path.append(os.path.abspath('helper'))
from cost_functions import mk_quad, mk_gauss, rosenbrock,\
rosenbrock_prime, rosenbrock_hessian, LoggingFunction,\
CountingFunction
14.4. Examples for the mathematical optimization chapter 418 14.4. Examples for the mathematical optimization chapter 419
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
A gradient descent algorithm do not use: its a toy, use scipy’s optimize.fmin_cg
def bfgs(x0, f, f_prime, hessian=None):
def gradient_descent(x0, f, f_prime, hessian=None, adaptative=False): all_x_i = [x0[0]]
x_i, y_i = x0 all_y_i = [x0[1]]
all_x_i = list() all_f_i = [f(x0)]
all_y_i = list() def store(X):
all_f_i = list() x, y = X
all_x_i.append(x)
for i in range(1, 100): all_y_i.append(y)
all_x_i.append(x_i) all_f_i.append(f(X))
all_y_i.append(y_i) optimize.minimize(f, x0, method="BFGS", jac=f_prime, callback=store, options={"gtol": 1e-
all_f_i.append(f([x_i, y_i])) ˓→12})
dx_i, dy_i = f_prime(np.asarray([x_i, y_i])) return all_x_i, all_y_i, all_f_i
if adaptative:
# Compute a step size using a line_search to satisfy the Wolf
# conditions def powell(x0, f, f_prime, hessian=None):
step = optimize.line_search(f, f_prime, all_x_i = [x0[0]]
np.r_[x_i, y_i], -np.r_[dx_i, dy_i], all_y_i = [x0[1]]
np.r_[dx_i, dy_i], c2=.05) all_f_i = [f(x0)]
step = step[0] def store(X):
if step is None: x, y = X
step = 0 all_x_i.append(x)
else: all_y_i.append(y)
step = 1 all_f_i.append(f(X))
x_i += - step*dx_i optimize.minimize(f, x0, method="Powell", callback=store, options={"ftol": 1e-12})
y_i += - step*dy_i return all_x_i, all_y_i, all_f_i
if np.abs(all_f_i[-1]) < 1e-16:
break
return all_x_i, all_y_i, all_f_i def nelder_mead(x0, f, f_prime, hessian=None):
all_x_i = [x0[0]]
all_y_i = [x0[1]]
def gradient_descent_adaptative(x0, f, f_prime, hessian=None): all_f_i = [f(x0)]
return gradient_descent(x0, f, f_prime, adaptative=True) def store(X):
x, y = X
all_x_i.append(x)
def conjugate_gradient(x0, f, f_prime, hessian=None): all_y_i.append(y)
all_x_i = [x0[0]] all_f_i.append(f(X))
all_y_i = [x0[1]] optimize.minimize(f, x0, method="Nelder-Mead", callback=store, options={"ftol": 1e-12})
all_f_i = [f(x0)] return all_x_i, all_y_i, all_f_i
def store(X):
x, y = X
all_x_i.append(x) Run different optimizers on these problems
all_y_i.append(y)
levels = dict()
all_f_i.append(f(X))
(continues on next page)
(continues on next page)
14.4. Examples for the mathematical optimization chapter 420 14.4. Examples for the mathematical optimization chapter 421
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
14.4. Examples for the mathematical optimization chapter 422 14.4. Examples for the mathematical optimization chapter 423
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• •
• •
• •
14.4. Examples for the mathematical optimization chapter 424 14.4. Examples for the mathematical optimization chapter 425
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• •
• •
• •
14.4. Examples for the mathematical optimization chapter 426 14.4. Examples for the mathematical optimization chapter 427
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• •
• •
• •
14.4. Examples for the mathematical optimization chapter 428 14.4. Examples for the mathematical optimization chapter 429
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• •
• •
• •
14.4. Examples for the mathematical optimization chapter 430 14.4. Examples for the mathematical optimization chapter 431
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• •
• •
• •
14.4. Examples for the mathematical optimization chapter 432 14.4. Examples for the mathematical optimization chapter 433
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• •
• •
• •
14.4. Examples for the mathematical optimization chapter 434 14.4. Examples for the mathematical optimization chapter 435
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
(> 250) the Hessian matrix is too costly to compute and invert. L-BFGS keeps a low-rank version. In
addition, box bounds are also supported by L-BFGS-B:
L-BFGS: Limited-memory BFGS Sits between BFGS and conjugate gradient: in very high dimensions
14.4. Examples for the mathematical optimization chapter 436 14.4. Examples for the mathematical optimization chapter 437
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
An ill-conditioned non-
quadratic function:
14.4. Examples for the mathematical optimization chapter 438 14.5. Practical guide to optimization with scipy 439
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Warning: A very common source of optimization not converging well is human error in the com-
putation of the gradient. You can use scipy.optimize.check_grad() to check that your gradient
is correct. It returns the norm of the different between the gradient given, and a gradient computed
numerically:
>>> optimize.check_grad(f, jacobian, [2, -1])
2.384185791015625e-07
BFGS needs more function calls, and gives a less precise result.
Note: leastsq is interesting compared to BFGS only if the dimensionality of the output vector is large,
14.5. Practical guide to optimization with scipy 440 14.6. Special case: non-linear least-squares 441
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
and larger than the number of parameters to optimize. (continued from previous page)
nfev: 9
nit: 2
status: 0
Warning: If the function is linear, this is a linear-algebra problem, and should be solved with success: True
scipy.linalg.lstsq(). x: array([1.5, 1.5])
>>> optimize.curve_fit(f, x, y)
(array([1.5185..., 0.92665...]), array([[ 0.00037..., -0.00056...],
[-0.0005..., 0.00123...]]))
straints:
Do the same with omega = 3. What is the difficulty? >>> def constraint(x):
... return np.atleast_1d(1.5 - np.sum(np.abs(x)))
14.7. Optimization with constraints 442 14.7. Optimization with constraints 443
Scipy lecture notes, Edition 2022.1
Lagrange multipliers
If you are ready to do a bit of math, many constrained optimization problems can be converted to
non-constrained optimization problems using a mathematical trick known as Lagrange multipliers.
15
14.8 Full code examples
14.9 Examples for the mathematical optimization chapter
CHAPTER
Interfacing with C
Chapters contents
• Introduction
• Python-C-Api
• Ctypes
• SWIG
• Cython
• Summary
• Further Reading and References
• Exercises
15.1 Introduction
This chapter covers the following techniques:
• Python-C-Api
• All of these techniques may crash (segmentation fault) the Python interpreter, which is (usually) /* parse the input, from python float to c double */
due to bugs in the C code. if (!PyArg_ParseTuple(args, "d", &value))
return NULL;
• All the examples have been done on Linux, they should be possible on other operating systems. /* if the above function returns -1, an appropriate Python exception will
• You will need a C compiler for most of the examples. * have been set, and the function simply returns NULL
*/
(continued from previous page) The file cos_module.so contains the compiled extension, which we can now load in the IPython inter-
-1, preter:
CosMethods
};
Note: In Python 3, the filename for compiled modules includes metadata on the Python interpreter
PyMODINIT_FUNC (see PEP 3149) and is thus longer. The import statement is not affected by this.
PyInit_cos_module(void)
{
In [1]: import cos_module
return PyModule_Create(&cModPyDem);
}
In [2]: cos_module?
Type: module
# else
String Form:<module 'cos_module' from 'cos_module.so'>
File: /home/esc/git-working/scipy-lecture-notes/advanced/interfacing_with_c/python_c_api/
/* module initialization */
˓→cos_module.so
/* Python version 2 */
Docstring: <no docstring>
PyMODINIT_FUNC
initcos_module(void)
In [3]: dir(cos_module)
{
Out[3]: ['__doc__', '__file__', '__name__', '__package__', 'cos_func']
(void) Py_InitModule("cos_module", CosMethods);
}
In [4]: cos_module.cos_func(1.0)
Out[4]: 0.5403023058681398
# endif
In [5]: cos_module.cos_func(0.0)
As you can see, there is much boilerplate, both to «massage» the arguments and return types into Out[5]: 1.0
place and for the module initialisation. Although some of this is amortised, as the extension grows, the
boilerplate required for each function(s) remains. In [6]: cos_module.cos_func(3.14159265359)
Out[7]: -1.0
The standard python build system distutils supports compiling C-extensions from a setup.py, which
is rather convenient:
Now let’s see how robust this is:
from distutils.core import setup, Extension
In [10]: cos_module.cos_func('foo')
# define the extension module ---------------------------------------------------------------------------
cos_module = Extension('cos_module', sources=['cos_module.c']) TypeError Traceback (most recent call last)
<ipython-input-10-11bee483665d> in <module>()
# run the setup ----> 1 cos_module.cos_func('foo')
setup(ext_modules=[cos_module])
TypeError: a float is required
This can be compiled:
# include <Python.h>
$ ls # include <numpy/arrayobject.h>
build/ cos_module.c cos_module.so setup.py # include <math.h>
/* Set up and create the iterator */ /* The location of the data pointer which the iterator may update */
iterator_flags = (NPY_ITER_ZEROSIZE_OK | char **dataptr = NpyIter_GetDataPtrArray(iter);
/* /* The location of the stride which the iterator may update */
* Enable buffering in case the input is not behaved npy_intp *strideptr = NpyIter_GetInnerStrideArray(iter);
* (native byte order or not aligned), /* The location of the inner loop size which the iterator may update */
* disabling may speed up some cases when it is known to npy_intp *innersizeptr = NpyIter_GetInnerLoopSizePtr(iter);
* be unnecessary.
*/ /* iterate over the arrays */
NPY_ITER_BUFFERED | do {
/* Manually handle innermost iteration for speed: */ npy_intp stride = strideptr[0];
NPY_ITER_EXTERNAL_LOOP | npy_intp count = *innersizeptr;
NPY_ITER_GROWINNER); /* out is always contiguous, so use double */
double *out = (double *)dataptr[1];
op_flags[0] = (NPY_ITER_READONLY | char *in = dataptr[0];
/*
* Required that the arrays are well behaved, since the cos /* The output is allocated and guaranteed contiguous (out++ works): */
* call below requires this. assert(strideptr[1] == sizeof(double));
*/
NPY_ITER_NBO | /*
NPY_ITER_ALIGNED); * For optimization it can make sense to add a check for
* stride == sizeof(double) to allow the compiler to optimize for that.
/* Ask the iterator to allocate an array to write the output to */ */
op_flags[1] = NPY_ITER_WRITEONLY | NPY_ITER_ALLOCATE; while (count--) {
*out = cos(*(double *)in);
/* out++;
* Ensure the iteration has the correct type, could be checked in += stride;
* specifically here. }
*/ } while (iternext(iter));
op_dtypes[0] = PyArray_DescrFromType(NPY_DOUBLE);
op_dtypes[1] = op_dtypes[0]; /* Clean up and return the result */
NpyIter_Deallocate(iter);
/* Create the numpy iterator object: */ return ret;
iter = NpyIter_MultiNew(2, arrays, iterator_flags, }
/* Use input order for output and iteration */
NPY_KEEPORDER,
/* Allow only byte-swapping of input */ /* define functions in module */
NPY_EQUIV_CASTING, op_flags, op_dtypes); static PyMethodDef CosMethods[] =
Py_DECREF(op_dtypes[0]); /* The second one is identical. */ {
{"cos_func_np", cos_func_np, METH_VARARGS,
if (iter == NULL) "evaluate the cosine on a numpy array"},
return NULL; {NULL, NULL, 0, NULL}
};
iternext = NpyIter_GetIterNext(iter, NULL);
(continues on next page) (continues on next page)
# if PY_MAJOR_VERSION >= 3 # The function is OK with `x` not having any elements:
/* module initialization */ x_empty = np.array([], dtype=np.float64)
/* Python version 3*/ y_empty = cos_module_np.cos_func_np(x_empty)
static struct PyModuleDef cModPyDem = { assert np.array_equal(y_empty, np.array([], dtype=np.float64))
PyModuleDef_HEAD_INIT,
"cos_module", "Some documentation", # The function can handle arbitrary dimensions and non-contiguous data.
-1, # `x_2d` contains the same values, but has a different shape.
CosMethods # Note: `x_2d.flags` shows it is not contiguous and `x2.ravel() == x`
}; x_2d = x.repeat(2)[::2].reshape(-1, 3)
PyMODINIT_FUNC PyInit_cos_module_np(void) { y_2d = cos_module_np.cos_func_np(x_2d)
PyObject *module; # When reshaped back, the same result is given:
module = PyModule_Create(&cModPyDem); assert np.array_equal(y_2d.ravel(), y)
if(module==NULL) return NULL;
/* IMPORTANT: this must be called */ # The function handles incorrect byte-order fine:
import_array(); x_not_native_byteorder = x.astype(x.dtype.newbyteorder())
if (PyErr_Occurred()) return NULL; y_not_native_byteorder = cos_module_np.cos_func_np(x_not_native_byteorder)
return module; assert np.array_equal(y_not_native_byteorder, y)
}
# The function fails if the data type is incorrect:
# else x_incorrect_dtype = x.astype(np.float32)
/* module initialization */ try:
/* Python version 2 */ cos_module_np.cos_func_np(x_incorrect_dtype)
PyMODINIT_FUNC initcos_module_np(void) { assert 0, "This cannot be reached."
PyObject *module; except TypeError:
module = Py_InitModule("cos_module_np", CosMethods); # A TypeError will be raised, this can be changed by changing the
if(module==NULL) return; # casting rule.
/* IMPORTANT: this must be called */ pass
import_array();
return; And this should result in the following figure:
}
# endif
To compile this we can use distutils again. However we need to be sure to include the Numpy headers
by using :func:numpy.get_include.
from distutils.core import setup, Extension
import numpy
""" Example of wrapping cos function from math.h using ctypes. """ In [6]: cos_module.cos_func(3.14159265359)
Out[6]: -1.0
import ctypes
As with the previous example, this code is somewhat robust, although the error message is not quite as
# find and load the library helpful, since it does not tell us what the type should be.
12 def cos_func(arg):
# set the argument type 13 ''' Wrapper for cos from math.h '''
libm.cos.argtypes = [ctypes.c_double] ---> 14 return libm.cos(arg)
# set the return type
libm.cos.restype = ctypes.c_double ArgumentError: argument 1: <type 'exceptions.TypeError'>: wrong type
Docstring: <no docstring> /* Compute the cosine of each element in in_array, storing the result in
* out_array. */
In [3]: dir(cos_module) void cos_doubles(double * in_array, double * out_array, int size){
Out[3]: int i;
['__builtins__', for(i=0;i<size;i++){
'__doc__', out_array[i] = cos(in_array[i]);
'__file__', }
'__name__', }
'__package__',
'cos_func',
And since the library is pure C, we can’t use distutils to compile it, but must use a combination of
'ctypes',
make and gcc:
'find_library',
'libm'] m.PHONY : clean
clean :
-rm -vf libcos_doubles.so cos_doubles.o cos_doubles.pyc
We can then compile this (on Linux) into the shared library libcos_doubles.so:
$ ls
cos_doubles.c cos_doubles.h cos_doubles.py makefile test_cos_doubles.py
$ make
gcc -c -fPIC cos_doubles.c -o cos_doubles.o
gcc -shared -Wl,-soname,libcos_doubles.so -o libcos_doubles.so cos_doubles.o
$ ls
cos_doubles.c cos_doubles.o libcos_doubles.so* test_cos_doubles.py 15.4 SWIG
cos_doubles.h cos_doubles.py makefile
SWIG, the Simplified Wrapper Interface Generator, is a software development tool that connects pro-
Now we can proceed to wrap this library via ctypes with direct support for (certain kinds of) Numpy grams written in C and C++ with a variety of high-level programming languages, including Python.
arrays: The important thing with SWIG is, that it can autogenerate the wrapper code for you. While this is an
advantage in terms of development time, it can also be a burden. The generated file tend to be quite
""" Example of wrapping a C library function that accepts a C double array as large and may not be too human readable and the multiple levels of indirection which are a result of the
input using the numpy.ctypeslib. """ wrapping process, may be a bit tricky to understand.
import numpy as np
import numpy.ctypeslib as npct Note: The autogenerated C code uses the Python-C-Api.
from ctypes import c_int
$ ls Again we test for robustness, and we see that we get a better error message (although, strictly speaking
cos_module.c cos_module.h cos_module.i setup.py in Python there is no double type):
$ python setup.py build_ext --inplace In [7]: cos_module.cos_func('foo')
running build_ext ---------------------------------------------------------------------------
building '_cos_module' extension TypeError Traceback (most recent call last)
swigging cos_module.i to cos_module_wrap.c <ipython-input-7-11bee483665d> in <module>()
swig -python -o cos_module_wrap.c cos_module.i ----> 1 cos_module.cos_func('foo')
creating build
creating build/temp.linux-x86_64-2.7 TypeError: in method 'cos_func', argument 1 of type 'double'
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -
˓→fPIC -I/home/esc/anaconda/include/python2.7 -c cos_module.c -o build/temp.linux-x86_64-2.7/
˓→cos_module.o
15.4.2 Numpy Support
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -
˓→fPIC -I/home/esc/anaconda/include/python2.7 -c cos_module_wrap.c -o build/temp.linux-x86_64- Numpy provides support for SWIG with the numpy.i file. This interface file defines various so-called
˓→2.7/cos_module_wrap.o typemaps which support conversion between Numpy arrays and C-Arrays. In the following example we
gcc -pthread -shared build/temp.linux-x86_64-2.7/cos_module.o build/temp.linux-x86_64-2.7/cos_ will take a quick look at how such typemaps work in practice.
˓→module_wrap.o -L/home/esc/anaconda/lib -lpython2.7 -o /home/esc/git-working/scipy-lecture-
We can now load and execute the cos_module as we have done in the previous examples: /* Compute the cosine of each element in in_array, storing the result in
* out_array. */
In [1]: import cos_module void cos_doubles(double * in_array, double * out_array, int size){
int i;
In [2]: cos_module? for(i=0;i<size;i++){
Type: module out_array[i] = cos(in_array[i]);
String Form:<module 'cos_module' from 'cos_module.py'> }
File: /home/esc/git-working/scipy-lecture-notes/advanced/interfacing_with_c/swig/cos_ }
˓→module.py
(continues on next page) This is wrapped as cos_doubles_func using the following SWIG interface file:
/* the resulting C file should be built as a python extension */ ˓→anaconda/include/python2.7 -c cos_doubles.c -o build/temp.linux-x86_64-2.7/cos_doubles.o
# define SWIG_FILE_WITH_INIT gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -
/* Includes the header in the wrapper code */ ˓→fPIC -I/home/esc/anaconda/lib/python2.7/site-packages/numpy/core/include -I/home/esc/
%} ˓→wrap.o
%} from cos_doubles_wrap.c:2706:
/home/esc/anaconda/lib/python2.7/site-packages/numpy/core/include/numpy/npy_deprecated_api.
/* typemaps for the two arrays, the second will be modified in-place */ ˓→h:11:2: warning: #warning "Using deprecated NumPy API, disable it by #defining NPY_NO_
%apply (double* IN_ARRAY1, int DIM1) {(double * in_array, int size_in)} ˓→DEPRECATED_API NPY_1_7_API_VERSION"
%apply (double* INPLACE_ARRAY1, int DIM1) {(double * out_array, int size_out)} gcc -pthread -shared build/temp.linux-x86_64-2.7/cos_doubles.o build/temp.linux-x86_64-2.7/cos_
˓→doubles_wrap.o -L/home/esc/anaconda/lib -lpython2.7 -o /home/esc/git-working/scipy-lecture-
%inline %{ $ ls
/* takes as input two numpy arrays */ build/ cos_doubles.h cos_doubles.py cos_doubles_wrap.c setup.py
void cos_doubles_func(double * in_array, int size_in, double * out_array, int size_out) { cos_doubles.c cos_doubles.i _cos_doubles.so* numpy.i test_cos_doubles.py
/* calls the original funcion, providing only the size of the first */
cos_doubles(in_array, out_array, size_in); And, as before, we convince ourselves that it worked:
}
%} import numpy as np
import matplotlib.pyplot as plt
• To use the Numpy typemaps, we need include the numpy.i file. import cos_doubles
• Observe the call to import_array() which we encountered already in the Numpy-C-API example. x = np.arange(0, 2 * np.pi, 0.1)
y = np.empty_like(x)
• Since the type maps only support the signature ARRAY, SIZE we need to wrap the cos_doubles
as cos_doubles_func which takes two arrays including sizes as input. cos_doubles.cos_doubles_func(x, y)
• As opposed to the simple SWIG example, we don’t include the cos_doubles.h header, There plt.plot(x, y)
plt.show()
is nothing there that we wish to expose to Python since we expose the functionality through
cos_doubles_func.
And, as before we can use distutils to wrap this:
setup(ext_modules=[Extension("_cos_doubles",
sources=["cos_doubles.c", "cos_doubles.i"],
include_dirs=[numpy.get_include()])])
$ ls
cos_doubles.c cos_doubles.h cos_doubles.i numpy.i setup.py test_cos_doubles.py
$ python setup.py build_ext -i
running build_ext
15.5 Cython
building '_cos_doubles' extension
swigging cos_doubles.i to cos_doubles_wrap.c
Cython is both a Python-like language for writing C-extensions and an advanced compiler for this
swig -python -o cos_doubles_wrap.c cos_doubles.i language. The Cython language is a superset of Python, which comes with additional constructs that
cos_doubles.i:24: Warning(490): Fragment 'NumPy_Backward_Compatibility' not found. allow you call C functions and annotate variables and class attributes with c types. In this sense one
cos_doubles.i:24: Warning(490): Fragment 'NumPy_Backward_Compatibility' not found. could also call it a Python with types.
cos_doubles.i:24: Warning(490): Fragment 'NumPy_Backward_Compatibility' not found.
In addition to the basic use case of wrapping native code, Cython supports an additional use-case,
(continues on next page)
namely interactive optimization. Basically, one starts out with a pure-Python script and incrementally (continued from previous page)
adds Cython types to the bottleneck code to optimize only those code paths that really matter. creating build
creating build/temp.linux-x86_64-2.7
In this sense it is quite similar to SWIG, since the code can be autogenerated but in a sense it also quite
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -
similar to ctypes since the wrapping code can (almost) be written in Python.
˓→fPIC -I/home/esc/anaconda/include/python2.7 -c cos_module.c -o build/temp.linux-x86_64-2.7/
While others solutions that autogenerate code can be quite difficult to debug (for example SWIG) Cython ˓→cos_module.o
comes with an extension to the GNU debugger that helps debug Python, Cython and C code. gcc -pthread -shared build/temp.linux-x86_64-2.7/cos_module.o -L/home/esc/anaconda/lib -
˓→lpython2.7 -o /home/esc/git-working/scipy-lecture-notes/advanced/interfacing_with_c/cython/
˓→cos_module.so
Advantages
And running it:
• Python like language for writing C-extensions
In [1]: import cos_module
• Autogenerated code
In [2]: cos_module?
• Supports incremental optimization Type: module
• Includes a GNU debugger extension String Form:<module 'cos_module' from 'cos_module.so'>
File: /home/esc/git-working/scipy-lecture-notes/advanced/interfacing_with_c/cython/cos_
• Support for C++ (Since version 0.13) ˓→module.so
Again we can use the standard distutils module, but this time we need some additional pieces from And, testing a little for robustness, we can see that we get good error messages:
the Cython.Distutils:
In [7]: cos_module.cos_func('foo')
from distutils.core import setup, Extension ---------------------------------------------------------------------------
from Cython.Distutils import build_ext TypeError Traceback (most recent call last)
<ipython-input-7-11bee483665d> in <module>()
setup( ----> 1 cos_module.cos_func('foo')
cmdclass={'build_ext': build_ext},
ext_modules=[Extension("cos_module", ["cos_module.pyx"])] /home/esc/git-working/scipy-lecture-notes/advanced/interfacing_with_c/cython/cos_module.so in␣
) ˓→cos_module.cos_func (cos_module.c:506)()
$ cd advanced/interfacing_with_c/cython
Additionally, it is worth noting that Cython ships with complete declarations for the C math library,
$ ls
cos_module.pyx setup.py
which simplifies the code above to become:
$ python setup.py build_ext --inplace """ Simpler example of wrapping cos function from math.h using Cython. """
running build_ext
cythoning cos_module.pyx to cos_module.c from libc.math cimport cos
building 'cos_module' extension
(continues on next page) (continues on next page)
• As with the previous compiled Numpy examples, we need the include_dirs option.
15.5.2 Numpy Support
Cython has support for Numpy via the numpy.pyx file which allows you to add the Numpy array type $ ls
cos_doubles.c cos_doubles.h _cos_doubles.pyx setup.py test_cos_doubles.py
to your Cython code. I.e. like specifying that variable i is of type int, you can specify that variable
$ python setup.py build_ext -i
a is of type numpy.ndarray with a given dtype. Also, certain optimizations such as bounds checking running build_ext
are supported. Look at the corresponding section in the Cython documentation. In case you want to cythoning _cos_doubles.pyx to _cos_doubles.c
pass Numpy arrays as C arrays to your Cython wrapped C functions, there is a section about this in the building 'cos_doubles' extension
Cython documentation. creating build
creating build/temp.linux-x86_64-2.7
In the following example, we will show how to wrap the familiar cos_doubles function using Cython.
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -
˓→fPIC -I/home/esc/anaconda/lib/python2.7/site-packages/numpy/core/include -I/home/esc/
void cos_doubles(double * in_array, double * out_array, int size);
˓→anaconda/include/python2.7 -c _cos_doubles.c -o build/temp.linux-x86_64-2.7/_cos_doubles.o
from /home/esc/anaconda/lib/python2.7/site-packages/numpy/core/include/numpy/
/* Compute the cosine of each element in in_array, storing the result in ˓→ndarrayobject.h:17,
* out_array. */ from /home/esc/anaconda/lib/python2.7/site-packages/numpy/core/include/numpy/
void cos_doubles(double * in_array, double * out_array, int size){ ˓→arrayobject.h:15,
int i; from _cos_doubles.c:253:
for(i=0;i<size;i++){ /home/esc/anaconda/lib/python2.7/site-packages/numpy/core/include/numpy/npy_deprecated_api.
out_array[i] = cos(in_array[i]); ˓→h:11:2: warning: #warning "Using deprecated NumPy API, disable it by #defining NPY_NO_
} ˓→DEPRECATED_API NPY_1_7_API_VERSION"
} /home/esc/anaconda/lib/python2.7/site-packages/numpy/core/include/numpy/__ufunc_api.h:236:␣
˓→warning: ‘_import_umath’ defined but not used
This is wrapped as cos_doubles_func using the following Cython code: gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -
˓→fPIC -I/home/esc/anaconda/lib/python2.7/site-packages/numpy/core/include -I/home/esc/
""" Example of wrapping a C function that takes C double arrays as input using ˓→anaconda/include/python2.7 -c cos_doubles.c -o build/temp.linux-x86_64-2.7/cos_doubles.o
the Numpy declarations from Cython """ gcc -pthread -shared build/temp.linux-x86_64-2.7/_cos_doubles.o build/temp.linux-x86_64-2.7/
˓→cos_doubles.o -L/home/esc/anaconda/lib -lpython2.7 -o /home/esc/git-working/scipy-lecture-
# cimport the Cython declarations for numpy ˓→notes/advanced/interfacing_with_c/cython_numpy/cos_doubles.so
cimport numpy as np $ ls
build/ _cos_doubles.c cos_doubles.c cos_doubles.h _cos_doubles.pyx cos_doubles.so* setup.
# if you want to use the Numpy-C-API from Cython ˓→py test_cos_doubles.py
# (not strictly necessary for this example, but good practice)
np.import_array()
And, as before, we convince ourselves that it worked:
# cdefine the signature of our c function import numpy as np
cdef extern from "cos_doubles.h": import matplotlib.pyplot as plt
void cos_doubles (double * in_array, double * out_array, int size) import cos_doubles
# create the wrapper code, with numpy type annotations x = np.arange(0, 2 * np.pi, 0.1)
def cos_doubles_func(np.ndarray[double, ndim=1, mode="c"] in_array not None, y = np.empty_like(x)
np.ndarray[double, ndim=1, mode="c"] out_array not None):
cos_doubles(<double*> np.PyArray_DATA(in_array), cos_doubles.cos_doubles_func(x, y)
<double*> np.PyArray_DATA(out_array), plt.plot(x, y)
in_array.shape[0]) plt.show()
setup(
cmdclass={'build_ext': build_ext},
(continues on next page)
15.8.1 Python-C-API
1. Modify the Numpy example such that the function takes two input arguments, where the second
is the preallocated output array, making it similar to the other Numpy examples.
2. Modify the example such that the function only takes a single input array and modifies this in
place.
3. Try to fix the example to use the new Numpy iterator protocol. If you manage to obtain a working
solution, please submit a pull-request on github.
4. You may have noticed, that the Numpy-C-API example is the only Numpy example that does not
wrap cos_doubles but instead applies the cos function directly to the elements of the Numpy
array. Does this have any advantages over the other techniques.
5. Can you wrap cos_doubles using only the Numpy-C-API. You may need to ensure that the arrays
15.6 Summary have the correct type, are one dimensional and contiguous in memory.
In this section four different techniques for interfacing with native code have been presented. The table
below roughly summarizes some of the aspects of the techniques. 15.8.2 Ctypes
1. Modify the Numpy example such that cos_doubles_func handles the preallocation for you, thus
x Part of CPython Compiled Autogenerated Numpy Support making it more like the Numpy-C-API example.
Python-C-API True True False True
Ctypes True False False True 15.8.3 SWIG
Swig False True True True
Cython False True True True 1. Look at the code that SWIG autogenerates, how much of it do you understand?
2. Modify the Numpy example such that cos_doubles_func handles the preallocation for you, thus
Of all three presented techniques, Cython is the most modern and advanced. In particular, the ability making it more like the Numpy-C-API example.
to optimize code incrementally by adding types to your Python code is unique.
3. Modify the cos_doubles C function so that it returns an allocated array. Can you wrap this using
SWIG typemaps? If not, why not? Is there a workaround for this specific situation? (Hint: you
15.7 Further Reading and References know the size of the output array, so it may be possible to construct a Numpy array from the
returned double *.)
• Gaël Varoquaux’s blog post about avoiding data copies provides some insight on how to handle
memory management cleverly. If you ever run into issues with large datasets, this is a reference to 15.8.4 Cython
come back to for some inspiration.
1. Look at the code that Cython autogenerates. Take a closer look at some of the comments that
Cython inserts. What do you see?
15.8 Exercises 2. Look at the section Working with Numpy from the Cython documentation to learn how to incre-
mentally optimize a pure python script that uses Numpy.
Since this is a brand new section, the exercises are considered more as pointers as to what to look at
next, so pick the ones that you find more interesting. If you have good ideas for exercises, please let us 3. Modify the Numpy example such that cos_doubles_func handles the preallocation for you, thus
know! making it more like the Numpy-C-API example.
1. Download the source code for each example and compile and run them on your machine.
2. Make trivial changes to each example and convince yourself that this works. ( E.g. change cos for
sin.)
3. Most of the examples, especially the ones involving Numpy may still be fragile and respond badly
to input errors. Look for ways to crash the examples, figure what the problem is and devise a
potential solution. Here are some ideas:
(a) Numerical overflow.
(b) Input and output arrays that have different lengths.
(c) Multidimensional array.
(d) Empty array
(e) Arrays with non-double types
4. Use the %timeit IPython magic to measure the execution time of the various solutions
This part of the Scipy lecture notes is dedicated to various scientific packages useful for extended needs.
Part III
468 469
Scipy lecture notes, Edition 2022.1
16
Contents
"";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"
Tip: Why Python for statistics? "1";"Female";133;132;124;"118";"64.5";816932
(continues on next page)
Reading from a CSV file: Using the above CSV file that gives observations of brain size and weight
Manipulating data
and IQ (Willerman et al. 1991), the data are a mixture of numerical and categorical values:
data is a pandas.DataFrame, that resembles R’s dataframe:
>>> import pandas
>>> data = pandas.read_csv('examples/brain_size.csv', sep=';', na_values=".") >>> data.shape # 40 rows and 8 columns
>>> data (40, 8)
Unnamed: 0 Gender FSIQ VIQ PIQ Weight Height MRI_Count
0 1 Female 133 132 124 118.0 64.5 816932 >>> data.columns # It has columns
1 2 Male 140 150 124 NaN 72.5 1001121 Index([u'Unnamed: 0', u'Gender', u'FSIQ', u'VIQ', u'PIQ', u'Weight', u'Height', u'MRI_Count'],␣
2 3 Male 139 123 150 143.0 73.3 1038437 ˓→dtype='object')
Note: For a quick view on a large dataframe, use its describe method: pandas.DataFrame.describe().
Creating from arrays: A pandas.DataFrame can also be seen as a dictionary of 1D ‘series’, eg arrays
or lists. If we have 3 numpy arrays:
16.1. Data representation and interaction 472 16.1. Data representation and interaction 473
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
groupby_gender is a powerful object that exposes many operations on the resulting group of dataframes:
>>> groupby_gender.mean()
Unnamed: 0 FSIQ VIQ PIQ Weight Height MRI_Count
Gender
Female 19.65 111.9 109.45 110.45 137.200000 65.765000 862654.6
Male 21.35 115.0 115.25 111.60 166.444444 71.431579 954855.4
Tip: Use tab-completion on groupby_gender to find more. Other common grouping functions are
median, count (useful for checking to see the amount of missing values in different subsets) or sum.
Groupby evaluation is lazy, no work is done until an aggregation function is applied.
Two populations
Exercise
• What is the mean value for VIQ for the full population?
• How many males/females were included in this study?
Hint use ‘tab completion’ to find out the methods that can be called, instead of ‘mean’ in the
above example.
• What is the average value of MRI counts expressed in log units, for males and females?
Note: groupby_gender.boxplot is used for the plots above (see this example).
Plotting data
Exercise
Pandas comes with some plotting tools (pandas.tools.plotting, using matplotlib behind the scene)
to display statistics of the data in dataframes: Plot the scatter matrix for males only, and for females only. Do you think that the 2 sub-populations
correspond to gender?
Scatter matrices:
16.1. Data representation and interaction 474 16.1. Data representation and interaction 475
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
16.2 Hypothesis testing: comparing two groups 16.2.2 Paired tests: repeated measurements on the same individuals
For simple statistical tests, we will use the scipy.stats sub-module of scipy:
See also:
Scipy is a vast library. For a quick summary to the whole library, see the scipy chapter.
The problem with this approach is that it forgets that there are links between observations: FSIQ
and PIQ are measured on the same individuals. Thus the variance due to inter-subject variability is
confounding, and can be removed, using a “paired test”, or “repeated measures test”:
scipy.stats.ttest_1samp() tests if the population mean of data is likely to be equal to a given value
(technically if observations are drawn from a Gaussian distributions of given population mean). It returns
the T statistic, and the p-value (see the function’s help):
Tip: With a p-value of 10^-28 we can claim that the population mean for the IQ (VIQ measure) is not
0.
Exercise
16.2. Hypothesis testing: comparing two groups 476 16.2. Hypothesis testing: comparing two groups 477
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
>>> print(model.summary())
OLS Regression Results
16.3 Linear models, multiple factors, and analysis of variance ==========================...
Dep. Variable: y R-squared: 0.804
Model: OLS Adj. R-squared: 0.794
16.3.1 “formulas” to specify statistical models in Python Method: Least Squares F-statistic: 74.03
Date: ... Prob (F-statistic): 8.56e-08
A simple linear regression Time: ... Log-Likelihood: -57.988
No. Observations: 20 AIC: 120.0
Df Residuals: 18 BIC: 122.0
Df Model: 1
Covariance Type: nonrobust
==========================...
coef std err t P>|t| [0.025 0.975]
------------------------------------------...
Intercept -5.5335 1.036 -5.342 0.000 -7.710 -3.357
x 2.9369 0.341 8.604 0.000 2.220 3.654
==========================...
Omnibus: 0.100 Durbin-Watson: 2.956
Prob(Omnibus): 0.951 Jarque-Bera (JB): 0.322
Skew: -0.058 Prob(JB): 0.851
Kurtosis: 2.390 Cond. No. 3.03
==========================...
Given two set of observations, x and y, we want to
Warnings:
test the hypothesis that y is a linear function of x. In other terms: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
𝑦 = 𝑥 * coef + intercept + 𝑒
where e is observation noise. We will use the statsmodels module to:
Terminology:
1. Fit a linear model. We will use the simplest strategy, ordinary least squares (OLS).
Statsmodels uses a statistical terminology: the y variable in statsmodels is called ‘endogenous’ while
2. Test that coef is non zero.
the x variable is called exogenous. This is discussed in more detail here.
To simplify, y (endogenous) is the value you are trying to predict, while x (exogenous) represents the
features you are using to make the prediction.
“formulas” for statistics in Python Categorical variables: comparing groups or multiple categories
Let us go back the data on brain size:
See the statsmodels documentation
16.3. Linear models, multiple factors, and analysis of variance 478 16.3. Linear models, multiple factors, and analysis of variance 479
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
>>> data = pandas.read_csv('examples/brain_size.csv', sep=';', na_values=".") >>> data_fisq = pandas.DataFrame({'iq': data['FSIQ'], 'type': 'fsiq'})
>>> data_piq = pandas.DataFrame({'iq': data['PIQ'], 'type': 'piq'})
We can write a comparison between IQ of male and female using a linear model: >>> data_long = pandas.concat((data_fisq, data_piq))
>>> print(data_long)
>>> model = ols("VIQ ~ Gender + 1", data).fit() iq type
>>> print(model.summary()) 0 133 fsiq
OLS Regression Results 1 140 fsiq
==========================... 2 139 fsiq
Dep. Variable: VIQ R-squared: 0.015 ...
Model: OLS Adj. R-squared: -0.010 31 137 piq
Method: Least Squares F-statistic: 0.5969 32 110 piq
Date: ... Prob (F-statistic): 0.445 33 86 piq
Time: ... Log-Likelihood: -182.42 ...
No. Observations: 40 AIC: 368.8
Df Residuals: 38 BIC: 372.2 >>> model = ols("iq ~ type", data_long).fit()
Df Model: 1 >>> print(model.summary())
Covariance Type: nonrobust OLS Regression Results
==========================... ...
coef std err t P>|t| [0.025 0.975] ==========================...
-----------------------------------------------------------------------... coef std err t P>|t| [0.025 0.975]
Intercept 109.4500 5.308 20.619 0.000 98.704 120.196 ------------------------------------------...
Gender[T.Male] 5.8000 7.507 0.773 0.445 -9.397 20.997 Intercept 113.4500 3.683 30.807 0.000 106.119 120.781
==========================... type[T.piq] -2.4250 5.208 -0.466 0.643 -12.793 7.943
Omnibus: 26.188 Durbin-Watson: 1.709 ...
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3.703
Skew: 0.010 Prob(JB): 0.157 We can see that we retrieve the same values for t-test and corresponding p-values for the effect of the
Kurtosis: 1.510 Cond. No. 2.62 type of iq than the previous t-test:
==========================...
>>> stats.ttest_ind(data['FSIQ'], data['PIQ'])
Ttest_indResult(statistic=0.46563759638..., pvalue=0.64277250...)
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Forcing categorical: the ‘Gender’ is automatically detected as a categorical variable, and thus each
of its different values are treated as different entities.
An integer column can be forced to be treated as categorical using:
>>> model = ols('VIQ ~ C(Gender)', data).fit()
Intercept: We can remove the intercept using - 1 in the formula, or force the use of an intercept
using + 1.
Tip: By default, statsmodels treats a categorical variable with K possible values as K-1 ‘dummy’
boolean variables (the last level being absorbed into the intercept term). This is almost always a
good default choice - however, it is possible to specify different encodings for categorical variables
(http://statsmodels.sourceforge.net/devel/contrasts.html).
Consider a linear model explaining a variable z (the dependent variable) with 2 variables x and y:
𝑧 = 𝑥 𝑐1 + 𝑦 𝑐2 + 𝑖 + 𝑒
Such a model can be seen in 3D as fitting a plane to a cloud of (x, y, z) points.
To compare different types of IQ, we need to create a “long-form” table, listing IQs, where the type
of IQ is indicated by a categorical variable:
Example: the iris data (examples/iris.csv)
16.3. Linear models, multiple factors, and analysis of variance 480 16.3. Linear models, multiple factors, and analysis of variance 481
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Tip: Sepal and petal size tend to be related: bigger flowers are bigger! But is there in addition a
systematic effect of species?
16.3.3 Post-hoc hypothesis testing: analysis of variance (ANOVA)
In the above iris example, we wish to test if the petal length is different between versicolor and virginica,
after removing the effect of sepal width. This can be formulated as testing the difference between the
coefficient associated to versicolor and virginica in the linear model estimated above (it is an Analysis of
Variance, ANOVA). For this, we write a vector of ‘contrast’ on the parameters estimated: we want
to test "name[T.versicolor] - name[T.virginica]", with an F-test:
Exercise
Going back to the brain size + IQ data, test if the VIQ of male and female are different after removing
the effect of brain size, height and weight.
16.3. Linear models, multiple factors, and analysis of variance 482 16.4. More visualization: seaborn for statistical exploration 483
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
>>> seaborn.pairplot(data, vars=['WAGE', 'AGE', 'EDUCATION'], Look and feel and matplotlib settings
... kind='reg', hue='SEX')
Seaborn changes the default of matplotlib figures to achieve a more “modern”, “excel-like” look. It
does that upon import. You can reset the default using:
>>> from matplotlib import pyplot as plt
>>> plt.rcdefaults()
Tip: To switch back to seaborn settings, or understand better styling in seaborn, see the relevent
section of the seaborn documentation.
16.4. More visualization: seaborn for statistical exploration 484 16.4. More visualization: seaborn for statistical exploration 485
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• Hypothesis testing and p-values give you the significance of an effect / difference.
• Formulas (with categorical variables) enable you to express rich links in your data.
• Visualizing your data and fitting simple models give insight into the data.
• Conditionning (adding factors that can explain all or part of the variation) is an important
modeling aspect that changes the interpretation.
16.4. More visualization: seaborn for statistical exploration 486 16.5. Testing for interactions 487
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
16.6 Full code for the figures data = pandas.read_csv('brain_size.csv', sep=';', na_values='.')
Plot boxplots for FSIQ, PIQ, and the paired difference between the two: while the spread (error bars) plt.show()
for FSIQ and PIQ are very large, there is a systematic (common) effect due to the subjects. This effect
is cancelled out in the difference and the spread of the difference (“paired” by subject) is much smaller Total running time of the script: ( 0 minutes 0.053 seconds)
than the spread of the individual measures.
• •
import pandas
16.6. Full code for the figures 488 16.6. Full code for the figures 489
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import pandas
from pandas.tools import plotting
# The parameter 'c' is passed to plt.scatter and will control the color
plotting.scatter_matrix(data, c=categories.codes, marker='o')
fig = plt.gcf()
fig.suptitle("blue: setosa, green: versicolor, red: virginica", size=13)
import pandas
16.6. Full code for the figures 490 16.6. Full code for the figures 491
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Testing the difference between effect of versicolor and virginica
<F test: F=array([[3.24533535]]), p=0.07369058781700064, df_denom=146, df_num=1>
import numpy as np
import matplotlib.pyplot as plt
Statistical analysis import pandas
# Let us try to explain the sepal length as a function of the petal # For statistics. Requires statsmodels 5.0 or more
# width and the category of iris from statsmodels.formula.api import ols
# Analysis of Variance (ANOVA) on linear models
model = ols('sepal_width ~ name + petal_length', data).fit() from statsmodels.stats.anova import anova_lm
print(model.summary())
Generate and show the data
# Now formulate a "contrast", to test if the offset for versicolor and
# virginica are identical x = np.linspace(-5, 5, 20)
print('Testing the difference between effect of versicolor and virginica') # To get reproducable values, provide a seed value
print(model.f_test([0, 1, -1, 0])) np.random.seed(1)
plt.show()
y = -5 + 3*x + 4 * np.random.normal(size=x.shape)
Out:
# Plot the data
OLS Regression Results plt.figure(figsize=(5, 4))
============================================================================== plt.plot(x, y, 'o')
Dep. Variable: sepal_width R-squared: 0.478
Model: OLS Adj. R-squared: 0.468
Method: Least Squares F-statistic: 44.63
Date: Thu, 18 Aug 2022 Prob (F-statistic): 1.58e-20
Time: 10:40:00 Log-Likelihood: -38.185
No. Observations: 150 AIC: 84.37
Df Residuals: 146 BIC: 96.41
Df Model: 3
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
(continues on next page)
16.6. Full code for the figures 492 16.6. Full code for the figures 493
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
ANOVA results
df sum_sq mean_sq F PR(>F)
x 1.0 1588.873443 1588.873443 74.029383 8.560649e-08
Residual 18.0 386.329330 21.462741 NaN NaN
plt.show()
# Convert the data into a Pandas DataFrame to use the formulas framework
# in statsmodels
data = pandas.DataFrame({'x': x, 'y': y})
print('\nANOVA results')
print(anova_results)
Out:
16.6. Full code for the figures 494 16.6. Full code for the figures 495
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
import matplotlib.pyplot as plt
import pandas
x = np.linspace(-5, 5, 21)
# We generate a 2D grid
X, Y = np.meshgrid(x, x)
print('\nANOVA results')
print(anova_results)
plt.show()
Out:
16.6. Full code for the figures 496 16.6. Full code for the figures 497
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
ANOVA results
df sum_sq mean_sq F PR(>F)
x 1.0 39284.301219 39284.301219 623.962799 2.888238e-86
y 1.0 1055.220089 1055.220089 16.760336 5.050899e-05
Residual 438.0 27576.201607 62.959364 NaN NaN
import pandas
import urllib
import os
if not os.path.exists('wages.txt'):
# Download the file if it is not present
urllib.urlretrieve('http://lib.stat.cmu.edu/datasets/CPS_85_Wages', statistical analysis
'wages.txt')
(continues on next page)
16.6. Full code for the figures 498 16.6. Full code for the figures 499
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
16.6. Full code for the figures 500 16.6. Full code for the figures 501
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import seaborn
seaborn.pairplot(data, vars=['WAGE', 'AGE', 'EDUCATION'],
kind='reg')
16.6. Full code for the figures 502 16.6. Full code for the figures 503
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• •
16.6. Full code for the figures 504 16.6. Full code for the figures 505
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import pandas
if not os.path.exists('airfares.txt'):
# Download the file if it is not present
urllib.urlretrieve(
'http://www.stat.ufl.edu/~winner/data/airq4.dat',
(continues on next page)
16.6. Full code for the figures 506 16.6. Full code for the figures 507
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import seaborn
seaborn.pairplot(data_flat, vars=['fare', 'dist', 'nb_passengers'],
kind='reg', markers='.')
# A second plot, to show the effect of the year (ie the 9/11 effect)
seaborn.pairplot(data_flat, vars=['fare', 'dist', 'nb_passengers'],
kind='reg', hue='year', markers='.')
16.6. Full code for the figures 508 16.6. Full code for the figures 509
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
•
Statistical testing: dependence of fare on distance and number of passengers
import statsmodels.formula.api as sm
Out:
OLS Regression Results
==============================================================================
Dep. Variable: fare R-squared: 0.275
Model: OLS Adj. R-squared: 0.275
Method: Least Squares F-statistic: 1585.
Date: Thu, 18 Aug 2022 Prob (F-statistic): 0.00
Time: 10:40:20 Log-Likelihood: -45532.
No. Observations: 8352 AIC: 9.107e+04
•
Df Residuals: 8349 BIC: 9.109e+04
Plot the difference in fare Df Model: 2
Covariance Type: nonrobust
import matplotlib.pyplot as plt =================================================================================
coef std err t P>|t| [0.025 0.975]
plt.figure(figsize=(5, 2)) ---------------------------------------------------------------------------------
seaborn.boxplot(data.fare_2001 - data.fare_2000) Intercept 211.2428 2.466 85.669 0.000 206.409 216.076
plt.title('Fare: 2001 - 2000') dist 0.0484 0.001 48.149 0.000 0.046 0.050
plt.subplots_adjust() nb_passengers -32.8925 1.127 -29.191 0.000 -35.101 -30.684
==============================================================================
plt.figure(figsize=(5, 2)) Omnibus: 604.051 Durbin-Watson: 1.446
seaborn.boxplot(data.nb_passengers_2001 - data.nb_passengers_2000) Prob(Omnibus): 0.000 Jarque-Bera (JB): 740.733
plt.title('NB passengers: 2001 - 2000') Skew: 0.710 Prob(JB): 1.42e-161
plt.subplots_adjust() Kurtosis: 3.338 Cond. No. 5.23e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.23e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Robust linear Model Regression Results
==============================================================================
Dep. Variable: fare No. Observations: 8352
Model: RLM Df Residuals: 8349
Method: IRLS Df Model: 2
Norm: HuberT
Scale Est.: mad
• Cov Type: H1
(continues on next page)
16.6. Full code for the figures 510 16.6. Full code for the figures 511
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import pandas
from statsmodels.formula.api import ols
Out: Out:
16.6. Full code for the figures 512 16.7. Solutions to this chapter’s exercises 513
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.4e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
<F test: F=array([[0.68319608]]), p=0.4140878441244722, df_denom=35, df_num=1>
Here we plot a scatter matrix to get intuitions on our results. This goes beyond what was asked in the Total running time of the script: ( 0 minutes 0.328 seconds)
exercise
# The parameter 'c' is passed to plt.scatter and will control the color
# The same holds for parameters 'marker', 'alpha' and 'cmap', that
# control respectively the type of marker used, their transparency and
# the colormap
plotting.scatter_matrix(data[['VIQ', 'MRI_Count', 'Height']],
c=(data['Gender'] == 'Female'), marker='o',
alpha=1, cmap='winter')
fig = plt.gcf()
fig.suptitle("blue: male, green: female", size=13)
plt.show()
16.7. Solutions to this chapter’s exercises 514 16.7. Solutions to this chapter’s exercises 515
Scipy lecture notes, Edition 2022.1
• Algebraic manipulations
– Expand
– Simplify
• Calculus
17
– Limits
– Differentiation
– Series expansion
– Integration
CHAPTER
• Equation solving
• Linear Algebra
– Matrices
– Differential Equations
>>> a
Author: Fabian Pedregosa
1/2
– Symbols
>>> x = sym.Symbol('x') Simplification is a somewhat vague term, and more precises alternatives to simplify exists: powsimp
>>> y = sym.Symbol('y') (simplification of exponents), trigsimp (for trigonometric expressions) , logcombine, radsimp, together.
Symbols can now be manipulated using some of python operators: +, -`, ``*, ** (arithmetic), &, |, ~ 17.3 Calculus
, >>, << (boolean).
17.3.1 Limits
Printing
Limits are easy to use in SymPy, they follow the syntax limit(function, variable, point), so to
compute the limit of 𝑓 (𝑥) as 𝑥 → 0, you would issue limit(f, x, 0):
Sympy allows for control of the display of the output. From here we use the following setting for
printing: >>> sym.limit(sym.sin(x) / x, x, 0)
>>> sym.init_printing(use_unicode=False, wrap_line=True) 1
SymPy is capable of performing powerful algebraic manipulations. We’ll take a look into some of the >>> sym.limit(1 / x, x, sym.oo)
most frequently used: expand and simplify. 0
>>> sym.limit(x ** x, x, 0)
17.2.1 Expand 1
Use this to expand an algebraic expression. It will try to denest powers and multiplications:
Higher derivatives can be calculated using the diff(func, var, n) method: It is possible to compute definite integral:
>>> sym.series(sym.cos(x), x)
2 4
x x / 6\ 17.4 Equation solving
1 - -- + -- + O\x /
2 24 SymPy is able to solve algebraic equations, in one and several variables using solveset():
>>> sym.series(1/sym.cos(x), x)
2 4 >>> sym.solveset(x ** 4 - 1, x)
x 5*x / 6\ {-1, 1, -I, I}
1 + -- + ---- + O\x /
2 24 As you can see it takes as first argument an expression that is supposed to be equaled to 0. It also has
(limited) support for transcendental equations:
Sympy is able to solve a large part of polynomial equations, and is also capable of solving multiple
17.3.4 Integration equations with respect to multiple variables giving a tuple as second argument. To do this you use
the solve() command:
SymPy has support for indefinite and definite integration of transcendental elementary and special func-
tions via integrate() facility, which uses the powerful extended Risch-Norman algorithm and some >>> solution = sym.solve((x + 5 * y - 2, -3 * x + 6 * y - 15), (x, y))
heuristics and pattern matching. You can integrate elementary functions: >>> solution[x], solution[y]
(-3, 1)
>>> sym.integrate(6 * x ** 5, x)
6
x Another alternative in the case of polynomial equations is factor. factor returns the polynomial factorized
>>> sym.integrate(sym.sin(x), x) into irreducible terms, and is capable of computing the factorization over various domains:
-cos(x)
>>> sym.integrate(sym.log(x), x) >>> f = x ** 4 - 3 * x ** 2 + 1
x*log(x) - x >>> sym.factor(f)
>>> sym.integrate(2 * x + sym.sinh(x), x) / 2 \ / 2 \
2 \x - x - 1/*\x + x - 1/
x + cosh(x)
>>> sym.factor(f, modulus=5)
Also special functions are handled easily: 2 2
(x - 2) *(x + 2)
>>> sym.integrate(sym.exp(-x ** 2) * sym.erf(x), x)
____ 2 SymPy is also able to solve boolean equations, that is, to decide if a certain boolean expression is
\/ pi *erf (x) satisfiable or not. For this, we use the function satisfiable:
(continues on next page)
Exercises
17.5.1 Matrices
𝑑𝑓 (𝑥)
Matrices are created as instances from the Matrix class: 𝑥 + 𝑓 (𝑥) − 𝑓 (𝑥)2 = 0
𝑥
>>> sym.Matrix([[1, 0], [0, 1]])
[1 0]
2. Solve the same equation using hint='Bernoulli'. What do you observe ?
[ ]
[0 1]
>>> A**2
[x*y + 1 2*x ]
[ ]
[ 2*y x*y + 1]
f and g are now undefined functions. We can call f(x), and it will represent an unknown function:
>>> f(x)
f(x)
– Mathematical morphology
• Image segmentation
– Binary segmentation: foreground + background
– Marker based methods
18
• Measuring regions’ properties
• Data visualization and interaction
• Feature extraction for computer vision
• Full code examples
CHAPTER
• Examples for the scikit-image chapter
Chapters contents
Other Python packages are available for image processing and work with NumPy arrays:
• scipy.ndimage : for nd-arrays. Basic filtering, mathematical morphology, regions properties
• Mahotas
Works with all data formats supported by the
Also, powerful image processing libraries have Python bindings: Python Imaging Library (or any other I/O plugin provided to imread with the plugin keyword ar-
gument).
• OpenCV (computer vision)
Also works with URL image paths:
• ITK (3D images and registration)
• and many others >>> logo = io.imread('http://scikit-image.org/_static/img/logo.png')
(but they are less Pythonic and NumPy friendly, to a variable extent). Saving to files:
>>> import os Different integer sizes are possible: 8-, 16- or 32-bytes, signed or unsigned.
>>> filename = os.path.join(skimage.data_dir, 'camera.png')
>>> camera = io.imread(filename)
Warning: An important (if questionable) skimage convention: float images are supposed to lie
in [-1, 1] (in order to have comparable contrast for all float images)
18.2. Input/output, data types and colorspaces 526 18.2. Input/output, data types and colorspaces 527
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Some image processing routines need to work with float arrays, and may hence output an array with a
different type and the data range from the input array
Routines converting between different colorspaces (RGB, HSV, LAB etc.) are available in skimage.
color : color.rgb2hsv, color.lab2rgb, etc. Check the docstring for the expected dtype (and data
range) of input images.
18.3.2 Non-local filters
Non-local filters use a large region of the image (or all the image) to transform the value of one pixel:
3D images
>>> from skimage import exposure
Most functions of skimage can take 3D images as input arguments. Check the docstring to know if a >>> camera = data.camera()
function can be used on 3D images (for example MRI or CT images). >>> camera_equalized = exposure.equalize_hist(camera)
Exercise
18.3. Image preprocessing / enhancement 528 18.3. Image preprocessing / enhancement 529
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
18.3. Image preprocessing / enhancement 530 18.3. Image preprocessing / enhancement 531
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Image segmentation is the attribution of different labels to different regions of the image, for example in
order to extract the pixels of an object of interest.
Tip: The Otsu method is a simple heuristic to find a threshold to separate the foreground from the
background.
>>> plt.figure()
<Figure size ... with 0 Axes>
>>> plt.imshow(clean_border, cmap='gray')
<matplotlib.image.AxesImage object at 0x...>
• Load the coins image from the data submodule. >>> coins_edges = segmentation.mark_boundaries(coins, clean_border.astype(np.int))
• Separate the coins from the background by testing several segmentation methods: Otsu thresh-
olding, adaptive thresholding, and watershed or random walker segmentation.
• If necessary, use a postprocessing function to improve the coins / background segmentation.
Example: compute the size and perimeter of the two segmented regions:
18.5. Measuring regions’ properties 534 18.6. Data visualization and interaction 535
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
(this ex-
ample is taken from the plot_corner example in scikit-image)
Points of interest such as corners can then be used to match objects in different images, as described in
the plot_matching example of scikit-image.
18.7. Feature extraction for computer vision 536 18.8. Full code examples 537
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
camera = data.camera()
import numpy as np
import matplotlib.pyplot as plt plt.figure(figsize=(4, 4))
plt.imshow(camera, cmap='gray', interpolation='nearest')
check = np.zeros((8, 8)) plt.axis('off')
check[::2, 1::2] = 1
check[1::2, ::2] = 1 plt.tight_layout()
plt.matshow(check, cmap='gray') plt.show()
plt.show()
Total running time of the script: ( 0 minutes 0.043 seconds)
Total running time of the script: ( 0 minutes 0.011 seconds)
Note: Click here to download the full example code
Note: Click here to download the full example code
18.9. Examples for the scikit-image chapter 538 18.9. Examples for the scikit-image chapter 539
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Total running time of the script: ( 0 minutes 0.076 seconds) Note: Click here to download the full example code
text = data.text()
hsobel_text = filters.sobel_h(text)
(continues on next page)
18.9. Examples for the scikit-image chapter 540 18.9. Examples for the scikit-image chapter 541
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
(continued from previous page) Total running time of the script: ( 0 minutes 0.083 seconds)
plt.figure(figsize=(12, 3))
Note: Click here to download the full example code
plt.subplot(121)
plt.imshow(text, cmap='gray', interpolation='nearest')
plt.axis('off') 18.9.7 Otsu thresholding
plt.subplot(122)
plt.imshow(hsobel_text, cmap='nipy_spectral', interpolation='nearest') This example illustrates automatic Otsu thresholding.
plt.axis('off')
plt.tight_layout()
plt.show()
camera = data.camera()
val = filters.threshold_otsu(camera)
plt.figure(figsize=(9, 4))
plt.subplot(131)
plt.imshow(camera, cmap='gray', interpolation='nearest')
from skimage import data, segmentation plt.axis('off')
from skimage import filters plt.subplot(132)
import matplotlib.pyplot as plt plt.imshow(camera < val, cmap='gray', interpolation='nearest')
import numpy as np plt.axis('off')
plt.subplot(133)
coins = data.coins() plt.plot(bins_center, hist, lw=2)
mask = coins > filters.threshold_otsu(coins) plt.axvline(val, color='k', ls='--')
clean_border = segmentation.clear_border(mask).astype(np.int)
plt.tight_layout()
coins_edges = segmentation.mark_boundaries(coins, clean_border) plt.show()
plt.figure(figsize=(8, 3.5)) Total running time of the script: ( 0 minutes 0.104 seconds)
plt.subplot(121)
plt.imshow(clean_border, cmap='gray')
plt.axis('off') Note: Click here to download the full example code
plt.subplot(122)
plt.imshow(coins_edges)
plt.axis('off')
18.9.8 Labelling connected components of an image
plt.tight_layout() This example shows how to label connected components of a binary image, using the dedicated skim-
plt.show() age.measure.label function.
18.9. Examples for the scikit-image chapter 542 18.9. Examples for the scikit-image chapter 543
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
n = 12
l = 256
np.random.seed(1)
im = np.zeros((l, l))
points = l * np.random.random((2, n ** 2))
im[(points[0]).astype(np.int), (points[1]).astype(np.int)] = 1
im = filters.gaussian(im, sigma= l / (4. * n))
blobs = im > 0.7 * im.mean()
all_labels = measure.label(blobs)
blobs_labels = measure.label(blobs, background=0) from matplotlib import pyplot as plt
18.9. Examples for the scikit-image chapter 544 18.9. Examples for the scikit-image chapter 545
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
import matplotlib.pyplot as plt
from skimage import data import numpy as np
from skimage import filters from skimage.morphology import watershed
from skimage import restoration from skimage.feature import peak_local_max
from skimage import measure
coins = data.coins() from skimage.segmentation import random_walker
gaussian_filter_coins = filters.gaussian(coins, sigma=2) import matplotlib.pyplot as plt
med_filter_coins = filters.median(coins, np.ones((3, 3))) from scipy import ndimage
tv_filter_coins = restoration.denoise_tv_chambolle(coins, weight=0.1)
# Generate an initial image with two overlapping circles
plt.figure(figsize=(16, 4)) x, y = np.indices((80, 80))
plt.subplot(141) x1, y1, x2, y2 = 28, 28, 44, 52
plt.imshow(coins[10:80, 300:370], cmap='gray', interpolation='nearest') r1, r2 = 16, 20
plt.axis('off') mask_circle1 = (x - x1) ** 2 + (y - y1) ** 2 < r1 ** 2
plt.title('Image') mask_circle2 = (x - x2) ** 2 + (y - y2) ** 2 < r2 ** 2
plt.subplot(142) image = np.logical_or(mask_circle1, mask_circle2)
plt.imshow(gaussian_filter_coins[10:80, 300:370], cmap='gray', # Now we want to separate the two objects in image
interpolation='nearest') # Generate the markers as local maxima of the distance
plt.axis('off') # to the background
plt.title('Gaussian filter') distance = ndimage.distance_transform_edt(image)
plt.subplot(143) local_maxi = peak_local_max(
plt.imshow(med_filter_coins[10:80, 300:370], cmap='gray', distance, indices=False, footprint=np.ones((3, 3)), labels=image)
interpolation='nearest') markers = measure.label(local_maxi)
plt.axis('off') labels_ws = watershed(-distance, markers, mask=image)
plt.title('Median filter')
plt.subplot(144) markers[~image] = -1
plt.imshow(tv_filter_coins[10:80, 300:370], cmap='gray', labels_rw = random_walker(image, markers)
interpolation='nearest')
plt.axis('off') plt.figure(figsize=(12, 3.5))
plt.title('TV filter') plt.subplot(141)
plt.show() plt.imshow(image, cmap='gray', interpolation='nearest')
plt.axis('off')
plt.title('image')
Total running time of the script: ( 0 minutes 0.118 seconds)
plt.subplot(142)
plt.imshow(-distance, interpolation='nearest')
Note: Click here to download the full example code plt.axis('off')
plt.title('distance map')
plt.subplot(143)
plt.imshow(labels_ws, cmap='nipy_spectral', interpolation='nearest')
18.9.11 Watershed and random walker for segmentation plt.axis('off')
plt.title('watershed segmentation')
This example compares two segmentation methods in order to separate two connected disks: the water- plt.subplot(144)
shed algorithm, and the random walker algorithm. plt.imshow(labels_rw, cmap='nipy_spectral', interpolation='nearest')
Both segmentation methods require seeds, that are pixels belonging unambigusouly to a reagion. Here, plt.axis('off')
local maxima of the distance map to the background are used as seeds. plt.title('random walker segmentation')
plt.tight_layout()
plt.show()
18.9. Examples for the scikit-image chapter 546 18.9. Examples for the scikit-image chapter 547
Scipy lecture notes, Edition 2022.1
Tutorial content
• Introduction
• Example
19
• What are Traits
– Initialisation
– Validation
– Documentation
CHAPTER – Visualization: opening a dialog
– Deferral
– Notification
– Some more advanced traits
19.1 Introduction
Traits: building interactive dialogs Tip: The Enthought Tool Suite enable the construction of sophisticated application frameworks for
data analysis, 2D plotting and 3D visualization. These powerful, reusable components are released under
liberal BSD-style licenses.
Tip: In this tutorial we will explore the Traits toolset and learn how to dramatically reduce the amount
of boilerplate code you write, do rapid GUI application development, and understand the ideas which
underly other parts of the Enthought Tool Suite.
Traits and the Enthought Tool Suite are open source projects licensed under a BSD-style license.
Intended Audience
Requirements
Tip: Annual electric energy production depends on the available water supply. In some installations Trait Python Type Built-in Default Value
the water flow rate can vary by a factor of 10:1 over the course of a year. Bool Boolean False
Complex Complex number 0+0j
The second part of the behaviour is the state of the storage that depends on controlled and uncontrolled Float Floating point number 0.0
parameters : Int Plain integer 0
Long Long integer 0L
𝑠𝑡𝑜𝑟𝑎𝑔𝑒𝑡+1 = 𝑠𝑡𝑜𝑟𝑎𝑔𝑒𝑡 + 𝑖𝑛𝑓 𝑙𝑜𝑤𝑠 − 𝑟𝑒𝑙𝑒𝑎𝑠𝑒 − 𝑠𝑝𝑖𝑙𝑙𝑎𝑔𝑒 − 𝑖𝑟𝑟𝑖𝑔𝑎𝑡𝑖𝑜𝑛
Str String ‘’
Unicode Unicode u”
Warning: The data used in this tutorial are not real and might even not have sense in the reality.
A number of other predefined trait type do exist : Array, Enum, Range, Event, Dict, List, Color, Set,
Expression, Code, Callable, Type, Tuple, etc.
19.3 What are Traits Custom default values can be defined in the code:
A trait is a type definition that can be used for normal Python object attributes, giving the attributes from traits.api import HasTraits, Str, Float
some additional characteristics:
class Reservoir(HasTraits):
• Standardization:
name = Str
– Initialization
max_storage = Float(100)
– Validation (continues on next page)
if __name__ == '__main__':
19.3.2 Validation reservoir = Reservoir(
name = 'Project A',
Every trait does validation when the user tries to set its content: max_storage = 30,
max_release = 100.0,
reservoir = Reservoir(name='Lac de Vouglans', max_storage=605)
head = 60,
efficiency = 0.8
reservoir.max_storage = '230'
)
---------------------------------------------------------------------------
TraitError Traceback (most recent call last)
release = 80
.../scipy-lecture-notes/advanced/traits/<ipython-input-7-979bdff9974a> in <module>()
print('Releasing {} m3/s produces {} kWh'.format(
----> 1 reservoir.max_storage = '230'
release, reservoir.energy_production(release)
))
.../traits/trait_handlers.pyc in error(self, object, name, value)
166 """
167 raise TraitError( object, name, self.full_info( object, name, value ),
--> 168 value ) 19.3.4 Visualization: opening a dialog
169 The Traits library is also aware of user interfaces and can pop up a default view for the Reservoir class:
170 def arg_error ( self, method, arg_num, object, name, value ):
reservoir1 = Reservoir()
TraitError: The 'max_storage' trait of a Reservoir instance must be a float, but a value of '23 reservoir1.edit_traits()
˓→' <type 'str'> was specified.
19.3.3 Documentation
By essence, all the traits do provide documentation about the model itself. The declarative approach to
the creation of classes makes it self-descriptive:
class Reservoir(HasTraits):
name = Str
max_storage = Float(100)
The desc metadata of the traits can be used to provide a more descriptive information about the trait :
class Reservoir(HasTraits):
TraitsUI simplifies the way user interfaces are created. Every trait on a HasTraits class has a default
name = Str
editor that will manage the way the trait is rendered to the screen (e.g. the Range trait is displayed as
max_storage = Float(100, desc='Maximal storage [hm3]')
a slider, etc.).
Let’s now define the complete reservoir class: In the very same vein as the Traits declarative way of creating classes, TraitsUI provides a declarative
interface to build user interfaces code:
19.3. What are Traits 552 19.3. What are Traits 553
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
A special trait allows to manage events and trigger function calls using the magic _xxxx_fired method:
class ReservoirState(HasTraits):
"""Keeps track of the reservoir state given the initial storage.
19.3. What are Traits 554 19.3. What are Traits 555
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
update_storage = Event(desc='Updates the storage to the next time step') # state attributes
storage = Property(depends_on='inflows, release')
def _update_storage_fired(self):
# update storage state # control attributes
new_storage = self.storage - self.release + self.inflows inflows = Float(desc='Inflows [hm3]')
self.storage = min(new_storage, self.max_storage) release = Range(low='min_release', high='max_release')
overflow = new_storage - self.max_storage spillage = Property(
self.spillage = max(overflow, 0) desc='Spillage [hm3]', depends_on=['storage', 'inflows', 'release']
)
def print_state(self):
print('Storage\tRelease\tInflows\tSpillage') ### Private traits.
str_format = '\t'.join(['{:7.2f} 'for i in range(4)]) _storage = Float
print(str_format.format(self.storage, self.release, self.inflows,
self.spillage)) ### Traits property implementation.
print('-' * 79) def _get_storage(self):
new_storage = self._storage - self.release + self.inflows
return min(new_storage, self.max_storage)
if __name__ == '__main__':
projectA = Reservoir( def _set_storage(self, storage_value):
name = 'Project A', self._storage = storage_value
max_storage = 30,
max_release = 5.0, def _get_spillage(self):
hydraulic_head = 60, new_storage = self._storage - self.release + self.inflows
efficiency = 0.8 overflow = new_storage - self.max_storage
) return max(overflow, 0)
19.3. What are Traits 556 19.3. What are Traits 557
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
For the simplicity of the example, the release is considered in state = ReservoirState(reservoir=projectA, storage=25)
hm3/timestep and not in m3/s. state.release = 4
""" state.inflows = 0
reservoir = Instance(Reservoir, ())
name = DelegatesTo('reservoir') state.print_state()
max_storage = DelegatesTo('reservoir') state.configure_traits()
max_release = DelegatesTo('reservoir')
min_release = Float
# state attributes
storage = Property(depends_on='inflows, release')
# control attributes
inflows = Float(desc='Inflows [hm3]')
release = Range(low='min_release', high='max_release')
spillage = Property(
desc='Spillage [hm3]', depends_on=['storage', 'inflows', 'release']
)
19.3. What are Traits 558 19.3. What are Traits 559
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
19.3. What are Traits 560 19.3. What are Traits 561
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
The dynamic trait notification signatures are not the same as the static ones : @on_trait_change('storage')
def print_state(self):
• def wake_up_watchman(): pass print('Storage\tRelease\tInflows\tSpillage')
str_format = '\t'.join(['{:7.2f} 'for i in range(4)])
• def wake_up_watchman(new): pass
print(str_format.format(self.storage, self.release, self.inflows,
• def wake_up_watchman(name, new): pass self.spillage))
print('-' * 79)
• def wake_up_watchman(object, name, new): pass
if __name__ == '__main__':
• def wake_up_watchman(object, name, old, new): pass
projectA = Reservoir(
Removing a dynamic listener can be done by: name = 'Project A',
max_storage = 30,
• calling the remove_trait_listener method on the trait with the listener method as argument, max_release = 5,
• calling the on_trait_change method with listener method and the keyword remove=True, hydraulic_head = 60,
efficiency = 0.8
• deleting the instance that holds the listener. )
Listeners can also be added to classes using the on_trait_change decorator: state = ReservoirState(reservoir=projectA, storage=25)
from traits.api import HasTraits, Instance, DelegatesTo, Float, Range state.release = 4
from traits.api import Property, on_trait_change state.inflows = 0
from reservoir import Reservoir The patterns supported by the on_trait_change method and decorator are powerful. The reader should
look at the docstring of HasTraits.on_trait_change for the details.
class ReservoirState(HasTraits):
"""Keeps track of the reservoir state given the initial storage.
19.3.7 Some more advanced traits
For the simplicity of the example, the release is considered in The following example demonstrate the usage of the Enum and List traits :
hm3/timestep and not in m3/s.
""" from traits.api import HasTraits, Str, Float, Range, Enum, List
reservoir = Instance(Reservoir, ()) from traitsui.api import View, Item
max_storage = DelegatesTo('reservoir')
min_release = Float class IrrigationArea(HasTraits):
max_release = DelegatesTo('reservoir') name = Str
surface = Float(desc='Surface [ha]')
# state attributes crop = Enum('Alfalfa', 'Wheat', 'Cotton')
storage = Property(depends_on='inflows, release')
19.3. What are Traits 562 19.3. What are Traits 563
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
release = 80 release = 80
print('Releasing {} m3/s produces {} kWh'.format( print('Releasing {} m3/s produces {} kWh'.format(
release, reservoir.energy_production(release) release, reservoir.energy_production(release)
)) ))
Trait listeners can be used to listen to changes in the content of the list to e.g. keep track of the total The next example shows how the Array trait can be used to feed a specialised TraitsUI Item, the
crop surface on linked to a given reservoir. ChacoPlotItem:
from traits.api import HasTraits, Str, Float, Range, Enum, List, Property import numpy as np
from traitsui.api import View, Item
from traits.api import HasTraits, Array, Instance, Float, Property
class IrrigationArea(HasTraits): from traits.api import DelegatesTo
name = Str from traitsui.api import View, Item, Group
surface = Float(desc='Surface [ha]') from chaco.chaco_plot_editor import ChacoPlotItem
crop = Enum('Alfalfa', 'Wheat', 'Cotton')
from reservoir import Reservoir
class Reservoir(HasTraits):
name = Str class ReservoirEvolution(HasTraits):
max_storage = Float(1e6, desc='Maximal storage [hm3]') reservoir = Instance(Reservoir)
max_release = Float(10, desc='Maximal release [m3/s]')
head = Float(10, desc='Hydraulic head [m]') name = DelegatesTo('reservoir')
efficiency = Range(0, 1.)
inflows = Array(dtype=np.float64, shape=(None))
irrigated_areas = List(IrrigationArea) releass = Array(dtype=np.float64, shape=(None))
initial_stock = Float
total_crop_surface = Property(depends_on='irrigated_areas.surface') stock = Property(depends_on='inflows, releases, initial_stock')
19.3. What are Traits 564 19.3. What are Traits 565
Scipy lecture notes, Edition 2022.1
20
def _get_month(self):
return np.arange(self.stock.size)
if __name__ == '__main__':
reservoir = Reservoir(
name = 'Project A',
max_storage = 30,
max_release = 100.0, CHAPTER
head = 60,
efficiency = 0.8
)
initial_stock = 10.
inflows_ts = np.array([6., 6, 4, 4, 1, 2, 0, 0, 3, 1, 5, 3])
releases_ts = np.array([4., 5, 3, 5, 3, 5, 5, 3, 2, 1, 3, 3])
Tip: Mayavi is an interactive 3D plotting package. matplotlib can also do simple 3D plotting, but
Mayavi relies on a more powerful engine ( VTK ) and is more suited to displaying large or complex data.
Chapters contents
mlab.clf()
x, y = np.mgrid[-10:10:100j, -10:10:100j]
r = np.sqrt(x**2 + y**2)
z = np.sin(r)/r
Hint: Points in 3D, represented with markers (or “glyphs”) and optionaly different sizes. mlab.surf(z, warp_scale='auto')
20.1. Mlab: the scripting interface 568 20.1. Mlab: the scripting interface 569
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
This function works with a regular orthogonal grid: the value array is a 3D array that gives the
Hint: A surface mesh given by x, y, z positions of its node points shape of the grid.
x, y, z are 2D arrays, all of the same shape, giving the positions of the vertices of the surface. The
Hint: If your data is dense in 3D, it is more difficult to display. One option is to take iso-contours of connectivity between these points is implied by the connectivity on the arrays.
the data. For simple structures (such as orthogonal grids) prefer the surf function, as it will create more efficient
data structures.
mlab.clf() Keyword arguments:
x, y, z = np.mgrid[-5:5:64j, -5:5:64j, -5:5:64j]
values = x*x*0.5 + y*y + z*z*2.0 color the color of the vtk object. Overides the colormap, if any, when specified.
mlab.contour3d(values) This is specified as a triplet of float ranging from 0 to 1, eg (1, 1, 1) for
white.
colormap type of colormap to use.
20.1. Mlab: the scripting interface 570 20.1. Mlab: the scripting interface 571
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
extent [xmin, xmax, ymin, ymax, zmin, zmax] Default is the x, y, z arrays
extents. Use this to change the extent of the object created.
figure Figure to populate.
line_width The with of the lines, if any used. Must be a float. Default: 2.0
mask boolean mask array to suppress some data points.
mask_points If supplied, only one out of ‘mask_points’ data point is dis-
played. This option is usefull to reduce the number of points displayed on
large datasets Must be an integer or None.
mode the mode of the glyphs. Must be ‘2darrow’ or ‘2dcircle’ or
‘2dcross’ or ‘2ddash’ or ‘2ddiamond’ or ‘2dhooked_arrow’ or ‘2dsquare’ or
‘2dthick_arrow’ or ‘2dthick_cross’ or ‘2dtriangle’ or ‘2dvertex’ or ‘arrow’ or
‘cone’ or ‘cube’ or ‘cylinder’ or ‘point’ or ‘sphere’. Default: sphere
name the name of the vtk object created.
representation the representation type used for the surface. Must be ‘surface’
or ‘wireframe’ or ‘points’ or ‘mesh’ or ‘fancymesh’. Default: surface
resolution The resolution of the glyph created. For spheres, for instance, this
is the number of divisions along theta and phi. Must be an integer. Default:
8
Example:
scalars optional scalar data.
In [1]: import numpy as np
scale_factor scale factor of the glyphs used to represent the vertices, in
fancy_mesh mode. Must be a float. Default: 0.05 In [2]: r, theta = np.mgrid[0:10, -np.pi:np.pi:10j]
scale_mode the scaling mode for the glyphs (‘vector’, ‘scalar’, or ‘none’).
In [3]: x = r * np.cos(theta)
transparent make the opacity of the actor depend on the scalar.
In [4]: y = r * np.sin(theta)
tube_radius radius of the tubes used to represent the lines, in mesh mode. If
None, simple lines are used. In [5]: z = np.sin(r)/r
tube_sides number of sides of the tubes used to represent the lines. Must be
In [6]: from mayavi import mlab
an integer. Default: 6
vmax vmax is used to scale the colormap If None, the max of the data will be In [7]: mlab.mesh(x, y, z, colormap='gist_earth', extent=[0, 1, 0, 1, 0, 1])
used Out[7]: <mayavi.modules.surface.Surface object at 0xde6f08c>
vmin vmin is used to scale the colormap If None, the min of the data will be In [8]: mlab.mesh(x, y, z, extent=[0, 1, 0, 1, 0, 1],
used ...: representation='wireframe', line_width=1, color=(0.5, 0.5, 0.5))
Out[8]: <mayavi.modules.surface.Surface object at 0xdd6a71c>
Decorations
Tip: Different items can be added to the figure to carry extra information, such as a colorbar or a title.
In [11]: mlab.outline(Out[7])
Out[11]: <enthought.mayavi.modules.outline.Outline object at 0xdd21b6c>
In [12]: mlab.axes(Out[7])
Out[12]: <enthought.mayavi.modules.axes.Axes object at 0xd2e4bcc>
20.1. Mlab: the scripting interface 572 20.1. Mlab: the scripting interface 573
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
20.2. Interactive work 574 20.3. Slicing and dicing data: sources, modules and filters 575
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
such data a field, borrowing from terminology used in physics, as it is continuously defined in
space.
• A set of data points measured at random positions in a random order gives rise to much more
difficult and ill-posed interpolation problems: the data structure itself does not tell us what are
the neighbors of a data point. We call such data a scatter.
Unstructured and unconnected data: a scatter Structured and connected data: a field
mlab.points3d, mlab.quiver3d mlab.contour3d
• If you compute the norm of the vector field, you can apply an isosurface to it. Exercice:
• using mayavi.mlab.quiver3d() you can plot vectors. You can also use the ‘masking’ options
1. Create a contour (for instance of the magnetic field norm) by using one of those functions and
(in the GUI) to make the plot a bit less dense.
adding the right module by clicking on the GUI dialog.
2. Create the right source to apply a ‘vector_cut_plane’ and reproduce the picture of the magnetic
20.3.2 Different views on data: sources and modules field shown previously.
Note that one of the difficulties is providing the data in the right form (number of arrays, shape) to
Tip: As we see above, it may be desirable to look at the same data in different ways. the functions. This is often the case with real-life data.
See also:
Mayavi visualization are created by loading the data in a data source and then displayed on the screen
using modules. Sources are described in details in the Mayavi manual.
This can be seen by looking at the “pipeline” view. By right-clicking on the nodes of the pipeline, you
Transforming data: filters
can add new modules.
If you create a vector field, you may want to visualize the iso-contours of its magnitude. But the isosurface
module can only be applied to scalar data, and not vector data. We can use a filter, ExtractVectorNorm
Quiz
to add this scalar value to the vector field.
Why is it not possible to add a VectorCutPlane to the vectors created by mayavi.mlab.quiver3d()? Filters apply a transformation to data, and can be added between sources and modules
Excercice
Different sources: scatters and fields
Using the GUI, add the ExtractVectorNorm filter to display iso-contours of the field magnitude.
Tip: Data comes in different descriptions.
• A 3D block of regularly-spaced value is structured: it is easy to know how one measurement is
related to another neighboring and how to continuously interpolate between these. We can call
20.3. Slicing and dicing data: sources, modules and filters 576 20.3. Slicing and dicing data: sources, modules and filters 577
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
The mlab scripting layer builds pipelines for you. You can reproduce these pipelines programmati- iso = mlab.contour3d(scalars, transparent=True, contours=[0.5])
for i in range(1, 20):
cally with the mlab.pipeline interface: each step has a corresponding mlab.pipeline function (sim-
scalars = np.sin(i * x * y * z) /(x * y * z)
ply convert the name of the step to lower-case underscore-separated: ExtractVectorNorm gives ex- iso.mlab_source.scalars = scalars
tract_vector_norm). This function takes as an argument the node that it applies to, as well as optional
parameters, and returns the new node.
For example, iso-contours of the magnitude are coded as: See also:
For the interaction with the user (for instance changing the view with the mouse), Mayavi needs some
time to process these events. The for loop above prevents this. The Mayavi documentation details a
workaround
def curve(n_turns):
"The function creating the x, y, z coordinates needed to plot"
phi = np.linspace(0, 2*np.pi, 2000)
return [np.cos(phi) * (1 + 0.5*np.cos(n_turns*phi)),
np.sin(phi) * (1 + 0.5*np.cos(n_turns*phi)),
0.5*np.sin(n_turns*phi)]
x , y , z = np.ogrid[-5:5:100j ,-5:5:100j, -5:5:100j] Tip: Let us read a bit the code above (examples/mlab_dialog.py).
scalars = np.sin(x * y * z) / (x * y * z)
First, the curve function is used to compute the coordinate of the curve we want to plot.
(continues on next page)
20.4. Animating the data 578 20.5. Making interactive dialogs 579
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Second, the dialog is defined by an object inheriting from HasTraits, as it is done with Traits. The (continued from previous page)
important point here is that a Mayavi scene is added as a specific Traits attribute (Instance). This is self.plot.mlab_source.set(x=x, y=y, z=z)
important for embedding it in the dialog. The view of this dialog is defined by the view attribute of the
object. In the init of this object, we populate the 3D scene with a curve. view = View(Item('scene', height=300, show_label=False,
editor=SceneEditor()),
Finally, the configure_traits method creates the dialog and starts the event loop. HGroup('n_turns'), resizable=True)
You can look at the example coil application to see a full-blown application for coil design in 270 lines
of code.
class Visualization(HasTraits):
n_turns = Range(0, 30, 11)
scene = Instance(MlabSceneModel, ())
def __init__(self):
HasTraits.__init__(self)
x, y, z = curve(self.n_turns)
self.plot = self.scene.mlab.plot3d(x, y, z)
@on_trait_change('n_turns')
def update_plot(self):
x, y, z = curve(self.n_turns)
(continues on next page)
Chapters contents
21
• Supervised Learning: Classification of Handwritten Digits
• Supervised Learning: Regression of Housing Data
• Measuring prediction performance
• Unsupervised Learning: Dimensionality Reduction and Visualization
CHAPTER
• The eigenfaces example: chaining PCA and SVMs
• The eigenfaces example: chaining PCA and SVMs
• Parameter selection, Validation, and Testing
• Examples for the scikit-learn chapter
Tip: Machine Learning is about building programs with tunable parameters that are adjusted
automatically so as to improve their behavior by adapting to previously seen data.
Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be
seen as building blocks to make computers learn to behave more intelligently by somehow generalizing
Authors: Gael Varoquaux
rather that just storing and retrieving data items like a database system would do.
Prerequisites
• numpy
• scipy
• matplotlib (optional)
• ipython (the enhancements come handy)
Acknowledgements
This chapter is adapted from a tutorial given by Gaël Varoquaux, Jake Vanderplas, Olivier Grisel. Fig. 1: A classification problem
See also:
We’ll take a look at two very simple machine learning tasks here. The first is a classification task:
Data science in Python the figure shows a collection of two-dimensional data, colored according to two different class labels. A
classification algorithm may be used to draw a dividing boundary between the two clusters of points:
• The Statistics in Python chapter may also be of interest for readers looking into machine learning.
By drawing this separating line, we have learned a model which can generalize to new data: if you were
• The documentation of scikit-learn is very complete and didactic.
to drop another point onto the plane which is unlabeled, this algorithm could now predict whether it’s
Quick Question:
Remember that there must be a fixed number of features for each sample, and feature number i must
The next simple task we’ll look at is a regression task: a simple best-fit line to a set of data. be a similar kind of quantity for each sample.
Again, this is an example of fitting a model to data, but our focus here is that the model can make Loading the Iris Data with Scikit-learn
generalizations about new data. The model has been learned from the training data, and can be used
to predict the result of test data: here, we might be given an x-value, and the model would allow us to Scikit-learn has a very straightforward set of data on these iris species. The data consist of the following:
predict the y value.
• Features in the Iris dataset:
21.1.2 Data in scikit-learn – sepal length (cm)
Machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional – petal length (cm)
array or matrix. The arrays can be either numpy arrays, or in some cases scipy.sparse matrices. The – petal width (cm)
size of the array is expected to be [n_samples, n_features]
• Target classes to predict:
• n_samples: The number of samples: each sample is an item to process (e.g. classify). A sample
can be a document, a picture, a sound, a video, an astronomical object, a row in database or CSV – Setosa
file, or whatever you can describe with a fixed set of quantitative traits. – Versicolour
• n_features: The number of features or distinct traits that can be used to describe each item in – Virginica
a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued
in some cases. scikit-learn embeds a copy of the iris CSV file along with a function to load it into numpy arrays:
A Simple Example: the Iris Dataset The features of each sample flower are stored in the data attribute of the dataset:
The application problem >>> print(iris.data.shape)
(150, 4)
As an example of a simple dataset, let us a look at the iris data stored by scikit-learn. Suppose we want >>> n_samples, n_features = iris.data.shape
to recognize species of irises. The data consists of measurements of three different species of irises: (continues on next page)
21.1. Introduction: problem settings 584 21.1. Introduction: problem settings 585
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
(continued from previous page) 21.2 Basic principles of machine learning with scikit-learn
>>> print(n_samples)
150 21.2.1 Introducing the scikit-learn estimator object
>>> print(n_features)
4 Every algorithm is exposed in scikit-learn via an ‘’Estimator” object. For instance a linear regression is:
>>> print(iris.data[0]) sklearn.linear_model.LinearRegression
[5.1 3.5 1.4 0.2]
>>> from sklearn.linear_model import LinearRegression
The information about the class of each sample is stored in the target attribute of the dataset:
Estimator parameters: All the parameters of an estimator can be set when it is instantiated:
>>> print(iris.target.shape)
(150,) >>> model = LinearRegression(n_jobs=1, normalize=True)
>>> print(iris.target) >>> print(model.normalize)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 True
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 >>> print(model)
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 LinearRegression(n_jobs=1, normalize=True)
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Fitting on data
The names of the classes are stored in the last attribute, namely target_names: Let’s create some simple data with numpy:
>>> print(iris.target_names) >>> import numpy as np
['setosa' 'versicolor' 'virginica'] >>> x = np.array([0, 1, 2])
>>> y = np.array([0, 1, 2])
This data is four-dimensional, but we can visualize two of the dimensions at a time using a scatter plot:
>>> X = x[:, np.newaxis] # The input data for sklearn is 2D: (samples == 3 x features == 1)
>>> X
array([[0],
[1],
[2]])
>>> model.fit(X, y)
LinearRegression(n_jobs=1, normalize=True)
Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data
at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:
>>> model.coef_
array([1.])
Supervised learning is further broken down into two categories, classification and regression. In
classification, the label is discrete, while in regression, the label is continuous. For example, in astronomy,
21.1. Introduction: problem settings 586 21.2. Basic principles of machine learning with scikit-learn 587
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
the task of determining whether an object is a star, a galaxy, or a quasar is a classification problem: the (continued from previous page)
label is from three distinct categories. On the other hand, we might wish to estimate the age of an object model = LinearRegression()
based on such observations: this would be a regression problem, because the label (age) is a continuous model.fit(x, y)
quantity.
# predict y from the data
Classification: K nearest neighbors (kNN) is one of the simplest learning strategies: given a new, x_new = np.linspace(0, 30, 100)
unknown observation, look up in your reference database which ones have the closest features and assign y_new = model.predict(x_new[:, np.newaxis])
the predominant class. Let’s try it out on our iris classification problem:
from sklearn import neighbors, datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
knn = neighbors.KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?
print(iris.target_names[knn.predict([[3, 5, 4, 2]])])
21.2. Basic principles of machine learning with scikit-learn 588 21.2. Basic principles of machine learning with scikit-learn 589
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
And now, let’s fit a 4th order and a 9th order polynomial to the data.
With your naked eyes, which model do you prefer, the 4th order one, or the 9th order one?
Let’s look at the ground truth:
Tip: Regularization is ubiquitous in machine learning. Most scikit-learn estimators have a parameter
to tune the amount of regularization. For instance, with k-NN, it is ‘k’, the number of nearest neighbors
used to make the decision. k=1 amounts to no regularization: 0 error on the training set, whereas large
k will push toward smoother decision boundaries in the feature space.
21.2. Basic principles of machine learning with scikit-learn 590 21.2. Basic principles of machine learning with scikit-learn 591
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Tip: For classification models, the decision boundary, that separates the class expresses the complexity
of the model. For instance, a linear model, that makes a decision based on a linear combination of
features, is more complex than a non-linear one.
Python code and Jupyter notebook for this section are found here Let
us visualize the data and remind us what we’re looking at (click on the figure for the full code):
In this section we’ll apply scikit-learn to the classification of handwritten digits. This will go a bit beyond # plot the digits: each image is 8x8 pixels
the iris classification we saw before: we’ll discuss some of the metrics which can be used in evaluating for i in range(64):
the effectiveness of a classification model. ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
>>> from sklearn.datasets import load_digits
>>> digits = load_digits()
21.3.2 Visualizing the Data on its principal components
A good first-step for many problems is to visualize the data using a Dimensionality Reduction technique.
We’ll start with the most straightforward one, Principal Component Analysis (PCA).
PCA seeks orthogonal linear combinations of the features which show the greatest variance, and as such,
can help give you a good idea of the structure of the data set.
21.3. Supervised Learning: Classification of Handwritten Digits 592 21.3. Supervised Learning: Classification of Handwritten Digits 593
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
>>> # use the model to predict the labels of the test data
>>> predicted = clf.predict(X_test)
>>> expected = y_test
>>> print(predicted)
[1 7 7 7 8 2 8 0 4 8 7 7 0 8 2 3 5 8 5 3 7 9 6 2 8 2 2 7 3 5...]
>>> print(expected)
[1 0 4 7 8 2 2 0 4 3 7 7 0 8 2 3 4 8 5 3 7 9 6 3 8 2 2 9 3 5...]
As above, we plot the digits with the predicted labels to get an idea of how well the classification is work-
Question
Given these projections of the data, which numbers do you think a classifier might have trouble
distinguishing?
Tip: Gaussian Naive Bayes fits a Gaussian distribution to each training label independantly on each
feature, and uses this to quickly give a rough classification. It is generally not sufficiently accurate for
real-world data, but can perform surprisingly well, for instance on text data.
ing.
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.model_selection import train_test_split
(continues on next page)
21.3. Supervised Learning: Classification of Handwritten Digits 594 21.3. Supervised Learning: Classification of Handwritten Digits 595
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
21.3. Supervised Learning: Classification of Handwritten Digits 596 21.4. Supervised Learning: Regression of Housing Data 597
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Tip: The prediction at least correlates with the true price, though there are clearly some biases. We
could imagine evaluating the performance of the regressor by, say, computing the RMS residuals between
the true and predicted price. There are some subtleties in this, however, which we’ll cover in a later
section.
There are many other types of regressors available in scikit-learn: we’ll try a more powerful one here.
This is a manual version of a technique called feature selection.
Use the GradientBoostingRegressor class to fit the housing data.
Tip: Sometimes, in Machine Learning it is useful to use feature selection to decide which features are hint You can copy and paste some of the above code, replacing LinearRegression with
the most useful for a particular problem. Automated methods exist which quantify this sort of exercise GradientBoostingRegressor:
of choosing the most informative features. from sklearn.ensemble import GradientBoostingRegressor
# Instantiate the model, fit the results, and scatter in vs. out
21.4.2 Predicting Home Prices: a Simple Linear Regression Solution The solution is found in the code of this chapter
Now we’ll use scikit-learn to perform a simple linear regression on the housing data. There are many
possibilities of regressors to use. A particularly simple one is LinearRegression: this is basically a
wrapper around an ordinary least squares calculation. 21.5 Measuring prediction performance
>>> from sklearn.model_selection import train_test_split
21.5.1 A quick test on the K-neighbors classifier
>>> X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)
>>> from sklearn.linear_model import LinearRegression Here we’ll continue to look at the digits data, but we’ll switch to the K-Neighbors classifier. The K-
(continues on next page) neighbors classifier is an instance-based classifier. The K-neighbors classifier predicts the label of an
21.4. Supervised Learning: Regression of Housing Data 598 21.5. Measuring prediction performance 599
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
unknown point based on the labels of the K nearest points in the parameter space.
Now we train on the training data, and test on the testing data:
21.5. Measuring prediction performance 600 21.5. Measuring prediction performance 601
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Tip: We have applied Gaussian Naives, support vectors machines, and K-nearest neighbors classifiers >>> from sklearn.model_selection import ShuffleSplit
>>> cv = ShuffleSplit(n_splits=5)
to the digits dataset. Now that we have these validation tools in place, we can ask quantitatively which
>>> cross_val_score(clf, X, y, cv=cv)
of the three estimators works best for this dataset.
array([...])
• With the default hyper-parameters for each estimator, which gives the best f1 score on the valida-
tion set? Recall that hyperparameters are the parameters set when you instantiate the classifier: Tip: There exists many different cross-validation strategies in scikit-learn. They are often useful to
for example, the n_neighbors in clf = KNeighborsClassifier(n_neighbors=1) take in account non iid datasets.
21.5. Measuring prediction performance 602 21.5. Measuring prediction performance 603
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Nested cross-validation
Basic Hyperparameter Optimization
How do we measure the performance of these estimators? We have used data to set the hyperparameters,
We compute the cross-validation score as a function of alpha, the strength of the regularization for Lasso so we need to test on actually new data. We can do this by running cross_val_score() on our CV
and Ridge. We choose 20 values of alpha between 0.0001 and 1: objects. Here there are 2 cross-validation loops going on, this is called ‘nested cross validation’:
Automatically Performing Grid Search Tip: PCA computes linear combinations of the original features using a truncated Singular Value
Decomposition of the matrix X, to project the data onto a base of the top singular vectors.
sklearn.grid_search.GridSearchCV is constructed with an estimator, as well as a dictionary of pa-
rameter values to be searched. We can find the optimal parameters this way:
>>> from sklearn.decomposition import PCA
>>> from sklearn.model_selection import GridSearchCV >>> pca = PCA(n_components=2, whiten=True)
>>> for Model in [Ridge, Lasso]: >>> pca.fit(X)
... gscv = GridSearchCV(Model(), dict(alpha=alphas), cv=3).fit(X, y) PCA(n_components=2, ...)
... print('%s : %s ' % (Model.__name__, gscv.best_params_))
Ridge: {'alpha': 0.062101694189156162} Once fitted, PCA exposes the singular vectors in the components_ attribute:
Lasso: {'alpha': 0.01268961003167922}
21.5. Measuring prediction performance 604 21.6. Unsupervised Learning: Dimensionality Reduction and Visualization 605
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
by the colors): this is the sense in which the learning is unsupervised. Nevertheless, we see that the
>>> pca.components_
array([[ 0.3..., -0.08..., 0.85..., 0.3...], projection gives us insight into the distribution of the different flowers in parameter space: notably, iris
[ 0.6..., 0.7..., -0.1..., -0.07...]]) setosa is much more distinct than the other two species.
>>> X_pca = pca.transform(X) >>> # Take the first 500 data points: it's hard to see 1500 points
>>> X_pca.shape >>> X = digits.data[:500]
(150, 2) >>> y = digits.target[:500]
PCA normalizes and whitens the data, which means that the data is now centered on both components >>> # Fit and transform with a TSNE
with unit variance: >>> from sklearn.manifold import TSNE
>>> tsne = TSNE(n_components=2, random_state=0)
>>> X_pca.mean(axis=0) >>> X_2d = tsne.fit_transform(X)
array([...e-15, ...e-15])
>>> X_pca.std(axis=0, ddof=1) >>> # Visualize the data
array([1., 1.]) >>> plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y)
<matplotlib.collections.PathCollection object at ...>
Furthermore, the samples components do no longer carry any linear correlation:
>>> np.corrcoef(X_pca.T)
array([[1.00000000e+00, 0.0],
[0.0, 1.00000000e+00]])
fit_transform
As TSNE cannot be applied to new data, we need to use its fit_transform method.
sklearn.manifold.TSNE separates quite well the different classes of digits eventhough it had no access
to the class information.
Tip: Note that this projection was determined without any information about the labels (represented
21.6. Unsupervised Learning: Dimensionality Reduction and Visualization 606 21.6. Unsupervised Learning: Dimensionality Reduction and Visualization 607
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
sklearn.manifold has many other non-linear embeddings. Try them out on the digits dataset. Could
you judge their quality without knowing the labels y?
>>> from sklearn.datasets import load_digits
>>> digits = load_digits()
>>> # ...
Python code and Jupyter notebook for this section are found here
print(X_train.shape, X_test.shape)
Out:
One interesting part of PCA is that it computes the “mean” face, which can be interesting to examine:
21.7. The eigenfaces example: chaining PCA and SVMs 608 21.8. The eigenfaces example: chaining PCA and SVMs 609
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
plt.imshow(pca.mean_.reshape(faces.images[0].shape),
cmap=plt.cm.bone)
The components (“eigenfaces”) are ordered by their importance from top-left to bottom-right. We
see that the first few components seem to primarily take care of lighting conditions; the remaining
components pull out certain identifying features: the nose, eyes, eyebrows, etc.
With this projection computed, we can now project our original training and test data onto the PCA
basis:
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print(X_train_pca.shape)
Out:
(300, 150)
print(X_test_pca.shape)
Out:
The principal components measure deviations about this mean along orthogonal axes. (100, 150)
print(pca.components_.shape)
These projected components correspond to factors in a linear combination of component images such
that the combination approaches the original face.
Out:
import numpy as np
fig = plt.figure(figsize=(8, 6))
for i in range(15):
ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
ax.imshow(X_test[i].reshape(faces.images[0].shape),
cmap=plt.cm.bone)
y_pred = clf.predict(X_test_pca[i, np.newaxis])[0]
color = ('black' if y_pred == y_test[i] else 'red')
ax.set_title(y_pred, fontsize='small', color=color)
21.8. The eigenfaces example: chaining PCA and SVMs 610 21.8. The eigenfaces example: chaining PCA and SVMs 611
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Another interesting metric is the confusion matrix, which indicates how often any two items are mixed-
up. The confusion matrix of a perfect classifier would only have nonzero entries on the diagonal, with
zeros on the off-diagonal:
The classifier is correct on an impressive number of images given the simplicity of its learning model!
Using a linear classifier on 150 features derived from the pixel-level data, the algorithm correctly identifies print(metrics.confusion_matrix(y_test, y_pred))
a large number of the people in the images.
Out:
Again, we can quantify this effectiveness using one of several measures from sklearn.metrics. First we
can do the classification report, which shows the precision, recall and other measures of the “goodness” [[3 0 0 ... 0 0 0]
of the classification: [0 4 0 ... 0 0 0]
[0 0 2 ... 0 0 0]
from sklearn import metrics ...
y_pred = clf.predict(X_test_pca) [0 0 0 ... 3 0 0]
print(metrics.classification_report(y_test, y_pred)) [0 0 0 ... 0 1 0]
[0 0 0 ... 0 0 3]]
Out:
precision recall f1-score support 21.8.3 Pipelining
0 1.00 0.50 0.67 6 Above we used PCA as a pre-processing step before applying our support vector machine classifier.
1 1.00 1.00 1.00 4 Plugging the output of one estimator directly into the input of a second estimator is a commonly used
2 0.50 1.00 0.67 2 pattern; for this reason scikit-learn provides a Pipeline object which automates this process. The above
3 1.00 1.00 1.00 1 problem can be re-expressed as a pipeline as follows:
4 0.33 1.00 0.50 1
5 1.00 1.00 1.00 5 from sklearn.pipeline import Pipeline
6 1.00 1.00 1.00 4 clf = Pipeline([('pca', decomposition.PCA(n_components=150, whiten=True)),
7 1.00 0.67 0.80 3 ('svm', svm.LinearSVC(C=1.0))])
9 1.00 1.00 1.00 1
10 1.00 1.00 1.00 4 clf.fit(X_train, y_train)
11 1.00 1.00 1.00 1
12 0.67 1.00 0.80 2 y_pred = clf.predict(X_test)
13 1.00 1.00 1.00 3 print(metrics.confusion_matrix(y_pred, y_test))
14 1.00 1.00 1.00 5
15 1.00 1.00 1.00 3
(continues on next page)
21.8. The eigenfaces example: chaining PCA and SVMs 612 21.8. The eigenfaces example: chaining PCA and SVMs 613
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Bias-variance trade-off: illustration on a simple regression problem We can use another linear estimator that uses regularization, the Ridge estimator. This estimator
regularizes the coefficients by shrinking them to zero, under the assumption that very high correlations
are often spurious. The alpha parameter controls the amount of shrinkage used.
Code and notebook
regr = linear_model.Ridge(alpha=.1)
np.random.seed(0)
Python code and Jupyter notebook for this section are found here
for _ in range(6):
noisy_X = X + np.random.normal(loc=0, scale=.1, size=X.shape)
Let us start with a simple 1D regression problem. This will help us to easily visualize the data and the plt.plot(noisy_X, y, 'o')
model, and the results generalize easily to higher-dimensional datasets. We’ll explore a simple linear regr.fit(noisy_X, y)
regression problem, with sklearn.linear_model. plt.plot(X_test, regr.predict(X_test))
As we can see, the estimator displays much less variance. However it systematically under-estimates the
coefficient. It displays a biased behavior.
This is a typical example of bias/variance tradeof : non-regularized estimator are not biased, but they
can display a lot of variance. Highly-regularized models have little variance, but high bias. This bias is
not necessarily a bad thing: what matters is choosing the tradeoff between bias and variance that leads
to the best prediction performance. For a specific dataset there is a sweet spot corresponding to the
highest complexity that the data can support, depending on the amount of noise and of observations
available.
In real life situation, we have noise (e.g. measurement noise) in our data:
21.9.2 Visualizing the Bias/Variance Tradeoff
21.9. Parameter selection, Validation, and Testing 614 21.9. Parameter selection, Validation, and Testing 615
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
On a given data, let us fit a simple polynomial regression model with varying degrees:
Tip: In the above figure, we see fits for three different values of d. For d = 1, the data is under-fit.
Validation curve A validation curve consists in varying a model parameter that controls its complexity
This means that the model is too simplistic: no straight line will ever be a good fit to this data. In this
(here the degree of the polynomial) and measures both error of the model on training data, and on test
case, we say that the model suffers from high bias. The model itself is biased, and this will be reflected
data (eg with cross-validation). The model parameter is then adjusted so that the test error is minimized:
in the fact that the data is poorly fit. At the other extreme, for d = 6 the data is over-fit. This means
that the model has too many free parameters (6 in this case) which can be adjusted to perfectly fit the We use sklearn.model_selection.validation_curve() to compute train and test error, and plot it:
training data. If we add a new point to this plot, though, chances are it will be very far from the curve
representing the degree-6 fit. In this case, we say that the model suffers from high variance. The reason >>> from sklearn.model_selection import validation_curve
for the term “high variance” is that if any of the input points are varied slightly, it could result in a very
>>> degrees = np.arange(1, 21)
different model.
In the middle, for d = 2, we have found a good mid-point. It fits the data fairly well, and does not suffer >>> model = make_pipeline(PolynomialFeatures(), LinearRegression())
from the bias and variance problems seen in the figures on either side. What we would like is a way to
quantitatively identify bias and variance, and optimize the metaparameters (in this case, the polynomial >>> # Vary the "degrees" on the pipeline step "polynomialfeatures"
>>> train_scores, validation_scores = validation_curve(
degree d) in order to determine the best algorithm.
... model, x[:, np.newaxis], y,
... param_name='polynomialfeatures__degree',
... param_range=degrees)
Polynomial regression with scikit-learn
>>> # Plot the mean train score and validation score across folds
>>> plt.plot(degrees, validation_scores.mean(axis=1), label='cross-validation')
A polynomial regression is built by pipelining PolynomialFeatures and a LinearRegression:
[<matplotlib.lines.Line2D object at ...>]
>>> from sklearn.pipeline import make_pipeline >>> plt.plot(degrees, train_scores.mean(axis=1), label='training')
>>> from sklearn.preprocessing import PolynomialFeatures [<matplotlib.lines.Line2D object at ...>]
>>> from sklearn.linear_model import LinearRegression >>> plt.legend(loc='best')
>>> model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression()) <matplotlib.legend.Legend object at ...>
Validation Curves
Let us create a dataset like in the example above:
21.9. Parameter selection, Validation, and Testing 616 21.9. Parameter selection, Validation, and Testing 617
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
A learning curve shows the training and validation score as a function of the number of training points. • Gather more features for each sample.
Note that when we train on a subset of the training data, the training score is computed using this • Decrease regularization in a regularized model.
subset, not the full training set. This curve gives a quantitative view into how beneficial it will be to
add training samples. Increasing the number of samples, however, does not improve a high-bias model.
Now let’s look at a high-variance (i.e. over-fit) model:
Questions:
• As the number of training samples are increased, what do you expect to see for the training
score? For the validation score?
• Would you expect the training score to be higher or lower than the validation score? Would you
ever expect this to change?
>>> # Plot the mean train score and validation score across folds
>>> plt.plot(train_sizes, validation_scores.mean(axis=1), label='cross-validation')
[<matplotlib.lines.Line2D object at ...>]
>>> plt.plot(train_sizes, train_scores.mean(axis=1), label='training') Fig. 6: For a degree=15 model
[<matplotlib.lines.Line2D object at ...>]
Here we show the learning curve for d = 15. From the above discussion, we know that d = 15 is a
high-variance estimator which over-fits the data. This is indicated by the fact that the training score
is much higher than the validation score. As we add more samples to this training set, the training score
21.9. Parameter selection, Validation, and Testing 618 21.9. Parameter selection, Validation, and Testing 619
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
will continue to decrease, while the cross-validation error will continue to increase, until they meet in 21.9.4 A last word of caution: separate validation and test set
the middle.
Using validation schemes to determine hyper-parameters means that we are fitting the hyper-parameters
Learning curves that have not yet converged with the full training set indicate a high- to the particular validation set. In the same way that parameters can be over-fit to the training set,
variance, over-fit model. hyperparameters can be over-fit to the validation set. Because of this, the validation error tends to
under-predict the classification error of new data.
A high-variance model can be improved by:
For this reason, it is recommended to split the data into three sets:
• Gathering more training samples.
• The training set, used to train the model (usually ~60% of the data)
• Using a less-sophisticated model (i.e. in this case, make d smaller)
• The validation set, used to validate the model (usually ~20% of the data)
• Increasing regularization.
• The test set, used to evaluate the expected error of the validated model (usually ~20% of the
In particular, gathering more features for each sample will not help the results.
data)
21.9.3 Summary on model selection Many machine learning practitioners do not separate test set and validation set. But if your goal is to
gauge the error of a model on unknown data, using an independent test set is vital.
We’ve seen above that an under-performing algorithm can be due to two possible situations: high
bias (under-fitting) and high variance (over-fitting). In order to evaluate our algorithm, we set aside
a portion of our training data for cross-validation. Using the technique of learning curves, we can
train on progressively larger subsets of the data, evaluating the training error and cross-validation error
to determine whether our algorithm has high variance or high bias. But what do we do with this
information?
21.10 Examples for the scikit-learn chapter
High Bias
If a model shows high bias, the following actions might help: Note: Click here to download the full example code
• Add more features. In our example of predicting home prices, it may be helpful to make use of
information such as the neighborhood the house is in, the year the house was built, the size of the
lot, etc. Adding these features to the training and test sets can improve a high-bias estimator 21.10.1 Measuring Decision Tree performance
• Use a more sophisticated model. Adding complexity to the model can help improve on bias. Demonstrates overfit when testing on train set.
For a polynomial fit, this can be accomplished by increasing the degree d. Each learning technique Get the data
has its own methods of adding complexity.
from sklearn.datasets import load_boston
• Use fewer samples. Though this will not improve the classification, a high-bias algorithm can data = load_boston()
attain nearly the same error with a smaller training sample. For algorithms which are compu-
tationally expensive, reducing the training sample size can lead to very large improvements in Train and test a model
speed.
from sklearn.tree import DecisionTreeRegressor
• Decrease regularization. Regularization is a technique used to impose simplicity in some ma- clf = DecisionTreeRegressor().fit(data.data, data.target)
chine learning models, by adding a penalty term that depends on the characteristics of the param-
eters. If a model has high bias, decreasing the effect of regularization can lead to better results. predicted = clf.predict(data.data)
expected = data.target
High Variance
Plot predicted as a function of expected
If a model shows high variance, the following actions might help:
• Use fewer features. Using a feature selection technique may be useful, and decrease the over- from matplotlib import pyplot as plt
plt.figure(figsize=(4, 3))
fitting of the estimator.
plt.scatter(expected, predicted)
• Use a simpler model. Model complexity and over-fitting go hand-in-hand. plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
• Use more training samples. Adding training samples can reduce the effect of over-fitting, and plt.xlabel('True price ($1000s)')
lead to improvements in a high variance estimator. plt.ylabel('Predicted price ($1000s)')
plt.tight_layout()
• Increase Regularization. Regularization is designed to prevent over-fitting. In a high-variance
model, increasing regularization can lead to better results.
These choices become very important in real-world situations. For example, due to limited telescope
time, astronomers must seek a balance between observing a large number of objects, and observing a
large number of features for each object. Determining which is more important for a particular learning
task can inform the observing strategy that the astronomer employs.
21.9. Parameter selection, Validation, and Testing 620 21.10. Examples for the scikit-learn chapter 621
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
from sklearn import datasets Note: Click here to download the full example code
iris = datasets.load_iris()
X = iris.data
y = iris.target
21.10.3 A simple linear regression
Fit a PCA
X_pca = pca.transform(X)
target_ids = range(len(iris.target_names))
21.10. Examples for the scikit-learn chapter 622 21.10. Examples for the scikit-learn chapter 623
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# x from 0 to 30
x = 30 * np.random.random((20, 1))
ax.set_xlabel('x')
ax.set_ylabel('y')
# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
21.10.4 Plot 2D views of the iris dataset
plt.figure(figsize=(5, 4))
Plot a simple scatter plot of 2 features of the iris dataset. plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)
Note that more elaborate visualization of this dataset is detailed in the Statistics in Python chapter. plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index])
plt.tight_layout()
plt.show()
21.10. Examples for the scikit-learn chapter 624 21.10. Examples for the scikit-learn chapter 625
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
(442, 10)
Visualize the data
target_ids = range(len(digits.target_names)) Compute the cross-validation score with the default hyper-parameters
Out:
Ridge: 0.4101758336587286
Lasso: 0.3375597834274947
We compute the cross-validation score as a function of alpha, the strength of the regularization for Lasso
and Ridge
import numpy as np
from matplotlib import pyplot as plt
plt.figure(figsize=(5, 3))
plt.legend(loc='lower left')
plt.xlabel('alpha')
plt.ylabel('cross validation score')
plt.tight_layout()
plt.show()
21.10. Examples for the scikit-learn chapter 626 21.10. Examples for the scikit-learn chapter 627
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
np.random.seed(0)
for _ in range(6):
noisy_X = X + np.random.normal(loc=0, scale=.1, size=X.shape)
plt.plot(noisy_X, y, 'o')
regr.fit(noisy_X, y)
plt.plot(X_test, regr.predict(X_test))
In real life situation, we have noise (e.g. measurement noise) in our data:
21.10. Examples for the scikit-learn chapter 628 21.10. Examples for the scikit-learn chapter 629
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
LinearSVC: 0.9341800269333108
GaussianNB: 0.8332741681010101
KNeighborsClassifier: 0.9804562804949924
------------------
LinearSVC(loss='hinge'): 0.9294570108037394
LinearSVC(loss='squared_hinge'): 0.9341371852581549
-------------------
KNeighbors(n_neighbors=1): 0.9913675218842191
KNeighbors(n_neighbors=2): 0.9848442068835102
KNeighbors(n_neighbors=3): 0.9867753449543099
KNeighbors(n_neighbors=4): 0.9803719053818863
KNeighbors(n_neighbors=5): 0.9804562804949924
KNeighbors(n_neighbors=6): 0.9757924194139573
KNeighbors(n_neighbors=7): 0.9780645792142071
KNeighbors(n_neighbors=8): 0.9780645792142071
KNeighbors(n_neighbors=9): 0.9780645792142071
import numpy as np
KNeighbors(n_neighbors=10): 0.9755550897728812
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.datasets.samples_generator import make_blobs
print('-------------------')
plt.show()
# test the number of neighbors
Total running time of the script: ( 0 minutes 0.026 seconds) for n_neighbors in range(1, 11):
clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_train, y_train)
y_pred = clf.predict(X_test)
Note: Click here to download the full example code print("KNeighbors(n_neighbors={0} ): {1} ".format(n_neighbors,
metrics.f1_score(y_test, y_pred, average="macro")))
21.10. Examples for the scikit-learn chapter 630 21.10. Examples for the scikit-learn chapter 631
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Total running time of the script: ( 0 minutes 1.025 seconds) For this we need to engineer features: the n_th powers of x:
plt.figure(figsize=(6, 4))
Note: Click here to download the full example code plt.scatter(x, y, s=4)
# Create color maps for 3-class classification problem, as with iris plt.legend(loc='best')
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']) plt.axis('tight')
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF']) plt.title('Fitting a 4th and a 9th order polynomial')
rng = np.random.RandomState(0)
x = 2*rng.rand(100) - 1
The data
plt.figure(figsize=(6, 4))
plt.scatter(x, y, s=4)
Ground truth
plt.figure(figsize=(6, 4))
plt.scatter(x, y, s=4)
plt.plot(x_test, f(x_test), label="truth")
plt.axis('tight')
plt.title('Ground truth (9th order polynomial)')
plt.show()
21.10. Examples for the scikit-learn chapter 632 21.10. Examples for the scikit-learn chapter 633
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
21.10. Examples for the scikit-learn chapter 634 21.10. Examples for the scikit-learn chapter 635
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• •
• •
• •
21.10. Examples for the scikit-learn chapter 636 21.10. Examples for the scikit-learn chapter 637
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
clf = GradientBoostingRegressor()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
expected = y_test
plt.figure(figsize=(4, 3))
plt.scatter(expected, predicted)
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.xlabel('True price ($1000s)')
plt.ylabel('Predicted price ($1000s)')
plt.tight_layout()
•
Simple prediction
plt.figure(figsize=(4, 3))
plt.scatter(expected, predicted)
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.xlabel('True price ($1000s)')
plt.ylabel('Predicted price ($1000s)')
plt.tight_layout() Print the error rate
import numpy as np
print("RMS: %r " % np.sqrt(np.mean((predicted - expected) ** 2)))
plt.show()
Out:
RMS: 0.5314909993118918
import numpy as np
Prediction with gradient boosted tree from matplotlib import pyplot as plt
from sklearn import neighbors, datasets
(continues on next page)
21.10. Examples for the scikit-learn chapter 638 21.10. Examples for the scikit-learn chapter 639
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y = iris.target
knn = neighbors.KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
plt.show()
21.10. Examples for the scikit-learn chapter 640 21.10. Examples for the scikit-learn chapter 641
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
# label the image with the target value
ax.text(0, 7, str(digits.target[i]))
21.10. Examples for the scikit-learn chapter 642 21.10. Examples for the scikit-learn chapter 643
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
450
# label the image with the target value
if predicted[i] == expected[i]:
ax.text(0, 7, str(predicted[i]), color='green') And now, the ration of correct predictions
else:
ax.text(0, 7, str(predicted[i]), color='red')
21.10. Examples for the scikit-learn chapter 644 21.10. Examples for the scikit-learn chapter 645
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
from sklearn import metrics Let’s visualize these faces to see what we’re working with
print(metrics.classification_report(expected, predicted))
from matplotlib import pyplot as plt
Out: fig = plt.figure(figsize=(8, 6))
# plot several images
precision recall f1-score support for i in range(15):
ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
0 1.00 1.00 1.00 51 ax.imshow(faces.images[i], cmap=plt.cm.bone)
1 0.62 0.93 0.75 41
2 0.94 0.70 0.80 46
3 0.93 0.87 0.90 47
4 1.00 0.84 0.91 43
5 0.86 0.93 0.89 40
6 0.98 0.98 0.98 45
7 0.86 0.96 0.91 52
8 0.65 0.69 0.67 49
9 0.96 0.69 0.81 36
print(metrics.confusion_matrix(expected, predicted))
plt.show()
Out:
[[51 0 0 0 0 0 0 0 0 0]
[ 0 38 0 0 0 0 0 0 3 0]
[ 0 5 32 0 0 0 0 0 9 0]
[ 0 1 0 41 0 2 0 0 2 1]
[ 0 2 1 0 36 0 1 2 1 0]
[ 0 1 0 0 0 37 0 1 1 0]
[ 0 0 1 0 0 0 44 0 0 0]
[ 0 0 0 0 0 1 0 50 1 0]
[ 0 12 0 0 0 1 0 2 34 0]
[ 0 2 0 3 0 2 0 3 1 25]]
Total running time of the script: ( 0 minutes 1.639 seconds) Tip: Note is that these faces have already been localized and scaled to a common size. This is an
important preprocessing piece for facial recognition, and is a process that can require a large collection
Note: Click here to download the full example code of training data. This can be done in scikit-learn, but the challenge is gathering a sufficient amount of
training data for the algorithm to work. Fortunately, this piece is common enough that it has been done.
One good resource is OpenCV, the Open Computer Vision Library.
21.10.14 The eigenfaces example: chaining PCA and SVMs
We’ll perform a Support Vector classification of the images. We’ll do a typical train-test split on the
The goal of this example is to show how an unsupervised method and a supervised one can be chained
images:
for better prediction. It starts with a didactic but lengthy way of doing things, and finishes with the
idiomatic approach to pipelining in scikit-learn. from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(faces.data,
Here we’ll take a look at a simple facial recognition example. Ideally, we would use a dataset con-
faces.target, random_state=0)
sisting of a subset of the Labeled Faces in the Wild data that is available with sklearn.datasets.
fetch_lfw_people(). However, this is a relatively large download (~200MB) so we will do the tutorial print(X_train.shape, X_test.shape)
on a simpler, less rich dataset. Feel free to explore the LFW dataset.
Out:
21.10. Examples for the scikit-learn chapter 646 21.10. Examples for the scikit-learn chapter 647
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
One interesting part of PCA is that it computes the “mean” face, which can be interesting to examine:
plt.imshow(pca.mean_.reshape(faces.images[0].shape),
cmap=plt.cm.bone)
The components (“eigenfaces”) are ordered by their importance from top-left to bottom-right. We
see that the first few components seem to primarily take care of lighting conditions; the remaining
components pull out certain identifying features: the nose, eyes, eyebrows, etc.
With this projection computed, we can now project our original training and test data onto the PCA
basis:
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print(X_train_pca.shape)
Out:
(300, 150)
print(X_test_pca.shape)
Out:
(100, 150)
These projected components correspond to factors in a linear combination of component images such
that the combination approaches the original face.
Out: Finally, we can evaluate how well this classification did. First, we might plot a few of the test-cases with
(150, 4096) the labels learned from the training set:
import numpy as np
It is also interesting to visualize these principal components: fig = plt.figure(figsize=(8, 6))
for i in range(15):
fig = plt.figure(figsize=(16, 6))
ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
for i in range(30):
ax.imshow(X_test[i].reshape(faces.images[0].shape),
ax = fig.add_subplot(3, 10, i + 1, xticks=[], yticks=[])
cmap=plt.cm.bone)
(continues on next page)
(continues on next page)
21.10. Examples for the scikit-learn chapter 648 21.10. Examples for the scikit-learn chapter 649
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Another interesting metric is the confusion matrix, which indicates how often any two items are mixed-
up. The confusion matrix of a perfect classifier would only have nonzero entries on the diagonal, with
The classifier is correct on an impressive number of images given the simplicity of its learning model! zeros on the off-diagonal:
Using a linear classifier on 150 features derived from the pixel-level data, the algorithm correctly identifies print(metrics.confusion_matrix(y_test, y_pred))
a large number of the people in the images.
Again, we can quantify this effectiveness using one of several measures from sklearn.metrics. First we Out:
can do the classification report, which shows the precision, recall and other measures of the “goodness”
[[3 0 0 ... 0 0 0]
of the classification:
[0 4 0 ... 0 0 0]
from sklearn import metrics [0 0 2 ... 0 0 0]
y_pred = clf.predict(X_test_pca) ...
print(metrics.classification_report(y_test, y_pred)) [0 0 0 ... 3 0 0]
[0 0 0 ... 0 1 0]
[0 0 0 ... 0 0 3]]
Out:
21.10. Examples for the scikit-learn chapter 650 21.10. Examples for the scikit-learn chapter 651
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Here we have used PCA “eigenfaces” as a pre-processing step for facial recognition. The reason we
chose this is because PCA is a broadly-applicable technique, which can be useful for a wide array of
data types. Research in the field of facial recognition in particular, however, has shown that other more
specific feature extraction methods are can be much more effective.
Total running time of the script: ( 0 minutes 2.276 seconds)
import numpy as np
from matplotlib import pyplot as plt
21.10. Examples for the scikit-learn chapter 652 21.10. Examples for the scikit-learn chapter 653
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
(continued from previous page) 21.10.16 Bias and variance of polynomial fit
labels[far_pts] = -1 Demo overfitting, underfitting, and validation and learning curves with polynomial regression.
return data, labels Fit polynomes of different degrees to a dataset: for too small a degree, the model underfits, while for too
large a degree, it overfits.
for i, d in enumerate(degrees):
ax = fig.add_subplot(131 + i, xticks=[], yticks=[])
ax.scatter(x, y, marker='x', c='k', s=50)
ax.set_xlim(-0.2, 1.2)
ax.set_ylim(0, 12)
ax.set_xlabel('house size')
if i == 0:
ax.set_ylabel('price')
ax.set_title(titles[i])
21.10. Examples for the scikit-learn chapter 654 21.10. Examples for the scikit-learn chapter 655
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
n_samples = 200
test_size = 0.4
error = 1.0
# Plot the mean train error and validation error across folds
plt.figure(figsize=(6, 4))
plt.plot(degrees, validation_scores.mean(axis=1), lw=2,
label='cross-validation')
plt.plot(degrees, train_scores.mean(axis=1), lw=2, label='training')
plt.legend(loc='best')
plt.xlabel('degree of fit')
plt.ylabel('explained variance')
plt.title('Validation curve')
plt.tight_layout()
21.10. Examples for the scikit-learn chapter 656 21.10. Examples for the scikit-learn chapter 657
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
Learning curves
Plot train and test error with an increasing number of samples
# Plot the mean train error and validation error across folds
plt.figure(figsize=(6, 4))
plt.plot(train_sizes, validation_scores.mean(axis=1),
lw=2, label='cross-validation')
plt.plot(train_sizes, train_scores.mean(axis=1),
lw=2, label='training')
plt.ylim(ymin=-.1, ymax=1)
plt.legend(loc='best')
plt.xlabel('number of train samples')
plt.ylabel('explained variance')
plt.title('Learning curve (degree=%i )' % d) •
plt.tight_layout()
plt.show()
21.10. Examples for the scikit-learn chapter 658 21.10. Examples for the scikit-learn chapter 659
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
• •
Total running time of the script: ( 0 minutes 1.423 seconds)
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Circle, Rectangle, Polygon, Arrow, FancyArrow
21.10. Examples for the scikit-learn chapter 660 21.10. Examples for the scikit-learn chapter 661
Scipy lecture notes, Edition 2022.1 Scipy lecture notes, Edition 2022.1
21.10. Examples for the scikit-learn chapter 662 21.10. Examples for the scikit-learn chapter 663
Scipy lecture notes, Edition 2022.1
Part IV
22
• BSGalvan • Zbigniew Jędrzejewski-Szmek
• Lars Buitinck • Thouis (Ray) Jones
• Pierre de Buyl • jorgeprietoarranz
• Ozan Çağlayan • josephsalmon
CHAPTER
• Lawrence Chan • Greg Kiar
• Adrien Chauve • kikocorreoso
• Robert Cimrman • Vince Knight
• Christophe Combelles • LFP6
• David Cournapeau • Manuel López-Ibáñez
E
equations
algebraic, 521
differential, 522
I
integration, 520
M
Matrix, 522
P
Python Enhancement Proposals
PEP 255, 268
PEP 3118, 304
PEP 3129, 277
PEP 318, 270, 277
PEP 342, 268
PEP 343, 278
PEP 380, 269
PEP 380#id13, 269
PEP 8, 272
S
solve, 521