Python Data Science
Python Data Science
Python Data Science
Python Data
Science
Python Data Science
Chaolemen Borjigin
The print edition is not for sale in China (Mainland). Customers from China (Mainland) please order the print book from: Publishing House of
Electronics Industry.
ISBN of the Co-Publisher’s edition: 978-7-121-41200-4
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
“Writing a textbook” holds immeasurable merit as it allows us to save others’ time with our own. In today’s
impetuous and realistic society, I have dedicated myself to writing textbooks, knowing that they may not be
counted among my personal achievements. However, I find immense joy in the process of writing this textbook.
As the old saying goes, “If you’re afraid, don't do it; if you’re doing it, don’t be afraid!” It has taken me 18
months of dedicated effort simply because I want to utilize my time to save the valuable time of the readers.
“Writing a textbook” requires an exceptional top-down design and stepwise refinement. Through years of teaching
experience, I have come to realize the urgent need for an excellent Python textbook for the education of data
science and big data professionals. Existing textbooks face several issues: firstly, they teach (or learn) Python as
if it were Java or C, failing to capture the unique characteristics of Python. Secondly, the style of “knowledge
first, code later” and the dominance of knowledge over practical implementation seem to invert their proper
order of importance. Thirdly, there is no clear distinction between Python textbooks used for data science and
computer science, leading to confusion. Lastly, some authors treat the readers (or themselves) as programming
novices, neglecting the fact that most readers possess prior knowledge of Java or C and are learning Python as a
second programming language. They do not require repetitive explanations of low-level concepts or redundant
explanations of the same knowledge in different languages. Overcoming these limitations and exploring new
teaching and textbook-writing patterns was my original intention in writing this book. Whether or not I have
achieved this goal remains to be seen and depends on your careful reading and fair judgment.
“Writing a textbook” necessitates the knowledge and practice of countless resources. Throughout the writing
process, I extensively referred to monographs, textbooks, papers, open-source projects, and original data. The
reference list contains detailed citations for the sources I have used. However, I may have inadvertently missed
a few. If so, I sincerely apologize to the relevant scholars. This book also incorporates data science research and
engineering projects completed by my team since 2012, as well as the questions and discussions raised by my
students. The course slides, raw data, source codes, and errata list for this textbook can be found at github.com.
For further information, please contact me at [email protected].
“Writing a textbook” is impossible without the help of others. The leaders and editors at Springer Press and
Publishing House of Electronics Industry, especially editor Zhang Haitao, have made significant contributions
to the publication of this book. I would like to express my gratitude to the Ministry of Education-IBM Industry-
University Cooperation Collaborative Education Project for their funding and support. Special thanks go to Zhang
Chen, Xiao Jiwen, Liu Xuan, Tianyi Zhang, Meng Gang, Sun Zhizhong, Wang Rui, Liu Yan, Yang Canjun, Li
Haojing, Wang Yuqing, Qu Hanqing, Zhao Qun, Li Xueming, Ji Jiayu, and other students at Renmin University
of China for their invaluable proofreading assistance.
“Writing a textbook” is a lengthy process of iterative refinement. This edition may still have some shortcomings,
and I genuinely welcome your feedback and suggestions. This textbook is my third book after Data Science and
Data Science Theory and Practice. Someone once said to me, “Prof. Chaolemen, you have already achieved
so much, why do you still work so tirelessly? You will become the number one in the field of data science.” I
replied, “No, that is not my purpose. I have undertaken all these endeavors with the belief that I strive to be the
one who works the hardest and is most willing to invest time and energy in this field. As for other matters, I am
not concerned. My hope is that my dedication and effort will inspire you to do the same!
In this textbook, I have aimed to provide a comprehensive and cohesive guide to Python programming. I have
taken great care in addressing the shortcomings I observed in existing textbooks. Rather than treating Python as a
mere translation of Java or C, I have emphasized its unique features and characteristics. The content is structured
in a logical and progressive manner, ensuring that knowledge and practical implementation are presented in the
right order of importance.
V
VI Preface
Throughout the writing process, I have drawn from a wide range of resources, including monographs, textbooks,
research papers, open-source projects, and my team's own data science endeavors. The reference list provides
detailed citations for the sources I have used. While I have made every effort to be thorough, I acknowledge that
there may be a few omissions. For any oversights, I sincerely apologize to the respective scholars.
I am immensely grateful to the leaders and editors at Springer Press and Publishing House of Electronics
Industry, particularly editor Zhang Haitao, for their invaluable contributions to the publication of this book.
Additionally, I would like to express my appreciation to the Ministry of Education-IBM Industry-University
Cooperation Collaborative Education Project as well as The Quality Textbook Support Project for Graduate
Students at Renmin University of China for their support and funding.
I extend my heartfelt thanks to Zhang Chen, Xiao Jiwen, Liu Xuan, Tianyi Zhang, Meng Gang, Sun Zhizhong,
Wang Rui, Liu Yan, Yang Canjun, Li Haojing, Wang Yuqing, Qu Hanqing, Zhao Qun, Li Xueming, Ji Jiayu, and
other students at Renmin University of China for their diligent proofreading of the book.
“Writing a textbook” is a laborious process that requires dedication, perseverance, and continuous refinement.
While I have strived for excellence in this edition, I acknowledge that there may still be areas for improvement.
I genuinely welcome your feedback and advice, as it will contribute to the future enhancement of this textbook.
Once again, I want to emphasize that my purpose in writing this textbook is to share my knowledge and contribute
to the field of data science. It is my hope that through your engagement with this book, my efforts will become
your efforts, and together we can advance the understanding and application of Python programming in the
realm of data science.
Thank you for embarking on this data science journey with me.
Chaolemen Borjigin
May 13, 2023
Contents
VII
VIII Contents
2.13.3 Iteration�������������������������������������������������������������������������������������������������������������������������������100
2.13.4 Unpacking���������������������������������������������������������������������������������������������������������������������������100
2.13.5 Repeat operator�������������������������������������������������������������������������������������������������������������������101
2.13.6 Basic Functions�������������������������������������������������������������������������������������������������������������������102
2.14 Sets����������������������������������������������������������������������������������������������������������������������������������������������������105
Q&A����������������������������������������������������������������������������������������������������������������������������������������������105
2.14.1 Defining sets�����������������������������������������������������������������������������������������������������������������������106
2.14.2 Main features����������������������������������������������������������������������������������������������������������������������106
2.14.3 Basic operations������������������������������������������������������������������������������������������������������������������107
2.14.4 Sets and data science����������������������������������������������������������������������������������������������������������109
2.15 Dictionaries���������������������������������������������������������������������������������������������������������������������������������������110
Q&A����������������������������������������������������������������������������������������������������������������������������������������������110
2.15.1 Defining dictionaries�����������������������������������������������������������������������������������������������������������111
2.15.2 Accessing dictionary items�������������������������������������������������������������������������������������������������111
2.15.3 Dictionary and data science������������������������������������������������������������������������������������������������113
2.16 Functions�������������������������������������������������������������������������������������������������������������������������������������������114
Q&A����������������������������������������������������������������������������������������������������������������������������������������������114
2.16.1 Built-in functions����������������������������������������������������������������������������������������������������������������115
2.16.2 Module Functions���������������������������������������������������������������������������������������������������������������115
2.16.3 User-defined functions��������������������������������������������������������������������������������������������������������115
2.17 Built-in functions������������������������������������������������������������������������������������������������������������������������������117
Q&A����������������������������������������������������������������������������������������������������������������������������������������������117
2.17.1 Calling built-in functions����������������������������������������������������������������������������������������������������118
2.17.2 Mathematical functions������������������������������������������������������������������������������������������������������118
2.17.3 Type conversion functions��������������������������������������������������������������������������������������������������119
2.17.4 Other common used functions��������������������������������������������������������������������������������������������120
2.18 Module functions������������������������������������������������������������������������������������������������������������������������������124
Q&A����������������������������������������������������������������������������������������������������������������������������������������������124
2.18.1 import module name�����������������������������������������������������������������������������������������������������������124
2.18.2 import module name as alias����������������������������������������������������������������������������������������������126
2.18.3 from module name import function name��������������������������������������������������������������������������126
2.19 User-defined functions����������������������������������������������������������������������������������������������������������������������127
Q&A����������������������������������������������������������������������������������������������������������������������������������������������127
2.19.1 Defining user-defined functions������������������������������������������������������������������������������������������128
2.19.2 Function docStrings������������������������������������������������������������������������������������������������������������130
2.19.3 Calling user-defined functions��������������������������������������������������������������������������������������������130
2.19.4 Returning values�����������������������������������������������������������������������������������������������������������������131
2.19.5 Parameters and arguments��������������������������������������������������������������������������������������������������132
2.19.6 Scope of variables���������������������������������������������������������������������������������������������������������������133
2.19.7 Pass-by-value and pass-by-reference����������������������������������������������������������������������������������135
2.19.8 Arguments in functions�������������������������������������������������������������������������������������������������������136
2.20 Lambda functions�����������������������������������������������������������������������������������������������������������������������������138
Q&A����������������������������������������������������������������������������������������������������������������������������������������������138
2.20.1 Defining a lambda function������������������������������������������������������������������������������������������������139
2.20.2 Calling a lambda function���������������������������������������������������������������������������������������������������139
Exercises���������������������������������������������������������������������������������������������������������������������������������������������������141
3. Advanced Python Programming for Data Science�����������������������������������������������������������������������������������������145
3.1 Iterators and generators�����������������������������������������������������������������������������������������������������������������������146
Q&A����������������������������������������������������������������������������������������������������������������������������������������������146
3.1.1 Iterable objects vs. iterators��������������������������������������������������������������������������������������������������147
3.1.2 Generator vs. iterators�����������������������������������������������������������������������������������������������������������148
3.2 Modules����������������������������������������������������������������������������������������������������������������������������������������������150
X Contents
Q&A����������������������������������������������������������������������������������������������������������������������������������������������150
3.2.1 Importing and using modules�����������������������������������������������������������������������������������������������151
3.2.2 Checking built-in modules list����������������������������������������������������������������������������������������������152
3.3 Packages����������������������������������������������������������������������������������������������������������������������������������������������153
Q&A����������������������������������������������������������������������������������������������������������������������������������������������153
3.3.1 Packages vs modules������������������������������������������������������������������������������������������������������������154
3.3.2 Installing packages���������������������������������������������������������������������������������������������������������������154
3.3.3 Checking installed packages�������������������������������������������������������������������������������������������������154
3.3.4 Updating or removing installed packages����������������������������������������������������������������������������155
3.3.5 Importing packages or modules��������������������������������������������������������������������������������������������156
3.3.6 Checking Package Version����������������������������������������������������������������������������������������������������156
3.3.7 Commonly used Packages����������������������������������������������������������������������������������������������������157
3.4 Help documentation����������������������������������������������������������������������������������������������������������������������������158
Q&A����������������������������������������������������������������������������������������������������������������������������������������������158
3.4.1 The help function������������������������������������������������������������������������������������������������������������������159
3.4.2 DocString������������������������������������������������������������������������������������������������������������������������������159
3.4.3 Checking source code�����������������������������������������������������������������������������������������������������������160
3.4.4 The doc attribute�������������������������������������������������������������������������������������������������������������������161
3.4.5 The dir() function�����������������������������������������������������������������������������������������������������������������162
3.5 Exception and errors���������������������������������������������������������������������������������������������������������������������������164
Q&A����������������������������������������������������������������������������������������������������������������������������������������������164
3.5.1 Try/Except/Finally����������������������������������������������������������������������������������������������������������������165
3.5.2 Exception reporting mode����������������������������������������������������������������������������������������������������166
3.5.3 Assertion�������������������������������������������������������������������������������������������������������������������������������167
3.6 Debugging�������������������������������������������������������������������������������������������������������������������������������������������168
Q&A����������������������������������������������������������������������������������������������������������������������������������������������168
3.6.1 Enabling the Python Debugger���������������������������������������������������������������������������������������������169
3.6.2 Changing exception reporting modes�����������������������������������������������������������������������������������170
3.6.3 Working with checkpoints����������������������������������������������������������������������������������������������������171
3.7 Search path������������������������������������������������������������������������������������������������������������������������������������������172
Q&A����������������������������������������������������������������������������������������������������������������������������������������������172
3.7.1 The variable search path�������������������������������������������������������������������������������������������������������173
3.7.2 The module search path��������������������������������������������������������������������������������������������������������175
3.8 Current working directory������������������������������������������������������������������������������������������������������������������178
Q&A����������������������������������������������������������������������������������������������������������������������������������������������178
3.8.1 Getting current working directory����������������������������������������������������������������������������������������179
3.8.2 Resetting current working directory�������������������������������������������������������������������������������������179
3.8.3 Reading/writing current working directory��������������������������������������������������������������������������179
3.9 Object-oriented programming�������������������������������������������������������������������������������������������������������������181
Q&A����������������������������������������������������������������������������������������������������������������������������������������������181
3.9.1 Classes����������������������������������������������������������������������������������������������������������������������������������183
3.9.2 Methods��������������������������������������������������������������������������������������������������������������������������������184
3.9.3 Inheritance����������������������������������������������������������������������������������������������������������������������������185
3.9.4 Attributes������������������������������������������������������������������������������������������������������������������������������187
3.9.5 Self and Cls���������������������������������������������������������������������������������������������������������������������������188
3.9.6 __new__ () and __init__()���������������������������������������������������������������������������������������������������188
Exercises���������������������������������������������������������������������������������������������������������������������������������������������������191
4. Data wrangling with Python��������������������������������������������������������������������������������������������������������������������������195
4.1 Random number generation����������������������������������������������������������������������������������������������������������������196
Q&A����������������������������������������������������������������������������������������������������������������������������������������������196
4.1.1 Generating a random number at a time��������������������������������������������������������������������������������197
4.1.2 Generating a random array at a time������������������������������������������������������������������������������������198
Contents XI
Python has become the most popular data science programming language in recent years. This chapter will
introduce:
How to learn Python for data science
How to setup my Python IDE for data science
How to write and run my Python code
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 1
C. Borjigin, Python Data Science, https://doi.org/10.1007/978-981-19-7702-2_1
2 Python Data Science
Q&A
Python and Data Science 3
Figure 1.2 Guido van Rossum (the creator of Python Programming language) and the
official website of Python
Python and Data Science 5
Q&A
6 Python Data Science
Once you have Anaconda installed, you do not need to download or install the editors,
interpreters, packages, or package managers, nor manually edit their configuration
files.
Tips
This book uses Jupyter Notebook/Lab, the most commonly used editor in data
science, and the Python3 kernel (interpreter)
Notes
For more in-depth details about installing Anaconda on macOS or Linux, please
refer to the Anaconda official website: https://anaconda.com.
Tips
The Anaconda official website and introduction are shown in Figure 1.3. The menu items and their
usage are shown in Figure 1.4.
Figure 1.4 Windows Start menu items and their usage after Anaconda is installed
Figure 1.5 How to use Jupyter Notebook for data science projects
8 Python Data Science
Q&A
Python and Data Science 9
1.3.1 Inputs
Jupyter Notebooks documents consist of “cells”: input cells and output cells. We
only need to type the codes in the input cell since the output cell is automatically
evaluated by Jupyter Notebook. For instance, the Output[1] shows the output of codes
in input[1] as the following cell.
Tips
In[1] x1=11
x1
Out[1] 11
When writing Python codes, you need to pay attention to their case sensitivity as well
as code indentation (the spaces at the beginning of a code line). For more information,
please refer to [2.4 Statements].
Notes
The default shortcut for running a “Cell” in Jupyter Notebook is Ctrl+Enter. For more
shortcuts, you can refer to the Help/Keyboard Shortcuts menu item in the “Menu Bar”
of Jupyter Notebook.
Tricks
In an input “Cell” it is not necessary to start a code line with “#” since it denotes a
Python comment statement. A comment in Python starts with the hash character(#),
and extends to the end of the physical line. Please refer to [2.6 Comments] for more
details.
Tips
In[3] X3=13
X3
Out[3] 13
The number of In[] is the sequential order in which the input cell was executed in the
current session in the Jupyter Notebook Kernel.
Tips
When the same “Cell” is executed multiple times, the number in its In[] will be updated
accordingly. This means that the In[] number will reflect the order of execution,
indicating the current iteration of the cell. For more details, refer to [1.3.5 Tips for
Notes Python programming].
10 Python Data Science
To restart or stop the “current session (Session)”, we can restart the kernel of Jupyter
notebook by clicking Kernel > Restart (or stop) from the Jupyter menu.
Tricks
In[4] x4=14
In[5] x4=x4+1
x4
Out[5] 15
Python is an interpreted programming language, and we can run the cell as many
times as we want. As a result, the current values of a variable may be updated
simultaneously. For instance, x4=x4+1” is a self-assignment statement, and the value
of x4 is incremented each time the cell is executed.
Tips
Checking the current value of a variable is one of good habits for successful
programmers of data science projects.
Notes
In[6] x5=16
x5
Out[6] 16
In Jupyter Notebook, Python code is executed within “Cell” as the unit, the execution
order is different from the C/Java language, and execute items one by one in a non-
predetermined order (such as from top to bottom). Therefore, the code cells are
Notes executed individually and can be run in a non-sequential order.
1.3.2 Outputs
The output cell is displayed to the right of the output variable “Out[]:” in Jupyter
Notebook “Cell” on the side.
Tips
In[7] y1=21
y1
Out[7] 21
Python and Data Science 11
In[8] y2=22
y2
Out[8] 22
The number displayed in the Out[] is the corresponding In[] number of the output
result.
Tips
In[9] y3=23
print(y3)
Out[9] 23
In Jupyter Notebook, instead of using the print() function, you can directly write the
variable name to see the output result. However, in this case, the output result does not
have an Out[] number associated with it.
Notes
Both y2 and print(y3) in In[8] and In[9] can produce the same result. Is there any
difference between the two?
The former is not the syntax of Python, but the function provided by Jupyter
Notebook to facilitate our programming. In Python, the standard output still needs
Tips to use the print() function;
The former is the syntax of Jupyter Notebook, which output result into the Out
queue variable of Jupyter Notebook, and has an Out number put the latter will not
be put into Jupyter Notebook In the Out queue variable of, and there is no Out
number;
(3)The former is the display result after “optimization” by Jupyter Notebook, and
the output effect is often different from the function of print().
The error message indicates that “z is an undefined object” because the name of the
defined variable is not “z”. but “z1”.
Notes
For further details about Python errors or exceptions, please refer to [3.5 Exceptions
and Errors].
Tips
Prior to reading data source files such as Excel, CSV, or JSON, it is necessary to place
them in the current working D directory.
Tips
os.getcwd():
Returns the current working directory of the session.
Tricks
In[11] import os
print(os.getcwd())
Out[11] C:\Users\soloman\clm
For further details about current working directory, please refer to [3.8 Current
working directory].
Tips
The data file, named “bc_data.csv”, needs to be placed in your current working
directory in advance. If the file is not found in the current working directory, it will
raise a “FileNotFoundError” error message.
Notes
Python and Data Science 13
The code in the input cell loads the data file “bc_data.csv” from the local disk
into memory using the read_csv() method from the Pandas library. This method is
specifically designed to read data from CSV files.
Tips
Tips
14 Python Data Science
Figure 1.8 shows the Edit and Esc state of a cell in Jupyter Notebook.
Tips
Python and Data Science 15
Many valuable learning and reference resources are available in the “Help” menu of
Jupyter Notebook/Lab. Python beginners are advised to take full advantage of these
essential resources, as illustrated in Figure 1.9.
Tricks
Exercises
[1] Python is created by ( ).
A. Wes McKinney
B. Guido van Rossum
C. James Gray
D. Hadley Wickham
[2] Which of the following is true of Python?
A. Python is a programming language that uses compiling.
B. Python is a language that represents simplicity.
C. Python is a scripting language.
D. Python is an advanced language.
[3] Which of the following is false of Python?
A. Python’s syntax is concise.
B. Python is a platform dependent language.
C. Chinese is supported in Python.
D. Python has rich resources of classes and libraries.
[4] Which of the following is false of programming languages?
A. A programming language is a concrete implementation of programming.
B. Natural languages are simpler, more rigorous and more precise than programming languages.
C. Programming languages are primarily used for interaction between humans and computers.
D. A programming language is an artificial language for interaction.
[5] What is false about the basic programming strategies?
A. Input is the beginning of a program.
B. Output is the way in which the program displays the results of operations.
C. P
rocessing is the process in which the program calculates the input data and produces the output
results.
D. Output is the soul of a program.
[6] Python is suitable for ()
A. hardware development
B. mobile development
C. data analysis
D. game development
[7] Which of the following is the Python interpreter?
A. CPython
B. JPython
C. ironpython
D. All of the above
[8] Which of the following is false of the indentation in Python?
A. Indentation is a part of syntax.
B. Indentation does not affect the running of programs.
C. Indentation is the only way to represent the containing and hierarchical relationship between codes.
D. Indentation is normally represented by 4 spaces or 1 tab.
[9] Which of the following is false of the Python development environment configuration?
A. The installation of Python may vary depending on the operating system.
B. Python can be integrated into integrated development environment such as Eclipse, PyCharm.
C. Jupyter Notebook editor is widely used in data science and data analysis projects.
D. After installing Anaconda, we need to download the editors and packages required for Python
programming one by one.
Python and Data Science 17
Python is a general-purpose language so that it can be used for a wide range of applications, such as data science,
computer science, software engineering, mathematics, life science, linguistics, and journalism. However, learning
Python programming for data science requires its unique specific knowledge tailored to its use in that field. This
chapter will introduce the basics of python syntax for data science, including:
Data types (Lists, Tuples, Strings, Sequences, Sets, Dictionaries)
Variables
Operators and expressions
Statements (assignments, comments, if statements, for statements, and while statements)
Functions (built-in functions, module functions, user-defined functions, and lambda functions)
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 19
C. Borjigin, Python Data Science, https://doi.org/10.1007/978-981-19-7702-2_2
20 Python Data Science
Q&A
Basic Python Programming for Data Science 21
In[1] # int
type(1)
Out[1] int
The term “built-in functions” refers to the functions that are inherently available in the
Python interpreter and can be used without the need for any additional imports. These
functions are part of the core Python language and are commonly used in various
Notes programming tasks. The most common used functions, such as type(), ininstance(),
dir(), print(), int(), float(), string(). list(), tuple(), and set(), are built-in functions.
In[2] # float
type(1.2)
Out[2] float
In[3] # bool
type(True)
Out[3] bool
Python Boolean data type has two values: True and False
Tips
In Python, there is no difference between single and double quoted string. For further
details, please refer to [2.13 Strings]
Tips
In[5] # list
type([1,2,3,4,5,6,7,8,9])
Out[5] list
In Python, to create a list, the elements are placed inside square brackets ([]), separated
by commas. For further details, please refer to [2.10 Lists]
Tips
22 Python Data Science
In contrast to C and Java, Python does not have a built-in data type called “array”.
Instead, Python uses “list” and “tuple” as its primary data structures for storing
collections of elements.
Notes
In[6] # tuple
type((1,2,3,4,5,6,7,8,9))
Out[6] tuple
In Python, A tuple is created by placing all the items (elements) inside parentheses (),
separated by commas. For further details, please refer to [2.12 Tuples].
Tips
In[7] # set
type({1,2,3,4,5,6,7,8,9})
Out[7] set
In Python, a set or a dictionary can be created by placing all the items (elements) inside
braces {}, including key-value pairs separated by commas (,). A colon (:) separates
each key from its value. For further details, please refer to [2.14 Sets].
Tips
Dictionary holds key-value pair. The nexus between a dictionary and a set: a dictionary
is a set containing the keys. For further details, please refer to [2.15 Dictionaries].
Tips
Here, the output[11] is True in that the Boolean class is implemented as a subclass of
the integer class in Python.
Notes
In explicit type conversion, also known as type casting, we can convert the data type
of an object to required data type by calling the predefined functions like int(), float(),
Notes str().
In[12] int(1.6)
Out[12] 1
In general, the name of the type casting function matches the name of the target data
type.
Notes
bool(0)
Out[14] False
24 Python Data Science
tuple([1,2,1,1,3])
Out[15] (1, 2, 1, 1, 3)
list((1,2,3,4))
Out[16] [1, 2, 3, 4]
The difference between a list and a tuple: the former is “a mutable object”, while the
latter is “an immutable object”. In Python, mutable objects are those whose value can
be changed after creation, while immutable O objects are those whose value cannot
be modified once they are created. For further details, please refer to [2.10 Lists] as
Tips well as [2.11Tuples].
Python provides not only basic data types such as int, float, string, list, tuple and set,
but also some built-in constants including None, Ellipsis, and NotImplemented.
Tips
In[17] # None
x = None
print(x)
Out[17] None
Notice that the output of None always use the print() function, otherwise nothing can
be seen in Jupyter Notebook.
Notes
The None keyword is used to represent a null value or indicate the absence of a value.
Consequently, None is distinct from 0, False, or an empty string.
Tricks
In[18] # NotImplemented
print(NotImplemented)
Out[18] NotImplemented
In[19] # Ellipsis
print(Ellipsis)
Out[19] Ellipsis
In Python, the Ellipsis keyword, represented by “...” (three dots), is equivalent to the
ellipsis literal. It is a special value commonly used in combination with extended
slicing syntax, particularly for user-defined container data types.
Notes
x = 2+3j
print('x = ', x)
Out[20] x = (2+3j)
Tips
In[21] y=complex(3,4)
print('y = ', y)
Out[21] y = (3+4j)
To access the documentation for the print() function and learn about its arguments and
usage in Python, you can use either “print?” or “?print” in most interactive Python
environments, such as Jupyter Notebook or IPython.
Tricks
In this context, the symbol “e” represents 10 in scientific notation, not the mathematical
constant “e” with a value of approximately 2.71828.
Notes
26 Python Data Science
2.1.5 Sequences
In Python, a sequence refers to a collection of items that are ordered by their positions.
It is a general term that does not specifically refer to an independent data type but
rather encompasses various ordered containers.
Notes
There are three basic sequence types: strings, lists, and tuples. The set type is not a
sequence, because its elements have no order.
Tips
mySeq1*3
Out[31] 'Data ScienceData ScienceData Science'
Tips
Basic Python Programming for Data Science 27
2.2 Variables
Q&A
28 Python Data Science
Python is a dynamically typed language. We don’t have to declare the type of variable
while assigning a value to a variable in Python. In other words, the python interpretor
doesn’t know about the type of the variable until the code is run.
Notes
In[2] x = 10
x = "testMe"
The following code would raise an error in C or Java, but not in Python.
Notes
Variables in Python do not need to declare their type in advance, and the same
variable can be assigned to different object types.
Tips
Python is considered a strongly typed language because the interpreter keeps track
of variable types. Strong typing ensures that the type of a value does not change
unexpectedly.
Notes
In[3] "3" + 2
Out[3] ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-e8240368dace> in <module>
----> 1 "3"+2
In Python, automatic data type conversion during runtime is not performed by default,
except for conversions between int, float, bool, and complex types.
Notes
In[4] 3+True # Here, no error was raised.
Out[4] 4
In[5] 3+3.3 # Here, no error was raised.
Out[5] 6.3
In Python, variables are simply names that refer to objects. In other words, a Python
variable is a symbolic name that is a reference or pointer to an object.
Notes
In[7] i = 20
i = "myStr"
i = 30.1
i
Out[7] 30.1
The variable name represents (or is essentially) “a reference to a value”, rather than
“the value of the variable”.
Tips
2.2.5 Case-sensitivity
In[8] i = 20
I
Out[8] ---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-447541a63ca9> in <module>
1 i=20
----> 2 I
Defined variables are named lowercase “i”, while output variables are named
uppercase “I”.
Tips
30 Python Data Science
In Python, a NameError is raised when the identifier being accessed is neither defined
in advance nor imported from other modules/packages. Hence, we can correct
NameErrors by:
Tricks declaring it in advance or quoting it to be a string constant
importing the modules/packages that declared it
In[10] myvariable_2 = 0
In[11] 2_ myvariable = 0
Out[11] File "<ipython-input-10-6006d03e9e23>", line 1
2_myvariable=0
^
SyntaxError: invalid decimal literal
The reason for the error is that the variable name starts with a number.
Tips
If a keyword is used as a variable name, it will cause the meaning of the keyword to
change, and the original function of the keyword will be invalidated.
Workaround: Restart the session. To do this: Select Kernel→Restart in the menu bar
Notes
of Jupyter Notebook.
Here, the meaning of print is redefined as a reference to the value 0. Hence, within the
scope of the current session of Jupyter Notebook Kernel, the variable “print” refers to
0, not to the original print(output) function.
Tips
IPython offers numbered prompts (In/Out) with input and output caching, also referred
to as ‘input history’. All input is saved and can be retrieved as variables.
Notes
In[13] x = 12+13
Out[13] x
In[13]
Out[14] ‘x = 12+13\nx’
Here, the In[] and Out[] are not Python variables, but a special variable offered by
IPython for editing code conveniently and tracing execution process.
Notes
Out[13]
Out[15] 25
In[18] print(dir())
Out[18] ['In', 'Out', '_', '__', '___', '__builtin__', '__builtins__', '__doc__', '__loader__', '__
name__', '__package__', '__spec__', '_dh', '_i', '_i1', '_i2', '_i3', '_ih', '_ii', '_iii', '_oh',
'exit', 'get_ipython', 'keyword', 'quit', 'x']
In[19] i = 20
print(i)
del i
Here you need to restart Jupyter Notebook Kernel, otherwise an error will be raised,
because “print=0” in In[12], that is, print is redefined as a variable name.
In earlier versions of Python, del was a statement, not a function.
Notes Hence, an error will be raised when written as del(i).
In[20] i
Out[20] ---------------------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-397d543883c5> in <module>
----> 1 i
NameError: name 'i' is not defined
Basic Python Programming for Data Science 33
The naming convention recommended by Guido, the father of Python, includes the
following points.
1. Module or package names use lowercase letters and underscore-separated words, e.g.
regex_syntax, py_compile, winreg
Tips 2. Class or exception names should capitalize the first letter of each word, e.g.
BaseServer, ForkingMixIn, KeyboardInterrupt
3. Global constants or class constants use uppercase letters and underscore-
separated words, e.g.
MAX_LOAD
4. The names of other objects, including method names, function names, and common
variable names, use lowercase letters with underscore-separated words, e.g.
my_thread
5. If the above objects are private types, name them start with an underscore, e.g.
__init__, _new__
To write better code in Python, you can follow the guidelines provided by PEP (Python
Enhancement Proposal). PEP8 is specifically focused on the Python Code Writing
Specification, which serves as a style guide for Python code. PEP20, known as “The
Notes Zen of Python,” also provides valuable principles to guide Python programmers.
To access the official PEP documents and read more about them, you can visit the
official website at https://www.python.org/dev/peps/. It is a valuable resource for
understanding the recommended practices and conventions in Python programming.
In addition to PEP, the Google Style Guide is another commonly used coding
specification, particularly in data science practices. It provides guidelines and best
practices for writing code in a consistent and readable manner.
By following these coding specifications, you can enhance the quality, readability, and
maintainability of your Python code.
34 Python Data Science
Q&A
Basic Python Programming for Data Science 35
Table 3-5 Bitwise operators(x=2, y=5; Note: You can use the built-in
function bin() to get the corresponding binary.)
Operators Meanings Instances Results
& Bitwise AND x&y 0
| Bitwise OR x|y 7
^ Bitwise XOR x^y 7
~ Bitwise NOT ~x -3
<< Bitwise left shift x << y 64
>> Bitwise right shift x >> y 0
Low
Figure 3.1 Precedence of Python operators
In[4] # Exponentiation
x=2
y=5
x ** y
Out[4] 32
In[13] x=2
y=5
y//=x+8
print(y)
Out[13] 0
The Ouput[13] y is 0, not 10. For more details, please refer to [2.5 Assignment
statement].
Notes
In[15] x=True
y=False
x or y
Out[15] True
In[16] x=True
not x
Out[16] False
40 Python Data Science
Decimal data can be converted to binary data with built-in function bin().
Tips
In[18] x=2
y=3
x&y
Out[18] 2
Tips
In[19] x=2
y=3
bin(x&y)
Out[19] '0b10'
In[20] x=2
y=3
bin(x | y)
Out[20] '0b11'
In[21] bin(x^y)
Out[21] '0b1'
In[22] bin(~x)
Out[22] '-0b11'
In[23] x=2
y=3
bin(x<<y)
Out[23] '0b10000'
Basic Python Programming for Data Science 41
In[24] x=2
y=3
bin(x>>y)
Out[24] '0b0'
Built-in functions (BIFs) are functions that are built into the Python interpreter and
can be called directly by their function name.
Tips
In[26] round(2.991)
Out[26] 3
Rounding function: round(number, ndigits). Its function is to round its first argument
number, and retain ndigits significant figures after the decimal point. The ndigits
argument defaults to 0.
Tips
In[27] round(2.991,2)
Out[27] 2.99
The meaning of argument “2” is “retain 2 significant figures after the decimal point”.
Tips
We can get the help information of round() function through “?round” or “round?”.
The help information given by the system is as follows:
round(number[,ndigits])
Notes
The arguments placed in [] are optional such as ndigits.
42 Python Data Science
In Python, many commonly used mathematical functions (such as sin(), cos()), and
others, are not built-in functions, but are placed in the math module. The math module
provides a wide range of mathematical operations and functions.
Tips
Tips
An error will be raised when attempting to take the square root of a negative number
using the math module.
Tips
The functions for complex numbers are in another module called cmath.
Tips
In[33] 2**2**3
Out[33] 256
Tips Operator associativity is relevant when two or more operators have the same precedence
in an expression. It determines the order in which operators are evaluated when they
have the same precedence. Associativity can be either left to right (left-associative) or
right to left (right-associative).
In[34] (2**2)**3
Out[34] 64
In[35] x=2+3
x
Out[35] 5
Please analyze the reason why the result of the expression “1+2 and 3+4” is “7”.
Tips
44 Python Data Science
2.4 Statements
Q&A
Basic Python Programming for Data Science 45
Python statements are usually written in a single line. The newline character marks
the end of the statement.
Notes
In[1] i=20
j=30
k=40
Unlike C and Java, Python does not have statement terminators such as “;”.
Notes
Please refer to PEP8-Style Guide for Python Code and Google Python Style Guide for
the writing specifications of Python code.
Tips
Though not typically recommended, you can separate different statements on the
same line with a semicolon “;” in Python.
Tips
In[3] i;j;k
Out[3] 40
In Python, “i, j, k” differs from “i; j; k”. The former creates a tuple, while the latter
represents multiple statements.
Tips
In[4] i,j,k
Out[4] (20, 30, 40)
46 Python Data Science
In Python, there is a distinct difference between “;” and “,”. The former is used for
representing multiple statements in a single line, while the latter is used for creating
tuples. Detailed information about tuples is described in Section [2.11 Tuples].
Notes
In[5] print(i;j;k)
Out[5] File "<ipython-input-5-efd9c261ba8d>", line 1
print(i;j;k) #Exception, SyntaxError: invalid syntax
^
SyntaxError: invalid syntax
It is easy for beginners to confuse the use of semicolons and commas. For example,
the above code will raise an error.
Notes
Attempting to print a statement like “print(i; j; k)” will result in a SyntaxError (“invalid
syntax”) exception. This is because semicolons represent statement separators in
Python and cannot be used in this context.
Tips
Here, “\” refers to the line continuation character. PEP8 recommends that a line of
Python code should be limited to a maximum of 79 characters. If a line needs to
extend beyond this limit, it should generally be split into multiple lines. You can use
Notes the line continuation character “\” to indicate that a line should be continued, although
in many cases, Python allows line continuation inside parentheses, brackets, and
braces without the need for this character.
In[7] sum=0
for i in range(1,10):
sum=sum+i
print(i)
print(sum)
Basic Python Programming for Data Science 47
Out[7] 1
2
3
4
5
6
7
8
9
45
In Python, indentation is used to represent the block structure of code, similar to the
way braces “{}” are used in Java and C. However, there are some unique aspects of
Python’s indentation rules:
Notes A colon (“:”) is required at the end of the line before the start of an indented block.
This is usually at the end of control flow statements like if, for, while, def, and class.
The consistency of indentation is very important. All lines within the same block
of code must be indented at the same level. This alignment is required to correctly
represent the structure of the code.
Tips
In[8] a = 10
if a >5:
print("a+1=",a+1)
print("a=",a)
Out[8] a+1= 11
a= 10
Please note that incorrect indentation can cause SyntaxError exceptions or lead to
unexpected behavior due to the code’s logic being interpreted differently than intended.
Tips
Python doesn’t require a specific number of spaces for indentation, but by convention
and according to PEP 8 (the official Python style guide), four spaces are typically used
to denote one level of indentation.
Tricks
In Python, a colon (:) must be added at the end of the line before starting an indented
block.
Notes
48 Python Data Science
An empty statement is a statement that does nothing. In Python, the pass statement
serves this purpose, essentially acting as a placeholder for future code and having no
effect when executed.
Notes
In[9] x=1
y=2
if x>y:
pass
else:
print(y)
Out[9] 2
Python is often described as being like “executable pseudocode” due to its readability.
In Python, if you need to create an empty block (for instance, a function or a loop that
you have not yet implemented), you would use the pass statement as a placeholder. If
you don’t include a pass statement or some other statement in such a block, Python
Tips will raise a syntax error.
Q&A
50 Python Data Science
Tips
In[3] j=2
i=j
i
j
Out[3] 2
In[4] i=1
i+=20
i
Out[4] 21
Basic Python Programming for Data Science 51
Operator Description
+= Addition
-= Subtraction
Notes *= Multiplication
/= Division
%= Modulus
<<= Left bit shift
>>= Right bit shift
In[6] a=2
a*=1+3
a
Out[5] 8
Here, the Out[5] is 8 (not 5) because the right hand side is always evaluated completely
before the assignment when running an augmented assignment.
Tips
In[6] a,b,c=1,2,3
a,b,c
Out[6] (1, 2, 3)
For further details about sequences and their unpacked assignments, please refer to
[2.13 Sequences].
Tips
The output here is a tuple, in other words, numbers with parentheses. Please refer to
[2.11 Tuples].
Notes
52 Python Data Science
In[7] a=1
b=2
a,b=b,a
a,b
Out[7] (2, 1)
Here, a,b is equivalent to the tuple (a,b). Therefore, a,b=b,a is equivalent to (a,b)=(b,a)
which is an example of is a sequence unpacking described in [2.5.4 Sequence
unpacking].
Tips
In C and Java, swapping two variables (a, b) requires the introduction of a third
variable (c), as shown in the sequence ‘c = a; a = b; b = c;’. Python, on the other hand,
allows the same operation to be performed more succinctly with the line ‘a, b = b, a’.
Notes However, this does not necessarily mean that Python consumes less memory than C
or Java. It’s worth noting that in Python, the ‘a, b = b, a’ operation creates temporary
tuples under the hood for the swap, which can consume additional memory beyond
just the variables a and b.
Basic Python Programming for Data Science 53
2.6 Comments
Q&A
54 Python Data Science
Unlike Java and C, comments in Python start with a hash mark(#) and extend to the
end of the physical line.
Notes
In[1] x=1
# y=2
print(x)
Out[1] 1
In[2] x=1
# y=2
print(y)
Out[2] ---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-f9b039d12571> in <module>
1 x=1
2 # y=2
----> 3 print(y) # why that exception: the defined part of the variable is the comment
line.
A NameError exception was raised when the Python Interpreter executed In[2] in that
y=2 was commented out.
Tips
Python does not have a specific syntax for multiline comments. However, we can
implement multi-line comments in Python either by using single-line comments
consecutively or by using triple-quoted Python strings.
Notes
Basic Python Programming for Data Science 55
In[3] x=1
y=2
print(y)
"""
This is
a
multiline comment
in python
"""
2
In Jupyter Notebook, we can switch between “code line” and “comment line” by
keyboard shortcut【Ctrl + /】.
Tips
By convention, the triple quotes that appear right after the function, method or class
definition are docstrings (documentation strings). For more details, please refer to
“3.4 Help documetation”.
Notes
56 Python Data Science
2.7 If statements
Q&A
In Python, indentation serves the same function as braces ({}) in C and Java; that is, it
signifies the scope of compound statements.
Notes
In Python, a colon (:) is required at the end of the line that introduces a new indentation
level, such as the start of a control structure or a function definition. Thus, a colon
often precedes an indentation.
Notes
In[2] if(a<=b):
if(a<b):
print(a)
else:
print(a)
else:
print(b)
Out[2] 2
Tips
In Python, the keyword ‘elif’ is shorthand for ‘else if’. It’s useful in avoiding excessive
indentation and keeping the code concise.
Notes
In[3] if(a<=b):
print(a)
elif(a==b):
print(a)
else:
print(b)
Out[3] 2
Unlike C and Java, the ‘if’ statement in Python can include an ‘elif’ clause. Additionally,
Python’s ‘try-catch’, ‘while’, ‘for’, and other control statements can all include an
‘else’ clause. In Python, the ‘else’ statement signifies that the preceding code block
was exited normally, meaning without a ‘break’, ‘continue’, or an exception being
Tips thrown.
58 Python Data Science
In[4] x=0
Result="Y" if x>0 else "N"
Result
Out[4] ‘N’
The ‘if’ statement in Python can be written as a single-line expression, similar to the
ternary conditional operator (?:) in C and Java.
Notes
In Python’s ternary operators, the ‘true’ expression (Y) precedes the ‘if’ statement.
Tips
In Python, the if statements, the for statements, and functions can all be written on
a single line, using ternary operators, list comprehensions, and lambda functions,
respectively.
Notes
In[5] x=1
Result="Y" if x>0 else "N"
Result
Out[5] 'Y'
In[6] if(a<=b):
else:
print(b)
Out[6] File "<ipython-input-6-12262625dfcc>", line 2
else:
^
IndentationError: expected an indented block
In Python, each part of an if statement must have some code or statement. If any
part is empty, the Python interpreter will raise an error because Python is executable
pseudocode. You can refer to [2.10 The pass statements (In[1])] for more information.
Notes
Basic Python Programming for Data Science 59
Tips
In[7] if(a<=b):
pass # no error
else:
print(b)
In this case, the pass statement serves as a placeholder to indicate that no action is
taken when a is less than or equal to b. If a is greater than b, the code will execute
the print(b) statement in the else block.
Tips
To check whether a year is a leap year in Python, you can use the following
suggestion:
Tips
Q&A
Basic Python Programming for Data Science 61
In[1] sum=0
for i in (1,2,3):
sum=sum+i
print(i,sum)
Out[1] 1 1
23
36
Unlike C and Java, there is only one way to write the for statement in Python:
[for ... in ...]. Make sure to include the colon at the end of the line and pay attention to
the indentation. You can refer to [2.4 Statements] for more information.
Notes
The in keyword in Python is used to iterate over iterables or iterators. In the given
context, the parentheses () represent a tuple, which is an iterable. For more information
on iterators and decorators, you can refer to [3.1 Iterators and Decorators].
Tips
Before the for statement, it is necessary to assign a value to the sum variable; otherwise,
an error will be raised due to the variable being undefined.
Notes
The range() function is commonly used after the in keyword in the for statement,
such as range(1, 10). The range() function returns a “range iterator” that generates a
sequence of numbers from the start value (1 in this case) to the end value (10 in this
Notes case).
Please refer to [3.1 Iterators and Decorators] for more information on iterators.
Tips
In[3] myList=list(range(1,10))
myList
Out[3] [1, 2, 3, 4, 5, 6, 7, 8, 9]
62 Python Data Science
To examine the contents of an iterator, you can use the list() function to convert the
“range iterator” into a list type. This allows you to view all the elements generated by
the iterator.
Tricks
In the return value of the range(1, 10) function, the generated sequence includes the
number 1 but excludes the number 10. This is a characteristic of the range() function
in Python, where the end value is exclusive. For more details on working with lists,
Notes you can refer to [2.10 Lists].
Unlike C and Java, the for statement in Python can be used together with the else
statement.
Tips
In[5] myList=list(range(1,10))
for j in [1,3,4,5]:
print(myList[j])
Out[5] 2
4
5
6
Similar to C, Java, etc., the for statement in Python supports the break and continue
statements.
Tips
Basic Python Programming for Data Science 63
The difference between the break and continue statements is as follows: The break
statement “exits the loop entirely,” while the continue statement “skips the remaining
code inside the loop for the current iteration and moves to the next iteration.”
Notes
In contrast to the break statement, the continue statement in Python means “jump
inside the loop body.” It allows you to skip the remaining statements in the current
iteration of the loop and move on to the next iteration. This means that any code
following the continue statement within the loop for the current iteration will be
Tips bypassed.
64 Python Data Science
Q&A
Basic Python Programming for Data Science 65
In[1] i=1
sum=0
while(i<=100):
sum=sum+i
i+=1
print(sum)
Out[1] 5050
In Python, the while statement is written in a single way, and there is no equivalent
do-while statement as found in some other programming languages. The while loop
in Python allows you to repeatedly execute a block of code as long as a specified
Notes condition is true. The condition is checked before each iteration, and if it evaluates to
False initially, the loop will not be executed.
In[2] i=1
sum=0
while(i<=10):
sum=sum+i
i+=1
if i==6:
continue
if i==9:
break
print(i,sum)
else:
print("here is esle")
Out[2] 2 1
33
46
5 10
7 21
8 28
To summarize, break exits the loop entirely, while continue skips the remaining
statements within the loop for the current iteration and proceeds to the next iteration.
Notes
In[3] i=1
sum=0
while(i<=10):
sum=sum+i
i+=1
print(i,sum)
else:
print("here is esle")
66 Python Data Science
Out[3] 2 1
33
46
5 10
6 15
7 21
8 28
9 36
10 45
11 55
here is esle
Unlike C and Java, the while statement in Python can indeed include an else clause.
The else clause in a while loop will be executed only when the condition of the loop
becomes False and the loop completes its iterations normally, without encountering
a break or return statement.
Tips
2.10 Lists
Q&A
68 Python Data Science
In the basic Python syntax, parentheses (), brackets [], and braces {} represent tuples,
lists, and sets/dictionaries, respectively.
Notes
In[2] myList2=myList1
myList2
Out[2] [21, 22, 23, 24, 25, 26, 27, 28, 29]
Method 2: Using an assignment statement, where you assign a defined list variable to
a new list variable.
Tips
In[3] myList3=list("Data")
myList3
Out[3] ['D', 'a', 't', 'a']
Method 3: Using type casting to convert other types of objects to the list type.
Tips
Negative subscripts or negative indexes can be used in Python to access elements from
the end of a sequence, such as a string or a list.
The positive indexes start from 0, where 0 represents the first element.
Notes The negative indexes start from –1, where –1 represents the last element.
In[4] myList1[-1]
Out[4] 29
In[5] myList1[-9]
Out[5] 21
In[6] myList1[9]
Out[6] ---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-6-8724c27fc4be> in <module>
----> 1 myList1[9]
Here, the reason for the error is that the index is out of range.
Notes
The difference between positive indexes and negative indexes in Python is as follows:
Positive indexes start with 0 and are numbered from left to right, while negative
indexes start with –1 and are numbered from right to left.
Tips
2.10.2 Slicing
In[7] myList1
Out[7] [21, 22, 23, 24, 25, 26, 27, 28, 29]
By printing the variable, you can view its current value in your data science project.
It’s indeed important to pay attention to the current values of variables throughout
your project to ensure accurate results and proper data analysis.
Tips
In[8] myList1[1:8]
Out[8] [22, 23, 24, 25, 26, 27, 28]
In Python, we can slice a list using indexes, and the notation for slicing is Start:Stop:Step.
Tips
When a colon (:) appears in the index of a Python sequence, it typically indicates
slicing the sequence. This slicing notation allows you to specify the start, stop, and
step values to extract a portion of the sequence.
Notes
In[9] myList1[1:8:2]
Out[9] [22, 24, 26, 28]
It’s important to note that the start, stop, and step values can be omitted when writing
a slice. When any of these values are omitted, they take on default values:
Tips
In[10] myList1[:5]
Out[10] [21, 22, 23, 24, 25]
Basic Python Programming for Data Science 71
If the start value is omitted, it defaults to the beginning of the sequence. If the stop
value is omitted, it defaults to the end of the sequence.If the step value is omitted,
it defaults to 1, indicating consecutive elements.For more details on working with
sequences and slicing, you can refer to [2.13 Sequences].
Tips
The element with the index of “stop” is not included in the slicing. For example, in
the case of an element with an index of 5, which corresponds to a value of 26 in this
example, it is not included in the slice.
Notes
In[11] myList1[:]
Out[11] [21, 22, 23, 24, 25, 26, 27, 28, 29]
Tips
In[12] myList1[2:]
Out[12] [23, 24, 25, 26, 27, 28, 29]
Tips
In[13] myList1[:-1]
Out[13] [21, 22, 23, 24, 25, 26, 27, 28]
Tips
2.10.3 Reversing
In[14] myList1
Out[14] [21, 22, 23, 24, 25, 26, 27, 28, 29]
In[15] myList1[::-1]
Out[15] [29, 28, 27, 26, 25, 24, 23, 22, 21]
72 Python Data Science
Reversing lists can be achieved using the index [::-1], which means setting step to -1.
Tricks
In[16] myList1
Out[16] [21, 22, 23, 24, 25, 26, 27, 28, 29]
In Python, slicing a list does not change the list itself; instead, it creates a new list with
the selected elements.
Notes
In[17] myList1[:-1]
Out[17] [21, 22, 23, 24, 25, 26, 27, 28]
Here, [:-1] has the same meaning as [:n-1]. In data science projects, there is always a
case where the index is -1, which indicates the maximum value of the index.
Tips
In[18] reversed(myList1)
Out[18] <list_reverseiterator at 0x18ef35863d0>
In Python, to reverse lists, we can also use the built-in function reversed() or the list
method reverse().
Tricks
The return value of the reversed() function is an iterator, and its values can be displayed
by passing it to the list() function.
Notes
For information about iterators, you can refer to the section titled “Iterators and
Decorators” in the Python documentation or resource you mentioned, specifically
section 3.1.
Tips
In[19] list(reversed(myList1))
Out[19] [29, 28, 27, 26, 25, 24, 23, 22, 21]
Basic Python Programming for Data Science 73
In[20] myList1
Out[20] [21, 22, 23, 24, 25, 26, 27, 28, 29]
To check the current value of the myList1 list, you can use the reverse() method as
follows: myList1.reverse().
Tips
In[21] myList1.reverse()
myList1
Out[21] [29, 28, 27, 26, 25, 24, 23, 22, 21]
When you use reversed(), it returns an iterator that allows you to iterate over the list
in reverse order without modifying the original list. However, if you use the reverse()
method directly on a list, it will reverse the elements of the list itself.
Notes
In[22] list("chaolemen")
Out[22] ['c', 'h', 'a', 'o', 'l', 'e', 'm', 'e', 'n']
We can use the list() function to convert an object of a different type into a list.
Tips
In Python, the “+” operation for lists and the extend() method of a list have similar
functionality. Both operations are used to concatenate or combine lists.
Notes
74 Python Data Science
The difference between the append() and extend() methods of a list is that append()
is used to add a single element to the list, while extend() is used to add multiple
individual elements.
Notes
The zip() function in Python is used to iterate in parallel over two or more iterables. It
takes multiple iterables as input and returns an iterator that generates tuples containing
elements from each iterable, paired together based on their respective positions.
Notes
In Python, list comprehension (or list derivation) is a concise way to create lists based
on existing lists or other iterables. List comprehension is typically written within
square brackets ([]).
You can refer to section 2.10.6 titled “Lists Derivation” for more detailed information
Tips on this topic.
List comprehension (or list derivation) must be enclosed within square brackets ([]).
You can refer to section 2.10 titled “Lists” for more information on this topic.
Tips
List comprehension is typically written within square brackets ([]), and it allows you
to generate new lists by applying an expression to each item in an iterable, optionally
including conditions for filtering the elements.
Tips
In[30] range(10)
Out[30] range(0, 10)
Tips
In[31] list(range(0,10,2))
Out[31] [0, 2, 4, 6, 8]
In the code snippet (range(0,10,2)), the numbers 0, 10, and 2 represent the start, stop,
and step arguments of the iterator, respectively.
Tips
String placeholders, such as %d, can be used in Python list comprehensions, which are
similar to the placeholders used in the printf() and scanf() functions in C.
Tricks
76 Python Data Science
We can add or insert elements to a list using the insert() method of the list.
Tips
Here, the number “8” represents the element to be inserted, and the number “1”
represents the position at which the element will be inserted into the lst_1 list.
Notes
We can use the pop() method of a list to delete a specific element based on its index.
To remove the element at index 2, you can use the above code.
Tips
Tips
Basic Python Programming for Data Science 77
In addition to deleting an element based on its index, Python also supports removing
an element from a list based on its value. You can use the remove() method for this
purpose.
Tips
Here, only the first occurrence of 10 is removed, not the second occurrence.
Notes
If you want to remove all occurrences of a particular value from the list, you can use
other techniques such as a list comprehension or a loop.
Tips
To calculate the length of a list in Python, you can use the built-in function len().
Tips
To sort lists in Python, you can use the built-in function sorted().
Tips
In[42] lst_1
Out[42] [10, 10, 11, 12, 11, 13, 14, 15]
78 Python Data Science
In Python, the built-in function sorted() does not change the order of the elements in
a list.
Notes
In addition to the built-in function sorted(), the list method sort() can also be used to
sort lists.
Tips
The difference between the built-in function sorted() and the list method sort() is that
the sort() method directly modifies the order of elements within the list itself, while
the sorted() function returns a new sorted list without modifying the original list.
Notes
Note the difference between the list methods extend() and append().
Notes
Tips
Appending lst_1 directly after lst_2, that is, directly merging the elements in the two
lists.
Tips
Basic Python Programming for Data Science 79
To print lists in Python, you can use the built-in function print().
Tips
The difference between the built-in function reversed() and the list method reverse()
is that the former does not modify the list itself, while the latter directly modifies the
list itself.
Tips
‘reversed(lst_1)’ returns an iterator that needs to be converted using the list() function
before printing.
Notes
In[48] reversed(lst_1)
Out[48] <list_reverseiterator at 0x18ef367bf10>
In[49] lst_1
Out[49] [1, 2, 3, 'Python', True, 4.3, None]
The reversed() function is a built-in function in Python that does not change the list
itself. Instead, it temporarily returns the list in reverse order as an iterator.
Tips
In[51] str1=[1,2,3,4,5]
str2=[20,21,23,24,25]
print(zip(str1,str2))
Out[51] <zip object at 0x0000018EF368C280>
To aggregate elements from two lists simultaneously in Python, you can use the zip()
function.
Tips
In[52] print(list(zip(str1,str2)))
Out[52] [(1, 20), (2, 21), (3, 23), (4, 24), (5, 25)]
The return value of the zip() function is an iterator, which needs to be cast by list() to
get its value. Please refer to [3.1 Iterators and Decorators] for details.
Tips
In[53] str1=["a","about","c","china","b","beijing"]
[x.upper() for x in str1 if len(x)>1]
Out[53] ['ABOUT', 'CHINA', 'BEIJING']
Unlike C and Java, Python introduces the concept of list comprehension, which can be
used to simplify complex for statements.
Tips
In[55] str1=["a","about","c","china","b","beijing"]
[str2.upper() for str2 in str1 if len(str2)>1]
Out[55] ['ABOUT', 'CHINA', 'BEIJING']
The code above contains the ternary operators of the if statement, please refer to [2.7
The if statement] for details.
Tips
In[56] myList=[2,3,5,6,7,3,2]
list(enumerate(myList))
Out[56] [(0, 2), (1, 3), (2, 5), (3, 6), (4, 7), (5, 3), (6, 2)]
In Python, to track the index of a list, you can use the built-in function enumerate().
Tips
List comprehension provides a concise and efficient way to create lists based on
existing lists or other iterables. It simplifies code by condensing multiple lines of
code into a single line, making it more readable and expressive. This approach is
particularly valuable in data analysis and data science tasks that involve working with
large datasets.
By utilizing list comprehension, data scientists can express complex operations more
succinctly and intuitively, resulting in more manageable and error-resistant code.
2.11 Tuples
Q&A
Basic Python Programming for Data Science 83
In[1] myTuple1=(1,3,5,7,2)
print(myTuple1)
Out[1] (1, 3, 5, 7, 2)
In[2] 1,3,5,7,2
Out[2] (1, 3, 5, 7, 2)
84 Python Data Science
In Python, parentheses can be omitted when defining tuples; however, commas cannot
be omitted
Notes
In[3] myTuple2=myTuple1
print(myTuple2)
Out[3] (1, 3, 5, 7, 2)
The second method to create a tuple in Python is through tuple unpacking. This
involves using an ‘assignment statement’ to assign the values of an existing tuple to
the variables of a new tuple.
Notes
In[4] myTuple3=tuple("Data")
myTuple3
Out[4] ('D', 'a', 't', 'a')
The third method involves using type casting to convert other data types into tuples.
Notes
In[5] myTuple4=1,3,5,7,2
print(myTuple4)
Out[5] (1, 3, 5, 7, 2)
The fourth method involves using the ‘comma operator’. This means that the
parentheses which are typically used in the first method can be omitted. In Python, a
comma operator signifies a tuple, even without parentheses.
Notes
In[6] 1,3,5,7,2
Out[6] (1, 3, 5, 7, 2)
Tips
Basic Python Programming for Data Science 85
In[7] myTuple=1,3,5,7,2
myTuple[2]=100
Out[7] ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-bab615dd7a09> in <module>
1 myTuple=1,3,5,7,2
----> 2 myTuple[2]=100 # Why that exception: Tuples are immutable objects.
One of the key differences between tuples and lists in Python is that tuples are
‘immutable objects’, which means they cannot be changed after they are created.
Lists, on the other hand, are ‘mutable objects’ and can be modified even after their
creation.
Tips
In[8] myList=[1,3,5,7,2]
myList[2]=100
myList
Out[8] [1, 3, 100, 7, 2]
In this case, no exception is raised when performing certain operations because a ‘list’
is a mutable object in Python, allowing modifications without causing errors.
Tips
In[9] myTuple=1,3,5,7,2
myTuple[2:5]
Out[9] (5, 7, 2)
Similar to lists, tuples support slicing operations because both are sequence types in
Python.
Notes
In[10] myTuple=1,3,5,7,2
len(myTuple)
Out[10] 5
To calculate the length of a tuple in Python, you can use the built-in function, len().
Tips
86 Python Data Science
In[11] myTuple=1,3,5,7,2
print(sorted(myTuple))
Out[11] [1, 2, 3, 5, 7]
To sort tuples in Python, you can use the built-in function sorted().
Notes
The sorted() function in Python returns a new result that is of type ‘list’, not ‘tuple’.
Tips
In[12] myTuple=1,3,5,7,2
myTuple.sort()
Out[12] ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-12-d7b571f24488> in <module>
1 myTuple=1,3,5,7,2
----> 2 myTuple.sort()
3 # Why that exception: Tuples do not have the method.
AttributeError: 'tuple' object has no attribute 'sort'
Unlike lists, tuples in Python do not have a sort() method. This is because tuples are
immutable objects and the sort() method would require changing the original object
itself, which is not possible with tuples.
Notes
The code myTuple.sort() causes an error because tuples in Python do not have a sort()
method, given their immutability.
Tips
In[13] myTuple=1,3,5,7,2
5 in myTuple
Out[13] True
The in operator can be used with tuples in Python to check if a specific value exists
within the tuple. For example, 5 in myTuple checks if the number 5 is an element of
the myTuple tuple.
Tips
In[14] myTuple=1,3,5,7,2
myTuple.count(11)
Out[14] 0
Basic Python Programming for Data Science 87
To count the frequency of an element in a tuple, you can use the count() method. For
instance, myTuple.count(11) counts the occurrences of the value 11 in the myTuple
tuple.
Tips
In[15] myTuple=1,3,5,7,2
x1,x2,x3,x4,x5=myTuple
x2
Out[15] 3
In Python, the rule for unpacking tuples is ‘assignment by position’. This means that
variables are assigned to the corresponding values in the tuple based on their positions.
Tips
In[17] myTuple=(1,5,6,3,4)
print(myTuple)
print(len(myTuple))
print(max(myTuple))
Out[17] (1, 5, 6, 3, 4)
5
6
In[18] myTuple=(11,12,13,12,11,11)
a1,a2,a3,a4,a5,a6=myTuple
a3
Out[18] 13
88 Python Data Science
In Python, tuples support the feature of unpacking assignment, which allows for the
assignment of tuple values to a corresponding set of variables in a single line of code.
Tips
In[19] myTuple=(11,12,13,12,11,11)
myTuple.count(11)
Out[19] 3
Tips
In Python, a tuple used as a formal parameter with a ‘*’ prefix in function definition
means that the function can receive a variable number of actual arguments. These
arguments are collected into a tuple.
Tips
In Python, the ‘’ operator is used to represent a tuple, while the ‘**’ operator is used
to represent a dictionary. The ‘’ operator unpacks elements into a tuple, and the ‘**’
operator unpacks key-value pairs into a dictionary.
Tips
In a dictionary, the keys must be explicitly present in the actual parameters, such as
x1, x2.
Notes
Basic Python Programming for Data Science 89
In Python, the return value of many functions is often a tuple because the syntax
‘return 1, 2, 3’ is equivalent to ‘return (1, 2, 3)’. This shorthand allows multiple values
to be returned as a tuple without explicitly using parentheses.
Tips
In[23] 1,2
Out[23] (1, 2)
It’s important to note that the parentheses are not part of the tuple itself; they are added
for clarity and readability.
Tips
In[24] x=1
y=2
x,y=y,x
print(x,y)
Out[24] 2 1
In Python, swapping the values of two variables can be achieved using tuples. This is
commonly known as “tuple unpacking”.
Tricks
Tips
90 Python Data Science
2.12 Strings
Q&A
Basic Python Programming for Data Science 91
In[1] print('abc')
print("abc")
Out[1] abc
abc
Unlike in C and Java, the concepts of ‘character’ and ‘string’ are more closely
unified in Python, resulting in fewer practical distinctions between them. In Python, a
character is typically represented as a string of length 1, which allows it to be treated
as a special case of a string.
Tips
Strings can be enclosed either with single quotes or double quotes in Python.
Tips
In[2] print("abc'de'f")
Out[2] abc'de'f
When the string itself contains single quotes, it should be enclosed with double quotes,
and vice versa.
Tips
In this case, the argument of the print() function is enclosed within single quotes.
Notes
In[3] print('abc"de"f')
Out[3] abc"de"f
In this case, when using the print() function, the output is enclosed within double
quotes.
Notes
In[4] str1='"
Hello
world
!
'"
str1
Out[4] '\n Hello \n world \n !\n'
92 Python Data Science
Triple quotes can also be used in Python to indicate strings with newlines. For more
details, please refer to the official Python documentation on string literals.
Tips
In[5] str1[1:4]="2222"
# Why that exception: TypeError: 'str' object does not support item assignment
Out[5] ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-d80a51ea9762> in <module>()
1
----> 2 str1[1:4]="2222"
3 # Why that exception: TypeError: 'str' object does not support item assignment
TypeError: 'str' object does not support item assignment
Tips
In[6] str1="abc"
str1="defghijk"
str1[1:4]
Out[6] 'efg'
The execution of the above code will not raise an error, because Python is a dynamically
typed language. Please refer to [2.2.2 Dynamically Typed Language].
Tips
“Immutable object” means that the value of the object cannot be altered locally, and
“dynamically typed language” is a different concept from “Immutable object”.
Notes
The second feature of strings in Python is that they are considered ‘sequences’. This
means that all operators and functions that support sequences can be used with strings.
For instance, strings in Python support operations like slicing, which allows you to
extract portions of a string by specifying a range of indices.
Tips
In[7] 'clm'[0:2]
Out[7] 'cl'
Basic Python Programming for Data Science 93
Strings in Python support the operation of slicing. The rule for slicing is that it
includes the beginning index but excludes the ending index. For example, when
slicing a string, the resulting substring will include the element at index 0 but not
the element at index 2.
Tips
In[8] str3="chaolemen"
str4=str3[1:3]
str4
Out[8] 'ha'
In[9] "chaolemen"[:6]
Out[9] 'chaole'
Removing whitespaces at the beginning and end of a string, such as spaces, newlines.
Tips
To check if a character or string appears within another string in Python, you can use
the in keyword.
Tips
In[14] len('clm')
Out[14] 3
94 Python Data Science
To calculate the length of a string in Python, you can use the built-in function len().
Tips
In[15] print(ord('A'))
print(chr(97))
Out[15] 65
a
In Python, you can use the built-in function ord() to obtain the Unicode value of a
character.
Tips
The built-in function chr() in Python is indeed the counterpart of the ord() function. It
takes a Unicode value as an argument and returns the corresponding character string.
Notes
In[16] print(ord('朝'))
print(chr(26397))
Out[16] 26397
朝
By importing the sys module and calling the sys.getdefaultencoding() function from
it, you can obtain the default character encoding used in Python.
Tips
In[17] s='a\tbbc'
s
Out[17] 'a\tbbc'
Escape character.
Tips
In[18] print(s)
Out[18] a bbc
When a string contains ‘escape characters’, there is a difference between the output
of s and print(s). The difference is that the former does not interpret or process the
escape characters, while the latter does perform the necessary escaping and displays
Notes the string accordingly.
Basic Python Programming for Data Science 95
In[19] str(1234567)
Out[19] '1234567'
The integer can be converted into a string using the str() function.
Tips
In[20] "abc".upper()
Out[20] 'ABC'
To convert uppercase characters to lowercase, you can use the lower() method.
Conversely, to convert lowercase characters to uppercase, you can use the upper()
method.
Tips
When working with special characters and path strings in Python, it is important to
be mindful of certain issues. For example, assigning a path string to a variable, such
as s1 = “E:\SparkR\My\T”, can lead to unexpected behavior due to the interpretation
of backslashes as escape characters.
Tips
In[21] s1="E:\SparkR\My\T"
s1
Out[21] 'E:\\SparkR\\My\\T'
In the Jupyter notebook, printing the string s1 directly is different from the output of
the built-in function print().
Tips
In[22] s1=r"http://www.chaolemen.org"
s1
Out[22] 'http://www.chaolemen.org'
In Python, strings prefixed with r or R, such as r’...’ or r”...”, are referred to as raw
strings.
Raw strings treat backslashes (\) as literal characters instead of escape characters.
Notes This means that they preserve the original backslashes and do not interpret them as
escape sequences.
Raw strings are commonly used when dealing with regular expressions, file paths, or
any situation where backslashes need to be handled as literal characters.
Tricks
96 Python Data Science
In Python 3, the use of Unicode string literals (string literals prefixed by u) is no longer
necessary. While they are still valid, they are primarily maintained for compatibility
purposes with Python 2.
Tips
The join() method in Python returns a string by concatenating all the elements of an
iterable object (iterObj), separated by a specified string separator (sepStr). For more
detailed information about iterable objects, please refer to the relevant sections on
iterators and generators in the appropriate Python documentation.
Tips
The argument of the join() method in Python is a sequence, and the variable before
the dot (referred to as seq_str here) represents the separator. The join() method
concatenates all the elements of the sequence, using seq_str as the separator between
Notes them.
In[24] str1=["abc","aaba","adefg","bb","c"]
str1.sort()
str1
Out[24] ['aaba', 'abc', 'adefg', 'bb', 'c']
In Python, you can use the set() function to convert a string into a set data structure.
Tips
In[28] print("set(str1)=",set(str1))
Out[28] set(str1)= {'c', 'adefg', 'bb', 'abc', 'aaba'}
The re module in Python provides support for regular expressions, including regular
expression syntax, pattern matching, and various operations for working with
patterns. It offers powerful tools for pattern matching, searching, substitution, and
Notes other advanced operations involving text processing based on regular expressions.
In[29] import re
p1 = re.compile('[a-dA-D]')
r1 = p1.findall('[email protected]')
r1
Out[29] ['c', 'a', 'c', 'd', 'c']
Basic Python Programming for Data Science 97
The syntax of regular expressions in Python can be found in the official documentation.
For Python 3, the documentation can be accessed at: https://docs.python.org/3/library/
re.html.
Notes
98 Python Data Science
2.13 Sequences
Q&A
Basic Python Programming for Data Science 99
2.13.1 Indexing
In[1] myString="123456789"
myString[1]
Out[1] '2'
Tips
In[2] myList=[11,12,13,14,15,16,17,18,19]
myList[1]
Out[2] 12
In[3] myTuple=(21,22,23,24,25,26,27,28,29)
myTuple[1]
Out[3] 22
2.13.2 Slicing
In[4] myString="123456789"
myString[1:9:2]
Out[4] '2468'
Slicing can be used through [start: stop: step]. For further details, please refer to [2.10
Lists]
Tips
In[5] myList=[11,12,13,14,15,16,17,18,19]
myList[1:9:2]
Out[5] [12, 14, 16, 18]
In[6] myTuple=(21,22,23,24,25,26,27,28,29)
myTuple[1:9:2]
Out[6] (22, 24, 26, 28)
100 Python Data Science
2.13.3 Iteration
In[7] myString="123456789"
for i in myString:
print(i,end=" ")
Out[7] 1 2 3 4 5 6 7 8 9
A sequence is an example of an iterable data type that can be iterated over using the
for statement in Python.
Tips
In[8] myList=[11,12,13,14,15,16,17,18,19]
for i in myList:
print(i,end=" ")
Out[8] 11 12 13 14 15 16 17 18 19
In[9] myTuple=(21,22,23,24,25,26,27,28,29)
for i in myTuple:
print(i,end=" ")
Out[9] 21 22 23 24 25 26 27 28 29
2.13.4 Unpacking
In[10] myString="123456789"
a1,a2,a3,a4,a5,a6,a7,a8,a9=myString
a1,a2,a3,a4,a5,a6,a7,a8,a9
Out[10] ('1', '2', '3', '4', '5', '6', '7', '8', '9')
In[11] myList=[11,12,13,14,15,16,17,18,19]
a1,a2,a3,a4,a5,a6,a7,a8,a9=myList
a1,a2,a3,a4,a5,a6,a7,a8,a9
Out[11] (11, 12, 13, 14, 15, 16, 17, 18, 19)
In[12] myTuple=(21,22,23,24,25,26,27,28,29)
a1,a2,a3,a4,a5,a6,a7,a8,a9=myTuple
a1,a2,a3,a4,a5,a6,a7,a8,a9
Out[12] (21, 22, 23, 24, 25, 26, 27, 28, 29)
Basic Python Programming for Data Science 101
Tips
In Python, the * operator, when used with a sequence, performs a “repeat operation”
rather than a “multiplication” operation. This means that the sequence is repeated a
certain number of times to create a new sequence.
Notes
In[14] myList=[11,12,13,14,15,16,17,18,19]
myList * 3
Out[14] [11,
12,
13,
14,
15,
16,
17,
18,
19,
11,
12,
13,
14,
15,
16,
17,
18,
19,
11,
12,
13,
14,
15,
16,
17,
18,
19]
102 Python Data Science
In[15] myTuple=(21,22,23,24,25,26,27,28,29)
myTuple * 3
Out[15] (21,
22,
23,
24,
25,
26,
27,
28,
29,
21,
22,
23,
24,
25,
26,
27,
28,
29,
21,
22,
23,
24,
25,
26,
27,
28,
29)
In Python, all objects of “sequence”, regardless of their data type (such as lists, tuples,
strings), support common functions.
Notes
In Python, the built-in function len() is used to calculate the length of sequences. This
function can be applied to various sequence types, such as lists, tuples, and strings, to
determine the number of elements they contain.
Tips
Basic Python Programming for Data Science 103
In[17] sorted(myString),sorted(myList),sorted(myTuple)
Out[17] (['1', '2', '3', '4', '5', '6', '7', '8', '9'],
[11, 12, 13, 14, 15, 16, 17, 18, 19],
[21, 22, 23, 24, 25, 26, 27, 28, 29])
In Python, the sorted() function is used to sort sequences. It takes an iterable as input
and returns a new sorted list containing the elements of the original sequence.
Tips
In[18] reversed(myString),reversed(myList),reversed(myTuple)
Out[18] (<reversed at 0x15bad2819a0>,
<list_reverseiterator at 0x15bad281070>,
<reversed at 0x15bad2d6af0>)
In[19] list(reversed(myString))
Out[19] ['9', '8', '7', '6', '5', '4', '3', '2', '1']
What the reversed() function returns is an iterator, which supports lazy evaluation, and
can be converted into a list using the built-in function list().
Notes
In[20] enumerate(myString),enumerate(myList),enumerate(myTuple)
Out[20] (<enumerate at 0x15bad2d1280>,
<enumerate at 0x15bad2cb500>,
<enumerate at 0x15bad2cb300>)
In Python, the enumerate() function is used to track and enumerate indexes while
iterating over a sequence. It returns an iterator that generates pairs of index and value
for each element in the sequence.
Tips
In[21] list(enumerate(myString))
Out[21] [(0, '1'),
(1, '2'),
(2, '3'),
(3, '4'),
(4, '5'),
(5, '6'),
(6, '7'),
(7, '8'),
(8, '9')]
104 Python Data Science
The enumerate() function returns an iterator that can be converted to a list using list().
Notes
In[22] zip(myList,myTuple)
Out[22] <zip at 0x15bad2d7600>
In Python, the zip() function is used to aggregate elements from two or more iterables
into tuples. It takes multiple iterables as input and returns an iterator that generates
tuples containing elements from each iterable, paired together.
Tips
In[23] list(zip(myList,myTuple))
Out[23] [(11, 21),
(12, 22),
(13, 23),
(14, 24),
(15, 25),
(16, 26),
(17, 27),
(18, 28),
(19, 29)]
The built-in function zip() returns an iterator, which can be converted into a list
using another built-in function list(). For details, please refer to [3.1 Iterators and
Decorators].
Notes
In contrast with list, tuple, set, and dictionary, “sequence” is not an independent data
type in Python, but a general term for multiple data types including list, tuple, and
string.
Tips
Basic Python Programming for Data Science 105
2.14 Sets
Q&A
106 Python Data Science
In[1] mySet1={1,2,3,4,1,2,23}
mySet1
Out[1] {1, 2, 3, 4, 23}
In Python, there are several methods to define sets. The first method involves directly
defining a set using braces {}.
From this definition, it’s clear that a set is essentially an unordered data structure
Tips consisting only of values with no keys.
In[2] mySet2=mySet1
mySet2
Out[2] {1, 2, 3, 4, 23}
The second method: use the assignment statement to assign values to new set variables
from pre-existing defined set variables.
Tips
In[3] myList1=[1,2,3,3,2,2,1,1]
mySet3=set(myList1)
mySet3
Out[3] {1, 2, 3}
The third method: use the set() function to convert objects of other types into a set
object.
Tips
In[4] mySet4=set("chaolemen")
mySet4
Out[4] {'a', 'c', 'e', 'h', 'l', 'm', 'n', 'o'}
A key feature of a set is certainty: for any given set and any specific element, that
element either belongs to the set or it does not. There is no ambiguity permitted.
Tips
Basic Python Programming for Data Science 107
In[6] mySet4[2]
Out[6] ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-78241c857f8a> in <module>
1 # Unordered
2 # The elements in the sets are unordered, so the elements in the
# set cannot be accessed with indexes
----> 3 mySet4[2] # Why that exception: TypeError: 'set' object does not support
# indexing
Unordered: The elements in sets are unordered, meaning they don’t have a specific
arrangement. Therefore, in Python, it’s not possible to use indices to access elements
within a set.
Tips
In[7] mySet5={1,2,3}
mySet6={1,2,1,1,3}
mySet5==mySet6
Out[7] True
Uniqueness: The elements in a set are distinct from each other, meaning each element
appears only once. Therefore, in Python, two sets with the same elements, regardless
of their order, are considered equal.
Tips
In[9] # Include
3 in mySet7
Out[9] True
In[11] # Equal to
mySet7 == mySet8
Out[11] False
108 Python Data Science
In[13] # Subset
{1,5} < mySet7
Out[13] True
In[14] # Union
mySet7|mySet8
Out[14] {1, 2, 3, 4, 5, 6, 10}
In[15] # Intersection
mySet7&mySet8
Out[15] {10}
In[16] # Difference
mySet7-mySet8
Out[16] {1, 3, 5}
In[18] #To check whether one set is a subset of another set in Python
print({1,3}.issubset(mySet7))
Out[18] True
In[19] #To check whether one set is a superset of another set in Python
print({1,3,2,4}.issuperset(mySet7))
Out[19] False
In[20] mySet9={1,2,3,4}
mySet9.add(4)
mySet9.remove(1)
mySet9
Out[20] {2, 3, 4}
The set type is mutable, meaning that after its creation, you can modify it by adding,
removing, or changing elements.
Tips
In[21] mySet10=frozenset({1,2,3,4})
mySet10
Out[21] frozenset({1, 2, 3, 4})
Basic Python Programming for Data Science 109
The frozenset is an immutable type of set in Python. This means that once a frozenset
is created, it cannot be modified – you can’t add or remove elements from it.
Tips
In data science projects, to safeguard data from unintentional modifications during the
analysis process, we typically employ immutable objects.
Notes
In[22] mySet10.add(5)
Out[22] ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-22-d051a89f1878> in <module>
----> 1 mySet10.add(5) # Why that exception: AttributeError: 'frozenset' object has no
attribute 'add'
Due to the uniqueness of elements in sets, they are commonly used to perform
deduplication operations in data analysis and data science projects.
Tips
2.15 Dictionaries
Q&A
Basic Python Programming for Data Science 111
Tips
In[2] myDict3={"grade":2,"gender":"M","grade":15,"grade":5}
myDict3
Out[2] {'grade': 5, 'gender': 'M'}
In Python dictionaries, duplicate keys are not allowed. If you provide duplicate keys,
the value of the last key will be preserved, effectively overwriting previous assignments
to that key.
Tips
In[3] myDict1['name']
Out[3] 'Jerry'
In[4] myDict1[name]
Out[4] ---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-9f850ce95d5e> in <module>
1 myDict2={2:2,2:3,4:5}
----> 2 myDict2[name]
In Python, you can change the value of a specific item in a dictionary by referring to
its key name and assigning a new value to it.
Notes
Here, the key [2,3] is a list (unhashable objects) so that the unhashable type error
was raised.
Tips
Basic Python Programming for Data Science 113
In Python, a TypeError will be raised when an unhashable data type is used in code
that requires hashable data.
Tricks
Dictionaries are widely utilized in data science projects for various purposes, including
but not limited to storing temporary data, such as function arguments using **args.
However, dictionaries have broader applications in tasks like data preprocessing,
feature engineering, configuration parameter storage, categorical variable mapping,
Tips and efficient data retrieval.
2.16 Functions
Q&A
Basic Python Programming for Data Science 115
There are three types of functions in Python: built-in functions, functions inside
modules, and user-defined functions.
User-defined functions can be written as single-line functions, known as “lambda
Notes functions”.
User-defined functions can be defined both inside and outside a class. This is because
Python supports both object-oriented programming and procedural programming
paradigms.
A built-in function (BIF) refers to a function that is included as part of the Python
programming language. These functions are built into the Python interpreter and can
be called directly by their function name.
Tips
A function inside a module, also known as a module function, refers to a function that
is defined within a Python module. To call a module function, you first need to import
the module to which it belongs, and then you can use its name to invoke the function.
Tips
To define a user-defined function in Python, you use the def keyword. This keyword is
followed by the name of the function, parentheses for any parameters, and a colon to
indicate the start of the function block.
Notes
Q&A
118 Python Data Science
Tips
In[3] min([1,2,3])
Out[3] 1
Tips
In[4] max([1,2,3])
Out[4] 3
Basic Python Programming for Data Science 119
Tips
In[5] pow(2,10)
Out[5] 1024
Tips
In[6] round(2.991,2)
Out[6] 2.99
The round() function in Python is used for rounding numbers. The second argument
of the round() function specifies the number of decimal places to retain after rounding,
rather than the number of digits after the decimal point.
Tips
Tips
In general, the function names used for casting in Python are often similar to the
names of the target data types.
Notes
In[8] bool(1)
Out[8] True
Tips
In[9] float(1)
Out[9] 1.0
120 Python Data Science
Tips
In[10] str(123)
Out[10] '123'
Tips
In[11] list("chao")
Out[11] ['c', 'h', 'a', 'o']
Tips
In[12] set("chao")
Out[12] {'a', 'c', 'h', 'o'}
Tips
In[13] tuple("chao")
Out[13] ('c', 'h', 'a', 'o')
Tips
Tips
Basic Python Programming for Data Science 121
In Python, you can use the isinstance() function to check the data type of an object.
The isinstance() function takes two arguments: the object you want to check and the
data type you want to compare it against. It returns True if the object is an instance of
the specified data type, and False otherwise.
Tips
In[16] dir()
Out[16] ['In',
'Out',
'_',
'_1',
'_10',
'_11',
……
'_ih',
'_ii',
'_iii',
'_oh',
'exit',
'get_ipython',
'i',
'quit']
To check the search path for a variable in Python, you can use the dir() function or
the magic commands %whos and %who in interactive environments like IPython or
Jupyter Notebook.
Tips
In[17] help(dir)
Out[17] Help on built-in function dir in module builtins:
dir(...)
dir([object]) -> list of strings
Tips
In[18] myList=[1,2,3,4,5]
len(myList)
Out[18] 5
Tips
In[19] range(1,10,2)
Out[19] range(1, 10, 2)
Tips
The range(1, 10, 2) function is used to generate an iterator that begins at 1 (inclusive),
ends at 10 (exclusive), and increments by a step size of 2. Please refer to [2.10 Lists]
for more details.
Tips
In[20] list(range(1,10,2))
Out[20] [1, 3, 5, 7, 9]
The range() function in Python returns an iterator object, which is a form of lazy
evaluation. To evaluate and print the values of the iterator, you can use the list()
function to convert the iterator into a list.
Tips
In[21] callable(dir)
Out[21] True
Tips
Basic Python Programming for Data Science 123
In[22] bin(8)
Out[22] '0b1000'
Tips
In[23] hex(8)
Out[23] '0x8'
Tips
Python and its third-party packages offer various features and programming concepts
that better support the specific needs of data science projects compared to traditional
software development. These features are outlined below, along with the corresponding
Notes references for further reading:
Q&A
To call a module function, you can use the following method: import the module by
its name.
Tips
“Unlike built-in functions, functions inside a module are defined within packages or
modules provided by third parties. To call these functions, you need to first import the
module where the function is defined. The function is usually called using the module
Notes name followed by the function name.
In Python, there are multiple ways to import modules, and each method corresponds
to a different way of calling functions from the imported modules.
Notes
In[2] cos(1.5)
Out[2] ---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-edeaf624fe76> in <module>
----> 1 cos(1.5) # Why that exception: NameError: name 'cos' is not defined
In[3] math.cos(1.5)
Out[3] 0.0707372016677029
Workaround: <module_name>.<function_name>
Tips
In principle, we have the flexibility to create our own “alias” when using the syntax
“import module_name as alias”. However, in the practice of data science, it is common
to follow conventional “alias” names to ensure the readability of the source code.
Tips
Tips
It is recommended to carry out a comparative analysis with In[2]. The reason why the
interpreter does not raise an error here is that the method of importing module has
changed.
Tips
By using the method of importing specific functions from a module, you can
directly import the desired function and use it without needing to reference the
module name.
Tips
Basic Python Programming for Data Science 127
Q&A
128 Python Data Science
Unlike C and Java, a user-defined function is defined using the “def” keyword in
Python.
Notes
Python supports the definition of “inner functions,” which means that a function can
be defined within another function. If the inner function, func2(), references a local
variable (not a global variable) from the outer function, it is referred to as a “closure.”
Notes
Basic Python Programming for Data Science 129
def func2(i):
print('pass'+str(i)+str(j))
return func2
The inner function, func2(), is a local function that can only be accessed within
the outer function, func1(). This means that func2() can only be called from within
func1(). The return func2 statement in the outer function is used to return the inner
function itself. Without this statement, func2() would not be executed since there are
Tips no other statements to call it.
In[2] func1()
In[3] func1()(2)
According to the definition in [1], func2 is the return value of func1(). Therefore, in
terms of the running process, calling func1() with the argument 2 is similar to calling
func2(2).
Notes
When func1() is executed, the return func2 statement will also be executed. As a result,
func2() will be returned and can be subsequently executed. If the return statement is
not present, the system will automatically return None, and an error will be raised
with the message “TypeError: ‘NoneType’ object is not callable.”
Tips
130 Python Data Science
Docstrings serve as documentation for functions and can be accessed using either
the built-in help() function or the ? symbol in certain Python environments, such as
Jupyter Notebook or IPython.
Notes
In[5] help(get_name)
get_name(msg)
Get the user name according to the user prompt msg. If the input is blank, the default
is Anonymous User
In[6] get_name?
To call a user-defined function, you can simply use the function name directly followed
by parentheses.
Notes
We can use the built-in function callable() to check whether the function is “callable”.
Tricks
In[8] print(callable(get_name))
Out[8] True
In Python, when defining a function, you have the option to use the return statement
to specify the value or values that the function should return.
Notes
Out[9] 4
In Python, if a function does not have a return statement, the return value of the
function is None. In Python, None is a special object that represents the absence of a
value or a missing value.
Notes
Out[10] None
When multiple values are returned, they are usually bundled together in a tuple data
structure.
Notes
Out[11] (3, 4)
132 Python Data Science
From the perspective of function definition, formal parameters are divided into
optional parameters and required parameters. The way to distinguish them is that the
parameters with default values are called “optional parameters,” which can be called
without giving arguments, such as x4 and x5.
Tips
Out[12] 1
(2, 4)
3
4
5
After the “formal parameters” corresponds to the “arguments”, the remaining (2 and
4) become an element and pass in the arguments x2.
Notes
In[13] my_func(1,2,x4=4,x3=3,x5=5)
Basic Python Programming for Data Science 133
Out[13] 1
(2,)
3
4
5
From the perspective of function definition, any “formal parameters” defined after
the *parameter in Python are called “forced named parameters.” In the example def
my_func(x1, *x2, x3, x5=5, x4=4):, the parameters x3, x5, and x4 are considered
Notes forced named parameters.
When calling a function with parameters defined after the *parameter (also known as
a “starred parameter” or “splat parameter”), you must use explicit parameter names
in the arguments. If you omit the parameter names, the Python interpreter will raise
Notes an error.
In[14] my_func(1,2,4,x3=3,x5=5)
Out[14] 1
(2, 4)
3
4
5
Local variables in Python are variables that are defined or declared inside a function’s
body. These variables have a local scope, meaning they can only be accessed and used
within that specific function.
Notes
In[15] x=0
def myFunc(i):
x=i
print(x)
myFunc(1)
print(x)
The second x is not the same one as the x in the first line. The second x is a local
variable.
Tips
134 Python Data Science
Out[15] 1
0
To convert a local variable to a global variable in Python, you can use the global
keyword followed by the variable name. Simply declaring global x will make the
variable x accessible and modifiable in the global scope.
Notes
In[16] x=0
def myFunc(i):
global x
#Then x is the global variable, not a local variable.
#
x=i
print(x)
myFunc(1)
print(x)
Here, the statement “global x” must be written on a single line and cannot be written
as just “global” without specifying the variable name “x”.
Tips
Out[16] 1
1
Similar to global variables, Python also has “nonlocal” variables, which are used in
inner functions. The usage of nonlocal variables is similar to that of global variables,
Notes but they are specific to inner functions rather than being accessible globally.
In[17] x=0
def myFunc(i):
x=i
def myF():
nonlocal x #this statement must be written on a single line.
x=2
print(x)
print(x)
myFunc(1)
print(x)
Both the statements “global x” and “nonlocal x” must be written on a single line.
Tips
Out[17] 1
0
Basic Python Programming for Data Science 135
1. Immutable objects (int, float, str, bool, tuple): Pass-by-value. Changes to formal
parameters do not affect the original arguments.
Notes
2. Mutable objects (list, set, dict): Pass-by-reference. Changes to formal parameters
affect the original arguments.
(1) Pass-by-value: When the “argument” is an immutable object (int, float, str, bool,
tuple), the “argument” and the “formal parameter” occupy different memory spaces,
that is, when the “formal parameter” is modified by the “calling function”, the value
Notes of the argument will not be changed.
In[18] i=100
def myfunc(j,k=2):
j+=2
myfunc(i)
print(i)
In Python, parameters with default values are often referred to as “optional parameters”.
These parameters allow the function to be called without explicitly providing a value
for them.
Tips
Out[18] 100
(2) Pass-by-reference: when the “argument” is a mutable object (list, set, dict), the
argument and the “formal parameter” share the same memory space, that is, when the
“formal parameter” is changed, the “argument” will also be changed.
Notes
In[19] i=[100]
def myfunc(j,k=2):
j[0]+=2
myfunc(i)
print(i)
Out[19] [102]
The principle of passing data between “arguments” and “formal parameters” is that
“arguments” should correspond to “formal parameters” one by one. Except for the
special parameters like “self” and “cls”, they do not need to pass to the “arguments”,
Notes such as:
def class_func(cls):
def __init__(self, name, age):
136 Python Data Science
When using user-defined functions, we have to pay attention to the following three
problems.
Notes
Firstly, “arguments” are divided into “positional arguments” and “keyword arguments”.
The distinction between them is not based on the presence of a default value. Instead,
it lies in how the arguments are provided during function calls.
Notes
If there is no return statement in a Python function, the return value of the function is
None. In Python, the value None is commonly used to represent a missing or empty
value.
Tips
The reason for this error is that in Python, when defining function parameters,
non-default arguments (positional arguments) must come before default arguments
Tips (keyword arguments). The order should be: non-default arguments first, followed by
default arguments.
Basic Python Programming for Data Science 137
Secondly, if a return statement is not explicitly written in a function, the default return
value will be None. The value None can be displayed using the print() function or by
accessing it directly.
Notes
Out[22] None
In[23] d is None
Out[23] True
Thirdly, functions are treated as objects in Python. This means that in Python, the
language follows the philosophy of “everything is an object.” Functions can be
assigned to variables, passed as arguments to other functions, stored in data structures,
Notes and have attributes just like any other object in Python.
In[24] myfunc=abs
print(type(myfunc))
#Like other objects, Python function names can be used as arguments of type (), and
the return value is the function type.
print(myfunc(-100))
Q&A
Basic Python Programming for Data Science 139
The lambda function has a colon (:). Before the colon are the formal parameters, and
after the colon is the function’s return value.
Tips
In[1] x=2
y= lambda x:x+3
y(2)
Out[1] 5
In[2] x=2
def myfunc(x):
return x+3
myfunc(2)
Out[2] 5
In data science projects, lambda functions are commonly used as arguments for other
functions, with the filter() function being a common example.
Notes
The filter() function uses an iterative reading mode, which reads the value of each
element in the order of subscript from the second argument (e.g., MyList), and assigns
it to the variable x in the first argument (a lambda function).
Tips
In Python, the return value of the filter() function is an iterator. This iterator’s values
can be displayed after converting it into a list. For more information, refer to section
‘3.1 Iterators and Generators’.
Notes
Out[4] [3, 6, 9]
Out[6] 55
Basic Python Programming for Data Science 141
Exercises
[1] Which of the following is not a sequence type?
A. list
B. tuple
C. set
D. str
[2] Which of the following variables complies with the naming rules?
A. 3q
B. _
C. while
D. ds@
A. 1
B. True
C. 21
D. raise an exception
[7] How many times will the following while loop be executed?
k = 100
while k > 1:
print(k)
k = k // 2
A. 3
B. 4
C. 5
D. 6
142 Python Data Science
A. [3,4,5,3,6,7,8]
B. [3,4,3,5,6,7,8]
C. [3,4,5,6,7,8,[2,3]]
D. [3,4,5,6,7,8,2,3]
[10] Which of the following is not used in Python 3 to solve the problem of special characters in the path?
A. s = “D:\test”
B. s = r”D:\test”
C. s = u”D:\test”
[13] Which of the following data structures is normally used to duplicate removal in Python?
A. list
B. tuple
C. set
D. string
[15] Which of the following is true of calling method of the built-in functions?
A. First import the module to which it belongs, and then call through the module name.
B. Call directly with the function name.
C. Call with the def keyword.
Basic Python Programming for Data Science 143
[19] Which of the following statements about the user-defined function is wrong?
A. The mutable variable of the user-defined function is passed by position.
B. Immutable variables of the user-defined function are passed by value.
C. The user-defined functions cannot be placed in a class.
D. The user-defined functions can be written as single line functions.
[20] Which of the following statements about the lambda function is wrong?
A. Small anonymous functions can be created with the the lambda keyword.
B. L ike nested function definitions,the lambda functions can reference variables from the containing
scope.
C. The lambda functions are syntactically restricted to a single expression.
D. All of the above
3. Advanced Python Programming for Data
Science
This chapter will introduce the advanced Python programming concepts and skills necessary to excel as a data
scientist. The topics we will cover include:
Iterators and generators
Modules
Packages
Help documentations
Exception and errors
Debugging
Search path
Current working directory
Object-oriented programming
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 145
C. Borjigin, Python Data Science, https://doi.org/10.1007/978-981-19-7702-2_3
146 Python Data Science
Q&A
Advanced Python Programming for Data Science 147
1. Iterable object: An object that can be used directly in a loop statement, such as a
‘for’ loop.
2. Iterator: An object that can be called by the built-in next() function and will
Tips continuously return the next value in sequence.
(1) While all iterators are iterable, the converse is not necessarily true. That is, iterable
objects are not always iterators.
Notes
In[1] myList=[1,2,3,4,5]
next(myList)
In[2] myList=[1,2,3,4,5]
from collections.abc import Iterable
result = isinstance(myList, Iterable)
(2) The built-in iter() function in Python is used to convert iterable objects into
iterators.
Notes
148 Python Data Science
In[3] myIterator=iter(myList)
print(next(myIterator))
print(next(myIterator))
print(next(myIterator))
The built-in next() function in Python is a method used to traverse items in an iterator,
retrieving them one at a time.
Tips
Out[3] 1
2
3
(1) In Python, a generator is a special type of function that does not return a single
value. Instead, it returns an iterator object that generates a sequence of values.
This is accomplished by using yield statements instead of return statements.
Notes
In[5] myGen()
(3) One of the key features of a generator is that its elements are executed only when
they are accessed or called. This behavior is part of the ‘lazy execution’ model
followed by generators in Python.
Notes
“To directly display the items produced by a generator, use the print() function with
the unpacking operator (*), as in the following example: print(*mygen()).
Tricks
Out[6] 3,4,5,6,7,8,9,10,11,12,
150 Python Data Science
3.2 Modules
Q&A
Advanced Python Programming for Data Science 151
In Python, we can import a module into our code using the import statement. The
import statement performs two operations: searching for the named module and
binding the search results to a name in the local scope.
There are three types of the import statements:
Tips import module_name
import module_name as alias_name
from module_name import function_name
Out[1] 0.9974949866040544
In[2] cos(1.5)
Out[3] 0.9974949866040544
152 Python Data Science
Out[4] 0.0707372016677029
3.3 Packages
Q&A
154 Python Data Science
The two most commonly used tools for managing Python packages or modules are:
1. Pip: Pip is the recommended tool by the Python Packaging Authority for installing
packages from the Python Package Index (PyPI).
Notes
2. Conda: Conda is a cross-platform package and environment manager that not
only installs and manages conda packages from the Anaconda repository but also
supports packages from the Anaconda Cloud.
In Python, when you run the command pip install scipy, it prompts “Requirement
already satisfied: scipy in c:\anaconda\lib\site-packages”. This indicates that the scipy
package is already installed and there is no need to reinstall it. However, when you run
Notes the command pip install orderPy, there is no such prompt.
pip list
or
conda list
Notes
Advanced Python Programming for Data Science 155
Run pip uninstall and conda uninstall to remove the package installed.
Tips
To remove installed packages, you can use either pip or conda, depending on your
package management setup. Here’s the syntax for uninstalling packages with each
tool:
Notes 1. Using pip: pip uninstall package_name
2. Using conda:conda uninstall package_name
156 Python Data Science
To import multiple modules and provide aliases, you should use separate import
statements for each module and alias.
Tips
If you are unable to download or install a package using pip or conda commands,
you can visit the official website of the package. From there, you can download the
package and follow the installation steps outlined in the official documentation.
Notes
To check the version of a package using its built-in attributes and methods, you can
typically access the __version__ attribute of the package.
Notes
In[5] pd.__version__
Advanced Python Programming for Data Science 157
Q&A
Advanced Python Programming for Data Science 159
The most basic and generic way to check help information is the built-in help()
function.
Notes
In[1] help(len)
len(obj, /)
Return the number of items in a container.
3.4.2 DocString
In[2] len?
The syntax to check help information using the question mark character (?) is a
functionality provided by IPython, an enhanced interactive Python shell, and is not a
syntax inherent to the Python language.
Notes
In[3] myList1=[1,2,3,4]
myList1.append?
Tips
160 Python Data Science
testDocString?
The iPython (or iPython-based Jupyter Notebook/Lab, etc.) system display help
information as following:
Notes
In[5] ?testDocString
The iPython (or iPython-based Jupyter Notebook/Lab, etc.) system display the same
help information as In[4].
Tips
In[6] testDocString??
Advanced Python Programming for Data Science 161
Tips
Prerequisite: The target object must be written in Python, as the source code cannot
be checked if it is not. In such cases, the functionality of ?? becomes the same as ?.
Notes
To check the help information for the built-in len() function, we can use len?
Notes
In[7] len?
The iPython (or iPython-based Jupyter Notebook/Lab, etc.) system display help
information as following:
Notes
In[8] len??
In[9] testDocString.__doc__
Out[9] 'This is docString,\nYou can use "?" to view the help information'
In[10] len.__doc__
The dir() function is used to retrieve a list of all attributes and methods available in
the specified object.
Notes
In[11] dir(print)
Out[11] ['__call__',
'__class__',
'__delattr__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__le__',
'__lt__',
'__module__',
'__name__',
'__ne__',
'__new__',
'__qualname__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__self__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__text_signature__']
Advanced Python Programming for Data Science 163
In[12] dir(len)
Out[12] ['__call__',
'__class__',
'__delattr__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__name__',
'__ne__',
'__new__',
'__qualname__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__self__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__text_signature__']
In[13] dir?
Python follows a principle called ‘duck typing’ in its programming style. The term
‘duck typing’ refers to a concept that focuses on an object’s behavior rather than its
specific type or class. In other words, if an object walks like a duck (supports certain
attributes and methods) and quacks like a duck (exhibits expected behavior), then it is
Tips considered a ‘duck’, regardless of its actual class or type.
164 Python Data Science
Q&A
Advanced Python Programming for Data Science 165
3.5.1 Try/Except/Finally
In Python, Errors that occur at runtime (after passing the syntax test) are called
exceptions.
Notes
In Python’s try/except/finally statements, a colon (:) is placed after try, except, and
finally to signify the start of a code block associated with each statement.
Notes
In[1] try:
f=open('myfile.txt','w')
while True:
s=input("please enter Q")
if s.upper()=='Q':break
f.write(s+'\n')
except KeyboardInterrupt:
print("program interruption")
finally:
f.close()
The finally section refers to the code that will be executed regardless of whether
an exception occurs or not. The finally block is useful for releasing resources and
performing cleanup operations.
Tips
Unlike the C and Java languages, in Python, the else statement can be added to
exception handling constructs even when no exceptions occur.
Notes
Python defines a wide range of exception classes (Exceptions) and error classes
(Errors). More detailed information about these classes can be obtained from the
Python official website’s tutorial on errors and exceptions: https://docs.python.org/3/
tutorial/errors.html.
Tips
Advanced Python Programming for Data Science 167
3.5.3 Assertion
In data science projects, Assertion is mainly used to set Check Points and to test if
certain assumptions remain true.
Notes
In[5] a=1
b=2
assert b!=0 , “The denominator can't equal 0”
In[6] a=1
b=0
assert b!=0 , "The denominator can't equal 0"
Out[6] ---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_4976\796219993.py in <module>
1 a=1
2 b=0
----> 3 assert b!=0 , "The denominator can't equal 0"
3.6 Debugging
Q&A
Advanced Python Programming for Data Science 169
In[1] x=1
x1
Tips
Out[1] ---------------------------------------------------------------------------
NameError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_19180\372800001.py in <module>
1 x=1
----> 2 x1
To open the Python Debugger (PDB), you can type the magic command %debug in
IPython or Jupyter Notebook.
Notes
In[2] %debug
To exit the Python Debugger (PDB), you can press ‘q’ or type ‘quit’ while in the
debugger mode.
Notes
Tips
ipdb> x
1
ipdb> x1
*** NameError: name 'x1' is not defined
ipdb> x
1
ipdb> q
170 Python Data Science
In[5] %debug
Out[5] > c:\users\szz\appdata\local\temp\ipykernel_19180\3286916471.py(3)<module>()
ipdb> y
1
ipdb> Y
*** NameError: name 'Y' is not defined
ipdb> y
1
ipdb> quit
Advanced Python Programming for Data Science 171
In data science, assertions (or assert statements) can be used as checkpoints to validate
assumptions and ensure data integrity.
Notes
In[6] a=1
b=0
assert b!=0,"The denominator can't equal 0"
Tips
When coding an assert statement in Python, don’t forget to include a comma (,) to
separate the expression being evaluated from an optional error message.
Notes
Out[6] ---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_19180\3245325190.py in <module>
1 a=1
2 b=0
----> 3 assert b!=0,"The denominator can't equal 0"
global b = 0
In[7] %debug
Out[7] > c:\users\szz\appdata\local\temp\ipykernel_19180\3245325190.py(3)<module>()
ipdb> a
ipdb> b
ipdb> a
ipdb> quit
172 Python Data Science
Q&A
Advanced Python Programming for Data Science 173
To see all variables that exist in the search path of the Python interpreter, you can use
the built-in dir() function or the magic commands %whos and %who.
Notes
In[1] myList=[1,2,3,4,5]
next(myList)
Out[1] ['In',
'Out',
'_',
'__',
'___',
'__builtin__',
'__builtins__',
'__doc__',
'__loader__',
'__name__',
'__package__',
'__spec__',
'_dh',
'_i',
'_i1',
'_ih',
'_ii',
'_iii',
'_oh',
'exit',
'get_ipython',
'quit']
To add a variable to the search path, you can define a new variable using an assignment
statement. For example:
Tips
In[2] vi=1
To display the search path in Python and check whether the newly defined variable “vi”
is present, you can use the dir() function or the %whos magic command in IPython.
Tips
In[3] dir()
174 Python Data Science
Out[3] ['In',
'Out',
'_',
'_1',
'_2',
'_3',
'__',
'___',
'__builtin__',
'__builtins__',
'__doc__',
'__loader__',
'__name__',
'__package__',
'__spec__',
'_dh',
'_i',
'_i1',
'_i2',
'_i3',
'_i4',
'_i5',
'_ih',
'_ii',
'_iii',
'_oh',
'exit',
'get_ipython',
'quit',
'vi']
To remove a variable from the search path, use the del statement followed by the
variable name.
Notes
In[4] del vi
Tips
In[5] vi
Advanced Python Programming for Data Science 175
The reason for this error is that the variable ‘vi’ has been deleted or is not defined in
In[4].
Tips
Out[5] ---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-c5bfa1c921c4> in <module>()
----> 1 vi
To check the module search path in Python, you can use the path attribute provided by
the sys module and the python -m site command in the Anaconda prompt.
Notes
Out[6] [",
'C:\\Anaconda\\python36.zip',
'C:\\Anaconda\\DLLs',
'C:\\Anaconda\\lib',
'C:\\Anaconda',
'C:\\Anaconda\\lib\\site-packages',
'C:\\Anaconda\\lib\\site-packages\\win32',
'C:\\Anaconda\\lib\\site-packages\\win32\\lib',
'C:\\Anaconda\\lib\\site-packages\\Pythonwin',
'C:\\Anaconda\\lib\\site-packages\\IPython\\extensions',
'C:\\Users\\soloman\\.ipython']
To add a new path to the module search path in Python, you can use the sys.path.
append() method.
Notes
In[8] sys.path
176 Python Data Science
To display the module search path in Python and check whether the newly added path
from In[7] has appeared, you can use the sys.path attribute. Here’s an example of how
you can do it:
Tips
Out[8] ['C:\\Users\\szz',
'D:\\Anacoda\\python37.zip',
'D:\\Anacoda\\DLLs',
'D:\\Anacoda\\lib',
'D:\\Anacoda',
'',
'C:\\Users\\szz\\AppData\\Roaming\\Python\\Python37\\site-packages',
'D:\\Anacoda\\lib\\site-packages',
'D:\\Anacoda\\lib\\site-packages\\pyquery-1.4.3-py3.7.egg',
'D:\\Anacoda\\lib\\site-packages\\cssselect-1.1.0-py3.7.egg',
'D:\\Anacoda\\lib\\site-packages\\pip-21.1.1-py3.7.egg',
'D:\\Anacoda\\lib\\site-packages\\win32',
'D:\\Anacoda\\lib\\site-packages\\win32\\lib',
'D:\\Anacoda\\lib\\site-packages\\Pythonwin',
'D:\\Anacoda\\lib\\site-packages\\IPython\\extensions',
'C:\\Users\\szz\\.ipython',
'H:\\Python\\Anaconda']
To remove a path from the module search path, you can use the sys.path.remove()
method.
Notes
In[9] sys.path.remove('H:\\Python\\Anaconda')
In[10] sys.path
We can display the module search path again and check whether the path that was
removed in In[9] is no longer displayed on the module search path.
Tips
Advanced Python Programming for Data Science 177
Out[10] ['C:\\Users\\szz',
'D:\\Anacoda\\python37.zip',
'D:\\Anacoda\\DLLs',
'D:\\Anacoda\\lib',
'D:\\Anacoda',
'',
'C:\\Users\\szz\\AppData\\Roaming\\Python\\Python37\\site-packages',
'D:\\Anacoda\\lib\\site-packages',
'D:\\Anacoda\\lib\\site-packages\\pyquery-1.4.3-py3.7.egg',
'D:\\Anacoda\\lib\\site-packages\\cssselect-1.1.0-py3.7.egg',
'D:\\Anacoda\\lib\\site-packages\\pip-21.1.1-py3.7.egg',
'D:\\Anacoda\\lib\\site-packages\\win32',
'D:\\Anacoda\\lib\\site-packages\\win32\\lib',
'D:\\Anacoda\\lib\\site-packages\\Pythonwin',
'D:\\Anacoda\\lib\\site-packages\\IPython\\extensions',
'C:\\Users\\szz\\.ipython']
178 Python Data Science
Q&A
Advanced Python Programming for Data Science 179
To obtain the current working directory in Python, you can use the getcwd() method
provided by the os module.
Notes
In[1] import os
print(os.getcwd())
Out[1] C:\Users\szz
To change the current working directory in Python, you can use the chdir() function
from the os module.
Notes
Before changing the current working directory, you need to create a new working
directory to replace the original current working directory. For example, you can
create a new directory named ‘Python projects’ on the E: drive.
Tips
In[3] os.chdir('E:\PythonProjects')
print(os.getcwd())
Out[3] E:\PythonProjects
For example, read the file “bc_data.csv” from the current working directory into the
data dataframe.
Notes
180 Python Data Science
Here, executing the statement read_csv(‘bc_data.csv’) assumes that the target file
‘bc_data.csv’ has been placed in the current working directory, such as “E:\Python
projects”.
Notes
Readers can find the data file ‘bc_data.csv’ in the supporting resources provided with
this book.
Tips
Advanced Python Programming for Data Science 181
Q&A
182 Python Data Science
Advanced Python Programming for Data Science 183
3.9.1 Classes
In Python, class definitions begin with the keyword “class,” followed by the name of
the class and a colon.
Notes
def say_hi(self):
print(self.name)
p1 = Person('Tom', 30)
p1.say_hi()
The method of defining attribute and method visibility in Python differs from that of
languages like C++, C#, and Java. In Python, the convention is to use one underscore
or two underscores at the beginning of the name to indicate different levels of visibility,
Notes distinguishing between protected and private attributes and methods.
Person (‘Tom’, 30) does two things: create a new object and initialize it to return the
instance P1.
Tips
Out[1] Tom
184 Python Data Science
3.9.2 Methods
The difference between a method and a function lies in their relationship to object-
oriented programming. In object-oriented programming, a method is a function that is
associated with a class or object.
Notes
In Python, there are three types of the user-defined methods: instance methods, class
methods and static methods.
Notes
"""
nationality = 'China'
_deposit=10e10
__gender="M"
p1 = Person('Tom', 20)
p1.say_hi()
Advanced Python Programming for Data Science 185
Out[2] Tom
Tips
3.9.3 Inheritance
In Python, the syntax for specifying inheritance between classes is unique. When
defining a class, you can indicate its parent class or classes by putting the parent class
name(s) in parentheses after the class name.
Notes
t1=Teacher("zhang",20)
In[8] Person.class_func()
Out[8] I live in CHINA
In[9] t1.class_func()
As you can see from the above output, the class Teacher has inherited its parent class
Person’s class_func () method.
Tips
186 Python Data Science
A subclass can inherit the protected attribute from its parent class, such as the _deposit
attribute in In[12].
Tips
Out[12] 100000000000.0
In[13] t1. _ _gender #AttributeError: 'Teacher' object has no attribute '__gender'
A subclass can not inherit the private attribute from its parent class, such as the
— gender attribute in In[2].
Tips
Out[13] ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_18396\514368724.py in <module>
----> 1 t1.__gender
Use the following operations to check the docString of the Teacher class and its parent
class Person.
Notes
In[14] Person?
Teacher?
To check the name of a class in Python, you can use the __name__ attribute.
Notes
In[15] Person.__name__
Out[15] 'Person'
Advanced Python Programming for Data Science 187
In Python, theSystem-defined names are also known as “dunder” names. These names
are defined by the interpreter and its implementation (including the standard library.
Commonly used dunder attributes are as follows:
Notes __name__: Return the class name
__doc__: Return the docString of the class
__bases__: Return the tuple of all parent classes of the class
__dict__: Return a list of all attributes and methods of a class
__module__: Return the name of the module where the class definition is located
__class__: Return the class corresponding to the instance
3.9.4 Attributes
Unlike Java and C++, Python does not use the private keyword to define private
variables. Instead, the convention in Python is to use “double underscores” at the
beginning of variable names to indicate privacy, although it does not enforce true
Notes encapsulation.
A function decorated with the @property decorator in Python cannot be called using
(), and it must be accessed as an attribute. If you attempt to call a property-decorated
function with parentheses, you will encounter a TypeError stating that the function
Notes takes no positional arguments.
Out[16] Zhang
188 Python Data Science
In Python, self and cls are passed to the methods in the first argument. The self and cls
means a references to an instance and a class, respectively.
Always use self for the first argument to instance methods.
Notes Always use cls for the first argument to class methods.
For instance, when defining a class, self stands for “instance reference”, as is often
used in __ init__ (); cls stands for “a reference to a class”, as often used in __ new__ ().
It is important to note that if the __new__() method of your class does not return an
instance of the class (cls), the __init__() method will not be called. This means that
the initialization step will be skipped, and the object will not be properly initialized.
Notes
Advanced Python Programming for Data Science 189
def __new__(cls,name,age):
print('__new__() is called')
def __init__(self,name,age):
print('__init__() is called')
self.name = name
self.age = age
def sayHi(self):
print(self.name,self.age)
In Python, the __new__() method is called when an object is created, and it is responsible
for creating and returning a new instance of the class. The __init__() method, on the
other hand, is called after the __new__() method and is used to initialize the newly
created object.
Tips
Out[19] None
Out[19] None
In[20] s1.sayHi()
Here, the Python interpreter will raise an AttributeError that ‘NoneType’ object has
no attribute ‘sayHi’
Tips
Out[20] ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_18396\2394443593.py in <module>
----> 1 s1.sayHi()
The AttributeError is raised in that there is no return statement in the __new__() function.
To modify: Add return object. __new__() (cls) to the __new__() function.
Tips
def __init__(self,name,age):
print('__init__() is called')
self.name = name
self.age = age
def sayHi(self):
print(self.name,self.age)
Exercises
[1] Which of the following statements about the iterable object and the iterator object in Python is wrong?
A. Functions that can receive the iterator objects can receive the iterable objects.
B. T he iterator object can be called by the next function, constantly returning the object of the next
value.
C. The iter function can convert the iterable object into the iterator object.
D. The iterable objects are not necessarily the iterator objects.
[5] Which of the following statements about the relationship between modules and packages is false?
A. Packages can be used to group a set of modules under a common package name.
B. A package can only correspond to one module.
C. Each package directory will have an init.py, and init.py itself is a module.
A. 2.0
B. AssertionError: n must be positive
C. AssertionError: n must be positive
2.0
[16] Which of the following is false of Python Pdb?
A. The prompt of debugger is Pdb.
B. The debugger is not extensible.
C. The pdb module defines an interactive source code debugger for Python programs.
D. pdb supports post debugging, which can be imported under program control.
Advanced Python Programming for Data Science 193
A. NameError
B. SyntaxError
C. ValueError
D. AssertionError
[20] Which of the following statements about the current working directory is false?
A. The getcwd function in the os module is used to check the current working directory.
B. Current working path can be modified with os.chdir(path).
C. T he current working directory is the search path, which refers to the default read-write path of files
and folders in Python.
D. The open function is a common file import function.
4. Data wrangling with Python
Data wrangling is the process of transforming and mapping data from one raw data form into another format with
the intent of making it more appropriate and valuable for data science purposes. This chapter will introduce the
essential data wrangling skills for data scientists, including:
These skills are crucial for data scientists to effectively manipulate, analyze, and visualize data in their projects.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 195
C. Borjigin, Python Data Science, https://doi.org/10.1007/978-981-19-7702-2_4
196 Python Data Science
Q&A
Data wrangling with Python 197
There are two common methods for generating random numbers from a computer:
Pseudo-Random Number Generators (PRNGs) and True Random Number Generators
(TRNGs).
PRNGs rely on a seed number and an algorithm to generate numbers that appear to
Tips
be random but are actually predictable. They are widely used in computer programs
and simulations.
In contrast, TRNGs generate randomness from physical phenomena using hardware
and integrate it into a computer. These generators provide a higher level of true
randomness compared to PRNGs.
The Python standard library provides a module called random that offers a suite of
functions for generating pseudo random numbers.
Notes
The random package is not the only package in Python that generates random numbers.
Other packages such as NumPy, SciPy, and scikit-learn also provide functions for
generating random numbers.
Tips
The random module contains several functions that allows you to generate random
numbers. For instance, the randint(a,b) function generates random integers from a
(inclusive) to b (exclusive).
Notes
The “2” in “round(random.uniform(-10, 10),2)” means that the round function will
return random.uniform(-10, 10) rounded to 2 precision after the decimal point.
Notes
In[4] random.seed(3)
round(random.uniform(-10, 10),2)
Out[4] -5.24
import numpy as np
rand=np.random.RandomState(32)
x=rand.randint(0,10,(3,6))
x
Out[7] array([[7, 5, 6, 8, 3, 7],
[9, 3, 5, 9, 4, 1],
[3, 1, 2, 3, 8, 2]])
Q&A
200 Python Data Science
Data wrangling with Python 201
202 Python Data Science
Data wrangling with Python 203
import numpy as np
Ndarray can be created in several ways, one of which is using the np.arange () function.
For instance, numpy.arange(a,b) returns evenly spaced values within the half-open
interval [a, b). In other words, the interval including a but excluding b.
Notes
numpy.arange() :return evenly spaced values within the half-open interval [start, stop). In
other words, the interval including start but excluding stop.
Tips
range(1,10,2)
Out[3] range(1, 10, 2)
In Python 2, the python built-in function range() creates a list, and it is effectively
eagerly evaluated. In Python 3, it creates a range object, whose individual values are
lazily evaluated. In other words, In Python2, range() returns a list, which is equivalent
Notes to list(range()) in Python3.
204 Python Data Science
list(range(1,10,2))
Out[4] [1, 3, 5, 7, 9]
In[5] np.arange(1,10,2)
Out[5] array([1, 3, 5, 7, 9])
The second way to create an ndarray is by calling the np.array() function from the
NumPy module.
Notes
In[6] MyArray2=np.array([1,2,3,4,3,5])
MyArray2
Out[6] array([1, 2, 3, 4, 3, 5])
In[7] np.array(range(1,10,2))
Out[7] array([1, 3, 5, 7, 9])
The third way to create an ndarray is by calling functions like np.zeros(), np.ones(),
and others provided by NumPy. These functions allow you to create arrays filled with
zeros, ones, or specific values.
Notes
In[8] MyArray3=np.zeros((5,5))
MyArray3
Out[8] array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])
In np.zeros((5,5)), the argument (5,5) represents the shape of the target array, which is
an array of 5 rows and 5 columns.
Notes
Data wrangling with Python 205
In[9] MyArray4=np.ones((5,5))
MyArray4
Out[9] array([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])
In[10] # To create a new array with 3 rows and 5 columns, filled with 2
np.full((3,5),2)
Out[10] array([[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2]])
To generate random arrays using np.random(), please refer to section [4.1.2 Generating
a random array at a time] for detailed instructions and examples.
Notes
In[11] rand=np.random.RandomState(30)
MyArray5=rand.randint(0,100,[3,5])
MyArray5
Out[11] array([[37, 37, 45, 45, 12],
[23, 2, 53, 17, 46],
[ 3, 41, 7, 65, 49]])
Here, 0 and 100 represent the range of the random value, and [3,5] represents the
shape of the target array with 3 rows and 5 columns.
Notes
The shape argument represents the shape of the array, and its value can be a tuple,
such as (3,5).
Notes
In[13] np.ones((3,5),dtype=float)
Out[13] array([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])
The value of the shape argument can also be specified as a list, such as [3, 5].
Notes
In[14] np.ones([3,5],dtype=float)
Out[14] array([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])
Both Lists in Python and ndarrays in NumPy allow slicing and indexing, and they
share a similar syntax.
Notes
import numpy as np
myArray=np.array(range(1,10))
myArray
Out[15] array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In[17] myArray[0]
Out[17] 1
(2) Python supports negative indexes, please refer to [2.10 Lists] for details.
Notes
In[18] myArray[-1]
Out[18] 9
In[19] # to create and show the current value of the variable myArray
import numpy as np
myArray=np.array(range(0,10))
print("myArray=",myArray)
Out[19] myArray= [0 1 2 3 4 5 6 7 8 9]
Slicing means taking elements from one given index to another given index.
Tips
1, 9 and 2 are the start, end and step index of the slicing respectively.
Notes
In[20] print("myArray[1:9:2]=",myArray[1:9:2])
Out[20] myArray[1:9:2]= [1 3 5 7]
In[21] print("myArray[:9:2]=",myArray[:9:2])
Out[21] myArray[:9:2]= [0 2 4 6 8]
In[22] print("myArray[::2]=",myArray[::2])
Out[22] myArray[::2]= [0 2 4 6 8]
208 Python Data Science
In[23] print("myArray[::]=",myArray[::])
Out[23] myArray[::]= [0 1 2 3 4 5 6 7 8 9]
In[24] print("myArray[:8:]=",myArray[:8:])
Out[24] myArray[:8:]= [0 1 2 3 4 5 6 7]
In[25] print("myArray[:8]=",myArray[0:8])
Out[25] myArray[:8]= [0 1 2 3 4 5 6 7]
In[26] print("myArray[4::]=",myArray[4::])
Out[26] myArray[4::]= [4 5 6 7 8 9]
In[27] print("myArray[9:1:-2]=",myArray[9:1:-2])
print("myArray[::-2]=",myArray[::-2])
Out[27] myArray[9:1:-2]= [9 7 5 3]
myArray[::-2]= [9 7 5 3 1]
Fancy indexing refers to the practice of using an array of indices to access multiple
elements of an array simultaneously.
Tips
Data wrangling with Python 209
Fancy Indexing is a very flexible way of slicing, which means to support a non-iterative
way to slice the elements irregularly. The notation of fancy indexing is the nesting of
[], that is, another [] appears in the []. For example, myArray[[2,5,6]] means to locate
Notes the three elements with indexes 2, 5 and 6.
In[28] print("myArray[[2,5,6]]=",myArray[[2,5,6]])
Out[28] myArray[[2,5,6]]= [2 5 6]
In[29] print("myArray[myArray>5]=",myArray[myArray>5])
Out[29] myArray[myArray>5]= [6 7 8 9]
In slicing, the start index is inclusive (e.g., “0” in this code), but the stop index is
exclusive (e.g., “2” in this code). This is because the slicing rule in Python is “including
the start but excluding the stop”.
Notes
In[30] myArray[0:2]
Out[30] array([0, 1])
When slicing an ndarray, it will return a view of the elements in the original array, not
a shallow copy. This means that any modifications made to the sliced array will affect
the original array as well.
Notes
In[31] myArray
Out[31] array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
(4) To access non-consecutive elements of an array, you can use slicing. Please refer
to [2.10 Lists] for more details.
Notes
In[32] myArray=np.array(range(1,11))
myArray
Out[32] array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
When the index is irregular, an error will be raised if fancy indexing is not used.
Fancy indexing refers to passing an array of indices to access multiple array elements
at once. For more details, please refer to the section on Fancy Indexing in [4.2
Notes Multidimensional arrays].
210 Python Data Science
In[33] myArray[1,3,6]
IndexError Traceback (most recent call last)
<ipython-input-30-13b1cd8a6af6> in <module>()
1
----> 2 myArray[1,3,6]
3
In[34] myArray[[1,3,6]]
Out[34] array([2, 4, 7])
In[35] myArray
Out[35] array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
In NumPy, the np.newaxis is an alias for the Python constant None, hence, wherever
we use np.newaxis we could also use None:
Tricks
Data wrangling with Python 211
In[36] myArray[:,np.newaxis]
Out[36] array([[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10]])
Here, the np.newaxis is generally used with slicing. It indicates that you want to add an
additional dimension to the array. In addition, the colon (:) cannot be omitted here.
Tricks
Tips
To get the shape of a given ndarray, you can use the .shape attribute. For example, if
arr is an ndarray, you can access its shape using arr.shape.
Notes
In[37] myArray[:,np.newaxis].shape
Out[37] (10, 1)
To convert the shape of an ndarray in NumPy, you can use the numpy.reshape()
function. It allows you to reshape an array into a specified shape while keeping the
same elements.
Notes
In[38] myArray2=np.arange(1,21).reshape([5,4])
myArray2
Out[38] array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16],
[17, 18, 19, 20]])
In[39] myArray2[[2,4],3]
Out[39] array([12, 20])
Tips
In[40] x=[2,4]
myArray2[x,3]
Out[40] array([12, 20])
Here, the value of myArray has changed in that myArray1 and myArray2 share the same
memory adrrress
Tricks
Deep copy: A deep copy creates a new array object with its own separate copy of the
original array’s data. Any modifications made to the data in one array will not affect
the other. You can create a deep copy using the numpy.copy() function.
Notes
The myArray1 here has not changed. The reason is that ‘myArray2 = myArray1.copy()’
creates a deepcopy, resulting in myArray1 and myArray2 being mutually independent.
Tricks
Data wrangling with Python 213
Reshape means returning a transformed array with the new shape specifies in the
numpy method reshpe().
Notes
In numpy, to check the shape of an array, you can use the attribute ndarray.shape.
Notes
In[44] MyArray5.shape
Out[44] (20,)
In[45] MyArray6=MyArray5.reshape(4,5)
MyArray6
Out[45] array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20]])
In numpy, the numpy.reshape() function does not modify an array in place. Instead, it
returns a new reshaped array while leaving the original array unchanged.
Notes
In[46] MyArray5.shape
MyArray5
Out[46] (20,)
array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
The numpy.reshape() function returns a new array with a modified shape but preserves the
original data of the array. It does not modify the array in place.
Tips
214 Python Data Science
In[47] MyArray5.reshape(5,4)
Out[47] array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16],
[17, 18, 19, 20]])
In[48] MyArray5.reshape(5,4)
ValueError Traceback (most recent call last)
<ipython-input-46-8920a583f59a> in <module>()
----> 1 MyArray5.reshape(5,5)
2
Here, a value error is raised in that the python interpreter cannot reshape array of size
20 into shape (5,5).
Tips
In[49] MyArray5
Out[49] array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
The resize() method in NumPy can be used to change the shape of an array in-place.
Unlike reshape(), resize() modifies the array itself rather than returning a new array.
Notes
In[50] MyArray5.resize(4,5)
MyArray5
Out[50] array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20]])
The main difference between resize() and reshape() in NumPy is that resize() performs
in-place modification, meaning it modifies the array itself, while reshape() returns a new
array with the specified shape without modifying the original array.
Tips
Data wrangling with Python 215
(3) The swapaxes() method in NumPy allows you to interchange two axes of an array.
Notes
In[51] MyArray5.swapaxes(0,1)
Out[51] array([[ 1, 6, 11, 16],
[ 2, 7, 12, 17],
[ 3, 8, 13, 18],
[ 4, 9, 14, 19],
[ 5, 10, 15, 20]])
In[52] MyArray5
Out[52] array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20]])
In data science, it is important to pay attention to whether the evaluation of a data object
changes the data itself or returns a copy of the new value.
Tricks
(4) Use the flatten() method to convert a multidimensional array into a one-dimensional
array.
Notes
In[53] MyArray5.flatten()
Out[53] array([1, 6, 11, 16, 2, 7, 12, 17, 3, 8, 13, 18, 4, 9, 14, 19, 5, 10, 15, 20])
ndarray.flatten( ):
Return a copy of the array collapsed into one dimension.
Tips
(5) Use the tolist() method to convert the multidimensional array to a nested list.
Notes
216 Python Data Science
In[54] MyArray5.tolist()
Out[54] [[1, 6, 11, 16],
[2, 7, 12, 17],
[3, 8, 13, 18],
[4, 9, 14, 19],
[5, 10, 15, 20]]
ndarray.tolist():
Return the array as an a.ndim-levels deep nested list of Python scalars.
Tips
In[55] MyArray5.astype(np.float)
numpy.ndarray.astype():
Returns a copy of the array, cast to a specified type.
Tips
In[56] MyArray5
Out[56] array([[ 1, 6, 11, 16],
[ 2, 7, 12, 17],
[ 3, 8, 13, 18],
[ 4, 9, 14, 19],
[ 5, 10, 15, 20]])
In[57] np.rank(MyArray5)
C:\Anaconda\lib\site-packages\ipykernel_launcher.py:3: VisibleDeprecationWarning:
‘rank’ is deprecated; use the ‘ndim’ attribute or function instead. To find the rank of a
matrix see ‘numpy.linalg.matrix_rank’.
This is separate from the ipykernel package so we can avoid doing imports until
Out[57] 2
The system prompts “‘rank’ is deprecated”, indicating that this method has been deprecated.
The system prompts “use the ‘ndim’ attribute or function instead”.
Variations of this naming convention are common among Python third-party packages.
Tricks
In[58] np.ndim(MyArray5)
Out[58] 2
In[59] MyArray5.ndim
Out[59] 2
(2) To get the shape of the array: the shape() method or the shape attribute.
Notes
In[60] np.shape(MyArray5)
Out[60] (5, 4)
In[61] MyArray5.shape
Out[61] (5, 4)
(3) In NumPy, the numpy.size function can be used to evaluate the number of elements
in an array. It returns the total number of elements present in the array, regardless
of its shape or dimensions.
Notes
In[62] MyArray5.size
Out[62] 20
In[63] type(MyArray5)
Out[63] numpy.ndarray
Here,the type() function is a built-in function in Python and not specific to NumPy.
Therefore, you do not need to prefix it with np when using it to determine the type of a
NumPy array.
Tips
In[64] MyArray5*10
Out[64] array([[ 10, 60, 110, 160],
[ 20, 70, 120, 170],
[ 30, 80, 130, 180],
[ 40, 90, 140, 190],
[ 50, 100, 150, 200]])
There are three common used ways to multiply NumPy ndarrays in data science:
numpy.dot(array a, array b) : returns the dot product of two arrays.
numpy.multiply(array a, array b) : returns the element-wise matrix multiplication
of two arrays.
Tips numpy.matmul(array a, array b) : returns the matrix product of two arrays.
Data wrangling with Python 219
In[65] x=np.array([11,12,13,14,15,16,17,18])
x1,x2,x3=np.split(x,[3,5])
print(x1,x2,x3)
Out[65] [11 12 13] [14 15] [16 17 18]
The np.vsplit() method is used to perform a vertical split of an array. Here, MyArray5.
reshape(4, 5) is split into two parts at index 2 along the vertical axis. The resulting
splits are assigned to the variables upper and lower using unpacking assignment.
Notes
In[66] upper,lower=np.vsplit(MyArray5.reshape(4,5),[2])
print("The upper part is\n",upper)
print("\n\nThe lower part is\n",lower)
Out[66] The upper part is
[[ 1 6 11 16 2]
[ 7 12 17 3 8]]
In[67] np.concatenate((lower,upper),axis=0)
Out[67] array([[13, 18, 4, 9, 14],
[19, 5, 10, 15, 20],
[ 1, 6, 11, 16, 2],
[ 7, 12, 17, 3, 8]])
Here, axis = 0 means that the axis along which the arrays will be joined. If axis is None,
arrays are flattened before use. Default is 0.
Tricks
(4)
np.vstack() and np.hstack() support horizontal or vertical merging(stacking)
respectively.
The premise of calling np.vstack(): the number of columns of the arrays is the same.
Notes
220 Python Data Science
In[68] np.vstack([upper,lower])
Out[68] array([[ 1, 6, 11, 16, 2],
[ 7, 12, 17, 3, 8],
[13, 18, 4, 9, 14],
[19, 5, 10, 15, 20]])
The premise of calling np.hstack(): the number of rows of the arrays is the same.
Notes
In[69] np.hstack([upper,lower])
Out[69] array([[ 1, 6, 11, 16, 2, 13, 18, 4, 9, 14],
[ 7, 12, 17, 3, 8, 19, 5, 10, 15, 20]])
In[70] np.add(MyArray5,1)
Out[70] array([[ 2, 7, 12, 17],
[ 3, 8, 13, 18],
[ 4, 9, 14, 19],
[ 5, 10, 15, 20],
[ 6, 11, 16, 21]])
The same function, which is summing the elements of an array, can be achieved using both
the built-in Python function sum() and the NumPy function numpy.sum().
Tricks
To delete a specific element in a NumPy array, you can use the np.delete() function.
Notes
numpy.delete(arr, obj, axis=None) is a function that returns a new array with sub-
arrays along a specified axis deleted. The arr parameter represents the input array, obj
specifies the indices or slice objects of the elements to be deleted, and axis (optional)
indicates the axis along which the deletion should occur.
Tips
In[72] np.insert(myArray1,1,88)
Out[72] array([11, 88, 12, 13, 14, 15, 16, 17, 18])
numpy.insert(arr, obj, values, axis=None) is a function that inserts values into an array
along a specified axis before the given indices.
Tips
222 Python Data Science
In[73] np.isnan(myArray)
Out[73] array([False, False, False, False, False, False, False, False, False, False])
In[74] np.any(np.isnan(myArray))
Out[74] False
In[75] np.all(np.isnan(myArray))
Out[75] False
numpy.isnan: to test element-wise for NaN and return result as a boolean array.
Numpy.all(): to test whether all array elements along a given axis evaluate to True.
Numpy.any(): to test whether any array element along a given axis evaluates to True.
Tips
In many function evaluations, if missing values are encountered, an error may be raised
or a NaN (Not a Number) value may be obtained as the result. To handle missing
values in such cases, you can use NaN-safe functions provided by NumPy, such as
Notes np.nansum().
In[76] MyArray=np.array([1,2,3,np.nan])
np.nansum(MyArray)
Out[76] 6.0
In[77] np.sum(MyArray)
Out[77] nan
Data wrangling with Python 223
In NumPy, broadcasting refers to the mechanism by which arrays with different shapes
are treated during arithmetic operations. When performing operations between arrays
of different shapes, NumPy automatically adjusts the shapes of the arrays to make
them compatible, following a set of rules or constraints.
Tips
Rule 1: If the number of dimensions is the same, but the size of at least one dimension is
different, broadcasting is performed by replicating the array along the dimension with a
smaller size. The operation is completed by iterating over the arrays in a loop.
Notes
A1 and A2 have the same number of columns and different number of rows.
Notes
In[79] A2=np.array([10,10,10])
A2
Out[79] array([10, 10, 10])
Before A1+A2 is executed, the operation of broadcasting is performed row by row. After
A1 and A2 are converted to the same structure, the evaluation will be executed.
Notes
In[80] A1+A2
Out[80] array([[11, 12, 13],
[14, 15, 16],
[17, 18, 19]])
Rule 2: If the shapes of the arrays being operated on are not compatible, meaning they have
different numbers of dimensions and the size of at least one dimension is different (except
when one of the dimensions is 1).
Notes
In[81] A3=np.arange(10).reshape(2,5)
A3
Out[81] array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
224 Python Data Science
Notes
In[82] A4=np.arange(16).reshape(4,4)
A4
Out[82] array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
An error is raised: ValueError: operands could not be broadcast together with shapes
(2,5) (4,4)
Notes
In[83] A3+A4
ValueError Traceback (most recent call last)
<ipython-input-86-0fe8480883de> in <module>()
----> 1 A3+A4
2 ValueError: operands could not be broadcast together with shapes (2,5)
(4,4)
ValueError: operands could not be broadcast together with shapes (2,5) (4,4)
In[85] np.sort(myArray)
Out[85] array([11, 12, 13, 14, 15, 16, 17, 18, 19])
In[86] np.argsort(myArray)
Out[86] array([0, 3, 2, 6, 5, 8, 7, 1, 4], dtype=int64)
Data wrangling with Python 225
In NumPy, multidimensional arrays can be sorted along a specified axis by using the axis
parameter in the np.sort() function.
Tricks
Here, axis = 1 means the axis along which to sort. If None, the array is flattened before
sorting. The default is –1, which sorts along the last axis.
Tips
In[88] np.sort(MyArray,axis=0)
Out[88] array([[ 1, 2, 3, 24, 4],
[ 21, 22, 23, 32, 25],
[ 35, 34, 33, 100, 31]])
226 Python Data Science
4.3 Series
Q&A
Data wrangling with Python 227
When defining a Series in Pandas, the length of the index and the length of the data
should be the same. If the lengths do not match, an exception will be raised indicating
the mismatch between the index and data lengths.
Tricks
228 Python Data Science
mySeries1
NameError Traceback (most recent call last)
<ipython-input-3-88cdcb222886> in <module>()
1 import pandas as pd
----> 2 mySeries1=pd.Series([11,12,13,14,15,16,17],
index=[a,b,c,d,e,f,g])
3
4 mySeries1
Here, a NameError is raised because we missing the quotation marks of the string ‘a’.
Strings in python should be surrounded by either single quotation mark, or double
quotation marks.
Tips
If the “data” parameter is only one value, the pd.series() method will assign the same
value to each index.
Notes
When defining a Series in Pandas, if the “data” parameter contains more than one
value, the length of both the data values and the index should be the same. This ensures
that each data value is paired with a corresponding index value. If the lengths do not
Notes match, an exception will be raised.
The returned data type is index, which is a special type defined in Pandas.
Tricks
In[6] mySeries4.values
Out[6] array([21, 22, 23, 24, 25, 26, 27], dtype=int64)
230 Python Data Science
In[7] mySeries4['b']
Out[7] 22
The Series also supports Fancy Indexing to pass an array of indices to access multiple
elements at once.
Notes
In[9] mySeries4[["a","b","c"]]
Out[9] a 21
b 22
c 23
dtype: int64
In NumPy, explicit indexes can be used as start and stop positions for slicing operations.
Unlike in Python, both the start and stop indices will be included in the returned result.
This means that the sliced array will contain elements starting from the start index up
Notes to and including the stop index.
In[10] mySeries4["a":"d"]
Out[10] a 21
b 22
c 23
d 24
dtype: int64
In[11] mySeries4[1:4:2]
Out[11] b 22
d 24
dtype: int64
In[12] mySeries4
Out[12] a 21
b 22
c 23
d 24
e 25
f 26
g 27
dtype: int64
Data wrangling with Python 231
Notice that when slicing with an explicit index (i.e. mySeries4[“a”:”d” ), the
final index is included in the slice, while when slicing with an implicit index (i.e.
mySeries4[1:4:2] ), the final index is excluded from the slice.
Tips
(5) Checking whether a value is an element of the explicit index( labels) of a series
or not.
Notes
The series.reindex() method changes the the index labels of a series, but the
correspondence between key and value is not be destroyed.
Tricks
Regardless of the results order, the index and values of mySeries4 itself has not
changed.
Notes
232 Python Data Science
In[16] mySeries5=mySeries4.reindex(index=["b","c","a","d","e","g","f"])
mySeries4
Out[16] a 21
b 22
c 23
d 24
e 25
f 26
g 27
dtype: int64
The series.reindex() method is used to create a new index and reindex the DataFrame.
By default, holes in the new index that do not have corresponding records in the
DataFrame are assigned NaN.
Notes
In[17] mySeries5=mySeries4.reindex(index=["new1","c","a","new2","e","g","new3"])
mySeries5
Out[17] new1 NaN
c 23.0
a 21.0
new2 NaN
e 25.0
g 27.0
new3 NaN
dtype: float64
The Series.reindex() method does not modify the explicit index of the original Series
object.
Notes
In[18] mySeries4
Out[18] a 21
b 22
c 23
d 24
e 25
f 26
g 27
dtype: int64
Data wrangling with Python 233
4.4 DataFrame
Q&A
234 Python Data Science
Data wrangling with Python 235
236 Python Data Science
There are two common ways to create a DataFrame in data science projects.
The first way is to type values in Python pandas directly, and this way is rarely used.
The second commonly used way is to load the datasets from existing files.
Notes (1) The pd.DataFrame() method is used to type values.
(2) When importing values from existing files into Python using the pandas package,
the data stored on computer will be automatically converted into a DataFrame
object
Notes
We use the Fancy Indexing method here to select the “id”, “diagnosis”, “area_mean”
columns of the df2 object. Refers to this book [4.2 Multidimensional arrays].
Notes
In[3] df2=df2[["id","diagnosis","area_mean"]]
df2.head()
Out[3] id diagnosis area_mean
0 842302 M 1001.0
1 842517 M 1326.0
2 84300903 M 1203.0
3 84348301 M 386.1
4 84358402 M 1297.0
Data wrangling with Python 237
The .head() function and the .tail() function are two functions commonly used in
data science projects, which are used to return the first and last n (the default value is
5) rows. If we have a large amount of data, it is not possible or necessary to display
all of the rows.
Tips
The .index attribute is used to retrieve the axis labels of a pandas object, such as a
DataFrame or a Series.
Notes
In[4] df2.index
Out[4] RangeIndex(start=0, stop=569, step=1)
The .index.size attribute is used to get the number of elements in the underlying data.
Notes
In[5] df2.index.size
Out[5] 569
The .columns attribute is used to get the column labels of the DataFrame.
Notes
In[6] df2.columns
Out[6] Index([‘id’, ‘diagnosis’, ‘area_mean’], dtype=’object’)
In[7] df2.columns.size
Out[7] 3
The .shape attribute is used to get the shape of the DataFrame at the same time, i.e. the
number of rows and columns.
Notes
In[8] df2.shape
Out[8] (569, 3)
238 Python Data Science
In this case. The tuple represents the dimensions of the DataFrame ‘df2’, with the
first element indicating the number of rows and the second indicating the number of
columns. Therefore, ‘df2.shape[0]’ accesses the 0th element of the tuple (the number
Notes of rows), and ‘df2.shape[1]’ accesses the 1st element (the number of columns).
In[10] df2["id"].head()
Out[10] 0 842302
1 842517
2 84300903
3 84348301
4 84358402
Name: id, dtype: int64
In[11] df2.id.head()
Out[11] 0 842302
1 842517
2 84300903
3 84348301
4 84358402
Name: id, dtype: int64
Data wrangling with Python 239
In[12] df2["id"][2]
Out[12] 84300903
In pandas, the 0th axis refers to the DataFrame’s rows and the 1st axis refers to its
columns. Hence, we first specify the column (‘id’) and then the row (2). This is why
‘df2[2][“id”]’ is not valid and will raise an exception - it incorrectly assumes row-first
Tricks indexing.
In[13] df2.id[2]
Out[13] 84300903
In[14] df2["id"][[2,4]]
Out[14] 2 84300903
4 84358402
Name: id, dtype: int64
In[15] df2.loc[1,"id"]
Out[15] 842517
In[16] df2.iloc[1,0]
Out[16] 842517
240 Python Data Science
The main difference between .loc and .iloc lies in how they handle indexing:
(1). loc[ ] is label-based.
Tricks (2). iloc[ ] is integer position-based.
The .loc, .iloc, and .ix indexers in pandas are accessed using square brackets (e.g., ‘df.
loc[]’, ‘df.iloc[]’), not parentheses. It’s worth noting that the .ix indexer was available
in earlier versions of pandas, but it has been deprecated since version 0.20.0. Thus,
for current versions of pandas, only .loc and .iloc should be used for label-based and
Tips integer-based indexing, respectively.
In[17] df2[["area_mean","id"]].head()
Out[17] area_mean id
0 1001.0 842302
1 1326.0 842517
2 1203.0 84300903
3 386.1 84348301
4 1297.0 84358402
(4) Rows and columns of a DataFrame each have their unique explicit indices (or
labels).
The ‘index’ attribute of the DataFrame is used to get the labels of the rows.
Notes
The ‘columns’ attribute is used to get the labels of the columns.
In[18] df2.index
Out[18] RangeIndex(start=0, stop=569, step=1)
Data wrangling with Python 241
The return value of ‘df2.index’ is a RangeIndex object, which is a kind of iterator used
for lazy evaluation in pandas. To print all values of the index directly, we can use the
‘*’ operator within a print function, like so: ‘print(*df2.index)’.
Tricks
In[19] df2.columns
Out[19] Index(['id', 'diagnosis', 'area_mean'], dtype='object')
In[20] df2["id"].head()
Out[20] 0 842302
1 842517
2 84300903
3 84348301
4 84358402
Name: id, dtype: int64
Just like with Series, the reindex() method in a DataFrame can be used to create a
new object with the data conformed to a new index. This function does not modify the
explicit index of the original DataFrame.
Tricks
In pandas, we can add a new column during the reindexing process, effectively
creating an explicit index for that column. For instance, if we want to add a new
column named ‘MyNewColumn’, we can include it in the list of columns when
Notes calling the reindex method:
When used with an argument like ‘df2.drop([2])’, the ‘2’ is interpreted as a label-
based (or explicit) index, rather than a positional (or implicit) index.
Notes
In[25] df2.drop([2]).head()
Out[25] id diagnosis area_mean
0 842302 M 1001.0
1 842517 M 1326.0
3 84348301 M 386.1
4 84358402 M 1297.0
5 843786 M 477.1
Data wrangling with Python 243
The .drop() method in pandas does not modify the original DataFrame unless the
‘inplace’ parameter is set to True.
Notes
In[26] df2.head()
Out[26] id diagnosis area_mean
0 842302 M 1001.0
1 842517 M 1326.0
2 84300903 M 1203.0
3 84348301 M 386.1
4 84358402 M 1297.0
Running these lines of code may raise exceptions if the initial state of ‘df2’ isn’t
preserved. This could be due to previous operations that have modified ‘df2’.
Tricks
The first parameter of the ‘df2.drop()’ function in pandas is ‘labels’, which refers to
the labels of the rows or columns you want to drop. The ‘labels’ parameter can accept
a single label or a list-like object containing multiple labels.
Tricks
df2.head()
Out[28] id diagnosis area_mean
0 842302 M 1001.0
1 842517 M 1326.0
2 84300903 M 1203.0
3 84348301 M 386.1
4 84358402 M 1297.0
Another method is to use the .drop() function with the ‘columns’ parameter.
Notes
df2.head()
Out[30] area_mean
0 1001.0
1 1326.0
2 1203.0
3 386.1
4 1297.0
Data wrangling with Python 245
df2=df2[["id","diagnosis","area_mean"]]
df2[df2.area_mean> 1000].head()
Out[31] id diagnosis area_mean
0 842302 M 1001.0
1 842517 M 1326.0
2 84300903 M 1203.0
4 84358402 M 1297.0
6 844359 M 1040.0
To select and display only the ‘id’ and ‘diagnosis’ columns for the first five rows where
‘area_mean’ is greater than 1000 in the DataFrame ‘df2’, you can use the following
command.
Notes
When performing arithmetic operations between two DataFrames, it’s important that
they have the same structure or the operation might not behave as expected. One way
to ensure this is to align the DataFrames on their explicit index before the operation.
Notes
In[33] df4=pd.DataFrame(np.arange(6).reshape(2,3))
df4
Out[33] 0 1 2
0 0 1 2
1 3 4 5
In languages like C and Java, calculations involving arrays or lists are commonly
performed based on position indices (implicit indices). However, in Python, particularly
when using the pandas library, operations can be performed based on explicit indices
Tricks as well as position indices.
In[34] df5=pd.DataFrame(np.arange(10).reshape(2,5))
df5
Out[34] 0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
In[35] df4+df5
Out[35] 0 1 2 3 4
0 0 2 4 NaN NaN
1 8 10 12 NaN NaN
Data wrangling with Python 247
When performing arithmetic operations using operators like +, -, *, etc., the resulting
DataFrame may include NaN values if the operation involves NaN. To handle these
cases, pandas provides specific methods such as add(), sub(), mul(), and div(), which
Notes can be more effective.
In[36] df6=df4.add(df5,fill_value=10)
df6
Out[36] 0 1 2 3 4
0 0 2 4 13.0 14.0
1 8 10 12 18.0 19.0
Though basic arithmetic operators like ‘+’, ‘-’, ‘*’, and ‘/’ can be used in data science
tasks with pandas, it’s generally recommended to use corresponding pandas DataFrame
methods such as add(), sub(), mul(), and div() instead. This is because these methods
are more flexible and allow for additional parameters to be set.
Tips
When performing arithmetic operations with broadcasting rules, we need to ensure the
DataFrames involved have compatible shapes. This is so that the smaller DataFrame
can be ‘broadcast’ across the larger DataFrame, meaning its values are reused to
Notes match the shape of the larger DataFrame.
In[37] s1=pd.Series(np.arange(3))
s1
Out[37] 0 0
1 1
2 2
dtype: int32
In[38] df6-s1
Out[38] 0 1 2 3 4
0 0.0 1.0 2.0 NaN NaN
1 8.0 9.0 10.0 NaN NaN
In[39] df5=pd.DataFrame(np.arange(10).reshape(2,5))
s1=pd.Series(np.arange(3))
df5-s1
Out[39] 0 1 2 3 4
0 0.0 0.0 0.0 NaN NaN
1 5.0 5.0 5.0 NaN NaN
248 Python Data Science
We can also apply the sub() function in pandas, setting the axis parameter to 1 to perform
subtraction between DataFrames across columns. The add(), sub(), mul(), and div()
functions in pandas correspond to the arithmetic operators +, -, *, and /, respectively.
Notes
In[40] df5=pd.DataFrame(np.arange(10).reshape(2,5))
s1=pd.Series(np.arange(3))
df5.sub(s1,axis=1)
Out[40] 0 1 2 3 4
0 0.0 0.0 0.0 NaN NaN
1 5.0 5.0 5.0 NaN NaN
In pandas, setting the parameter axis=1 during an operation signifies the following:
1. The number of rows remains the same before and after the operation.
2. The operation is performed across all columns in each row.
Notes 3. Each column is considered as a whole during the operation.
When performing arithmetic operations along the vertical axis (axis=0) in pandas, we
are applying these operations across all rows for each column. Before we can do this,
we must first ensure that the DataFrames involved have the same number of columns.
Notes
In[41] df5=pd.DataFrame(np.arange(10).reshape(2,5))
s1=pd.Series(np.arange(3))
df5.sub(s1,axis=0)
Out[41] 0 1 2 3 4
0 0.0 1.0 2.0 3.0 4.0
1 4.0 5.0 6.0 7.0 8.0
2 NaN NaN NaN NaN NaN
In[42] df7=pd.DataFrame(np.arange(20).reshape(4,5))
df7
Out[42] 0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
In[43] df7+2
Out[43] 0 1 2 3 4
0 2 3 4 5 6
1 7 8 9 10 11
2 12 13 14 15 16
3 17 18 19 20 21
Data wrangling with Python 249
In[44] print(df7)
print("df7.cumsum=",df7.cumsum())
Out[44] 0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
df7.cumsum= 0 1 2 3 4
0 0 1 2 3 4
1 5 7 9 11 13
2 15 18 21 24 27
3 30 34 38 42 46
In[45] df7
Out[45] 0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
In[46] df7.rolling(2).sum()
Out[46] 0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 5.0 7.0 9.0 11.0 13.0
2 15.0 17.0 19.0 21.0 23.0
3 25.0 27.0 29.0 31.0 33.0
In[47] df7.rolling(2,axis=1).sum()
Out[47] 0 1 2 3 4
0 NaN 1.0 3.0 5.0 7.0
1 NaN 11.0 13.0 15.0 17.0
2 NaN 21.0 23.0 25.0 27.0
3 NaN 31.0 33.0 35.0 37.0
250 Python Data Science
In[48] df7.cov()
Out[48] 0 1 2 3 4
0 41.666667 41.666667 41.666667 41.666667 41.666667
1 41.666667 41.666667 41.666667 41.666667 41.666667
2 41.666667 41.666667 41.666667 41.666667 41.666667
3 41.666667 41.666667 41.666667 41.666667 41.666667
4 41.666667 41.666667 41.666667 41.666667 41.666667
In49] df7.corr()
Out[49] 0 1 2 3 4
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0
df2=df2[["id","diagnosis","area_mean"]][2:5]
df2.T
Out[50] 2 3 4
id 84300903 84348301 84358402
diagnosis M M M
area_mean 1203 386.1 1297
In[51] print(df6)
Out[51] 0 1 2 3 4
0 0 2 4 13.0 14.0
1 8 10 12 18.0 19.0
Data wrangling with Python 251
In[52] df6>5
Out[52] 0 1 2 3 4
0 False False False True True
1 True True True True True
In[53] print(s1)
Out[53] 0 0
1 1
2 2
dtype: int32
In[54] df6>s1
Out[54] 0 1 2 3 4
0 False True True False False
1 True True True False False
df2 = pd.read_csv('bc_data.csv')
df2=df2[["id","diagnosis","area_mean"]]
df2.describe()
Out[55] id area_mean
count 5.690000e+02 569.000000
mean 3.037183e+07 654.889104
std 1.250206e+08 351.914129
min 8.670000e+03 143.500000
25% 8.692180e+05 420.300000
50% 9.060240e+05 551.100000
75% 8.813129e+06 782.700000
max 9.113205e+08 2501.000000
In[56] dt = df2[df2.diagnosis=='M']
In data science projects, the amount of data can be large, and it’s often unnecessary to
access all rows of the data at once. Instead, it’s common to only need the first or last
few rows for analysis or inspection purposes. This is particularly applicable when the
Notes data has a consistent structure, with each row having the same set of columns.
In[57] dt.head()
Out[57] id diagnosis area_mean
0 842302 M 1001.0
1 842517 M 1326.0
2 84300903 M 1203.0
3 84348301 M 386.1
4 84358402 M 1297.0
By using functions like .head() or .tail(), you can easily access a specified number
of rows from the beginning or end of the DataFrame, respectively. These functions
are efficient ways to quickly examine a subset of the data without loading the entire
Notes dataset, which can be time-consuming and resource-intensive.
In[58] dt.tail()
Out[58] id diagnosis area_mean
563 926125 M 1347.0
564 926424 M 1479.0
565 926682 M 1261.0
566 926954 M 858.1
567 927241 M 1265.0
The function DataFrame.tail(n=5) in pandas returns the last ‘n’ rows from the
DataFrame based on their position. This function is particularly useful for quickly
verifying data, such as after performing sorting or appending rows to the DataFrame.
Tricks
The count() method counts the number of non-null (non-empty) values for each
column by default.
Notes
In[59] df2[df2.diagnosis=='M'].count()
Out[59] id 212
diagnosis 212
area_mean 212
dtype: int64
Data wrangling with Python 253
In[60] df2[["area_mean","id"]].head()
Out[60] area_mean id
0 1001.0 842302
1 1326.0 842517
2 1203.0 84300903
3 386.1 84348301
4 1297.0 84358402
In[61] df2.head(8)
Out[61] id diagnosis area_mean
0 842302 M 1001.0
1 842517 M 1326.0
2 84300903 M 1203.0
3 84348301 M 386.1
4 84358402 M 1297.0
5 843786 M 477.1
6 844359 M 1040.0
7 84458202 M 577.9
The sort_values() method in pandas can be used to sort the DataFrame by the values
along either axis, which can be the rows (axis=0) or the columns (axis=1).”
Notes
In[62] df2.sort_values(by="area_mean",axis=0,ascending=True).head()
Out[62] id diagnosis area_mean
101 862722 B 143.5
539 921362 B 170.4
538 921092 B 178.8
568 92751 B 181.0
46 85713702 B 201.9
The sort_index() method in pandas is used to sort an object (e.g., DataFrame or Series)
by its labels along a specified axis. By default, it sorts the object based on the index
labels, but you can also specify axis=1 to sort along the columns.
Notes
254 Python Data Science
In[63] df2.sort_index(axis=1).head(3)
Out[63] area_mean diagnosis id
0 1001.0 M 842302
1 1326.0 M 842517
2 1203.0 M 84300903
Setting axis=0 in pandas implies that the operation is applied vertically to all rows in
each column, while maintaining the same number of columns. Each row is treated as
a collective entity during the operation.
Notes
In[64] df2.sort_index(axis=0,ascending=False).head(3)
Out[64] id diagnosis area_mean
568 92751 B 181.0
567 927241 M 1265.0
566 926954 M 858.1
The prerequisite for importing and exporting a DataFrame is to know the current
working directory, as described in [3.8 Current working directory]. To retrieve the
current working directory of a process, you can use the getcwd() method from the os
Notes package.
In[65] import os
print(os.getcwd())
Out[65] C:\Users\soloman\clm
In[66] df2.head(3).to_csv("df2.csv")
One more example is calling the to_excel() method to write the first 3 rows of
DataFrame df2 to an Excel sheet named “df3.xls”.
Notes
In[68] df2.head(3).to_excel("df3.xls")
Next, we can use the read_excel() method to read the Excel file ‘df3.xls’ into a pandas
DataFrame and save it as ‘df3’.
Notes
When accessing the .empty attribute of a DataFrame, if the DataFrame is empty (i.e., it
has no rows or columns), it will return True. On the other hand, if the DataFrame has
any data (at least one row or column), it will return False.
Notes
In[70] df3.empty
Out[70] False
np.nan is a numeric value and None is an object in Python. As a result, np.nan can be
used in mathematical operations, while None cannot.
Notes
In[71] np.nan-np.nan +1
Out[71] nan
In[72] np.nan-np.nan
Out[72] nan
256 Python Data Science
In[73] None+1
TypeError Traceback (most recent call last)
<ipython-input-83-6e170940e108> in <module>()
----> 1 None+1
Here, the list(“ab”) method is used to convert a string, such as “ab”, into a list of
individual strings, [‘a’, ‘b’], in Python. For more details, please refer to [2.17 Built-in
Functions].
Notes
In[75] list("ab")
Out[75] ['a', 'b']
In[76] B=pd.DataFrame(np.array([1,1,1,2,2,2,3,3,3]).reshape(3,3),
columns=list("abc"),index=list("SWT"))
B
Out[76] a b c
S 1 1 1
W 2 2 2
T 3 3 3
In[77] C=A+B
C
Out[77] a b c
S 11.0 11.0 NaN
T NaN NaN NaN
W 22.0 22.0 NaN
In the expression A.add(B, fill_value=0), the fill_value=0 parameter specifies that any
missing values in A should be filled with 0 before adding B to A.
Notes
In[78] A.add(B,fill_value=0)
Out[78] a b c
S 11.0 11.0 1.0
T 3.0 3.0 3.0
W 22.0 22.0 2.0
In[79] A.add(B,fill_value=A.stack().mean())
Out[79] a b c
S 11.0 11.0 16.0
T 18.0 18.0 18.0
W 22.0 22.0 17.0
In[80] A.mean()
Out[80] a 15.0
b 15.0
dtype: float64
In pandas, the stack() method is used to pivot a DataFrame from a wide format to a
long format by creating a multi-level index. It essentially “stacks” or compresses the
columns of the DataFrame into a single column, resulting in a reshaped DataFrame or
Notes Series with a multi-level index.
258 Python Data Science
In[81] A.stack()
Out[81] S a 10
b 10
W a 20
b 20
dtype: int32
In[82] A.stack().mean()
Out[82] 15.0
In[83] C
Out[83] a b c
S 11.0 11.0 NaN
T NaN NaN NaN
W 22.0 22.0 NaN
There are four important functions in pandas to handle missing values in a DataFrame:
isnull(), notnull(), dropna(), and fillna().
Notes
(1) isnull(): This function returns a Boolean mask that identifies missing values in the
DataFrame.
Notes
In[84] C.isnull()
Out[84] a b c
S False False True
T True True True
W False False True
In[85] C.notnull()
Out[85] a b c
S True True False
T False False False
W True True False
Data wrangling with Python 259
(3) dropna(): This function is used to remove or drop rows or columns that contain
missing values.
Notes
In[86] C.dropna(axis='index')
Out[86] a b c
(4) fillna(): This function is used to fill missing values in the DataFrame with a specified
value or a calculated value.
Notes
In[87] C.fillna(0)
Out[87] a b c
S 11.0 11.0 0.0
T 0.0 0.0 0.0
W 22.0 22.0 0.0
By specifying method=”ffill”, missing values in the DataFrame are filled with the last
known non-null value.
Notes
In[88] C.fillna(method="ffill")
Out[88] a b c
S 11.0 11.0 NaN
T 11.0 11.0 NaN
W 22.0 22.0 NaN
By specifying method=“bfill”, missing values in the DataFrame are filled with the next
non-null value.
Notes
In[89] C.fillna(method="bfill",axis=1)
Out[89] a b c
S 11.0 11.0 NaN
T NaN NaN NaN
W 22.0 22.0 NaN
260 Python Data Science
Tips
In[91] df2.groupby("diagnosis")["area_mean"].mean()
Out[91] diagnosis
B 462.790196
M 978.376415
Name: area_mean, dtype: float64
To aggregate using one or more operations over the specified axis, we can call the
method the aggregate().
Notes
In[92] df2.groupby("diagnosis")["area_mean"].aggregate(["mean","sum","max",
np.median])
Out[92] mean sum max median
diagnosis
B 462.790196 165216.1 992.1 458.4
M 978.376415 207415.8 2501.0 932.0
In[93] df2.groupby("diagnosis")["area_mean"].aggregate(["mean","sum"]).
unstack()
Out[93] diagnosis
mean B 462.790196
M 978.376415
sum B 165216.100000
M 207415.800000
dtype: float64
The stack(), unstack(), pivot(), and melt() methods are commonly used in data science
to convert data formats:
1. pandas.DataFrame.stack(): This method returns a reshaped DataFrame or Series
Tricks with a multi-level index. It adds one or more new inner-most levels compared to
the current DataFrame, creating a hierarchical structure.
2. pandas.DataFrame.unstack(): The unstack() method returns a DataFrame with
a new level of column labels. The inner-most level of the resulting DataFrame
consists of the pivoted index labels. This operation is useful for reshaping data
from long to wide format.
3. pandas.DataFrame.pivot(): The pivot() method reshapes data, essentially
producing a “pivot” table. It uses unique values from specified index/columns
to form axes of the resulting DataFrame, allowing for easy restructuring of data
based on column values.
4. pandas.DataFrame.melt(): The melt() method is used to transform a DataFrame
into a specific format.
By utilizing the apply() method in pandas, you can apply a user-defined function to
groups within a DataFrame.
Notes
df2.groupby("diagnosis").apply(myfunc).head()
Out[94] id diagnosis area_mean
0 842302 M 0.004826
1 842517 M 0.006393
2 84300903 M 0.005800
3 84348301 M 0.001861
4 84358402 M 0.006253
262 Python Data Science
Q&A
Data wrangling with Python 263
The python built-in module datetime provides three different classes for creating dates
or times:
datetime.time() returns an idealized time, independent of any particular day,
including attributes: hour, minute, second, microsecond.
Tricks
datetime.date( ) returns an idealized naïve date with attributes: year, month, and
day.
datetime.datetime() returns a combination of a date and a time, including attributes:
year, month, day, hour, minute, second, microsecond, and tzinfo.
In[3] dt.datetime?
Out[3] Init signature: dt.datetime(self, /, *args, **kwargs)
Docstring:
datetime(year, month, day[, hour[, minute[, second[, microsecond[,tzinfo]]]]])
The year, month and day arguments are required. tzinfo may be None,
or an instance of a tzinfo subclass. The remaining arguments may be ints.
File:
c:\users\administrator\appdata\local\programs\python\python36\lib\datetime.py
Type: type
In the dt.datetime() function, the year, month, and day arguments are required. The
tzinfo argument may be set to None or an instance of a tzinfo subclass. The remaining
Notes arguments are optional but must be integers.
In[4] dt.datetime(month=3,day=3,second=59)
TypeError Traceback (most recent call last)
<ipython-input-5-6fbb4e101d77> in <module>()
----> 1 dt.datetime(month=3,day=3,second=59)
However, the second, minute, and hour arguments are optional for the datetime.
datetime() function.
Notes
There are many formats used to represent time or date, such as ‘3rd of July, 2022’,
‘2022-1-3’, and ‘2022-07-03 00:00:00’. However, most of these formats are not
represented in the standard format of the Python built-in module datetime. Attempting
to parse these formats using the standard datetime module can raise an exception.
Tips
In[7] dt.datetime("2022-1-3")
TypeError Traceback (most recent call last)
<ipython-input-9-c1b53c571977> in <module>()
----> 1 dt.datetime("2022-1-3")
In data science, there are common methods used for parsing a string into a standard
date or time format:
1. The parser.parse() method in the dateutil package
Tricks 2. The to_datetime() method in the pandas package
(1) parser.parse()
Notes
date= parser.parse("2022-1-3")
print(date)
Out[9] 2022-01-03 00:00:00
import pandas as pd
pd.to_datetime("3th of July,2018")
Out[10] Timestamp('2022-07-03 00:00:00')
import pandas as pd
pd.to_datetime("2022-1-3")
Out[11] Timestamp('2022-01-03 00:00:00')
266 Python Data Science
(1) To obtain the current local date and time, you can use the datetime.datetime.now()
function.
Notes
In[12] dt.datetime.now()
Out[12] datetime.datetime(2022, 5, 24, 21, 39, 50, 155634)
(2) To obtain the current local date, you can use the datetime.datetime.today()
function.
Notes
In[13] dt.datetime.today()
Out[13] datetime.datetime(2022, 5, 24, 21, 39, 50, 913872)
In[14] now=dt.datetime.now()
now.strftime("%W"),now.strftime("%a"),now.strftime("%A"),
now.strftime("%B"),now.strftime("%C"),now.strftime("%D")
Out[14] ('51', 'Sun', 'Sunday', 'December', '20', '12/23/18')
You can evaluate the duration, or the difference between two date or time objects,
by subtracting one object from another.
Notes
In[15] d1=dt.datetime.now()
d2=dt.datetime(year=2017,month=3,day=3)
(d1-d2).days
Out[15] 447
In[16] myindex=pd.DatetimeIndex(["2023-1-1","2024-1-2","2023-1-3","2023-1-4",
"2023-1-5"])
In[16] data=pd.Series([1,2,3,4,5],index=myindex)
data
Out[16] 2023-01-01 1
2024-01-02 2
2023-01-03 3
2023-01-04 4
2023-01-05 5
dtype: int64
In[17] data["2023-1-2"]
Out[17] Series([], dtype: int64)
268 Python Data Science
In[18] data["2023"]
Out[18] 2023-01-01 1
2023-01-03 3
2023-01-04 4
2023-01-05 5
dtype: int64
In[21] data.to_period(freq="D")
Out[21] 2023-01-01 1
2024-01-02 2
2023-01-03 3
2023-01-04 4
2023-01-05 5
Freq: D, dtype: int64
In[22] data.to_period(freq="M")
Out[22] 2023-01 1
2024-01 2
2023-01 3
2023-01 4
2023-01 5
Freq: M, dtype: int64
Q&A
Data wrangling with Python 271
272 Python Data Science
Matplotlib, Seaborn, and Pandas are widely used and important packages for data
visualization in Python.
Tips
The matplotlib is organized in a hierarchy. At the top of the hierarchy is the matplotlib
“state-machine environment” which is provided by the matplotlib.pyplot module. At
this level, simple functions are used to add plot elements (lines, images, text, etc.) to
Notes the current axes in the current figure.
%matplotlib inline
You can also use the magic command “%matplotlib notebook” to create interactive
figures if your environment allows it.
Notes
To generate a dataset t for visualization purposes, you can use the np.arange function
with the specified parameters:
Notes
The method to display multiple lines in a figure using Matplotlib is to pass multiple
arguments to the plt.plot() function. The argument format is as follows: “x1, y1, x2,
Notes y2, x3, y3, x4, y4, ...”.
274 Python Data Science
In[6] plt.plot(t,t,t,t+2,t,t**2,t,t+8)
plt.show()
Out[6]
(2) to set line styles and colors , e.g. ‘ g--’ for green dashed line style(‘--’).
Line Styles:
‘-’ solid line style
Notes ‘--’ dashed line style
‘-.’ dash-dot line style
‘:’ dotted line style
Colors:
‘b’ blue
‘g’ green
‘r’ red
‘c’ cyan
‘m’ magenta
‘y’ yellow
‘k’ black
‘w’ white
One more example of setting line styles and colors is “rD”, which means “red+ diamond”.
More arguments, please refer to Matplotlib’s official website documentation.
You can learn more about the meaning of the third argument of plt.plot() through the
Notes help documentation. The specific command is: plt.plot?
(3) To set the title of a plot and change axis labels:plt.title(), plt.xlabel() and plt.
ylabel().
Notes
plt.show()
Out[10]
Data wrangling with Python 277
plt.title(), plt.xlabel() and plt.ylabel() correspond to the title, X-axis label and X-axis
label.
Tricks
The correct placement of plt.title(), plt.xlabel(), and plt.ylabel() within plt.plot() and
between plt.show() ensures that the plot is configured with the desired title and axis
labels before it is shown.
Tricks
plt.legend(loc="upper left",labels=["Legend"])
plt.show()
Out[11]
The argument loc=“upper left” in the context of plt.legend() specifies that the legend
should be positioned in the upper left corner of the plot.
To gain more information about the available options for the loc argument, you can
Tricks refer to the docstring of plt.legend().
278 Python Data Science
To switch the plotting functions in Matplotlib, such as changing from plt.plot() to plt.
scatter() to create a scatter plot, you can use the appropriate function based on the type
of plot you want to generate.
Notes
I highly recommend accessing the example plots provided on the official Matplotlib
website (https://matplotlib.org/stable/gallery/index.html). Each example not only
showcases the visualization effects but also provides the corresponding source code.
Tricks
plt.xlim(11, –2) means “the value range of the x-axis is from 11 to –2”.
plt.ylim(2.2, –1.3) means “the value range of the y-axis is from 2.2 to -1.3”.
Notes
plt.plot(x,np.sin(x))
plt.xlim(11,-2)
plt.ylim(2.2,-1.3)
Data wrangling with Python 279
To get or set various axis properties in Matplotlib, you can use the matplotlib.pyplot.
axis() function.
Notes
In[14] plt.plot(x,np.sin(x))
plt.axis([-1,21,-1.6,1.6])
Out[14] (-1.0, 21.0, -1.6, 1.6)
To set equal scaling for both the x-axis and y-axis by changing the axis limits: plt.
axis(“equal”)
Notes
280 Python Data Science
In[15] plt.plot(x,np.sin(x))
plt.axis([-1,21,-1.6,1.6])
plt.axis("equal")
Out[15] (-0.5, 10.5, -1.0993384025373631, 1.0996461858110391)
To set limits just large enough to show all data and disable further autoscaling: plt.
axis(“tight”)
Notes
In[16] plt.plot(x,np.sin(x))
plt.axis([-1,21,-1.6,1.6])
plt.axis("tight")
Out[16] (-0.5, 10.5, -1.0993384025373631, 1.0996461858110391)
Data wrangling with Python 281
To create multiple plots on the same coordinates, you can write multiple functions
in the same cell and call plt.legend() to display multiple labels for the plots.
Notes
In[17] plt.plot(x,np.sin(x),label="sin(x)")
plt.plot(x,np.cos(x),label="cos(x)")
plt.axis("equal")
plt.legend()
Out[17] <matplotlib.legend.Legend at 0x124156820>
To add an Axes to the current figure or retrieve an existing Axes, you can call the plt.
subplot(x, y, z) function before each code line of creating a plot.
Notes
In[18] plt.subplot(2,3,5)
plt.scatter(women["height"], women["weight"])
plt.subplot(2,3,1)
plt.scatter(women["height"], women["weight"])
plt.show()
282 Python Data Science
Out[18]
The code plt.savefig(“savefig.png”) will save the current plot to the current working
directory with the filename “savefig.png”. However, you can customize the file
name and path by providing the desired directory and filename in the plt.savefig()
Tricks function.
Data wrangling with Python 283
First, generate the experimental datasets X and y that will be used for visualization.
The make_blobs function is used to generate a random dataset that conforms to a
normal distribution.
Notes
Pandas provides several different options for visualizing data with .plot() and their
usage is similar to matplotlib.
Notes
In[22] women.plot(kind="barh")
plt.show()
Data wrangling with Python 285
Out[22]
In[23] women.plot(kind="bar",x="height",y="weight",color="g")
plt.show()
Out[23]
In[24] women.plot(kind="kde")
plt.show()
Out[24]
286 Python Data Science
In[25] women.plot(kind="bar",x="height",y="weight",color="g")
plt.legend(loc="best")
plt.show()
Out[25]
Note that the function name for drawing plots in Seaborn is lmplot, which is different
from the function name in Matplotlib. Additionally, the arguments for lmplot also
differ from those in Matplotlib.
Tricks
To create a Kernel Density Estimation (KDE) plot for visualizing the distribution of
observations in a dataset : sns.kdeplot().
Notes
In[28] sns.distplot(women.height)
Out[28] <matplotlib.axes._subplots.AxesSubplot at 0x135b8c6a0>
In[29] sns.pairplot(women)
Out[29] <seaborn.axisgrid.PairGrid at 0x135c44eb0>
To create a plot of two variables with bivariate and univariate graphs : sns.jointplot()
Notes
Data wrangling with Python 289
In[30] sns.jointplot(women.height,women.weight,kind="reg")
Out[30] <seaborn.axisgrid.JointGrid.at.0x22d35c20280>
The data file salaries.csv is available in the electronic resources of this book.
Tricks
Calling the pandas.read_csv() method to read the file “salaries.csv” and store the
data in a DataFrame object called df_salaries.
Notes
Calling the df_salaries.head() method to display the first 6 rows of the DataFrame
object df_salaries. For detailed descriptions, please refer to [4.4 DataFrame].
Notes
290 Python Data Science
In[33] df_salaries.head(6)
Out[33] rank discipline yrs.since.phd yrs.service sex salary
1 Prof B 19 18 Male 139750
2 Prof B 20 16 Male 173200
3 AsstProf B 4 3 Male 79750
4 Prof B 45 39 Male 115000
5 Prof B 40 41 Male 141500
6 AssocProf B 6 6 Male 97000
In[35] sns.set_style('darkgrid')
sns.stripplot(data=df_salaries, x='rank', y='salary', jitter=True, alpha=0.5)
sns.boxplot(data=df_salaries, x='rank', y='salary')
Out[35] <matplotlib.axes._subplots.AxesSubplot at 0xd770b38>
Here, the argument “jitter=True” is used to add a small random noise to the data
points in order to prevent them from overlapping and make the distribution more
visible. The argument “alpha=0.5” is used to adjust the transparency of the data
Tricks points, where 0.5 represents a medium level of opacity.
Data wrangling with Python 291
Exercises
C.
a 1
a 4
b 2
b 5
dtype: int64
[12] What will the following program print out?
import numpy as np
import pandas as pd
mySeries1=pd.Series([1,2,3,4,5], index=["a","b","c","d","e"])
mySeries2=mySeries1.reindex(index=["b","c","a","d","e"])
np.all(mySeries2.values==mySeries1.values)
A. False
B. True
C. ValueError
[13] What will the following program print out?
import pandas as pd
mySeries4=pd.Series([21,22,23,24,25,26,27], index=["a","b","c","d","e","f","g"])
"c" in mySeries4
A. False
B. True
C. ValueError
[14] What will the following program print out?
import numpy as np
import pandas as pd
df=pd.DataFrame(np.arange(1,21).reshape(5,4))
df.iloc[3,2]
A. 18
B. 10
C. 15
[15] Which of the following is wrong of dataframe?
A. The dataframe with only one-dimensional data is series, both of which are under the Pandas package.
B. The row name of the dataframe can be accessed with the rows attributes.
C. The column name of the dataframe can be accessed with the columns attributes.
[16] For the following code:
{import datetime as dt}
Which of the following time and date definitions is wrong?
A. dt.datetime(2019,12,12,23,23,23)
B. dt.datetime(2019,0,0,23,23,23)
C. dt.datetime(2019,12,12,0)
D. dt.time(23,23,23)
[17] When calculating the time difference, the calculating unit can be( )
A. days
B. seconds
C. microseconds
D. All of the above
294 Python Data Science
Data analysis is one of the most critical stages in data science life cycle. This chapter will introduce various data
analysis skills including:
Statistical modelling with statsmodels
Machine learning with sci-kit learn
Natural language understanding with NLTK
Image processing with OpenCV
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 295
C. Borjigin, Python Data Science, https://doi.org/10.1007/978-981-19-7702-2_5
296 Python Data Science
Q&A
Advanced Python Programming for Data Science 297
There are two main concepts in statistical analysis: the feature matrix and the target
vector.
Taking y = F(X) as an example, X represents the feature matrix. The feature matrix
Notes is assumed to be two-dimensional, with a shape of [n_samples, n_features]. It is
typically stored in a NumPy array or a Pandas DataFrame, although some Scikit-
Learn models also accept SciPy sparse matrices. Each sample in the feature matrix is
stored in a separate row.
y is the dependent variable and also termed “target vector” or “target array”.
The target vector is usually one dimensional, with length n_samples. It is commonly
Notes stored in a NumPy array or Pandas Series.
In[1] # To obtain the current working directory in Python, you can use the os.getcwd()
method.
import os
print(os.getcwd())
Out[1] C:\Users\soloman\clm
To get the current working directory in Python, you can use the os.getcwd()
method. And if you want to change the current working directory, you can use the
os.chdir(path) method.
Tips
In[2] # to load data from the current working directory into Panda’s DataFrame
import pandas as pd
print(df_women.head())
Out[2]
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
Tips
The women.csv file is available in the learning resources for this textbook.
Tips
df_women.shape
Out[3] (15, 2)
Advanced Python Programming for Data Science 299
print(df_women.columns)
Out[4] Index(['height', 'weight'], dtype='object')
df_women.describe()
Out[5] height weight
count 15.000000 15.000000
mean 65.000000 136.733333
std 4.472136 15.498694
min 58.000000 115.000000
25% 61.500000 124.500000
50% 65.000000 135.000000
75% 68.500000 148.000000
max 72.000000 164.000000
Out[6]
Tips
(4) The data visualization chart indicates that the relationship between dependent
variable and independent variable is linear. So we can conduct a linear regression
analysis. Firstly, we need to arrange data into a feature matrix and target vector.
Tips
In[7] X = df_women["height"]
y = df_women["weight"]
Out[7] (15, 2)
In[8] X
Out[8] 1 58
2 59
3 60
4 61
5 62
6 63
7 64
8 65
9 66
10 67
11 68
12 69
13 70
14 71
15 72
Name: height, dtype: int64
Advanced Python Programming for Data Science 301
In[9] y
Out[9] 1 115
2 117
3 120
4 123
5 126
6 129
7 132
8 135
9 139
10 142
11 146
12 150
13 154
14 159
15 164
Name: weight, dtype: int64
“In fact, the data type of ‘y’ here is not the correct one for the target vector we
require. We can check its data type using the ‘type(y)’ function. Subsequent lines
of code won’t raise exceptions, as the ‘statsmodels’ package automatically handles
data type conversions. However, if other packages, such as ‘Scikit-Learn’, are used,
Tips exceptions may occur. To prevent this, we can use the ‘np.ravel()’ function to adjust
the data type as needed.”
import statsmodels.api as sm
Statsmodels, statistics, and scikit-learn are three popular packages used for statistical
analysis and machine learning in Python.
Tips
In[11] X
Out[11] 1 58
2 59
3 60
4 61
5 62
6 63
7 64
8 65
9 66
10 67
11 68
12 69
13 70
14 71
15 72
Name: height, dtype: int64
302 Python Data Science
By default, an intercept is included when we execute an OLS model using the sm.add_
constant() method.
Notes
In[12] X_add_const=sm.add_constant(X)
X_add_const
Out[12] const height
1 1.0 58
2 1.0 59
3 1.0 60
4 1.0 61
5 1.0 62
6 1.0 63
7 1.0 64
8 1.0 65
9 1.0 66
10 1.0 67
11 1.0 68
12 1.0 69
13 1.0 70
14 1.0 71
15 1.0 72
Tips
statsmodels is using endog and exog as names for the data, the observed variables that
are used in an estimation problem. For further details, please refer to https://www.
statsmodels.org/stable/endog_exog.html.
The first two arguments of the sm.OLS() function are endog(y) and exog(X_add_
Tips const).
Advanced Python Programming for Data Science 303
Here, the results object has many useful attributes. For further details, please refer to
the official website of the statsmodels package.
Tips
R-squared values range from 0 to 1. The closer its value is to 1, the better the regression
line fits the data.
Tips
When conducting data science projects with statistical methods, it is not only necessary
to evaluate the model results, but also to test the underlying statistical assumptions.
Notes
In statistical analysis, all parametric tests make certain assumptions about the data. It’s
important to test these assumptions to ensure valid results. Taking linear regression as
an example, these assumptions include:
Notes The first assumption is that a linear relationship exists between the dependent and
independent variables. This can be tested by calculating the F-statistic.
The second assumption is that there’s no autocorrelation in the residuals. This can be
tested using the Durbin-Watson statistic.
The third assumption is that the underlying residuals are normally distributed, or
approximately so. The Jarque–Bera test is a goodness-of-fit test of normality.
results.f_pvalue
Out[17] 1.0909729585997406e-14
The F-test of overall significance indicates whether the regression model provides a
better fit to the data than a model that contains no independent variables.
Notes
Advanced Python Programming for Data Science 305
A p-value less than some significance level (e.g. α = .05) is statistically significant.
It indicates strong evidence against the null hypothesis, as there is less than a 5%
probability the null is correct. Hence, we reject the null hypothesis and accept the
Tricks alternative hypothesis.
sm.stats.stattools.durbin_watson(results.resid)
Out[18] 0.31538037486218456
The Durbin Watson statistic is a test for autocorrelation in the residuals from regression
models. The Durbin-Watson statistic will always have a value ranging between 0 and
4. A value of 2 indicates there is no autocorrelation detected in the samples.
Notes
sm.stats.stattools.jarque_bera(results.resid)
Out[19] (1.6595730644310005,
0.43614237873238126,
0.7893583826332368,
2.5963042257390314)
In the statasmodels package, after a model has been fit predict returns the fitted values
Tips
306 Python Data Science
In addition to the statistics (e.g. R-squared), we can also display the goodness-of-fit
by data visualization.
Tips
As can be seen from the above figure, the effect of simple linear regression in this case
may be further optimized. Hence, we replace simple linear regression with polynomial
regression.
Notes
In the polynomial regression analysis, the feature matrix X consists of 3 parts ---X, the
square of X, and the cube of X.
Tips
In[23] X_add_const=sm.add_constant(X)
X_add_const
Out[23] array([[1.00000e+00, 5.80000e+01, 3.36400e+03, 1.95112e+05],
[1.00000e+00, 5.90000e+01, 3.48100e+03, 2.05379e+05],
[1.00000e+00, 6.00000e+01, 3.60000e+03, 2.16000e+05],
[1.00000e+00, 6.10000e+01, 3.72100e+03, 2.26981e+05],
[1.00000e+00, 6.20000e+01, 3.84400e+03, 2.38328e+05],
[1.00000e+00, 6.30000e+01, 3.96900e+03, 2.50047e+05],
[1.00000e+00, 6.40000e+01, 4.09600e+03, 2.62144e+05],
[1.00000e+00, 6.50000e+01, 4.22500e+03, 2.74625e+05],
[1.00000e+00, 6.60000e+01, 4.35600e+03, 2.87496e+05],
[1.00000e+00, 6.70000e+01, 4.48900e+03, 3.00763e+05],
[1.00000e+00, 6.80000e+01, 4.62400e+03, 3.14432e+05],
[1.00000e+00, 6.90000e+01, 4.76100e+03, 3.28509e+05],
[1.00000e+00, 7.00000e+01, 4.90000e+03, 3.43000e+05],
[1.00000e+00, 7.10000e+01, 5.04100e+03, 3.57911e+05],
[1.00000e+00, 7.20000e+01, 5.18400e+03, 3.73248e+05]])
Here, the purpose of calling the sm.add_constant() function is to add a column of ones
to the feature matrix, which represents the intercept term in the regression model.
Tips
plt.rcParams['font.family']="simHei"
plt.scatter(df_women["height"], df_women["weight"])
plt.plot(df_women["height"], y_predict_updated)
plt.title('Linear regression analysis of women weight and height')
plt.xlabel('height')
plt.ylabel('weight')
Out[28] Text(0, 0.5, 'weight')
We can apply the fitted model to predict new data. For instance, it can be used to
predict the weight of a woman who stands 63.5 inches tall.
Tips
The argument structure for the ‘predict()’ method should match the form of the model
‘s independent variables. We can access the DocStrings by typing ‘results.predict?’
Notes
310 Python Data Science
Q&A
Advanced Python Programming for Data Science 311
In Machine Learning, the original dataset is usually split into three independent
subsets:
The training set is a subset to train a model.
Notes The test set is a subset to test the trained model after training.
The validation set is a subset to validate model performance during training,
especially to tune the hyperparameters and make model selection.
Tips
In[2] # to load data from the current working directory into Panda's DataFrame
The original data file, ‘bc_data.csv’, can be found in the learning resources associated
with this textbook.
Tips
The bc_data.head() method returns the first 5 rows. For further details, please refer to
[4.4 DataFrame].
Tips
print(bc_data.shape)
Out[3] (569, 32)
Tips
Advanced Python Programming for Data Science 313
print(bc_data.columns)
Out[4] Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean',
'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se',
'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se',
'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_
worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst',
'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst'], dtype='object')
Tips
print(bc_data.describe())
Out[5] id radius_ texture_ perimeter_ area_mean\
mean mean mean
count 5.690000e+02 569.000000 569.000000 569.000000 569.000000
mean 3.037183e+07 14.127292 19.289649 91.969033 654.889104
std 1.250206e+08 3.524049 4.301036 24.298981 351.914129
min 8.670000e+03 6.981000 9.710000 43.790000 143.500000
25% 8.692180e+05 11.700000 16.170000 75.170000 420.300000
50% 9.060240e+05 13.370000 18.840000 86.240000 551.100000
75% 8.813129e+06 15.780000 21.800000 104.100000 782.700000
max 9.113205e+08 28.110000 39.280000 188.500000 2501.000000
fractal_dimension_worst
count 569.000000
mean 0.083946
std 0.018061
min 0.055040
25% 0.071460
50% 0.080040
75% 0.092080
max 0.207500
[8 rows x 31 columns]
Tips
Advanced Python Programming for Data Science 315
Data wrangling is one of the crucial phases in data science projects. In this case,
it refers to the process of defining the feature matrix and target vector, as well as
splitting the dataset into the training set and test set.
Notes
Here, the ID column is not an independent variable, so we will remove it from the
data DataFrame and create a new feature matrix named X_data.
Tips
5 rows × 30 columns
Here,
Axis=0 will act on all the ROWS in each COLUMN;
Axis=1 will act on all the COLUMNS in each ROW;
Notes
In data science projects, the np.ravel() function can be used to define target
vectors(arrays).
Notes
In[9] # to split the original data into training subsets and test subsets
Here, X_trainingSet is the feature matrix and y_trainingSet is the target vector of
training set. Besides, X_testSet is the feature matrix and y_testSet is the target vector
of test set.
Tips
print(X_trainingSet.shape)
Out[10] (426, 30)
print(X_testSet.shape)
Out[11] (143, 30)
The first step is to select an appropriate algorithm. In this case we select KNN, so
KNeighborsClassifier is imported.
Tips
myModel = KNeighborsClassifier(algorithm='kd_tree')
The second step is to describe the machine learning algorithm and set the
hyperparameter——algorithm=‘kd_tree’.
The KNN classifier implement different algorithms (BallTree, KDTree or Brute
Force) to calculate the nearest neighbors.
Tips
myModel.fit(X_trainingSet, y_trainingSet)
Out[14] KNeighborsClassifier(algorithm='kd_tree')
Here,
X_trainingSet is the feature matrix in the training set.
y_trainingSet is the target vector of the training set.
Tips
318 Python Data Science
The trained model can be utilized to predict the labels for the test set.
Notes
y_predictSet = myModel.predict(X_testSet)
Tips
print(y_predictSet)
Out[16] ['M' 'M' 'B' 'M' 'M' 'M' 'M' 'M' 'B' 'B' 'B' 'M' 'M' 'B' 'B' 'B' 'B' 'B'
'B' 'M' 'B' 'B' 'M' 'B' 'M' 'B' 'B' 'M' 'M' 'M' 'M' 'B' 'M' 'B' 'B' 'B'
'M' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'M' 'B' 'B' 'B' 'M' 'M' 'M' 'B' 'B'
'B' 'B' 'B' 'M' 'B' 'B' 'B' 'M' 'B' 'M' 'B' 'B' 'B' 'M' 'B' 'B' 'B' 'B'
'M' 'M' 'B' 'M' 'B' 'B' 'B' 'M' 'B' 'M' 'B' 'M' 'B' 'B' 'M' 'B' 'M' 'B'
'B' 'M' 'B' 'B' 'M' 'M' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B'
'M' 'M' 'B' 'B' 'B' 'B' 'M' 'M' 'B' 'B' 'B' 'B' 'B' 'M' 'M' 'B' 'B' 'M'
'M' 'M' 'M' 'M' 'B' 'B' 'B' 'M' 'B' 'M' 'M' 'M' 'B' 'B' 'M' 'M' 'B']
In[17] # to print the labels in the test set and compare against the predicted labels
print(y_testSet)
Out[17] ['B' 'M' 'B' 'M' 'M' 'M' 'M' 'M' 'B' 'B' 'B' 'M' 'M' 'B' 'B' 'B' 'B' 'B'
'B' 'M' 'B' 'B' 'M' 'B' 'M' 'B' 'B' 'M' 'M' 'M' 'M' 'B' 'M' 'M' 'B' 'B'
'M' 'B' 'M' 'B' 'B' 'B' 'B' 'B' 'B' 'M' 'B' 'B' 'B' 'M' 'M' 'M' 'B' 'B'
'B' 'B' 'B' 'M' 'B' 'B' 'B' 'M' 'B' 'B' 'B' 'B' 'B' 'M' 'B' 'B' 'B' 'B'
'M' 'M' 'B' 'M' 'M' 'M' 'B' 'M' 'B' 'M' 'B' 'M' 'B' 'B' 'M' 'B' 'M' 'B'
'B' 'M' 'B' 'B' 'M' 'M' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B'
'M' 'M' 'M' 'B' 'B' 'B' 'M' 'M' 'B' 'B' 'B' 'B' 'B' 'M' 'M' 'B' 'B' 'M'
'M' 'B' 'M' 'M' 'B' 'B' 'B' 'M' 'B' 'M' 'M' 'B' 'B' 'B' 'M' 'M' 'B']
Here,
y_testSet refers to the test set;
y_ predictSet refers to the predicted values.
Tips
We use elbow method to select the optimal number of clusters for KNN clustering.
Notes
In[19] # to create a for loop that trains various KNN models with different k values
scores
Out[19] [0.9230769230769231,
0.9020979020979021,
0.9230769230769231,
0.9440559440559441,
0.9370629370629371,
0.9230769230769231,
0.9300699300699301,
0.9230769230769231,
0.9230769230769231,
0.9230769230769231,
0.9230769230769231,
0.9230769230769231,
0.9230769230769231,
0.9230769230769231,
0.9230769230769231,
0.916083916083916,
0.916083916083916,
0.916083916083916,
0.916083916083916,
0.916083916083916,
0.916083916083916,
0.9090909090909091]
320 Python Data Science
Measuring the accuracy scores of the KNN model for values of k ranging from 1 to
23, and storing them in a list named “scores”.
Tips
In[20] # to visualize the accuracy scores of the KNN models with k=1 to 23
Tips
In[21] # to retrain KNN model with the best K value(k=4) and calculate its accuracy score
Tips
Q&A
Advanced Python Programming for Data Science 323
Natural Language Tool Kit (NLTK) and spaCy are two of the most popular English
Natural Language Processing (NLP) tools available in Python.
Notes
In this chapter, we will use NLTK (Natural Language Toolkit) to analyze the inaugural
speeches of the US presidents from 1789 to 2017 and compare the first speeches
of four presidents: Clinton, Bush, Obama, and Trump. The data consists of multiple
Notes inaugural speeches collected from the inaugural corpus of NLTK.
If the output is “True”, it means that the download has been completed. If the download
speed is slow or the download fails, an alternative option is to directly download that
package on GitHub (https://github.com/nltk/nltk_data), and put it in the path of file
“nltk_data”.
Tips
After the corpus is downloaded successfully, we import the package directly without
executing nltk.download().
Tips
len(inaugural.fileids())
Out[4] 58
It is evident that there are 58 documents in the inaugural corpus. We first analyze all
the documents, and then select a few presidential inaugural speeches for in-depth
comparative analysis.
Tips
We fill in the DataFrame with the speech year, president name, combination of time
and name, speech text.
Notes
df_inaugural.head()
Advanced Python Programming for Data Science 325
The first column is used to render the first four digits of the file name, i.e. speech year.
The second column indicates the president’s name which are extracted from the
characters between the symbol “-“ and the symbol “.” with the regular expression
method.
Tips The third column is used to render the combination of time and name.
The forth column refers to president’s speech text.
To fill in a DataFrame with speech year, president name, combination of time and
name, and speech text, you can use the DataFrame.apply() method. This method
allows you to apply a function to each row or column of the DataFrame.
Notes
df_inaugural.head()
Out[7]
Fellow-Citizens of
0 1789 Washington 1789 Washington the Senate and of 2 10 13
the House...
Fellow citizens,
1 1793 Washington 1793 Washington I am again called 1 0 1
upon by the...
Proceeding, fellow
4 1805 Jefferson 1805 Jefferson citizens, to that 1 22 8
qualifica...
326 Python Data Science
“America”, “we”, and “you” were selected as keywords, and the value of frequencies
these three words appeared in the speech text are returned.
Tips
fig = plt.figure(figsize=(16,5))
plt.xticks(size = 8, rotation = 60)
plt.plot(df_inaugural['president'],df_inaugural['"America" count'],c='r',label='"America"
count')
plt.plot(df_inaugural['president'],df_inaugural['"we" count'],c='g',label='"we" count')
plt.plot(df_inaugural['president'],df_inaugural['"you" count'],c='y',label='"you" count')
plt.legend()
plt.title("The number of times the three words 'America', 'we', and 'you' appear in the
presidents' inaugural speeches")
plt.show()
Out[8]
The word “we” appeared most frequently, and “we” is the most common word in
Van Buren’s speech in 1837. The word “America” was used frequently by Trump
in 2017.
Tips
We count the number of words in each speech by splitting words with spaces.
Notes
Out[9] year president president speech text “America” “we” “you” word
name count count count count
0 1789 Washington 1789Washington Fellow-Citizens of 2 10 13 1426
the Senate and of
the House...
1 1793 Washington 1793Washington Fellow citizens, 1 0 1 135
I am again called
upon by the...
2 1797 Adams 1797Adams When it was first 8 23 1 2306
perceived, in early
times, t...
3 1801 Jefferson 1801Jefferson Friends and 0 18 14 1725
Fellow Citizens:\n\
nCalled upon to...
4 1805 Jefferson 1805Jefferson Proceeding, fellow 1 22 8 2153
citizens, to that
qualifica...
We also plot a bar chart of the total number of words in each speech text.
Notes
The sent_tokenize() function in the NLTK.tokenize package can be used to split a text
to sentences.
Notes
328 Python Data Science
Out[11] year president president speech text “America” “we” “you” word sentence
name count count count count count
0 1789 Washington 1789 Fellow-Citizens 2 10 13 1426 23
Washington of the Senate
and of the
House...
1 1793 Washington 1793 Fellow citizens, 1 0 1 135 4
Washington I am again
called upon by
the...
2 1797 Adams 1797 When it was 8 23 1 2306 37
Adams first perceived,
in early times,
t...
3 1801 Jefferson 1801 Friends 0 18 14 1725 41
Jefferson and Fellow
Citizens:\n\
nCalled upon
to...
4 1805 Jefferson 1805 Proceeding, 1 22 8 2153 45
Jefferson fellow
citizens, to that
qualifica...
It can be seen that Van Buren’s speech text had a relatively high number of words and
sentences in 1837, while Washington’s speech had a relatively low number in 1793.
Tips
Next, we choose the speeches of Trump, Obama, Bush and Clinton for analysis.
Since there are some presidents who are re-elected, we choose the speeches of their
first inauguration ——“2017-Trump.txt”, “2009-Obama.txt”, “2001-Bush.txt”,
Notes “1993-Clinton.txt”.
Out[13] year president president speech text “America” “we” “you” word sentence
name count count count count count
0 1993 Clinton 1993Clinton My fellow 33 57 12 1583 81
citizens, today
we celebrate
the mys...
1 2001 Bush 2001Bush President 20 43 9 1580 97
Clinton,
distinguished
guests and
my...
2 2009 Obama 2009Obama My fellow 15 75 18 2383 110
citizens:\n\
nI stand here
today humb...
3 2017 Trump 2017Trump Chief Justice 35 37 23 1425 90
Roberts,
President
Carter, Presi...
Tips
330 Python Data Science
Tips
5.3.5 Tokenization
Stopwords are words that are extremely common in human language but carry
minimal meaning since they represent highly frequent words such as “the”, “to”,”
“of,” and “to.”
Notes
NLTK includes a list of 40 stop words, including: “a”, “an”, “the”, “of”, “in”, etc.
Tips
Advanced Python Programming for Data Science 331
There are still some useless words (such as “us”) after filtering out stopwords. So we
add custom stopwords and then remove them from speech texts.
Tips
Tips
In[19] freq_words_2017Trump
Out[19] america 19
american 11
people 10
country 9
one 8
dtype: int64
The high frequency words in Trump’s speech are “america”, “american”, “people”,
“country”, “one”. The word “america” appears 19 times in his speech.
Tips
In[20] plt.figure(figsize=(16,16))
fig,ax = plt.subplots(2, 2, figsize=(10,6))
plt.subplots_adjust(wspace=1.0, hspace=0.3)
ax[0][0].barh(freq_words_1993Clinton.index, freq_words_1993Clinton, color='red',
alpha=0.3)
ax[0][0].set_title("High-frequency words in Clinton's inaugural speech in 1993")
ax[0][1].barh(freq_words_2001Bush.index, freq_words_2001Bush, color='green',
alpha=0.3)
ax[0][1].set_title("High-frequency words in Bush's inaugural speech in 2001")
ax[1][0].barh(freq_words_2009Obama.index, freq_words_2009Obama, color='yellow',
alpha=0.3)
ax[1][0].set_title("High-frequency words in Obama's inaugural speech in 2009")
ax[1][1].barh(freq_words_2017Trump.index, freq_words_2017Trump, color='teal',
alpha=0.3)
ax[1][1].set_title("High-frequency words in Trump's inaugural speech in 2017")
plt.show()
Out[20] <Figure size 1152x1152 with 0 Axes>
The horizontal bar charts of high frequency words are drawn with matplotlib.
Tips
Advanced Python Programming for Data Science 333
Tips
Finally, we import the wordcloud package to generate word clouds for the speeches of
the four presidents, respectively.
Notes
word_cloud.generate(speech_2001Bush)
plt.subplots(figsize=(8,5))
plt.imshow(word_cloud)
plt.axis('off')
plt.title("Word cloud of Bush's inaugural speech in 2001")
334 Python Data Science
word_cloud.generate(speech_2009Obama)
plt.subplots(figsize=(8,5))
plt.imshow(word_cloud)
plt.axis('off')
plt.title("Word cloud of Obama's inaugural speech in 2009")
Out[23] Text(0.5, 1.0, "Word cloud of Obama’s inaugural speech in 2009")
word_cloud.generate(speech_2017Trump)
plt.subplots(figsize=(8,5))
plt.imshow(word_cloud)
plt.axis('off')
plt.title("Word cloud of Trump's inaugural speech in 2017")
Out[24] Text(0.5, 1.0, “Word cloud of Trump’s inaugural speech in 2017”)
Advanced Python Programming for Data Science 335
Q&A
336 Python Data Science
Here, the module import name name(cv2) differs from the package name(opencv-
python). cv2 (old interface in old OpenCV versions was named as cv) is the name that
OpenCV developers chose when they created the binding generators.
Notes
import cv2
To load the image file “test.jpg” into the image object named “image,” you can use the
imread() method from the “opencv-python” package.
Notes
In[2] # to load(read) the mage file from the current working directory
image = cv2.imread("test.jpg")
The image file “test.jpg” is available in the learning resources for this textbook.
Tips
gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
The waitkey() function allows users to display a window for given milliseconds or
until any key is pressed.
Here, the waitkey(0) means that it will display the window infinitely until users
actually press any key.
Tips
faceCascade=cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_frontalface_
default.xml")
A range of Haar cascade XML files are provided in OpenCV, each of which holds
the Haar features for different objects. In this data science project, we employ a pre-
defined Haar cascade XML file (haarcascade_frontalface_default.xml) in order to
detect frontal faces in an image. You can access the list of Haar cascade XML files
Tips from this link: https://github.com/opencv/opencv/tree/master/data/haarcascades.
In[5] faces=faceCascade.detectMultiScale(gray
,scaleFactor=1.1
,minNeighbors=5
,minSize=(30,30))
Out[5] 'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\cv2\\data\\'
We call the cv2.imshow() function to display the image with the detected face and the
added rectangle border.
Notes
To write an image with rectangles according to the specified format in the current
working directory using OpenCV-Python, you can use the cv2.imwrite() function.
Notes
cv2.imwrite("test.png",image)
Out[6] True
340 Python Data Science
Exercises
[7] According to the generation mode of individual learners, ensemble learning can be roughly divided
into two categories. One is a parallelization method that can be generated simultaneously without
strong dependency between individual learners. The representative of this method is ( )
A. boosting
B. bagging
C. decision tree
D. reboot
[8] Which of the following algorithms has no corresponding API in sklearn?
A. Support vector machine
B. K nearest neighbor classification
C. Gauss naive Bayes
D. Bayes
[9] Which of the following is not an evaluation indicator of the classification model?
A. Accuracy rate
B. Recall rate
C. Mean square error
D. ROC curve
10. Which of the following is flase of the arguments in the train_ test_ Is function?
A. Test size represents the size of the test set.
B. Train size represents the size of the training set.
C. Random state represents random seed number, which by default is 1.
D. shuffle represents whether to sample with or without replacement.
[11] Which of the following is not a method of the sklearn converter?
A. fit
B. transform
C. fit transform
D. transform fit
[12] Which of the following is a package for Chinese natural language processing in Python?
A. NTLK
B. spaCy
C. Jieba
[13] Which of the following function can be used to customize vocabulary?
A. pynlpir.AddUserWord()
B. nlpir.AddUserWord()
C. pynlpir.get_key_words()
D. pynlpir.nlpir.AddUserWord()
[14] What are the features of text corpus?
A. Word count in text
B. Vector annotation of words
C. Part of speech tag
D. Basic dependency grammar
E. All of the above
[15] Which of the following indicator can be used to calculate the distance between two word vectors?
A. Lemmatization
B. Euclidean distance
C. N-grams
342 Python Data Science
2. Books
[1]. V anderPlas, J. (2016). Python data science handbook: Essential tools for working with data. O’Reilly
Media, Inc.
[2]. McKinney, W. (2012). Python for data analysis: Data wrangling with Pandas, NumPy, and
IPython.O’Reilly Media, Inc.
[3]. Kirk, M. (2017). Thoughtful machine learning with Python: a test-driven approach.O’Reilly Media, Inc.
[4]. Ramalho, L. (2015). Fluent Python: Clear, concise, and effective programming.O’Reilly Media, Inc.
[5]. Chambers, B., & Zaharia, M. (2018). Spark: The definitive guide: Big data processing made
simple.O’Reilly Media, Inc.
[6]. Grus J. Data science from scratch: first principles with python[M]. O’Reilly Media, 2019.
[7]. Lutz, M. (2013). Learning python: Powerful object-oriented programming.O’Reilly Media, Inc.
[8]. Matthes, E. (2019). Python crash course: A hands-on, project-based introduction to programming. No
Starch Press.
3. Python Packages
[1]. Data Wrangling: Pandas,Numpy,Scipy
[2]. Data Visualization: Matplotlib,Seaborn,Bokeh,Basemap,Plotly,NetworkX
[3]. Machin Learning: SciKit-Learn, PyTorch, TensorFlow, Theano,Keras
[4]. Statistical analysis: Statsmodels
[5]. Natural Language Processing: Natural Language Toolkit (NLTK), Gensim CoreNLP,spaCy,TextBlob,
PyNLPl
[6]. Web Scraping :Scrapy,Beautiful Soup, Requests,Urllib
[7]. Image Processing:OpenCV,Scikit-Image, Mahotas,SimplelTK,Pillow
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 343
C. Borjigin, Python Data Science, https://doi.org/10.1007/978-981-19-7702-2
Appendix II Answers to Chapter Exercises
Chapter I Python and Data Science
1.B 2.A 3.B 4.B 5.D
6.C 7.D 8.B 9.D 10.C
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 345
C. Borjigin, Python Data Science, https://doi.org/10.1007/978-981-19-7702-2