HW0: Intro To Python Problem 1: 'Hello World'
HW0: Intro To Python Problem 1: 'Hello World'
HW0: Intro To Python Problem 1: 'Hello World'
Problem 1
We’re going to use Google Colaboratory (’Colab’) in this course. Colab is a Python development environment that
runs in your browser using Google Cloud. This standardizes the programming format and eliminates the hassle of
getting everyone to install python on their computer. Colab provides a nice user interface for writing / executing
python code in files referred to as ’notebooks’.
print('hello world')
f) In a new code block, store 2 numbers as separate variables “x” and “y” and print the output of their addition.
x=5
y=7
print(x+y)
g) In the same code block, modify code to print the result of xˆy. In python, the symbol for the power operator
is ** not ˆ
Problem 2
Now let’s work with some real data in python.
a) Navigate to the Files tab on the leftmost part of Colab’s interface. You should see a folder named “sam-
ple data” available to you. We’re going to have our computer access the data in ‘mnist test.csv’ using python
and then manipulate it. Note, the Files tab in Colab is like the root node (i.e. main directory/folder) of your
computer, this is where python will look when tasked with finding a file.
b) In a new code block, let’s assign a variable that will designate the path to “mnist test.csv” relative to the
main directory by typing: filepath = ‘sample data/mnist test.csv’. This string, now stored in the variable
“filepath”, specifies the exact location of the file relative to the starting directory.
c) In the same code block, let’s now have our computer open mnist text.csv with python’s open() function.
The open() function accesses a file and returns it as a file “object” which can be used to read, write, and
modify the file. You can think of objects in python as fancy variables that store data as well as functions
that can specifically work with that data. Run the following command, where “r’ specifies accessing the file
in read-only mode.
filepath='sample_data/mnist_test.csv'
f = open(filepath,'r')
1
data = f.readlines()
e) Let’s see how big our list is (i.e. how many elements it contains i.e. how many lines of text are in the opened
file) by running the following command. Python executes commands sequentially from the inner most set of
parentheses to the outer. So first it is returning a number value of how many element are in the list ”data”
then printing this returned value to the screen. How many lines are in “mnist test.csv”?
print(len(data))
f) Output the first line of ’mnist test.csv’ by printing the first element of the list ’data’. You can select the
first element of a list using list indexing which is just the process of selecting elements not based on their
values but by their position in the list. Python recognizes you are passing along indices (aka element positions)
whenever you use brackets [] instead of parentheses () which usually specify function inputs (aka ”arguments”).
Importantly, Python uses ”zero-based” indexing which means it includes 0 as a valid position, so the first
element in data is data[0]
g) What is the length of the first line? What type of data is it? Use python’s built-in len() and type() functions.
Problem 3
The mnist database is a common dataset used in computer science (https://en.wikipedia.org/wiki/MNIST_
database) to train and test image processing software and machine learning models. The dataset contains images
of handwritten numbers (0-9), digitized/normalized into a 28x28 pixel array with each pixel’s grayscale intensity
represented as a number between 0 and 255. Here, we are working with the “test” set which contains 10,000 images.
a) Earlier, we printed the first line (i.e. image) of the dataset and saw it was a string of comma-separated values.
The mnist dataset is structured so that the first number in each line represents the digit that was actually
drawn and the remaining 784 numbers (28x28) represent the pixel intensities if you were to move left to right
and then top to bottom across the image. To convert this data into a list of values, use python’s split()
method and store the returned list into a variable “vals”
vals = data[0].split(',')
print(vals)
Here, we are iterating over each value in the list “vals” using a for-loop, assigning each value to the temporary
variable x which we then manipulate and store in a new list. The int() built-in function converts a string to
an integer if possible (you can try int(‘a’) but it will return an error). Overall, we define a new list (“ad-
justed vals”) containing the integer-ized values in “vals” so that python now recognizes the pixel intensities
as true numbers instead of text characters. This is important because computers are extremely literal. They
will not treat a value as a number unless it is declared as a number.
2
d) As mentioned, the first element in the list “adjusted vals” is the digit being drawn whereas the remaining
784 values in the list are the pixel intensities of the 28x28 image array. Separate these elements by assigning
the first element of “adjusted vals” to a new variable called “true val” and the remaining 784 elements to a
list called “pixel intensities”. This can be done by “slicing” the list by the element indexes i.e. specifying the
positions of the elements you want to copy to a new variable.
true_val = adjusted_vals[0]
pixel_intensities = adjusted_vals[1:]
The general format for indexing/slicing is [start:stop] where the [] specifies you’re speaking in terms of indexes,
“start” is the first index/position to be included and “stop” is the index at which to stop and is NOT included.
Yes, this is a somewhat annoying quirk of Python, that the start index is included but the stop index is not.
Here, [1:] means select the second element of the list through the last element b/c we didn’t specify a stop
index. Remember python is zero-indexing based. Overall, the list “adjusted vals”, which contains 785 items,
implicitly has indexes 0-784. If I were instead to specify [1:784], the last element of the list would have
been excluded because python includes everything up until the ‘stop’ index but not the stop index itself.
Alternative ways to correctly slice the list when assigning “pixel intensitites” would have been [1:785] or
[1:len(adjusted vals)].
e) Let’s put everything we’ve done so far in a for-loop so we can iterate over each line/image of the dataset.
We will have to create a new temporary variable “line” to hold the data for each line in data as we iterate
through it.
The break command within our for-loop interrupts the loop so, as currently written, the program will stop
after passing through the first line of data. This is intentional for now, we don’t need to iterate through the
full 10,000 line dataset.
f) Let’s format the list “pixel intensities” further to get it in the shape of a 28x28 array. We can do this by
making a nested list i.e. a list of elements where each element is a list itself. If you think about it, this is a
two-dimensional data structure, like a table or Excel sheet, where each element within the outer list represents
a row of values (i.e. the 28 pixel values moving left to right across the “columns” of the image). You can
pretty much store anything you want in a python list, including other lists. Within our for-loop, let’s add a
new line directly above the break command:
This is a fairly complicated line of code but we will break it down step by step. Overall, we are defining a new
list using list comprehension and assigning it to the variable “pixels by row”. On the left side, you can see
we are slicing the list “pixel intensities” to select indexes i through i+28. What is i? Well it is a temporary
variable assigned to each value specified by range(0,len(pixel intensities),28). Range() is a python function
that returns a sequence of numbers. It accepts three arguments: range(start,stop,step) which work exactly
how you’d expect: start specifies the number on which to start, stop specifies the number on which to end,
and step specifies the step that should be taken between each number. Of note, range() only works with whole
numbers which is fine here because list indexes are only ever integers. Also, like list slicing, the stop value
itself is not included in the returned sequence. So, overall, the command range(0,len(pixel intensities),28) is
returning numbers from 0 to 784 in incremental steps of 28 i.e. 0, 28, 56, 84. . . . 756. These numbers represent
the start of each new row within “pixel intensities” so pixel intensities[i:i+28] capture all the elements within
each row and assigns the sliced list as a new element within the list “pixels by row”. To aid in understanding,
output the result of this command by printing each element of pixels by row in a for-loop after assignment
(but before the break statement). What is the length of “pixels by row”? What is the length of each element
within pixels by row? Does this make sense? Explain.
3
for line in data:
vals = line.split(',')
adjusted_vals = [int(x) for x in vals]
true_val = adjusted_vals[0]
pixel_intensities=adjusted_vals[1:]
pixels_by_row = [pixel_intensities[i:i+28] for i in range(0,len(pixel_intensities),28)]
for row in pixels_by_row:
print(row)
break
g) Let’s plot our re-formatted data as a heatmap to better visualize. Again, this is all just for the first image/line
in the dataset so keep the break statement in the for-loop. To do so, we will need to import a plotting package
called “seaborn” which makes nice visuals (https://seaborn.pydata.org/). All we need to do in Google
Colab is, in a new code block above our current code, type “import seaborn as sns” and execute. Google
Colab already has the seaborn package pre-installed but by importing it (and storing in a variable call “sns”)
we are essentially loading it into our python session so it is ready to call/use. Now, we want to use the
pre-defined seaborn heatmap() function and input our two dimensional list of pixel intensities so type the
following code above the break statement:
sns.heatmap(pixels_by_row)
The heatmap() function accepts data in a two dimensional format (such as a list of lists) and plots it on
an x-y grid, where the color of each gridblock is the value specified i.e. the grayscale intensity for our given
dataset. What number does the image appear to contain? Does this match the true value? Now train a
convolutional neural network to predict handwritten digits based on pixel intensities as input. Only kidding...
Instead, reformat the heatmap in some way, using the whitepages for seaborn’s heatmap() function to guide
you (https://seaborn.pydata.org/generated/seaborn.heatmap.html)
h) To complete our code block working with the ‘mnist test.csv’ dataset, modify the start of your for-loop as
follows. Don’t change any of the code within the loop.
What did we just do? Well, we added a call to some built-in python function called enumerate(). This is a
handy-little function that returns two things: (1) the index of the element and (2) the element itself within
the passed list i.e. “data” in this case. We then assign the element index to the variable “l” and the element
itself to “line” just as before when we didn’t include the enumerate() function. These variables, “l” and “line”,
change as we iterate through “data”. What it allows us to do though is now easily keep track of each image in
the dataset and print the results of, let’s say, image #100, specifically. Do this by running the following code
in a new code block. Note, if you try copying and pasting you may run into issues with spacing so rewrite
yourself and think about what each line is doing along the way.
if l==99: # a conditional statement that will only execute the code inside of it if TRUE
vals = line.split(',')
adjusted_vals = [int(x) for x in vals]
true_val = adjusted_vals[0]
pixel_intensities = adjusted_vals[1:]
pixels_by_row = [pixel_intensities[i:i+28] for i in range(0,len(pixel_intensities),28)]
sns.heatmap(pixels_by_row)
#break
Here, we are iterating over all the line/images in “data” and, once we find image #100 (which would have
an index of 99), running our prepared code which outputs the relevant data in the form of a heatmap.
We’ve commented out the “break” statement so as to not interrupt the code after passing through the first
4
line/image of the dataset. This is, overall, an inefficient way to work with the contents of line/image #100
as we are iterating over every line in the dataset to do so but it highlights the usefulness of the enumerate()
function with for-loops to keep track of the different elements and their indices as well as the number of
iterations through the for-loop. As good practice, you may want to get in the habit of using “for i,item in
enumerate(list)” for all your for-loops. What is the output of the 100th image in the dataset? Print both the
true value and the heatmap representing the image.
Bonus: how long does it take for python to run the above code block? Import the time package and use
the time() function within it to answer the question, using a print statement to output the elapsed run time.
What’s the time to execute if we print the 400th image instead of the 100th? How about the 8011th? Explain.
Problem 4
You now know how to upload, import, and manipulate text files in python using built-in functions. For
tabulated data, another option is to use the pandas package which includes a special object for tabulated
data referred to as a pandas DataFrame. The pandas package and its DataFrame class have many useful
functions including one to load in csv/tsv/xlsx files so let’s do that with the same dataset in a new code
block.
a) First, we must load the pandas package which is pre-installed in google Colab. Do this with the command
“import pandas as pd”
b) In a new code block, we now load the data stored in “mnist test.csv” using the pandas function
pd.read csv()
data_df = pd.read_csv('sample_data/mnist_test.csv',header=None)
We need to pass the optional argument header=None because, by default, pandas assumes the first line
in the csv file are the column names which is not true for our given dataset. Print the new data variable.
Make sure the structure of the object makes sense to you before proceeding.
c) A pandas DataFrame is a custom object (“fancy variable”) that holds data in a tabulated format with
dedicated row and column names while having, at its disposal, a bunch of predefined functions for
manipulating said data. It takes some time to learn all the ways in which to navigate/manipulate
a DataFrame object but we will go through a couple and the rest are easily accessible by searching
online. This is common practice as you will find it’s easy to forget the commands to perform a specific
manipulation. Let’s start by separating the first column of the DataFrame, representing true values for
each of the images, from the pixel intensities.
true_vals_df = data_df[0]
print(true_vals_df)
What did we just do and what does it resemble from earlier? When no header is specified, pandas
automatically assigns the column index as the column name. Column names can be strings or numbers.
If, instead, the dataset came with a header and the name for the first column was “column 1” we could
have done: true vals df = data df[‘column 1’] and gotten the same result. What datatype is data df?
How about true vals df? Use the type() function to answer. A pandas series is just a one dimensional
DataFrame i.e. a single column of data. This datatype can easily be converted to a python list, if
desired, by the .tolist() method.
print(true_vals_df.tolist())
d) If you once again output data df you will notice that the first column of true values is still there, we
didn’t cut it out when we assigned it to true val df, we simply copied it. To instead “pop” out the
column, we can use the pandas method pop()
true_values_df = data_df.pop(0)
Here, we assigned the first column to true values df while also permanently altering the original “data df”
DataFrame to exclude the first column. This means that “data df” no longer contains the column 0
which you will see if you print(data df) in a new code block, If you try rerunning the same code block
that contains the .pop() command, you will get an error since it will once again try to pop out column
5
0 from data df when it no longer contains the column. This is something to be aware of as you start to
manipulate DataFrames: the difference between functions that simply make a copy of the DataFrame or
permanently change it. In the latter case, the sequential order of commands and number of times you
execute a given block of code really matter.
e) Let’s convert true vals df to a standard python list by using the .to list() method that is available to
pandas Series in a new code block. Print the length of the list. Print the maximum and minimum values
in the list with the built-in min() and max() functions. Print the # of “3”s in the dataset by using the
.count() method available to python lists (but not Dataframes since that’s a different type of object).
num_occurences = true_vals_list.count(3)
Note that Pandas loaded in our numerical data as integers instead of strings so we don’t have to go
about converting the values between datatypes as before which is why the above command is .count(3)
and not .count(’3’).
f) Since there are 10,000 images in this dataset of hand-drawn, single-digit numbers (0-9), we’d expect
around 1,000 images for each number if the dataset is well balanced. Is the dataset well balanced?
Check the # of occurrences for each of the 10 digits, saving the counts of each digit in a single list called
“group vals”.
group_vals = []
for num in range(0,10):
group_vals.append(true_vals_list.count(num))
print(group_vals)
g) To better visualize the evenness of the dataset, let’s create a barplot of the # of occurrences of each
digit. Another plotting package commonly used in python is Matplotlib. To load the package into your
active session, type the following in a new code block:
import matplotlib.pyplot as plt
Here, we are importing a specific piece (“submodule”) of the matplotlib package and saving it in a variable
named “plt”. The pyplot submodule contains a bunch of functions for plotting including a function
to create bar plots (https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html).
The bar() function requires two lists as input, the first specifying the name of each group and the second
specifying the bar heights i.e. the value associated with each group, in our case the # of images in the
dataset corresponding to a given digit. You should already have created a list of the latter. For the list
of group names, simply create a new array of digits 0-9 storing as “group names”. Next, to create our
bar plot, run:
plt.figure()
plt.bar(group_names,group_vals)
plt.show()
plt.figure() creates a figure object which we can then ’fill’, manipulate, and visualize. plt.show() reveals
the figure object. Everything in-between modifies the created figure object. To add labels to the x and
y-axes, below plt.bar() but above plt.show() add the lines:
plt.xlabel('Digit')
plt.ylabel('# of occurrences')
Also add a title:
plt.title('mnist_test.csv')
Now, rerun code block and observe the changes. To change the size of the figure, you can add an
argument to the plt.figure() command as follows:
plt.figure(figsize=(20,8))
The array (20,8) specifies the size of the figures in terms of width (x dimension) and height (y-dimension).
Change the values once more to get a better sense of this. Lastly, adjust the font size of the text within
the figure by searching online for the correct command. Disclaimer, there is more than one way to do
this.
6
h) Save the final, formatted figure with the function plt.savefig(). The method accepts a few important
arguments such as the filepath at which the figure should be saved and the format of the image file. Note,
the plt.savefig() command should also be placed above plt.show() as the latter closes/clears
the file object after execution. Let’s save our image in the “sample data” folder where the mnist test
dataset resides:
plt.savefig('sample_data/final_image.png',format='png')
Here, we chose to export the image as a PNG file but plt.savefig() can also export images as a jpeg, pdf,
svg, etc. I included “.png” in the filename so that the format is obvious but this is not necessary as
we also specify the image format with the ’format=’ argument. After executing, check the sample data
folder in colab. It may take a few seconds for the image to appear. Once it does, download it to your
computer and open it to verify that everything worked. Congrats! You’ve now analyzed data in python
and summarized your findings in a concise visual.
Problem 5
Let’s start working with sequence data as is common in computational genomics.
a) Download the file ‘seq1.txt’ from canvas then upload to colab. Read the text file into your open session
using the open() command. Save the txt to a variable called “lines” using the readlines() command.
How many lines are in the file? You may notice the first line in your list ends with \n which is the
computer symbol for “next line”. To omit this character from your imported data, you could do:
f = open('seq1.txt','r')
lines = [x.rstrip() for x in f.readlines()]
f.close()
We know readlines() returns a list where each element is the contents of one line from the text file
object. What we’re doing in the code above is using list comprehension to create a new list containing
each element returned by f.readlines() after modifying the element with the .rstrip() command which is a
built-in python function that removes the last (‘trailing’) character from a string. If given no argument,
the function defaults to removing these \n characters, if present. Also, we include f.close() as convention
since it’s good practice to close the file object after using it. Save the DNA sequence in line 2 to a
variable named “seq1”.
b) This .txt file you read into python, now stored in “lines”, contains information on one piece of dna in
FASTA format which is a standardized file format for storing DNA/RNA sequences. In FASTA format,
sequences take up two lines, with the first line marked by a ”>” containing the sequence information
and the second line, directly below it, containing the nucleic acid sequence itself in the 5’ to 3’ direction.
Text files following FASTA format are often given the extension .fasta or .fna or .fa instead of .txt but
don’t be fooled, it is still a text-based file and can be opened in python the same exact way as you’ve
opened .txt files. Below is an example of a 20mer sequence with ID “example seq” in FASTA format
>example seq
AGCGTTAGCGATTAGCGAAT
What is the length of “seq1” which you stored earlier as a variable? What is the GC content of the
sequence i.e. what is the percentage of bases which are either guanine (G) or cytosine (C)? Use the
count() command here and print the result as a percentage. Note, ”C” is not equal to ”c” in python,
the capitalization of the text matters.
c) Let’s say we wanted to turn this DNA sequence into RNA. RNA contains uracil (U) in place of thymines
(T). Let’s write our own function in python to achieve this for any given sequence and then run the
function on “seq1”. Functions are useful if we want to perform the same manipulation multiple times
throughout our code.
7
def DNA_to_RNA(input_sequence):
DNA_seq = input_sequence.upper() # make input seq uppercase
RNA_seq = '' #an empty string
for char in DNA_seq:
if char=='T':
RNA_seq=RNA_seq+'U'
else:
RNA_seq=RNA_seq+char
return(RNA_seq)
Once the function is defined, it is easy to run it on “seq1”.
DNA_to_RNA(seq1)
Save the returned value of the function to the variable ‘seq1 RNA’ and print. For direct comparison to
the input, also print seq1 and verify the function is running as intended.
d) Another common manipulation we perform on DNA sequences is to generate its reverse complement
since DNA is often double-stranded in nature. Write a function to generate the reverse complement of
any given DNA sequence then run the function on ‘seq1’ and save the output to a new variable ‘seq1 rc’.
Print this variable. NOTE: to loop backwards over a string in python you can run:
x = 'string'
for char in x[::-1]:
print(char)
Running the above code will output:
g
n
i
r
t
s
The code x[::-1] is a case of indexing in which we don’t specify the start or stop position so the full string
is included but we do specify a ‘step’ of -1 which python interprets as moving backwards through the
string
e) Now, let’s save our reverse complement sequence to a new file “seq1 rc.fasta’. To do so, we must create
the new file in python and then ‘write’ text into it.
f = open('sample_data/seq1_rc.fasta','w') # creates python file object & file itself
f.close()
Checking your sample data directory, you should now see ‘seq1 rc.fasta’. You can open txt files directly
in colab by double clicking on them. In this case, you should see the file we created is empty since we
didn’t write any text to it. So, you can create a text-based file with the open() command in python
simply by creating a file object with the ‘w’ argument, meaning you intend to write to it. Importantly,
if you already had a file “seq1 rc.fasta” in sample data when you executed the above code, it would have
been deleted and replaced with an empty text file. This is an exercise to point out that you need to be
careful when creating/writing files. If, instead, you had specified “r” for read or “a” for append as an
argument in open(), the file is not replaced but rather just read or modified while preserving all original
content.
Now, in a new code block, let’s again create the file ‘sample data/seq1 rc.fasta’ but this time store text
within it.
f = open('sample_data/seq1_rc.fasta','w')
f.write('test')
f.close()
When you reopen ‘seq1 rc.fasta’ you should now see it includes one line of text reading ‘test’. If we
wanted to add another piece of text to the file, we can in python with the command:
8
f = open('sample_data/seq1_rc.fasta','a')
f.write('hello')
f.close()
The ‘a’ argument tells python we want to add additional text to our file while keeping the original
content. If you reopen seq1 rc.fasta you should see the addition of “hello” on the same line as the
original text. If we want to add a new line of text we need to include the new line character \n to
whatever string we write.
f = open('sample_data/seq1_rc.fasta','a')
f.write('\ndemo')
f.close()
You now know how to write files in python. One last thing for clarity, when you create a new file object
with open() in write-mode (i.e. using the ‘w’ argument), all the lines you write to it before closing the
object will be included in the output file.
f = open('sample_data/example.txt','w')
f.write('line1')
f.write('\nline2')
f.close()
Here, both “line1” and “line2” are saved in example.txt. If ‘example.txt’ existed beforehand and con-
tained 10 lines of text, all those lines are now gone since the file was essentially deleted and remade when
the open() command was executed.
With all this newfound knowledge, please create a file called ‘seq1 rc.fa’ which stores the reverse com-
plement of seq1 in FASTA format.
f) The Biopython package is an open source collection of tools/functions for working with biological data
such as sequence data stored within FASTA files. The bioinformatics package is not preinstalled in
google colab but installation in simple. In a new code block, run:
!pip install biopython
Now that it is installed, lets load the Sequence submodule from Biopython into our active session (in a
new code block):
from Bio.Seq import Seq
The Bio.Seq module provides tools to work with biological sequences such as the Sequence object for
storing/manipulating sequences (https://biopython.org/docs/1.75/api/Bio.Seq.html)
Let’s store ‘seq1’, from earlier, as a Sequence object in a new code block.
my_seq = Seq(seq1)
What is the datatype of ‘seq1’ ? What about ‘my seq’ ?
Sequence objects come with built-in methods such as reverse complement() and transcribe() that perform
common manipulations on the stored sequence and return the result.
my_seq.transcribe()
my_seq.reverse_complement()
What is the output from each of these commands? Do they match our own functions from earlier?
g) Biopython also contains a module (“pairwise2”) for aligning sequences to one another (https://biopython.
org/docs/1.75/api/Bio.pairwise2.html). This module is based on the Needleman-Wunsch dynamic
programming algorithm. Load this module into your session with the command:
from Bio import pairwise2
Create the following strings:
seqX = 'AACGT'
seqY = 'ACT'
Now align the two sequences and store output in variable “local alignment’:
local_alignment = pairwise2.align.localms(seqX,seqY,1,-1,-4,-1,one_alignment_only=True)
9
We specify the argument one alignment=True so that the function only returns the top scoring alignment
based on the algorithm scoring parameters (here, in order, matching bases = 1, mismatching bases =
-1, gap opening = -4, gap extension = -1). Otherwise, pairwise2.align() returns a list of alignments that
exceed a predetermined score threshold in order of highest to lowest score. The output of the function
is a list of alignment objects which you can print to examine. To better visualize, however, run:
print(local_alignment[0][0]) #aligned seqX
print(local_alignment[0][1]) #aligned seqY
print(local_alignment[0][2]) # alignment score
Now repeat the process but this time performing a “global” alignment with the same scoring criteria
global_alignment = pairwise2.align.globalms(seqX,seqY,1,-1,-4,-1,one_alignment_only=True)
Print the output alignments and the alignment score. Based on the results from this exercise, what
appears to be the difference between local and global alignments? What would the score be if, instead,
the global alignment was:
AACGT
-AC-T
10