Python and Excel
Python and Excel
Python and Excel
Datasets from Kaggle, Quandl, Google or on repositories (GitHub, Bitbucket, GitLab, etc.).
# Change directory
os.chdir("/path/to/your/folder")
# Install virtualenv
$ pip install virtualenv
# Activate `venv`
$ source venv/bin/activate
# Deactivate `venv`
$ deactivate
openpyxl, the recommended package for reading and writing Excel 2010 files (ie:
.xlsx). Documentation.
xlsxwriter, an alternative package for writing data, formatting information and, in
particular, charts in the Excel 2010 format (ie: .xlsx). Documentation.
xlrd, this package is for reading data and formatting information from older Excel
files (ie: .xls). Documentation.
xlwt, this package is for writing data and formatting information to older Excel files
(ie: .xls). Documentation.
xlutils, this package collects utilities that require both xlrd and xlwt, including the
ability to copy and modify or filter existing excel files. Documentation.
# Load spreadsheet
xl = pd.ExcelFile(file)
CSV
# Import pandas
import pandas as pd
# Load csv
df = pd.read_csv("example.csv")
More options.
Pass the writer variable to the to_excel() function and specify the sheet name. Add a sheet
with the data to an existing workbook: ExcelWriter save multiple, (slightly) different
DataFrames to one workbook.
# Specify a writer
writer = pd.ExcelWriter('example.xlsx')
Save one DataFrame to a file, don’t specify the engine argument in the pd.ExcelWriter()
function. The rest of the steps stay the same.
The general advice for installing this package is to do it in a Python virtual environment
without system packages. We can use virtual environment to create isolated Python
environments: it creates a folder which contains all the necessary executables to use the
packages that a Python project would need.
Go to the directory in which the project is located and re-activate the virtual
environment venv.
Then proceed to install openpyxl with pip to make sure that we can read and write
files with it:
# Activate virtualenv
$ source activate venv
# Check `anotherSheet`
anotherSheet
# Return 'A'
get_column_letter(1)
# Return '1'
column_index_from_string('A')
Print the coordinate and the value that is contained within that cell.
Pandas DataFrames.
# Import `pandas`
import pandas as pd
# Import `dataframe_to_rows`
from openpyxl.utils.dataframe import dataframe_to_rows
# Initialize a workbook
wb = Workbook()
# Import `xlrd`
import xlrd
# Open a workbook
workbook = xlrd.open_workbook('example.xls')
# Import `xlwt`
import xlwt
# Initialize a workbook
book = xlwt.Workbook(encoding="utf-8")
# Initialize a workbook
book = xlwt.Workbook()
# The data
cols = ["A", "B", "C", "D", "E"]
txt = [0,1,2,3,4]
# Loop over the rows and columns and fill in the values
for num in range(5):
row = sheet1.row(num)
for index, col in enumerate(cols):
value = txt[index] + num
row.write(index, value)
Documentation.
In an array.
# Import `pyexcel`
import pyexcel
In an ordered dictionary of lists; even obtain all the workbook sheets in a single dictionary.
The arrays and dictionaries of the spreadsheet, allow creating DataFrames of the data with
pd.DataFrame().
dest_delimiter = ";"
Dictionary.
# The data
2d_array_dictionary = {'Sheet 1': [
['ID', 'AGE', 'SCORE']
[1, 22, 5],
[2, 15, 6],
[3, 28, 9]
],
'Sheet 2': [
['X', 'Y', 'Z'],
[1, 2, 3],
[4, 5, 6]
[7, 8, 9]
],
'Sheet 3': [
['M', 'N', 'O', 'P'],
[10, 11, 12, 13],
[14, 15, 16, 17]
[18, 19, 20, 21]
]}
Numpy
genfromtxt() allows to load in the data that is contained within .csv files into arrays which
we can then put in DataFrames.