Python Tutorial: Tarfile Module
Python Tutorial: Tarfile Module
Python Tutorial: Tarfile Module
Contributor(s):
Last Revised: 2008.12.12
1. Introduction
"Tar" is an archiving format that has become rather popular in the open
source world. In essence, it takes several files and bundles them into one
file. Originally, the tar format was made for tape archives, hence the name;
today it is often used for distributing source code or for making backups of
data. Most Linux distributions have tools in the standard installation for
creating and unpacking tar files.
Python's standard library comes with a module which makes creating and
extracting tar files very simple. Examples of when individuals might want such
functionality include programming a custom backup script or a script to create
a snapshot of other personal projects.
Background Reading
There is significant documentation of both tar files and Python's tarfile
module. In addition to this document, the following resources are recommended
reading:
Wikipedia: tar file
Python Library Reference 12.5: tarfile
2. Tutorial
This is a basic tutorial designed to teach three things: how to add files
to an archive, how to retrieve information on files in the archive, and how to
extract files from the archive.
Adding Files
To begin, import the tarfile module. Then, create what is called a
"TarFile Object". This is an object with special functions for interacting with
the tar file. In this case, we are opening the file "archive.tar.gz". Note that
the mode is "w:gz", which opens the file for writing and with gzip compression.
As usual, "w" not preserve previous contents of the file. If the tarfile
already exists, use "a" to append files to the end of the archive (n.b.: you
cannot use append with a compressed archive - there is no such mode as "a:gz").
Adding files to the archive is very simple. If you want the file to have a
different name in the archive, use the arcname option.
Adding directories works in the same way. Note that by default a directory
will be added recursively: every file and folder under it will be included.
This behavior can be changed by setting recursive to False.
File Information
The tarfile module includes the ability to retrieve information about the
individual contents of a tar file. Each item is accessed as a "TarInfo Object".
For example, getmembers() will return a list of all TarInfo objects in a tar
file:
Each TarInfo object has several methods associated with it. Some examples
are below, and a full list can be found here.
TarInfo information
>>> members[0].name
'text.txt'
>>> members[0].isfile()
True
Extracting Files
Extracting the contents is a very simple process. To extract the entire
tar file, simple use extractall(). This will extract the file to the current
working directory. Optionally, a path may be specified to have the tar extract
elsewhere.
You should be aware that there is at least one security concern to take
into account when extracting tar files. Namely, a tar can be designed to
overwrite files outside of the current working directory (/etc/passwd, for
example). Never extract a tar as the root user if you do not trust it.
3. Examples
Archiving Select Files from a Directory
archiver.py
import os
import tarfile
tar.close()
4. Extending
Removing Files
The tarfile module does not contain any function to remove an item from an
archive. It is presumed that this is because of the nature of tape drives,
which were not designed to move back and forth (consider this post to the
Python tutor mailing list). Nevertheless, other programs for creating tar
archives do have a delete feature.
The following code uses the popular GNU tar programs that comes with most
Linux distributions. Their documentation of the "--delete" flag can be read
here; note that they warn not to use it on an actual tape drive. The reliance
on an external program obviously makes the code far less portable, but it is
suitable for personal scripts.