Squeeze Your Data With Python

Python’s powerful solutions for archiving and zipping Files

Squeeze Your Data With Python
Compressing data with Python’s zipfile and tarfile. Image generated by Midjourney, prompt by author.

Archiving and compressing files are essential tasks in the digital world, enabling efficient storage and transfer of large amounts of data.

This article will explore how Python simplifies these processes with its built-in libraries, zipfile and tarfile. Furthermore, we will compare the compression methods offered by both libraries to understand their strengths and weaknesses.

By the end of this tutorial, you'll be able to create, extract, and manipulate archive files in various formats using Python while also being well-informed on the choice of compression methods.

All the examples are available in this GitHub library.


Harnessing Python’s zipfile library for compression

Python’s built-in zip file library allows you to work with ZIP archives seamlessly. This section will discuss using this library to create, extract, add, and read metadata from ZIP files.

Reading metadata from a ZIP archive

First, we will read the metadata from an existing ZIP archive. Reading metadata from a ZIP archive involves using Python’s built-in zipfile library. Metadata can include details such as file name, file size, and last modification time.

The example below shows how to open a ZIP archive, read its metadata, and display the information for each file within the archive.

We utilize the ZipFile context manager in the main function to open the meta.zip file. The zip_file object provides an infolist method, which returns a list containing ZipInfo objects for all files within meta.zip.

Next, we iterate through all the ZipInfo objects and display each file's metadata using the print_metadata function. The metadata presented consists of the file name, compressed size, last modified date and time, and file size.

import zipfile 
 
def print_metadata(file_info): 
    file_name = file_info.filename 
    file_size = file_info.file_size 
    compressed_size = file_info.compress_size 
    l_mod = file_info.date_time 
    l_mod_date = f"{l_mod[0]:02d}-{l_mod[1]:02d}-{l_mod[2]:02d}" 
    l_mod_time = f"{l_mod[3]:02d}:{l_mod[4]:02d}:{l_mod[5]:02d}" 
 
    compression_ratio = compressed_size / file_size if file_size > 0 else 0 
 
    print(f"File Name: {file_name}") 
    print(f"File Size: {file_size} bytes") 
    print(f"Compressed Size: {compressed_size} bytes") 
    print(f"Last Modified: {l_mod_date} {l_mod_time}") 
    print(f"Compression Ratio: {compression_ratio:.2%}\n") 
 
def main(): 
    # Open the ZIP archive in read mode 
    with zipfile.ZipFile('meta.zip', 'r') as zip_file: 
        # Iterate through the files in the archive 
        for file_info in zip_file.infolist(): 
            print_metadata(file_info) 
 
if __name__ == "__main__": 
    main()

When we run the example, it generates the following output.

Creating a ZIP archive

Now, let’s examine how to create a ZIP archive using the zipfile library in Python. Creating a ZIP archive involves generating a new archive or updating an existing one by adding or modifying files. This process is essential for compressing multiple files into a single, smaller file, which is beneficial for storage and distribution.

In this example, the create_zip_archive function takes an input_folder and an output_file as arguments. The function opens a new ZIP archive with the specified output file name in write mode 'w' and uses the ZIP_DEFLATED compression method.

It then walks through the input folder's directory structure, adding each file to the archive with its relative path. This ensures that the folder structure is maintained within the ZIP archive.

import zipfile 
import os 
 
def create_zip_archive(input_folder, output_file): 
    with zipfile.ZipFile(output_file, 'w', zipfile.ZIP_DEFLATED) as zip_file: 
        for root, dirs, files in os.walk(input_folder): 
            for file in files: 
                file_path = os.path.join(root, file) 
                print(f"Adding {file_path} to {output_file}") 
                zip_file.write(file_path, 
                               os.path.relpath(file_path, input_folder)) 
 
def main(): 
    input_folder = 'example_folder' 
    output_file = 'example_archive.zip' 
 
    create_zip_archive(input_folder, output_file) 
 
if __name__ == "__main__": 
    main()

The example uses the zipfile.ZIP_DEFLATE compression method. The zipfile library in Python supports the three compression methods and one method that stores files without compression.

  1. zipfile.ZIP_STORED: This method stores files without any compression. It's the default method when no compression method is specified. The files are archived without reducing their size, which can be useful when dealing with already compressed files like images or video files where further compression would not lead to significant size reduction.
  2. zipfile.ZIP_DEFLATED: This method uses the DEFLATE algorithm for compressing files within the archive. DEFLATE is the most widely used compression method in ZIP archives, balancing compression speed and efficiency well. You must have the zlib module available in your Python installation to use DEFLATE compression.
  3. zipfile.ZIP_BZIP2: This method employs the BZIP2 algorithm for compressing files within the archive. BZIP2 generally offers better compression ratios than DEFLATE but can be slower. To use BZIP2 compression, you must have the bz2 module in your Python installation.
  4. zipfile.ZIP_LZMA: This method uses the LZMA algorithm to compress archive files. LZMA can provide better compression ratios than DEFLATE and BZIP2, especially for larger files, but it is usually slower. To use LZMA compression, you must have the lzma module available in your Python installation.

Having explored creating a ZIP archive, let’s now focus on extracting compressed files from the archive.

Extracting files from a ZIP archive

This section will delve into retrieving files from a ZIP archive, enabling access to the original, uncompressed data.

In the example below, the extract_zip_archive function takes an input_file (the ZIP archive) and an output_folder (the destination for the extracted files) as arguments. The function opens the ZIP archive in read mode 'r' and extracts all files to the specified output folder.

import zipfile 
 
def extract_zip_archive(input_file, output_folder): 
    with zipfile.ZipFile(input_file, 'r') as zip_file: 
        zip_file.extractall(output_folder) 
 
def main(): 
    input_file = 'example_archive.zip' 
    output_folder = 'extracted_files' 
 
    extract_zip_archive(input_file, output_folder) 
 
if __name__ == "__main__": 
    main()

In our final example, we demonstrate how to append a file to an existing ZIP archive, allowing for the expansion of an archive’s contents.

Adding files to an existing ZIP archive

When working with ZIP archives, modifying their contents by appending new files is often necessary. In this section, we’ll explore the process of adding files to an existing ZIP archive, enabling effortless expansion and updates

In the example below, the add_file_to_zip_archive function takes an input_file (the file to add) and an archive_file (the existing ZIP archive) as arguments. The function opens the ZIP archive in append mode 'a' with the ZIP_DEFLATED compression method and writes the input file to the archive, preserving its base name.

import zipfile 
import os 
 
def add_file_to_zip_archive(input_file, archive_file): 
    with zipfile.ZipFile(archive_file, 'a', zipfile.ZIP_DEFLATED) as zip_file: 
        zip_file.write(input_file, os.path.basename(input_file)) 
 
def main(): 
    input_file = 'new_file.txt' 
    archive_file = 'example_archive.zip' 
 
    add_file_to_zip_archive(input_file, archive_file) 
 
if __name__ == "__main__": 
    main()

With this example, we conclude our exploration of the zipfile library and will now transition to examining the tarfile library for handling TAR archives.


Mastering Archive with Python’s tarfile Library

A TAR archive, short for “tape archive”, is a file format that combines multiple files and directories into a single file while preserving the file structure and metadata. TAR archives are commonly used for file backup and distribution, making it easy to package a group of files for transportation or storage.

The tarfile library in Python is a built-in module that enables you to easily create, read, and extract TAR archives. It supports compression formats like gzip, bzip2, and lzma, which can be used with TAR to create compressed archives with extensions such as .tar.gz, .tar.bz2, or .tar.xz.

Similar to our exploration of the zipfile library, we will begin by examining the metadata of a tar file. Following this, we will delve into creating a TAR archive, extracting files from it, and ultimately adding files to an existing TAR archive.

Reading metadata from a TAR archive

This example will demonstrate how to read metadata from a tar archive using Python’s built-in tarfile library. The metadata includes information such as file name, file size, and last modification time.

The script consists of three main parts: the print_metadata method, the read_tar_metadata method, and the main method. Like with the zipfile, we use the context manager in the read_tar_metadata function.

import tarfile 
import time 
 
def print_metadata(file_info): 
    file_name = file_info.name 
    file_size = file_info.size 
    last_modified = file_info.mtime 
 
    last_modified_str = time.strftime('%Y-%m-%d %H:%M:%S', 
                                      time.localtime(last_modified)) 
 
    print(f"File Name: {file_name}") 
    print(f"File Size: {file_size} bytes") 
    print(f"Last Modified: {last_modified_str}\n") 
 
def read_tar_metadata(tar_path): 
    with tarfile.open(tar_path, 'r') as tar_file: 
        for file_info in tar_file.getmembers(): 
            print_metadata(file_info) 
 
def main(): 
    tar_path = 'meta.tgz' 
    read_tar_metadata(tar_path) 
 
if __name__ == '__main__': 
    main()

When we run the example, it generates the following output.

Creating a TAR archive

Now, let’s explore how to create a TAR archive using the tarfile library in Python. Creating a TAR archive involves generating a new archive or updating an existing one by adding or modifying files. This process is crucial for bundling multiple files into a single, consolidated file, which is advantageous for storage and distribution.

In the example, we employ the tarfile.open function with the 'w:gz' mode to directly create a compressed TAR archive, commonly called a tarball. A tarball represents a TAR file that has undergone compression to reduce size.

import tarfile 
 
def create_gzipped_tar_archive(tar_path, file_paths): 
    with tarfile.open(tar_path, 'w:gz') as tar_file: 
        for file_path in file_paths: 
            tar_file.add(file_path, arcname=file_path) 
 
def main(): 
    tar_path = 'example.tgz' 
    file_paths = ['file1.txt', 'file2.txt', 'file3.txt'] 
 
    create_gzipped_tar_archive(tar_path, file_paths) 
    print(f"{tar_path} created with the files: {', '.join(file_paths)}") 
 
if __name__ == '__main__': 
    main()

The tarfile library in Python supports several modes for opening TAR files with different compression algorithms. The primary compression options include:

  1. gzip compression: To open a TAR file with gzip compression, use the mode 'w:gz' (for writing) or 'r:gz' (for reading). The resulting archive will have the extension .tar.gz or .tgz.
  2. bzip2 compression: To open a TAR file with bzip2 compression, use the mode 'w:bz2' (for writing) or 'r:bz2' (for reading). The resulting archive will have the extension .tar.bz2 or .tbz2.
  3. lzma compression (also known as xz compression): To open a TAR file with lzma compression, use the mode 'w:xz' (for writing) or 'r:xz' (for reading). The resulting archive will have the extension .tar.xz or .txz.

Extracting files from a TAR archive

We will now retrieve files from a TAR archive, enabling access to the original, uncompressed data.

In the example below, the extract_tar_archive function takes an input_file (the TAR archive) and an output_folder (the destination for the extracted files) as arguments. The function opens the TAR archive in read mode 'r' and extracts all files to the specified output folder.

import tarfile 
 
def extract_tar_archive(tar_path, output_path): 
    with tarfile.open(tar_path, 'r') as tar_file: 
        tar_file.extractall(output_path) 
 
def main(): 
    tar_path = 'example.tgz' 
    output_path = 'extracted_files' 
 
    extract_tar_archive(tar_path, output_path) 
    print(f"Files from {tar_path} have been extracted to {output_path}") 
 
if __name__ == '__main__': 
    main()

Our final example demonstrates how to append a file to an existing TAR archive.

Adding files to an existing TAR archive

In this last example, we define a function add_file_to_tar_archive that takes two input arguments: the path of the TAR file to modify (tar_path) and the file's path to add (file_path).

We open the TAR archive in append mode inside the function using the tarfile.open() function. We use a with statement to ensure that the TAR archive is closed automatically after executing the code block.

Next, we call the add() method of the TarFile object, passing the file_path as an argument. The arcname parameter is set to the original file name in this example.

import tarfile 
 
def add_file_to_tar_archive(tar_path, file_path): 
    with tarfile.open(tar_path, 'a') as tar_file: 
        tar_file.add(file_path, arcname=file_path) 
 
def main(): 
    tar_path = 'example.tar' 
    file_path = 'file4.txt' 
 
    add_file_to_tar_archive(tar_path, file_path) 
    print(f"{file_path} added to {tar_path}") 
 
if __name__ == '__main__': 
    main()

With this last example, we have covered various operations related to TAR archives in Python, including creating, extracting, and updating archives.


Choosing the Right Archiving Format

Throughout the various examples we’ve covered, we’ve demonstrated how to archive and extract files using Python’s zipfile and tarfile libraries. These powerful libraries make handling ZIP and TAR files a breeze. Now, you may wonder about the differences between these archive formats and which one is better suited for your needs.

When choosing between ZIP and TAR formats, consider the following factors:

Compatibility: ZIP might be a better choice if you need to share the archive across different platforms due to its widespread compatibility.

Compression: Both formats can work well if you require efficient compression. Remember that you should compress the TAR files using an external tool to create a tarball.

Metadata preservation: If preserving file metadata is crucial, TAR may be a better choice, especially when working with Unix and Unix-like systems.

Random access: If you need to extract individual files frequently, ZIP provides a more efficient solution due to its random access capability.

In conclusion, your choice between ZIP and TAR formats depends on your specific use case and requirements. Evaluate the factors mentioned above to determine which format best suits your needs.


Conclusion

This article has provided a comprehensive guide on how to work with ZIP and TAR archives using Python’s built-in zipfile and tarfile libraries.

We have demonstrated various operations, including creating, extracting, and updating archives and reading metadata from archived files.

While ZIP and TAR formats offer different advantages, your choice ultimately depends on your specific use case and requirements.

Factors such as compatibility, compression, metadata preservation, and random access should be considered when deciding.

By understanding the strengths and weaknesses of each format and harnessing the power of Python’s built-in libraries, you can efficiently manage large amounts of data, optimize storage, and improve data transfer in your projects.

All the examples are available in this GitHub library.

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Interested in scaling your software startup? Check out Circuit.