13 Working with Files in Python

Learning Objectives

Be able to read and write files in Python
Understand how to use the with method and a context manager to work with files
Introduce ‘garbage collection’ in python

Working with files is simple in python. Many packages have their own built-in file reading and writing functions including for example ‘pandas’ which we will look at next. Here we use the built-in open function.

13.1 Working Directory

A working directory is the directory within the file system where your program will use as a default to look for and read files or output files. It easier than always defining absolute file paths and can allow code to be transferred.

When running python scripts from the console, the working directory is the one where you launched the script from.

The os module can be used in Jupyter Notebooks to check the working directory and change it.

Get working Directory:

import os
print(os.getcwd())  #current working directory

Change working Directory:

os.chdir('/path/to/directory')

You can also use the magic command:

%cd /path/to/directory

Magic commands in Jupyter Notebooks can be super useful. We will not look at them further here, but also note that they cannot be used ouside the iPython Kernel

You can also set default working directory in the config file. We will not go over this here (and I myself tend not to work this way - although I also don’t often use Jupyter Notebooks).

I recommend having scripts and data in the same directory, or a subdirectory of the project directory. My common setup might be:

/home/user/project/
    ├── data/
    │   └── dataset.csv
    ├── scripts/
    |   └── analysis.py
    ├── logs/
    │   └── error-log.log
    │   └── output-log.log
    └── output/
        └── output.csv

13.2 Opening and closing files in python

You can open files in different modes using the open() function. The common modes include:

'r': Read (default)
'w': Write (overwrites the file if it exists)
'a': Append (adds content to the end of the file)
'b': Binary mode (for non-text files)
'x': Exclusive creation (fails if the file already exists)

Example:

# Open the file in read mode
file = open('dna_sequence.txt', 'r')

# Read the contents of the file into a string
dna_sequence = file.read()

# Close the file
file.close()

# The dna_sequence variable now holds the content of dna_sequence.txt
print(dna_sequence)  # This will display the content of the file

file.readlines() will return a list with each line:

file = open('dna_sequences.txt', 'r')
dna_sequences = file.readlines()
file.close()

Using for loops to iterate line by line:

file = open('dna_sequences.txt', 'r')
for line in file:
    # Strip trailing whitespace or newline characters from the end of each line
    dna_sequence = line.strip()
    print(dna_sequence)  # Process or print the current line
file.close()

This method is memory-efficient, especially for large files, since it doesn’t load the entire file into memory at once.

Consider the problems with working with files this way:

Exercise 1 - Discussion: Exceptions after open()

Level:

If an exception occurs after open() what will happen? What have you already learnt as a method to stop this?

Answer

The file will never close properly - you could use try ,except , finally. The with statement presented in the next section is a better way to deal with this.

13.3 Using `with` when working with files:

Using the with statement is a better way to work with files in Python as it ensures that files are properly opened and closed.

Example:

with open('dna_sequence.txt', 'r') as file:
    for line in file:
        # Strip the newline character from each line and print it
        dna_sequence = line.strip()
        print(dna_sequence)

Note it uses the same kind of syntax as for loops and conditionals

with does this by working with a context manager. The open() function returns a fileobject which has the .__enter___() and .__exit__() methods. You can read more about this in your own time.

13.4 Aside: Garbage collection

Garbage collection frees memory by cleaning up objects and variables that are no longer needed.

Python does this automatically through reference counting and generational garbage collection.

You can interact with this using the gc module.

13.5 Writing to a file

The same methods are used to write to a file:

Write to a new file or replace an existing file

Writing to a new file or overwriting an existing file is simple:

with open('sequences.txt', 'w') as file:
    file.write("ACGTGCTTCCAAACGTA\n")
    file.write("ACGTGCTTCCAAACGTA\n")

file.writelines(): Takes an iterable (like a list) and writes each item to the file. You can also use for loops.

the \n is needed when manually adding lines.

Appending to an existing file

This is exactly the same however you need to replace 'w' with 'a' for append mode.

binary files Use the 'wb' for writing binaries for example.

Exercise 2 - Reading and writing files

Level:

Use the above to first write something to a file. Then read and append to the file.

13.6 Moving files

The shutil module can be used to move files in python.

import shutil
import os

# Define paths
source_file = 'source_folder/file.txt' 
destination_file = 'destination_folder/file.txt' 

# Create the destination folder if it doesn't exist
os.makedirs(os.path.dirname(destination_file), exist_ok=True)

# Move the file
shutil.move(source_file, destination_file)

The same can be used to move directories

13.7 Summary

It is important to be able to manipulate files in python. Many packages like pandas come with functions to read and write specific file types. See the following file types and modules to work with them:

File Type	Extension(s)	Description	Modules
Text File	`.txt`	Plain text files that contain unformatted text.	Built-in `open()`, `io`
CSV File	`.csv`	Comma-separated values, files used to store tabular data.	`csv`, `pandas`
TSV File	`.tsv`	Tab-separated values, files used to store tabular data, similar to CSV.	`csv`, `pandas`
JSON File	`.json`	JavaScript Object Notation, a lightweight data interchange format.	`json`, `pandas`
XML File	`.xml`	Extensible Markup Language, used to store structured data.	`xml.etree.ElementTree`, `lxml`
Excel File	`.xls`, `.xlsx`	Microsoft Excel files for storing spreadsheet data.	`pandas`, `openpyxl`, `xlrd`
Pickle File	`.pkl`, `.pickle`	Serialized Python objects for saving and loading Python data types.	`pickle`
SQLite Database	`.sqlite`, `.db`	Lightweight database files used with SQLite, a C-based database engine.	`sqlite3`, `pandas`
HDF5 File	`.h5`, `.hdf5`	Hierarchical Data Format for storing large amounts of data efficiently.	`h5py`, `pandas`
Parquet File	`.parquet`	Columnar storage file format optimized for use with big data processing.	`pyarrow`, `pandas`
Zarr File	`.zarr`	Chunked, compressed N-dimensional arrays for large datasets, designed for cloud storage.	`zarr`, `xarray`
HTML File	`.html`, `.htm`	HyperText Markup Language files used for web pages.	`html.parser`, `BeautifulSoup`
Markdown File	`.md`	Plain text files formatted using Markdown for easy reading and writing.	`markdown`, `Mistune`
Log File	`.log`	Text files that record events or messages from applications.	Built-in `open()`, `logging`
Image Files	`.jpg`, `.jpeg`, `.png`, `.gif`	Common image formats for graphics.	`PIL`, `OpenCV`, `imageio`
Audio Files	`.mp3`, `.wav`, `.ogg`	Common audio formats for music and sound.	`pydub`, `wave`, `soundfile`
Video Files	`.mp4`, `.avi`, `.mov`	Common video formats for visual media.	`moviepy`, `opencv`, `imageio`
Archive Files	`.zip`, `.tar`, `.gz`	Compressed files for reducing file size and bundling multiple files.	`zipfile`, `tarfile`, `shutil`
PDF File	`.pdf`	Portable Document Format, used for presenting documents.	`PyPDF2`, `pdfplumber`, `reportlab`
FASTA File	`.fasta`, `.fa`	Text-based format for representing nucleotide or peptide sequences.	`BioPython`
GenBank File	`.gb`, `.gbk`	Format for storing genomic data, including sequence and annotation.	`BioPython`
Alignment File	`.maf`	Files that store multiple sequences aligned against each other.	`BioPython`

Key Points

It is essential to be able to work with different file types
It is better to use the with method when working with files to handle opening and closing files
Many modules can be used to work with different file types

13.1 Working Directory

13.2 Opening and closing files in python

13.3 Using with when working with files:

13.4 Aside: Garbage collection

13.5 Writing to a file

13.6 Moving files

13.7 Summary

13.3 Using `with` when working with files: