- 3 min read

Tip of the day #4: Find and Remove Duplicate Files With Python

Have you ever wondered how to clean up your cluttered computer storage space efficiently? Over time, it's common to accumulate duplicate files that consume valuable disk space. In this tutorial, I'll walk you through a Python script to find and remove duplicate files based on both their size and content.

Understanding the Problem

Before we dive into the code, let's understand the problem we're trying to solve. Duplicate files are identical files that exist in different locations on your computer. They can be unintentionally created when you download or copy files multiple times. Identifying and removing these duplicates not only saves space but also makes your file organization more efficient.

Removing Duplicate Files with Python

To solve this issue, we first need to break it down into several components or steps for our code to solve:

1. Importing Necessary Modules

We start by importing the os module, which allows us to interact with the file system.

import os

2. Identify the size of a file

We will write a function that will receive the path to a file and will return the size of the file, which we will use to identify duplicates.

def get_file_size(filepath):
	return os.path.getsize(filepath)

3. Finding Duplicate Files

To find duplicate files, we'll traverse through a specified root folder and its subfolders using the os.walk() function.

For each file, we'll determine its size and use a dictionary to group files with the same size.

We will create a unique identifier, which will be a tuple, built from the file name and it's size.

We will use this tuple as the key to the dictionary, and the value will be a list of the paths to files of the same name and size.

def find_duplicate_files(root_folder):
    # Create a dictionary to store files by size
    files_dict = {}

    # Traverse through the root folder and its subfolders
    for foldername, subfolders, filenames in os.walk(root_folder):
        for filename in filenames:
            filepath = os.path.join(foldername, filename)
            file_size = get_file_size(filepath)

            # Create a unique identifier for each file
            file_identifier = (filename, file_size)

            # Add the file path to the dictionary
            if file_identifier in files_dict:
                files_dict[file_identifier].append(filepath)
            else:
                files_dict[file_identifier] = [filepath]

Then we will create an empty list and add the list of file paths that have more than 1 path in them.

    # Identify duplicate files based on size and content
    duplicate_files = []
    for file_paths in files_dict.values():
        if len(file_paths) > 1:
            duplicate_files.append(file_paths)

    return duplicate_files

The full function to find duplicate files looks like this:

def find_duplicate_files(root_folder):
    # Create a dictionary to store files by size
    files_dict = {}

    # Traverse through the root folder and its subfolders
    for foldername, subfolders, filenames in os.walk(root_folder):
        for filename in filenames:
            filepath = os.path.join(foldername, filename)
            file_size = get_file_size(filepath)

            # Create a unique identifier for each file
            file_identifier = (filename, file_size)

            # Add the file path to the dictionary
            if file_identifier in files_dict:
                files_dict[file_identifier].append(filepath)
            else:
                files_dict[file_identifier] = [filepath]

	# Identify duplicate files based on size and content
    duplicate_files = []
    for file_paths in files_dict.values():
        if len(file_paths) > 1:
            duplicate_files.append(file_paths)

    return duplicate_files

4. Removing Duplicate Files

Once we have identified duplicate files, we can remove them. This is done by looping through the list of duplicate file paths and using os.remove() to delete the duplicates.

Keep in mind though that we are removing all the files except for 1 by looping through the duplicate file paths list while omitting the first item using list slicing [1:].

def remove_duplicate_files(root_folder):
    duplicate_files = find_duplicate_files(root_folder)

    for file_paths in duplicate_files:
        for duplicate_file in file_paths[1:]:
            os.remove(duplicate_file)
            print(f"Removed duplicate file: {duplicate_file}")

5. Running the script

To run the script, we simply need to call the remove_duplicate_files() function with the path to the root folder we want to clean up.

The full script looks like this:

import os

def get_file_size(filepath):
    return os.path.getsize(filepath)

def find_duplicate_files(root_folder):
    # Create a dictionary to store files by attributes
    files_dict = {}

    # Traverse through the root folder and its subfolders
    for foldername, subfolders, filenames in os.walk(root_folder):
        for filename in filenames:
            filepath = os.path.join(foldername, filename)
            file_size = get_file_size(filepath)

            # Create a unique identifier for each file
            file_identifier = (filename, file_size)

            # Add the file path to the dictionary
            if file_identifier in files_dict:
                files_dict[file_identifier].append(filepath)
            else:
                files_dict[file_identifier] = [filepath]

    # Identify duplicate files based on size and content
    duplicate_files = []
    for file_paths in files_dict.values():
        if len(file_paths) > 1:
            duplicate_files.append(file_paths)

    return duplicate_files

def remove_duplicate_files(root_folder):
    duplicate_files = find_duplicate_files(root_folder)

    for file_paths in duplicate_files:
        for duplicate_file in file_paths[1:]:
            os.remove(duplicate_file)
            print(f"Removed duplicate file: {duplicate_file}")

if __name__ == "__main__":
    root_folder = input("Enter the root folder path: ")
    remove_duplicate_files(root_folder)

That's it!

Go ahead and clean up your computer.

If you're looking to learn Python, check out my course, Python Complete, intended for anyone interested in learning to code, whether you're an absolute beginner or already have some background in programming, this course covers Python basics and works through advanced concepts.

You can also try the free trial which includes the first 2 chapters of the course to see if this course is right for you. Get your free trial here.