Removing duplicate files
My sister received a new powerbook for Christmas and I’m in the process of moving her world from an old iMac to the new powerbook. I chose not to use the automatic migration tool because there are several years of garbage strewn across the old system’s hard drive.
Somehow, my sister managed to duplicate a huge number of the thousands of digital photos in her iPhoto library. Some are duped many times and there were a number of copies of the photo library from various efforts to back it up. Worse, some of the “backups” had photos that weren’t in the other libraries.
I needed something that could detect and remove duplicate files found within any random directory. I couldn’t find a freeware tool that would work without lots of user interaction or did the dupe test in a fashion that I found reasonable.
So, I wrote a python script that does what I need. Maybe others will find it useful.
The latest version can be found at http://svn.red-bean.com/bbum/trunk/hacques/dupinator.py. It is a one-off that solved a problem, not an attempt to write the world’s best python script.
It works by:
- launched via command line by passing a set of directories to be scanned
- traverses all directories and groups all files by size
- scans all sets of files of one size and checksums (md5) the first 1024 bytes
- for all files that have the same checksum for the first 1024 bytes, checksums the whole file and collects together all real duplicates
- deletes all duplicates of any one file, leaving the first encountered file as the one remaining copy
It is acceptably fast, processing 3.3 gigabytes of random files in only a few minutes (removing 1.2 gigabytes of duplicate files).

