Removing duplicate files

My sister received a new powerbook for Christmas and I’m in the process of moving her world from an old iMac to the new powerbook. I chose not to use the automatic migration tool because there are several years of garbage strewn across the old system’s hard drive.

Somehow, my sister managed to duplicate a huge number of the thousands of digital photos in her iPhoto library. Some are duped many times and there were a number of copies of the photo library from various efforts to back it up. Worse, some of the “backups” had photos that weren’t in the other libraries.

I needed something that could detect and remove duplicate files found within any random directory. I couldn’t find a freeware tool that would work without lots of user interaction or did the dupe test in a fashion that I found reasonable.

So, I wrote a python script that does what I need. Maybe others will find it useful.

The latest version can be found at http://svn.red-bean.com/bbum/trunk/hacques/dupinator.py. It is a one-off that solved a problem, not an attempt to write the world’s best python script.

It works by:

  • launched via command line by passing a set of directories to be scanned
  • traverses all directories and groups all files by size
  • scans all sets of files of one size and checksums (md5) the first 1024 bytes
  • for all files that have the same checksum for the first 1024 bytes, checksums the whole file and collects together all real duplicates
  • deletes all duplicates of any one file, leaving the first encountered file as the one remaining copy

It is acceptably fast, processing 3.3 gigabytes of random files in only a few minutes (removing 1.2 gigabytes of duplicate files).



Leave a Reply

Line and paragraph breaks automatic.
XHTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>