Archive for the 'Code' Category

Removing duplicate files

Wednesday, December 29th, 2004

My sister received a new powerbook for Christmas and I’m in the process of moving her world from an old iMac to the new powerbook. I chose not to use the automatic migration tool because there are several years of garbage strewn across the old system’s hard drive.

Somehow, my sister managed to duplicate a huge number of the thousands of digital photos in her iPhoto library. Some are duped many times and there were a number of copies of the photo library from various efforts to back it up. Worse, some of the “backups” had photos that weren’t in the other libraries.

I needed something that could detect and remove duplicate files found within any random directory. I couldn’t find a freeware tool that would work without lots of user interaction or did the dupe test in a fashion that I found reasonable.

So, I wrote a python script that does what I need. Maybe others will find it useful.

The latest version can be found at It is a one-off that solved a problem, not an attempt to write the world’s best python script.

It works by:

  • launched via command line by passing a set of directories to be scanned
  • traverses all directories and groups all files by size
  • scans all sets of files of one size and checksums (md5) the first 1024 bytes
  • for all files that have the same checksum for the first 1024 bytes, checksums the whole file and collects together all real duplicates
  • deletes all duplicates of any one file, leaving the first encountered file as the one remaining copy

It is acceptably fast, processing 3.3 gigabytes of random files in only a few minutes (removing 1.2 gigabytes of duplicate files).

Followup on World’s Most Annoying P2P

Saturday, December 18th, 2004

A couple of people pontificated on my original post that I didn’t get the point that the author was trying to make.

I most assuredly did get the point. I’m fully aware of the T-Shirt love-in that was the result of the DeCSS obfuscated code product and I thought it was both quite cool and quite effectively made the point it was trying to make.

Neat. Made a bit of a media splash and the response was generally ‘wow, they are trying to claim that that is rocket science protection technology?’. And it created a bunch of geek shirt / political activist combo paraphernalia.

But DeCSS is a different beast than P2P. CSS is a standard that an entire industry adopted. DeCSS effectively made that standard available to markets that could not otherwise consume CSS protected media.

P2P is not a standard. It is not a locked up technology that “BigCo” is trying to keep in their control.

P2P already has a very lively and open community that “BigCo” is trying to squash. The campaign is on to try and convince lawmakers and the public that P2P technology is Ÿber-evil nastiness that will destroy the world.

If you want to ensure that P2P technologies remain open and available, the best way to do that is to ensure that it is as easy as possible for relatively novice developers to toss together their own custom P2P solutions at whim.

Some random pile of obfuscated code makes a cute point, but it does not provide a building block for developers to perpetuate new protocols and solutions.

I would rather see the obviously talented developers of that particular script turn out a 1 pager bit of Python code that reads like pseudo-code for creating a simple P2P solution. Make the point that it is incredibly bloody easy to build a P2P solutions at whim using existing technologies and protocols.

From that, there will hopefully be lots of developers, admins and, even, users that put together their own custom little P2P solution for their own purposes. For example, given a clearly written version of TinyP2P, I would turn around a custom little script that I could drop on various family computers to sync photos around of the various kids in the family.

That kind of activity will very effectively demonstrate that P2P is here to stay. The more solutions available– the more developers whip off custom P2P solutions for their own personal needs– the harder it is to claim that P2P is a universally evil technology that must be squashed.

World’s most annoying P2P (& Suing BitTorrent)

Wednesday, December 15th, 2004

TinyP2P claims to be the world’s smallest p2p sharing application. It is implemented as 15 lines of highly obfuscated Python in an attempt to make the point that trying to suppress P2P technologies through lawsuits is futile.

I appreciate their attempt at making a point. Yet, I’m not really that impressed. 15 lines of highly obfuscated Python seems is just another useless geek wank cry for attention.

I would be far more impressed if it were implemented as less than a page of very clearly written Python. Not only would it still be impressively tiny, but it would provide a seed from which many new variants on P2P might arise. Instead of being a finger raised in defiance, it would be an opportunity to teach.

Speaking of futile attempts to squelch P2P technology. The MPAA is now pursuing BitTorrent as the Great Evil That Will Destroy Their Industry. I’m still of the camp that if Hollywood made movies that weren’t utter crap, did not bombard us with 30 minutes of ads prior to showing said crap, and didn’t charge us $9 for the “privilege” of watching the ads, then the crap, they wouldn’t be in such supposed dire straits.

The BitTorrent thing is certainly grabbing headlines.

I find it intensely irritating that BitTorrent is being called a “network” that Hollywood is trying to shut down. It isn’t a damned network already! Most articles actually do try– typically poorly– to describe the whole tracker concept and that the legal efforts are aimed at the tracker admins/hosts.

I have previously written on the topic of BitTorrent and how the law suits might play out. In that missive, I mentioned that it is impossible to prove that any one BT client uploaded an entire copy of any given piece of IP and it is actually quite provable that any one client could not possibly provide a whole copy of a piece of IP to any one other client. That may or may not work.

Regardless, there is likely something much more important at stake.

In particular, ISPs and the like have long been able to effectively defend themselves against the content passing through their networks, mailing lists, NNTP servers, web servers, proxies and the like by invoking a “common carrier” style defense. That is, the ISP actively chooses not to monitor or censor the content passing through its systems.

The trackers could employ a similar defense. Many trackers will accept any random torrents without censorship or filtration. That combined with the fact that trackers generally never host the content being distributed means that the tracker– and the tracker’s owner– never has the IP in their possession nor do they require any active awareness of what might be distributed by their site.

Now, in the email world, a mailing list operator that does not moderate content has significantly less liability than one that does. By choosing to moderate content, the operator is taking responsibility for each post that is broadcast to the list. By choosing not to moderate, the list is effectively a common carrier and the individual poster is the one responsible for the contents of their post. This has been used successfully to prevent ISPs from being liable for the many nasty things that pass through NNTP feeds, among other things.

Could this same defense be applied to trackers? It would appear to be a very similar situation. Obviously, those trackers that called themselves ‘warez depozt’ and made outright claims to have the largest collection of blah-blah-blah-illegal-blah have other problems.

If this kind of defense is successfully applied, then it will be very difficult for the MPAA or anyone else to continue to shut down torrent trackers through legal means.

If it is applied and fails in court, it would set a very ugly precedent. In particular, it would significantly further erode the “common carrier” defense. It could become the foundation for law suits that would effectively require ISPs, NSPs, hosting providers, etc.. to very actively censor the content that passes through and is served from their systems. This would make it prohibitively expensive to publish anything on the Internet and would also cause the cost of simply passing traffic through your system to skyrocket.

Or, to look at it a slightly different way, it would make an ideal environment for corporations with very deep pockets to quite effectively squelch independent content producers while ensuring that their content is widely available on IP in a manner that is thoroughly protected by the courts.

This is exactly the situation that the MPAA, RIAA, and various participating content providers have been gunning for every since they finally realized that the Internet is here to stay. TCP/IP is an incredibly cheap means of content distribution with the distinct advantage that the customer doesn’t, by default, receive a copy of your content that can be consumed repeatedly, forever, without pay-per-play. Yet, the current Internet environment, including the laws that protect it, make it extremely difficult to control the content to the degree said organizations would like.

By no means do I think that the legal pursuit of BitTorrent will be the case that makes or breaks the Internet. It is just one small battle in the ongoing war for control of this relatively new IP distribution mechanism we call the Internet.

I will be watching this one closely for two reasons. First, I think it is a very interesting grey area between P2P style services and services that have been protected so far by fairly strong laws. Secondly, I have played an active, albeit very minor, role in the development of BitTorrent and, as a result, have a personal interest in the technology as well as having used said technology to distribute bits as a part of my professional career.

A month of programming languages…

Monday, December 13th, 2004 is running a month of articles about various programming languages. It is a series of short articles that explains the history of the subject language, provides a synopsis of strengths/weaknesses, and provides some sample code.

So far, it includes 6 languages of which 3 are dialects of C. As BASH is an included “language”, it is clear that this language survey will contain a broad range of languages.

Interesting stuff.

PyObjC, py2app, & bundles

Friday, December 10th, 2004

PyObjC now has the ability to build NSBundles that can be dynamically loaded by any Objective-C application while the bundle is entirely implemented in Python. In other words, you can now use Python to implement plugins for any app that supports Objective-C plugins (NSBundle).

The PyObjC Subversion repository contains three examples; a screen saver, a preference pane, and an Interface Builder palette.

Most of the innovation comes from py2app. It is used to build the dynamically laodable bits within the NSBundle that then bootstraps the loading of the Python based class implementations.

py2app is also used within the build script of PyObjC. In particular, you can now execute python bdist_mpkg –open to build an Installer package containing PyObjC, py2app, the examples, and the Xcode templates. Once installed, invoking bdist_mpkg within any Python project that has a will build an Installer package containing all the bits necessary to install that package.

If you want to build the latest PyObjC, you will need DocUtils. See below.

Bob Ippolito implemented all of the magic. Thanks, Bob — this stuff is incredible. Awesome stuff. A little over a decade since its inception, PyObjC is still a vibrant project that is growing in exciting new directions while still remaining focused on its core goals.

The pieces involved:

PyObjC bridges Objective-C to Python transparently enough that you can build full featured Cocoa Apps. PyObjC is rapidly approaching a 1.2 release and there are many significant refinements in this release.

py2app builds double-clickable Mac OS X .apps, including all dependencies. Actually, it does considerably more than that. It will also create installer packages from any Python based project that uses distutils ( py2app can also rewrite mach-o files.

DocUtils is a reStructuredText rendered. reStructuredText is a very simple ASCII format that compiles into nicely formatted HTML, PDF, Open Office, or other kinds of documents.

Living Code (Dethe Elza)

Thursday, December 2nd, 2004

While doing a google search, I stumbled across Dethe Elza’s Living Code site.

Dethe’s weblog documents his work with PyGame, PyObjC and Mac OS X.

Very interesting stuff. Dethe has abandoned Interface Builder in favor of Renaissance for UI generation.

I don’t agree entirely with his reasons for not using Interface Builder. However, I do find Renaissance to be quite interesting technology and Dethe’s weblog reads as a good tutorial on using Renaissance and Cocoa via Python/PyObjC.


There is also a Living Code SourceForge project that contains all of the examples and source from Dethe’s site.

Python, disutils & automatic packaging for Mac OS X

Thursday, December 2nd, 2004

Bob Ippolito (whose blog is currently down) refactored the PyObjC disutils recipe (the script) such that it uses his py2app to automatically package the build product of PyObjC into multiple Mac OS X packages contained in a single mpkg. End result; you can easily install PyObjC and you can select a subset of the packages to install, as well.

py2app also performs dependency analysis such that the resulting product (not just a package) includes all of the dependencies required by the project.

And it does this automatically for any python project that uses disutils for packaging (which any modern python package should).

I dropped a current build of PyObjC and a new build of the Python Readline support for Mac OS X onto my .mac file download page.

BumFiles download link

Friday, October 22nd, 2004

I added a Downloads link in the link list to the left. It will take you to a file sharing page on my .mac account into which I have posted various random bits of software that I find useful.

It has the Python BSD-DB and Readline modules, both source and Mac OS X binaries. You’ll also find the slides from my PyObjC talk at O’Reilly’s 2003 OS X conference. Unfortunately, I won’t be speaking this year, but I will be around intermittently.

Useful python bits.

Sunday, July 11th, 2004

I haven’t disappeared! I have just been incredibly busy with WWDC and related development work. Unfortunately, our stuff hasn’t been pushed outside of NDA’d channels, so I can’t start swamping my weblog with interesting code snippets related to my day job. I am building up quite the list of “to post” hacks, though.

I have been writing little bits of code to make eliminate repetitive tedium from my world. I’ll share what I can. All scripts are available on my public iDisk. Mount in the Finder or visit my tedious file download page.

First up, if you use Python to bring sanity to shell scripting like tasks (a wonderful thing), then you likely need to execute lots of external commands in a shell like fashion. Process (formerly popen5) is an absolute god-send. It brings total sanity and a bit of security to invoking external commands from within Python:

pkgs = glob.glob('*.pkg')
for pkg in pkgs:['/usr/sbin/installer', '-pkg', pkg, '-target', '/'])

The above being a snippet from a script called When invoked (typically by sudo python it installs all pkgs and tarballs in the current working directory (tarballs are just untarred into /).

Just copy the two scripts into a directory, drop any packages and tarballs into the directory and run the script. It is the first thing I run whenever I install a new system. That way, I’m always guranteed to immediately have PyObjC, the Python bsddb and sqlite modules, Subversion, iPhoto, OCUnit, Keynote, the Python documentation and everything else I need regularly.

Python, threading, remote debugging, etc…

Thursday, June 10th, 2004

Yeah, been quiet around here. We have a little show coming up and I’ve been just a little busy.

Fred has discovered the wonders of XML-RPC as a remote debugging tool. In particular, he is remotely debugging and/or monitoring a python based server process that is multithreaded.

Which spawned a second post discussing how to take snapshots of executing Python threads as a part of the remote debugging tools.

The resulting discussions are particularly interesting. In particular, it discusses the threadframe module that offers Python VM introspection of the various threads that may be present.

Speaking of really bloody useful modules, the process module is a far superior mechanism for launching subshelled tasks. It avoids the security problems of system() and is vastly more elegant than execve() or popen().