Removing large numbers of duplicate emails from Mac OSX Mail

I was faced with the problem of having duplicate emails in some of my very large folders in Mail (some of my mail boxes have over 80,000 messages). There are applescripts for finding and removing duplicate messages, but they are very slow (~100 messages a minute) and so are not really viable on very large mail boxes – at this rate it would take more than 5 days to process one of my large email folders.

One simple solution is to archive the mailbox and then use formail to strip out the duplicate messages, then importing the cleaned mbox back into Mail. This approach retains all the meta information like flags and attachments and is actually quite fast. The basic procedure is:

1. Rebuild the mail box. This is probably a good idea anyway.
2. Archive the mail box. This will save a copy of your mailbox in standard mbox format which then formail can process.
3. Open a terminal window and cd to the folder containing the new mbox archive.
4 . Run the following formail command. This will create a copy of the mbox (mbox_new) with the duplicate messages removed. This will take around 5 minutes to process 80,000 messages.
formail -D 100000000 idcache < mbox -s > mbox_new
5. Open Mail and Import Mailboxes. Select the mbox_new mbox. Once it finishes importing you will now have a new folder under Import with all the non-duplicate messages. It is a good idea to rebuild the folder before moving it back to where you want.
6. Delete the archive.

This approach is much,much faster and has the advantage of working on a copy of your email data so if anything goes wrong you won’t mess up your current mailboxes.