How to de-duplicate your albums

16-05-2019 532 words 3 minutes

Contents

In this new recipe I will share my way of deduplicate my photo albums (you can apply this also to songs).

When I need to change my phone or simply reset it, I take all the pictures in the camera and do a backup, but when I need to put all the backups togheter I struggle to know what picture is alredy in a folder or folders; so at the end I have the same picture few times with diferent names in diferent folders.

I don’t want to go one by one and check if alredy exist, so I use a script to help me with the job.

Stay tune for it … but just in case here is the commands … I will tidy it up shortly

First you need to have 2 folders:

one for thr new pictures you wanted to add to de library, lets call it creatively new
another with the structure of you library, where you have all in order catalogue

For folder with the catalogue

We need to generate a file with the list of the pictures and the correspondant checksum, for that I used the following command:

1

find ./catalogue/ -type f -exec md5 {} ';' > md5_catalogue_folder.txt

Folder with new images

For this folder we also need to create a similar file with the checksum, so we can compare with the previous one:

1

find ./new/ -type f -exec md5 {} ';' > md5_new_folder.txt

Combine the two files with the checksum

Once you have both files, you need to combine and sort them, so for each ocurrence you will have the firt line with the original file in the catalogue and below will be the duplicates of that image but in the new folder. I use this command for that:

1

cat md5_catalogue_folder.txt md5_new_folder.txt |awk -F " = " '{print $2,$1}' | sort -k1,2 > sorted_list.txt

NOTE: This work in Mac as the field separator is " = “, you will need to change it for Linux or Windows

Generate the deletion file based on duplicates

In this step, we will be generating a file with all the duplicated to delete; once generated you can check if efectively those are duplicates.

1

awk 'BEGIN {while ((getline < "sorted_list.txt") > 0) { a[$1]++; if (a[$1] >= 2) print }}' >to_delete.txt

NOTE: if you want to check that the list don’t have a file from your catalogue run:

1
2
3


cat to_delete.txt | grep -v catalogue
cat to_delete.txt | wc -l
cat to_delete.txt | grep catalogue | wc -l

The last two number should be the same

Finally deleting the files

To delete the files from the to_delete.txt file:

1

cat to_delete.txt | cut -d" " -f3- | awk '{print "\"" substr($0, 2, length($0)-2) "\""}' | xargs rm -f -v

At this point if you has been following the instructions, all the duplicated files should be deleted. To confirm you can run the the commands again and the file to_delete.txt should be empty.

Buy me a Coffee

Hope you find this useful, if you have any question please visit my twitter @bigg_blog and if you have a couple of pounds buy me a coffee.
G