Removing hundreds of duplicate images

Question

So I've been slowly learning the ways of the "snake" and trying to create THE script that will remove all duplicate images after one has appended, shall we say, an entire library of characters, buildings, vehicles, etc.

I have seen the topics of going into the Outliner and remapping the duplicates, but imagine having hundreds of these one needs to Box-Select, then try to find the beginning of this list of eyemask.png.xxx. I have also seen scripts where one would match the name and remap to the name without the extra numbers. However, things get a little complicated when one's library of objects have the same named texture file (like body.png.xxx) but have their differences. And maybe the names got changed somehow so the first part doesn't quite match, yet they are exactly the same image (ever found duplicate textures in a game?).

I figured, why not remap the images to the first match Per Pixel? Yeah, that is definitely a lengthy process if the image is 4K... I have gone about the iteration in two ways:

# Purely based on if the images are sorted alphabetally and follow the .xxx pattern
import bpy
lastimg = False
lastpix = []
curpix = []
imgcnt = 0
for img in bpy.data.images:
    curimg = img
    curpix = curimg.pixels
    thesame = True
    if len(curpix) > 0 and len(lastpix) > 0 and len(curpix) == len(lastpix):
        for i in range(len(curpix)):
            if i % (len(curpix) / 4096) < 1:
                print(imgcnt,"/",len(bpy.data.images),"-",curimg.name,"-",((i/len(curpix)*100)),"%")
            if curpix[i] != lastpix[i]:
                thesame = False
                break
    else:
        thesame = False
    if thesame == True:
        curimg.user_remap(lastimg)
    else:
        lastimg = curimg
        lastpix = lastimg.pixels
    imgcnt = imgcnt + 1
print("DONE!")

# Double the iteration, making sure there are no other named textures that look the same
import bpy

imgcnt = 0

for img1 in bpy.data.images:
    for img2 in bpy.data.images:
        if img1.users > 0 and img2.users > 0 and img1.name != img2.name
            pix1 = img1.pixels
            pix2 = img2.pixels
            thesame = True
            if len(pix1) > 0 and len(pix2) > 0 and len(pix1) == len(pix2):
                for i in range(len(pix1)):
                    if i % (len(pix1) / 4096) < 1:
                        print(imgcnt,"/",len(bpy.data.images),"-",img2.name,"-",((i/len(curpix)*100)),"%")
                    if pix1[i] != pix2[i]:
                        thesame = False
                        break
            else:
                thesame = False
            if thesame == True:
                img2.user_remap(img1)
    imgcnt = imgcnt + 1

print("DONE!")

I have also attempted to hashlib the array of pixels to see if it is any quicker, but I keep getting errors. I have also attempted to convert the array of pixels to a string, but that seems just as slow. Who knew python couldn't handle 67MB in one second? :P

Is there any quicker way you guys could suggest to compare two images? Thinking that further down the line I could do the same for the materials (compare the nodes and their values and reduce duplicates).

quicker to access a copy img.pixels[:] https://blender.stackexchange.com/questions/3673/why-is-accessing-image-data-so-slow , or consider loading into numpy — batFINGER, Feb 07 '21 at 12:31

icalvin102 · Accepted Answer · 2021-02-07T16:49:23.243

3

You do not need hashlib to generate unique hashes of the pixels the hash function is enough for this.

After hashing the images it is only a matter of grouping the images by hash.

import bpy
from pprint import pprint
def duplicates(images):
    hashes = [(hash(i.pixels[:]), i.name) for i in images]
    grouped = {}
for h, n in hashes:
    grouped.setdefault(h, []).append(n)

return grouped


pprint(duplicates(bpy.data.images))

The output will look something like this:

{
 -6513983583712379323: ['Image1', 'Image2'],
 -2557438906662028752: ['Image3', 'Image4'],
 5740354900026072187: ['Render Result'],
 -3721407912412733414: ['Viewer Node'],
}

Note: The hash() function can only create meaningful hashes from immutable data structures. As @batFINGER pointed out in the comments image.pixels[:] is alreay a tuple (immutable) so we are good in this case. If however you want to create a hash from a list you have to convert it to a tuple first. E.g. hash(tuple([1,2,3]))

edited Feb 07 '21 at 16:49

answered Feb 07 '21 at 15:49

icalvin102

146
1
4

AFAIK image.pixels[:] is already a tuple. – batFINGER Feb 07 '21 at 16:03
@batFINGER you are right it's already a tuple. I assumed it's a list because slicing on a list e.g [1,2][:] returns a list – icalvin102 Feb 07 '21 at 16:19
Well, this sure was faster. The only downside is the hashes = line. Would like to move the for loop around so that I print out each image name so I know things are happening (or show progress percentage). But that's a minor detail. – Edward Feb 08 '21 at 08:28

Removing hundreds of duplicate images

1 Answers1