Find indexes of duplicate items in python

Pretty common question when dealing with lists (arrays, vectors or whatever is used in you favorite programming language) is to find duplicates and list the as a hashmap.
Example (Python). From the list like this

lst = [5,5,4,4,4,3]

get a dict

dupls = {5 => [0,1], 4 => [2,3,4]} 

The question was raised on the SO here and got some standard solutions which basically do the job with O(n) time.

I wanted to suggest another solution, which is slower — O(n*log(n)) but it allows generating an interesting structure.
The solution for original problem is this:

def dupl_rbespal(c):
    alreadyAdded = False
    dupl_c = dict()
    sorted_ind_c = sorted(range(len(c)), key=lambda x: c[x]) # sort incoming list but save the indexes of sorted items
    for i in xrange(len(c) - 1): # loop over indexes of sorted items
        if c[sorted_ind_c[i]] == c[sorted_ind_c[i+1]]: # if two consecutive indexes point to the same value, add it to the duplicates
            if not alreadyAdded:
                dupl_c[ c[sorted_ind_c[i]] ] = [sorted_ind_c[i], sorted_ind_c[i+1]]
                alreadyAdded = True
            else:
                dupl_c[ c[sorted_ind_c[i]] ].append( sorted_ind_c[i+1] )
        else:
            alreadyAdded = False
    return dupl_c

Now what I want to get is a following structure:

From the list like this

lst = [5,5,4,4,4,3]

get a dict

dupls = {0 => 1, 2 => 3, 3 => 4 } 

That is I want to get a dict where key points to the next index with the same value.
With a slight change of the above algorithm we get:

def nextDuplicates(c):
    dupl_c = dict()
    sorted_ind_c = sorted(range(len(c)), key=lambda x: c[x])
    for i in xrange(len(c) - 1):
        if c[sorted_ind_c[i]] == c[sorted_ind_c[i+1]]:
            dupl_c[ sorted_ind_c[i] ] = sorted_ind_c[i+1]
    return dupl_c

So it’s even shorter than the original one.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.