Pandas: Drop consecutive duplicates

Question

What's the most efficient way to drop only consecutive duplicates in pandas?

drop_duplicates gives this:

In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])

In [4]: a.drop_duplicates()
Out[4]: 
1    1
2    2
4    3
dtype: int64

But I want this:

In [4]: a.something()
Out[4]: 
1    1
2    2
4    3
5    2
dtype: int64

Zero · Accepted Answer · 2017-09-11 19:05:50Z

150

Use shift:

a.loc[a.shift(-1) != a]

Out[3]:

1    1
3    2
4    3
5    2
dtype: int64

So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask

Another method is to use diff:

In [82]:

a.loc[a.diff() != 0]
Out[82]:
1    1
2    2
4    3
5    2
dtype: int64

But this is slower than the original method if you have a large number of rows.

Update

Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1) or just shift() as the default is a period of 1, this returns the first consecutive value:

In [87]:

a.loc[a.shift() != a]
Out[87]:
1    1
2    2
4    3
5    2
dtype: int64

Note the difference in index values, thanks @BjarkeEbert!

edited Sep 11, 2017 at 19:05

Zero

76.7k22 gold badges152 silver badges154 bronze badges

answered Oct 19, 2013 at 8:27

EdChum

393k201 gold badges833 silver badges581 bronze badges

2

How should we do it, if we want to do a groupby at first then drop consecutive duplicates? For example df.groupby(['Col1','Col2']) and save it again as dataframe?
– DSaad
Commented Feb 12, 2021 at 11:05

Add a comment |

johnml1135 · Accepted Answer · 2016-11-01 12:01:09Z

38

Here is an update that will make it work with multiple columns. Use ".any(axis=1)" to combine the results from each column:

cols = ["col1","col2","col3"]
de_dup = a[cols].loc[(a[cols].shift() != a[cols]).any(axis=1)]

answered Nov 1, 2016 at 12:01

johnml1135

4,9273 gold badges24 silver badges19 bronze badges

Can you do this but keep the last row?
– jonboy
Commented May 20, 2022 at 7:29
It will only remove the last row if it is a duplicate.
– johnml1135
Commented May 20, 2022 at 13:54
@jonboy you need to use shift(-1) see the accepted answer
– Mr_and_Mrs_D
Commented Jan 31, 2023 at 22:01
I don't think so - the "Update" on the accepted answer affirms that shift() is the proper expression. In essence, if the current row is different than the next row (shift() or shift(1)), keep the current row. See pandas doc on shift.
– johnml1135
Commented Feb 1, 2023 at 12:34
de_dup = a[a.colums].loc[(a.shift() != a).any(axis=1)] is a little bit handier and will work as well.
– kernstock
Commented Dec 6 at 9:30

Add a comment |

Divakar · Accepted Answer · 2018-12-04 09:08:55Z

Since we are going for most efficient way, i.e. performance, let's use array data to leverage NumPy. We will slice one-off slices and compare, similar to shifting method discussed earlier in @EdChum's post. But with NumPy slicing we would end up with one-less array, so we need to concatenate with a True element at the start to select the first element and hence we would have an implementation like so -

def drop_consecutive_duplicates(a):
    ar = a.values
    return a[np.concatenate(([True],ar[:-1]!= ar[1:]))]

Sample run -

In [149]: a
Out[149]: 
1    1
2    2
3    2
4    3
5    2
dtype: int64

In [150]: drop_consecutive_duplicates(a)
Out[150]: 
1    1
2    2
4    3
5    2
dtype: int64

Timings on large arrays comparing @EdChum's solution -

In [142]: a = pd.Series(np.random.randint(1,5,(1000000)))

In [143]: %timeit a.loc[a.shift() != a]
100 loops, best of 3: 12.1 ms per loop

In [144]: %timeit drop_consecutive_duplicates(a)
100 loops, best of 3: 11 ms per loop

In [145]: a = pd.Series(np.random.randint(1,5,(10000000)))

In [146]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 136 ms per loop

In [147]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 114 ms per loop

So, there's some improvement!

Get major boost for values only!

If only the values are needed, we could get major boost by simply indexing into the array data, like so -

def drop_consecutive_duplicates(a):
    ar = a.values
    return ar[np.concatenate(([True],ar[:-1]!= ar[1:]))]

Sample run -

In [170]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])

In [171]: drop_consecutive_duplicates(a)
Out[171]: array([1, 2, 3, 2])

Timings -

In [173]: a = pd.Series(np.random.randint(1,5,(10000000)))

In [174]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 137 ms per loop

In [175]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 61.3 ms per loop

I do not understand why timing for [147] and [175] differ? Can you explain what change have you made cuz I dont see any? Perhaps a typo? — Biarys, Commented Feb 7, 2019 at 3:38
@Biarys [175] is with the modified Get major boost for values only! section one, hence the time diff. The original one works on pandas-Series whereas the modified one on array as also listed in the post. — Divakar, Commented Feb 7, 2019 at 8:38
Oh I see. Hard to notice change from return a[...] vs return ar[....]. Does your function work for dataframes? — Biarys, Commented Feb 8, 2019 at 13:01
@Biarys For dataframes, if you are looking for duplicate rows, we simply need to use slicing : ar[:,:-1]!= ar[:,1:], alongwith ALL reduction. — Divakar, Commented Feb 8, 2019 at 15:32

micstr · Accepted Answer · 2021-11-18 20:55:07Z

Here is a function that handles both pd.Series and pd.Dataframes. You can mask/drop, choose the axis and finally choose to drop with 'any' or 'all' 'NaN'. It is not optimized in term of computation time, but it has the advantage to be robust and pretty clear.

import numpy as np
import pandas as pd

# To mask/drop successive values in pandas
def Mask_Or_Drop_Successive_Identical_Values(df, drop=False, 
                                             keep_first=True,
                                             axis=0, how='all'):

    '''
    #Function built with the help of:
    # 1) https://stackoverflow.com/questions/48428173/how-to-change-consecutive-repeating-values-in-pandas-dataframe-series-to-nan-or
    # 2) https://stackoverflow.com/questions/19463985/pandas-drop-consecutive-duplicates
    
    Input:
    df should be a pandas.DataFrame of a a pandas.Series
    Output:
    df of ts with masked or dropped values
    '''
    
    # Mask keeping the first occurrence
    if keep_first:
        df = df.mask(df.shift(1) == df)
    # Mask including the first occurrence
    else:
        df = df.mask((df.shift(1) == df) | (df.shift(-1) == df))

    # Drop the values (e.g. rows are deleted)    
    if drop:
        return df.dropna(axis=axis, how=how)        
    # Only mask the values (e.g. become 'NaN')
    else:
        return df

Here is a test code to include in the script:


if __name__ == "__main__":
    
    # With time series
    print("With time series:\n")
    ts = pd.Series([1,1,2,2,3,2,6,6,float('nan'), 6,6,float('nan'),float('nan')], 
                    index=[0,1,2,3,4,5,6,7,8,9,10,11,12])
    
    print("#Original ts:")    
    print(ts)

    print("\n## 1) Mask keeping the first occurrence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=False, 
                                                   keep_first=True))

    print("\n## 2) Mask including the first occurrence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=False, 
                                                   keep_first=False))
    
    print("\n## 3) Drop keeping the first occurrence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=True, 
                                                   keep_first=True))
    
    print("\n## 4) Drop including the first occurrence:")        
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=True, 
                                                   keep_first=False))
    
    
    # With dataframes
    print("With dataframe:\n")
    df = pd.DataFrame(np.random.randn(15, 3))
    df.iloc[4:9,0]=40
    df.iloc[8:15,1]=22
    df.iloc[8:12,2]=0.23
        
    print("#Original df:")
    print(df)

    print("\n## 5) Mask keeping the first occurrence:") 
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=False, 
                                                   keep_first=True))

    print("\n## 6) Mask including the first occurrence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=False, 
                                                   keep_first=False))
    
    print("\n## 7) Drop 'any' keeping the first occurrence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                   keep_first=True,
                                                   how='any'))
    
    print("\n## 8) Drop 'all' keeping the first occurrence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                   keep_first=True,
                                                   how='all'))
    
    print("\n## 9) Drop 'any' including the first occurrence:")        
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                   keep_first=False,
                                                   how='any'))

    print("\n## 10) Drop 'all' including the first occurrence:")        
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                   keep_first=False,
                                                   how='all'))

And here is the expected result:

With time series:

#Original ts:
0     1.0
1     1.0
2     2.0
3     2.0
4     3.0
5     2.0
6     6.0
7     6.0
8     NaN
9     6.0
10    6.0
11    NaN
12    NaN
dtype: float64

## 1) Mask keeping the first occurrence:
0     1.0
1     NaN
2     2.0
3     NaN
4     3.0
5     2.0
6     6.0
7     NaN
8     NaN
9     6.0
10    NaN
11    NaN
12    NaN
dtype: float64

## 2) Mask including the first occurrence:
0     NaN
1     NaN
2     NaN
3     NaN
4     3.0
5     2.0
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
dtype: float64

## 3) Drop keeping the first occurrence:
0    1.0
2    2.0
4    3.0
5    2.0
6    6.0
9    6.0
dtype: float64

## 4) Drop including the first occurrence:
4    3.0
5    2.0
dtype: float64
With dataframe:

#Original df:
            0          1         2
0   -1.890137  -3.125224 -1.029065
1   -0.224712  -0.194742  1.891365
2    1.009388   0.589445  0.927405
3    0.212746  -0.392314 -0.781851
4   40.000000   1.889781 -1.394573
5   40.000000  -0.470958 -0.339213
6   40.000000   1.613524  0.271641
7   40.000000  -1.810958 -1.568372
8   40.000000  22.000000  0.230000
9   -0.296557  22.000000  0.230000
10  -0.921238  22.000000  0.230000
11  -0.170195  22.000000  0.230000
12   1.460457  22.000000 -0.295418
13   0.307825  22.000000 -0.759131
14   0.287392  22.000000  0.378315

## 5) Mask keeping the first occurrence:
            0          1         2
0   -1.890137  -3.125224 -1.029065
1   -0.224712  -0.194742  1.891365
2    1.009388   0.589445  0.927405
3    0.212746  -0.392314 -0.781851
4   40.000000   1.889781 -1.394573
5         NaN  -0.470958 -0.339213
6         NaN   1.613524  0.271641
7         NaN  -1.810958 -1.568372
8         NaN  22.000000  0.230000
9   -0.296557        NaN       NaN
10  -0.921238        NaN       NaN
11  -0.170195        NaN       NaN
12   1.460457        NaN -0.295418
13   0.307825        NaN -0.759131
14   0.287392        NaN  0.378315

## 6) Mask including the first occurrence:
           0         1         2
0  -1.890137 -3.125224 -1.029065
1  -0.224712 -0.194742  1.891365
2   1.009388  0.589445  0.927405
3   0.212746 -0.392314 -0.781851
4        NaN  1.889781 -1.394573
5        NaN -0.470958 -0.339213
6        NaN  1.613524  0.271641
7        NaN -1.810958 -1.568372
8        NaN       NaN       NaN
9  -0.296557       NaN       NaN
10 -0.921238       NaN       NaN
11 -0.170195       NaN       NaN
12  1.460457       NaN -0.295418
13  0.307825       NaN -0.759131
14  0.287392       NaN  0.378315

## 7) Drop 'any' keeping the first occurrence:
           0         1         2
0  -1.890137 -3.125224 -1.029065
1  -0.224712 -0.194742  1.891365
2   1.009388  0.589445  0.927405
3   0.212746 -0.392314 -0.781851
4  40.000000  1.889781 -1.394573

## 8) Drop 'all' keeping the first occurrence:
            0          1         2
0   -1.890137  -3.125224 -1.029065
1   -0.224712  -0.194742  1.891365
2    1.009388   0.589445  0.927405
3    0.212746  -0.392314 -0.781851
4   40.000000   1.889781 -1.394573
5         NaN  -0.470958 -0.339213
6         NaN   1.613524  0.271641
7         NaN  -1.810958 -1.568372
8         NaN  22.000000  0.230000
9   -0.296557        NaN       NaN
10  -0.921238        NaN       NaN
11  -0.170195        NaN       NaN
12   1.460457        NaN -0.295418
13   0.307825        NaN -0.759131
14   0.287392        NaN  0.378315

## 9) Drop 'any' including the first occurrence:
          0         1         2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742  1.891365
2  1.009388  0.589445  0.927405
3  0.212746 -0.392314 -0.781851

## 10) Drop 'all' including the first occurrence:
           0         1         2
0  -1.890137 -3.125224 -1.029065
1  -0.224712 -0.194742  1.891365
2   1.009388  0.589445  0.927405
3   0.212746 -0.392314 -0.781851
4        NaN  1.889781 -1.394573
5        NaN -0.470958 -0.339213
6        NaN  1.613524  0.271641
7        NaN -1.810958 -1.568372
9  -0.296557       NaN       NaN
10 -0.921238       NaN       NaN
11 -0.170195       NaN       NaN
12  1.460457       NaN -0.295418
13  0.307825       NaN -0.759131
14  0.287392       NaN  0.378315

You could also avoid explicitly checking the values if keep_first: is enough (and better style) — Mr_and_Mrs_D, Commented Mar 4, 2020 at 1:11

Arthur D. Howland · Accepted Answer · 2019-03-26 15:17:22Z

4

For other Stack explorers, building off johnml1135's answer above. This will remove the next duplicate from multiple columns but not drop all of the columns. When the dataframe is sorted it will keep the first row but drop the second row if the "cols" match, even if there are more columns with non-matching information.

cols = ["col1","col2","col3"]
df = df.loc[(df[cols].shift() != df[cols]).any(axis=1)]

answered Mar 26, 2019 at 15:17

Arthur D. Howland

4,5373 gold badges22 silver badges33 bronze badges

Add a comment |

Pablo C · Accepted Answer · 2020-11-26 13:13:57Z

3

Just another way of doing it:

a.loc[a.ne(a.shift())]

The method pandas.Series.ne is the not equal operator, so a.ne(a.shift()) is equivalent to a != a.shift(). Documentation here.

answered Nov 26, 2020 at 13:13

Pablo C

4,7412 gold badges9 silver badges25 bronze badges

This one is nice because people typically also want to drop if there are missing values but NaN == NaN evaluates to False. With .ne(fill_value=12356789.345326) you can still drop them. It just needs to be a float not in your data (easy if your data is ints or strings).
– grofte
Commented Jun 28, 2023 at 9:57

Add a comment |

Florian Brucker · Accepted Answer · 2021-02-19 16:02:35Z

1

Here's a variant of EdChum's answer that treats consecutive NaNs as duplicates, too:

def remove_consecutive_duplicates_and_nans(s):
    # By default, `shift` uses NaN as a fill value, which breaks our
    # removal of consecutive NaNs. Hence we use a different sentinel
    # object instead.
    shifted = s.astype(object).shift(-1, fill_value=object())
    return s.loc[
        (shifted != s)
        & ~(shifted.isna() & s.isna())
    ]

answered Feb 19, 2021 at 16:02

Florian Brucker

10.3k5 gold badges54 silver badges91 bronze badges

Add a comment |

ThePyGuy · Accepted Answer · 2021-06-16 00:47:15Z

1

Create new column.

df['match'] = df.col1.eq(df.col1.shift())

Then:

df = df[df['match']==False]

edited Jun 16, 2021 at 0:47

ThePyGuy

18.4k5 gold badges21 silver badges51 bronze badges

answered Jun 15, 2021 at 19:55

pugach

111 bronze badge

Add a comment |

aVral · Accepted Answer · 2023-09-06 19:41:00Z

0

The approach recommended by @johnml1135 does not work for me.

I figured out a similar approach though:

cols = ['Position', 'Offset']
df = df[df[cols] != df[cols].shift(-1)].dropna()

shift(-1) will keep the last duplicated row, and shift(1) will keep the first row.

answered Sep 6, 2023 at 19:41

aVral

1331 silver badge10 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Pandas: Drop consecutive duplicates

9 Answers 9

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
pandas
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonpandas or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
pandas
or ask your own question.