Each Answer to this Q is separated by one/two green lines.
I am getting a
ValueError: cannot reindex from a duplicate axis when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it.
Here is my session inside of
ipdb trace. I have a DataFrame with string index, and integer columns, float values. However when I try to create
sum index for sum of all columns I am getting
ValueError: cannot reindex from a duplicate axis error. I created a small DataFrame with the same characteristics, but was not able to reproduce the problem, what could I be missing?
I don’t really understand what
ValueError: cannot reindex from a duplicate axismeans, what does this error message mean? Maybe this will help me diagnose the problem, and this is most answerable part of my question.
ipdb> type(affinity_matrix) <class 'pandas.core.frame.DataFrame'> ipdb> affinity_matrix.shape (333, 10) ipdb> affinity_matrix.columns Int64Index([9315684, 9315597, 9316591, 9320520, 9321163, 9320615, 9321187, 9319487, 9319467, 9320484], dtype="int64") ipdb> affinity_matrix.index Index([u'001', u'002', u'003', u'004', u'005', u'008', u'009', u'010', u'011', u'014', u'015', u'016', u'018', u'020', u'021', u'022', u'024', u'025', u'026', u'027', u'028', u'029', u'030', u'032', u'033', u'034', u'035', u'036', u'039', u'040', u'041', u'042', u'043', u'044', u'045', u'047', u'047', u'048', u'050', u'053', u'054', u'055', u'056', u'057', u'058', u'059', u'060', u'061', u'062', u'063', u'065', u'067', u'068', u'069', u'070', u'071', u'072', u'073', u'074', u'075', u'076', u'077', u'078', u'080', u'082', u'083', u'084', u'085', u'086', u'089', u'090', u'091', u'092', u'093', u'094', u'095', u'096', u'097', u'098', u'100', u'101', u'103', u'104', u'105', u'106', u'107', u'108', u'109', u'110', u'111', u'112', u'113', u'114', u'115', u'116', u'117', u'118', u'119', u'121', u'122', ...], dtype="object") ipdb> affinity_matrix.values.dtype dtype('float64') ipdb> 'sums' in affinity_matrix.index False
Here is the error:
ipdb> affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0) *** ValueError: cannot reindex from a duplicate axis
I tried to reproduce this with a simple example, but I failed
In : import pandas as pd In : import numpy as np In : a = np.arange(35).reshape(5,7) In : df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17)) In : df.values.dtype Out: dtype('int64') In : df.loc['sums'] = df.sum(axis=0) In : df Out: 10 11 12 13 14 15 16 x 0 1 2 3 4 5 6 y 7 8 9 10 11 12 13 u 14 15 16 17 18 19 20 z 21 22 23 24 25 26 27 w 28 29 30 31 32 33 34 sums 70 75 80 85 90 95 100
This error usually rises when you join / assign to a column when the index has duplicate values. Since you are assigning to a row, I suspect that there is a duplicate value in
affinity_matrix.columns, perhaps not shown in your question.
As others have said, you’ve probably got duplicate values in your original index. To find them do this:
Indices with duplicate values often arise if you create a DataFrame by concatenating other DataFrames. IF you don’t care about preserving the values of your index, and you want them to be unique values, when you concatenate the the data, set
Alternatively, to overwrite your current index with a new one, instead of using
df.index = new_index
For people who are still struggling with this error, it can also happen if you accidentally create a duplicate column with the same name. Remove duplicate columns like so:
df = df.loc[:,~df.columns.duplicated()]
Simply skip the error using
.values at the end.
affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0).values
Try running this before grouping
Thank you to this github comment for the solution.
inplace=True if you want it to return the dataframe.
I came across this error today when I wanted to add a new column like this
df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
I wanted to process the
REMARK column of
df_temp to return 1 or 0. However I typed wrong variable with
df. And it returned error like this:
----> 1 df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0) /usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value) 2417 else: 2418 # set column -> 2419 self._set_item(key, value) 2420 2421 def _setitem_slice(self, key, value): /usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _set_item(self, key, value) 2483 2484 self._ensure_valid_index(value) -> 2485 value = self._sanitize_column(key, value) 2486 NDFrame._set_item(self, key, value) 2487 /usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _sanitize_column(self, key, value, broadcast) 2633 2634 if isinstance(value, Series): -> 2635 value = reindexer(value) 2636 2637 elif isinstance(value, DataFrame): /usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in reindexer(value) 2625 # duplicate axis 2626 if not value.index.is_unique: -> 2627 raise e 2628 2629 # other ValueError: cannot reindex from a duplicate axis
As you can see it, the right code should be
df_temp['REMARK_TYPE'] = df_temp.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
df_temp have a different number of rows. So it returned
ValueError: cannot reindex from a duplicate axis.
Hope you can understand it and my answer can help other people to debug their code.
In my case, this error popped up not because of duplicate values, but because I attempted to join a shorter Series to a Dataframe: both had the same index, but the Series had fewer rows (missing the top few). The following worked for my purposes:
df.head() SensA date 2018-04-03 13:54:47.274 -0.45 2018-04-03 13:55:46.484 -0.42 2018-04-03 13:56:56.235 -0.37 2018-04-03 13:57:57.207 -0.34 2018-04-03 13:59:34.636 -0.33 series.head() date 2018-04-03 14:09:36.577 62.2 2018-04-03 14:10:28.138 63.5 2018-04-03 14:11:27.400 63.1 2018-04-03 14:12:39.623 62.6 2018-04-03 14:13:27.310 62.5 Name: SensA_rrT, dtype: float64 df = series.to_frame().combine_first(df) df.head(10) SensA SensA_rrT date 2018-04-03 13:54:47.274 -0.45 NaN 2018-04-03 13:55:46.484 -0.42 NaN 2018-04-03 13:56:56.235 -0.37 NaN 2018-04-03 13:57:57.207 -0.34 NaN 2018-04-03 13:59:34.636 -0.33 NaN 2018-04-03 14:00:34.565 -0.33 NaN 2018-04-03 14:01:19.994 -0.37 NaN 2018-04-03 14:02:29.636 -0.34 NaN 2018-04-03 14:03:31.599 -0.32 NaN 2018-04-03 14:04:30.779 -0.33 NaN 2018-04-03 14:05:31.733 -0.35 NaN 2018-04-03 14:06:33.290 -0.38 NaN 2018-04-03 14:07:37.459 -0.39 NaN 2018-04-03 14:08:36.361 -0.36 NaN 2018-04-03 14:09:36.577 -0.37 62.2
I wasted couple of hours on the same issue. In my case, I had to reset_index() of a dataframe before using apply function.
Before merging, or looking up from another indexed dataset, you need to reset the index as 1 dataset can have only 1 Index.
I got this error when I tried adding a column from a different table. Indeed I got duplicate index values along the way. But it turned out I was just doing it wrong: I actually needed to
df.join the other table.
This pointer might help someone in a similar situation.
This can also be a cause for this[:) I solved my problem like this]
It may happen even if you are trying to insert a dataframe type column inside dataframe
you can try this
if you get this error after merging two dataframe and remove suffix adnd try to write to excel
Your problem is that there are columns you are not merging on that are common to both source DataFrames. Pandas needs a way to say which one came from where, so it adds the suffixes, the defaults being ‘_x’ on the left and ‘_y’ on the right.
If you have a preference on which source data frame to keep the columns from, then you can set the suffixes and filter accordingly, for example if you want to keep the clashing columns from the left:
# Label the two sides, with no suffix on the side you want to keep df = pd.merge( df, tempdf[what_i_care_about], on=['myid', 'myorder'], how='outer', suffixes=('', '_delete_suffix') # Left gets no suffix, right gets something identifiable ) # Discard the columns that acquired a suffix df = df[[c for c in df.columns if not c.endswith('_delete_suffix')]]
Alternatively, you can drop one of each of the clashing columns prior to merging, then Pandas has no need to assign a suffix.
Just add .to_numpy() to the end of the series you want to concatenate.
In my case it was caused by mismatch in dimensions:
accidentally using a column from different df during the