Programming
pandas numpy multiple-columns rename
Updated Wed, 24 Aug 2022 21:42:15 GMT

After rename column get keyerror


I have df:

df = pd.DataFrame({'a':[7,8,9],
                   'b':[1,3,5],
                   'c':[5,3,6]})
print (df)
   a  b  c
0  7  1  5
1  8  3  3
2  9  5  6

Then rename first value by this:

df.columns.values[0] = 'f'

All seems very nice:

print (df)
   f  b  c
0  7  1  5
1  8  3  3
2  9  5  6
print (df.columns)
Index(['f', 'b', 'c'], dtype='object')
print (df.columns.values)
['f' 'b' 'c']

If select b it works nice:

print (df['b'])
0    1
1    3
2    5
Name: b, dtype: int64

But if select a it return column f:

print (df['a'])
0    7
1    8
2    9
Name: f, dtype: int64

And if select f get keyerror.

print (df['f'])
#KeyError: 'f'
print (df.info())
#KeyError: 'f'

What is problem? Can somebody explain it? Or bug?




Solution

You aren't expected to alter the values attribute.

Try df.columns.values = ['a', 'b', 'c'] and you get:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-61-e7e440adc404> in <module>()
----> 1 df.columns.values = ['a', 'b', 'c']
AttributeError: can't set attribute

That's because pandas detects that you are trying to set the attribute and stops you.

However, it can't stop you from changing the underlying values object itself.

When you use rename, pandas follows up with a bunch of clean up stuff. I've pasted the source below.

Ultimately what you've done is altered the values without initiating the clean up. You can initiate it yourself with a followup call to _data.rename_axis (example can be seen in source below). This will force the clean up to be run and then you can access ['f']

df._data = df._data.rename_axis(lambda x: x, 0, True)
df['f']
0    7
1    8
2    9
Name: f, dtype: int64

Moral of the story: probably not a great idea to rename a column this way.


but this story gets weirder

This is fine

df = pd.DataFrame({'a':[7,8,9],
                   'b':[1,3,5],
                   'c':[5,3,6]})
df.columns.values[0] = 'f'
df['f']
0    7
1    8
2    9
Name: f, dtype: int64

This is not fine

df = pd.DataFrame({'a':[7,8,9],
                   'b':[1,3,5],
                   'c':[5,3,6]})
print(df)
df.columns.values[0] = 'f'
df['f']
KeyError:

Turns out, we can modify the values attribute prior to displaying df and it will apparently run all the initialization upon the first display. If you display it prior to changing the values attribute, it will error out.

weirder still

df = pd.DataFrame({'a':[7,8,9],
                   'b':[1,3,5],
                   'c':[5,3,6]})
print(df)
df.columns.values[0] = 'f'
df['f'] = 1
df['f']
   f  f
0  7  1
1  8  1
2  9  1

As if we didn't already know that this was a bad idea...


source for rename

def rename(self, *args, **kwargs):
    axes, kwargs = self._construct_axes_from_arguments(args, kwargs)
    copy = kwargs.pop('copy', True)
    inplace = kwargs.pop('inplace', False)
    if kwargs:
        raise TypeError('rename() got an unexpected keyword '
                        'argument "{0}"'.format(list(kwargs.keys())[0]))
    if com._count_not_none(*axes.values()) == 0:
        raise TypeError('must pass an index to rename')
    # renamer function if passed a dict
    def _get_rename_function(mapper):
        if isinstance(mapper, (dict, ABCSeries)):
            def f(x):
                if x in mapper:
                    return mapper[x]
                else:
                    return x
        else:
            f = mapper
        return f
    self._consolidate_inplace()
    result = self if inplace else self.copy(deep=copy)
    # start in the axis order to eliminate too many copies
    for axis in lrange(self._AXIS_LEN):
        v = axes.get(self._AXIS_NAMES[axis])
        if v is None:
            continue
        f = _get_rename_function(v)
        baxis = self._get_block_manager_axis(axis)
        result._data = result._data.rename_axis(f, axis=baxis, copy=copy)
        result._clear_item_cache()
    if inplace:
        self._update_inplace(result._data)
    else:
        return result.__finalize__(self)




Comments (5)

  • +4 – Very interesting research! — Apr 08, 2017 at 11:36  
  • +2 – I am thinking about how can print cause this difference. Do you have some idea why? Never seen it before. — Apr 09, 2017 at 06:07  
  • +0 – @jezrael my theory is that there is initialization that happens upon the first print. — Apr 09, 2017 at 15:45  
  • +0 – But it is bug, becuase influence of print? I think it is impossible, but maybe I am wrong . — Apr 09, 2017 at 15:47  
  • +0 – @jezrael when print is called, it calls the repr method. At that point I'm guessing pandas runs some caching scripts if they haven't run before. — Apr 09, 2017 at 16:01  


External Links

External links referenced by this document: