Programming
python pandas dataframe select
Updated Sat, 23 Jul 2022 17:47:21 GMT

Selecting multiple columns in a Pandas dataframe


How do I select columns a and b from df, and save them into a new dataframe df1?

index  a   b   c
1      2   3   4
2      3   4   5

Unsuccessful attempt:

df1 = df['a':'b']
df1 = df.ix[:, 'a':'b']



Solution

The column names (which are strings) cannot be sliced in the manner you tried.

Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__ syntax (the []'s).

df1 = df[['a', 'b']]

Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:

df1 = df.iloc[:, 0:2] # Remember that Python does not slice inclusive of the ending index.

Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).

Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the .copy() method to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.

df1 = df.iloc[0, 0:2].copy() # To avoid the case where changing df1 also changes df

To use iloc, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc along with get_loc function of columns method of dataframe object to obtain column indices.

{df.columns.get_loc(c): c for idx, c in enumerate(df.columns)}

Now you can use this dictionary to access columns through names and using iloc.





Comments (5)

  • +0 – Note: df[['a','b']] produces a copy — Jul 08, 2012 at 17:54  
  • +1 – Yes this was implicit in my answer. The bit about the copy was only for use of ix[] if you prefer to use ix[] for any reason. — Jul 08, 2012 at 18:09  
  • +0ix accepts slice arguments, so you can also get columns. For example, df.ix[0:2, 0:2] gets the upper left 2x2 sub-array just like it does for a NumPy matrix (depending on your column names of course). You can even use the slice syntax on string names of the columns, like df.ix[0, 'Col1':'Col5']. That gets all columns that happen to be ordered between Col1 and Col5 in the df.columns array. It is incorrect to say that ix indexes rows. That is just its most basic use. It also supports much more indexing than that. So, ix is perfectly general for this question. — Oct 31, 2012 at 19:02  
  • +7 – @AndrewCassidy Never use .ix again. If you want to slice with integers use .iloc which is exclusive of the last position just like Python lists. — Jul 01, 2017 at 13:55  
  • +8 – @dte324 If your DataFrame is named df then use df.iloc[:, [1, 4]]. Usually if you want this type of access pattern, you'll already know these particular column names, and you can just use df.loc[:, ['name2', 'name5']] where 'name2' and 'name5' are your column string names for the respective columns you want, or look the names up with e.g. name2 = df.columns[1]. — Sep 12, 2019 at 12:19  


External Links

External links referenced by this document: