This post describes different ways of dropping columns of rows from pandas dataframe. While performing any data analysis task you often need to remove certain columns or entire rows which are not relevant. So let’s learn how to remove columns or rows using pandas drop function.
Here I have taken CSV file of airbnb hosts. Mainly because, airbnb data is generally well understood by most of the people. You can download this data from this link.
At the end of this post, you will learn:
- Pandas drop columns using column name array
- Removing all columns with NaN Values
- Removing all rows with NaN Values
- Pandas drop rows by index
- Dropping rows based on index range
- Removing top x rows from dataframe
- Removing bottom x rows from dataframe
So Let’s get started….
Import Necessary Libraries
1 2 | import numpy as np import pandas as pd |
Create pandas dataframe from AirBnB Hosts CSV file
Here we are reading dataframe using pandas.read_csv() method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | hosts_df = pd.read_csv('data/airbnb-hosts.csv', parse_dates=['host_since']) columns = ['host_name', 'host_since', 'host_location', 'host_response_time', 'country', 'host_listings_count', 'host_total_listings_count', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet','number_of_reviews'] host_details_df = hosts_df[columns] print(host_details_df.shape) print(host_details_df.columns) # Output: # #>> (547, 17) #>> Index(['host_name', 'host_since', 'host_location', 'host_response_time', #>> 'country', 'host_listings_count', 'host_total_listings_count', #>> 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', #>> 'beds', 'bed_type', 'amenities', 'square_feet', 'number_of_reviews'], #>> dtype='object') |
Summary of Data
As number of columns in our dataframe are more, we are transposing the summary dataframe for reading. We will learn more about pandas describe() method in other post.
1 2 | summary_df = host_details_df.describe(include='all') summary_df.T |
count | unique | top | freq | first | last | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
host_name | 547 | 375 | Maarten | 9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
host_since | 547 | 398 | 2012-02-27 00:00:00 | 6 | 2009-10-05 00:00:00 | 2015-12-07 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
host_location | 546 | 61 | Antwerp, Flanders, Belgium | 314 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
host_response_time | 521 | 4 | within an hour | 240 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
country | 547 | 1 | Belgium | 547 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
host_listings_count | 547 | NaN | NaN | NaN | NaN | NaN | 1.97989 | 3.33721 | 1 | 1 | 1 | 2 | 51 |
host_total_listings_count | 547 | NaN | NaN | NaN | NaN | NaN | 1.97989 | 3.33721 | 1 | 1 | 1 | 2 | 51 |
property_type | 547 | 13 | Apartment | 393 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
room_type | 547 | 3 | Entire home/apt | 366 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
accommodates | 547 | NaN | NaN | NaN | NaN | NaN | 3.00914 | 1.60867 | 1 | 2 | 2 | 4 | 16 |
bathrooms | 545 | NaN | NaN | NaN | NaN | NaN | 1.09908 | 0.426919 | 0 | 1 | 1 | 1 | 8 |
bedrooms | 547 | NaN | NaN | NaN | NaN | NaN | 1.20475 | 0.658931 | 0 | 1 | 1 | 1 | 5 |
beds | 547 | NaN | NaN | NaN | NaN | NaN | 1.79525 | 1.32672 | 1 | 1 | 1 | 2 | 16 |
bed_type | 547 | 3 | Real Bed | 533 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
amenities | 547 | 515 | {} | 8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
square_feet | 33 | NaN | NaN | NaN | NaN | NaN | 498.091 | 524.259 | 0 | 0 | 431 | 861 | 2153 |
number_of_reviews | 547 | NaN | NaN | NaN | NaN | NaN | 23.1974 | 31.7685 | 1 | 5 | 11 | 25 | 236 |
Pandas drop columns using column name array
In order to remove certain columns from dataframe, we can use pandas drop function. To remove one or more columns one should simple pass a list of columns.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | ## Removing single column sample_df = host_details_df.drop(['country'], axis=1) print('Columns after removing "country" column') print(sample_df.columns) print() ## Removing multiple columns sample_df = host_details_df.drop(['accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities'], axis=1) print('Columns after removing all review related columns') print(sample_df.columns) # Output: # #>> Columns after removing "country" column #>> Index(['host_name', 'host_since', 'host_location', 'host_response_time', #>> 'host_listings_count', 'host_total_listings_count', 'property_type', #>> 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', #>> 'bed_type', 'amenities', 'square_feet', 'number_of_reviews'], #>> dtype='object') #>> #>> Columns after removing all review related columns #>> Index(['host_name', 'host_since', 'host_location', 'host_response_time', #>> 'country', 'host_listings_count', 'host_total_listings_count', #>> 'property_type', 'room_type', 'square_feet', 'number_of_reviews'], #>> dtype='object') |
Removing all columns with NaN Values
To remove all columns with NaN value we can simple use pandas dropna function. By simply specifying axis=1 the function will remove all columns which has atleast one row value is NaN.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | print(host_details_df.shape) print(host_details_df.columns) sample_df = host_details_df.dropna(axis=1) print() print(sample_df.shape) print(sample_df.columns) # Output: # #>> (547, 17) #>> Index(['host_name', 'host_since', 'host_location', 'host_response_time', #>> 'country', 'host_listings_count', 'host_total_listings_count', #>> 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', #>> 'beds', 'bed_type', 'amenities', 'square_feet', 'number_of_reviews'], #>> dtype='object') #>> #>> (547, 13) #>> Index(['host_name', 'host_since', 'country', 'host_listings_count', #>> 'host_total_listings_count', 'property_type', 'room_type', #>> 'accommodates', 'bedrooms', 'beds', 'bed_type', 'amenities', #>> 'number_of_reviews'], #>> dtype='object') |
As we can see in above output, pandas dropna function has removed 4 columns which had one or more NaN values.
Removing all rows with NaN Values
Similar to above example pandas dropna function can also remove all rows in which any of the column contain NaN value. By simply specifying axis=0 function will remove all rows which has atleast one column value is NaN.
1 2 3 4 5 6 7 8 | print(host_details_df.shape) sample_df = host_details_df.dropna(axis=0) print(sample_df.shape) # Output: # #>> (547, 17) #>> (26, 17) |
Looking at the shape of output dataframe, it seems that it has just kept 26 rows with not null values.
Pandas drop rows by index
Firstly, let’s take few columns from the hosts dataframe and check it.
1 2 3 4 | sample_df = host_details_df[['host_name', 'host_since', 'host_location', 'host_response_time', 'host_listings_count']] sample_df.head(10) |
host_name | host_since | host_location | host_response_time | host_listings_count | |
---|---|---|---|---|---|
0 | Maarten | 2014-04-01 | Antwerp, Flanders, Belgium | within a few hours | 1 |
1 | Fronk & Lieve | 2012-02-27 | Antwerpen, Vlaams Gewest, Belgium | within an hour | 7 |
2 | Elke | 2013-05-11 | Alveringem, Flanders, Belgium | a few days or more | 1 |
3 | Francis | 2012-04-17 | Antwerpen, Flemish Region, Belgium | within an hour | 2 |
4 | Kristof | 2012-10-13 | Antwerpen, Flanders, Belgium | within an hour | 1 |
5 | Liesbet | 2011-01-29 | Antwerp, Flemish Region, Belgium | within an hour | 1 |
6 | Chloé | 2014-03-15 | Antwerp, Flanders, Belgium | within a day | 1 |
7 | Klara | 2014-05-22 | Antwerp, Flanders, Belgium | within an hour | 1 |
8 | Maarten | 2012-07-24 | Antwerp, Flanders, Belgium | within a few hours | 1 |
9 | Kristina | 2014-02-17 | Antwerp, Flanders, Belgium | within an hour | 1 |
Now we can use pandas drop function to remove few rows. To remove one or more rows from a dataframe, we need to pass the array indexes for the rows which need to be removed. Also the argument axis=0 specifies that pandas drop function is being used to drop the rows.
1 2 | final_df = sample_df.drop([0,1,2,3], axis=0) final_df.head(10) |
host_name | host_since | host_location | host_response_time | host_listings_count | |
---|---|---|---|---|---|
4 | Kristof | 2012-10-13 | Antwerpen, Flanders, Belgium | within an hour | 1 |
5 | Liesbet | 2011-01-29 | Antwerp, Flemish Region, Belgium | within an hour | 1 |
6 | Chloé | 2014-03-15 | Antwerp, Flanders, Belgium | within a day | 1 |
7 | Klara | 2014-05-22 | Antwerp, Flanders, Belgium | within an hour | 1 |
8 | Maarten | 2012-07-24 | Antwerp, Flanders, Belgium | within a few hours | 1 |
9 | Kristina | 2014-02-17 | Antwerp, Flanders, Belgium | within an hour | 1 |
10 | Fred | 2014-01-24 | Antwerp, Flanders, Belgium | within an hour | 2 |
11 | Francis | 2012-04-17 | Antwerpen, Flemish Region, Belgium | within an hour | 2 |
12 | Katrien | 2011-03-09 | Antwerpen, Flemish Region, Belgium | NaN | 1 |
13 | Alexandra | 2012-05-05 | Antwerp/London | within a few hours | 1 |
Dropping rows based on index range
1 2 3 4 | final_df = sample_df.drop(host_details_df.index[0:3], axis=0) final_df.head(10) ## Output will be same as above |
Removing top x rows from dataframe
1 2 | final_df = sample_df.drop(host_details_df.index[:5], axis=0) final_df.head(10) |
host_name | host_since | host_location | host_response_time | host_listings_count | |
---|---|---|---|---|---|
5 | Liesbet | 2011-01-29 | Antwerp, Flemish Region, Belgium | within an hour | 1 |
6 | Chloé | 2014-03-15 | Antwerp, Flanders, Belgium | within a day | 1 |
7 | Klara | 2014-05-22 | Antwerp, Flanders, Belgium | within an hour | 1 |
8 | Maarten | 2012-07-24 | Antwerp, Flanders, Belgium | within a few hours | 1 |
9 | Kristina | 2014-02-17 | Antwerp, Flanders, Belgium | within an hour | 1 |
10 | Fred | 2014-01-24 | Antwerp, Flanders, Belgium | within an hour | 2 |
11 | Francis | 2012-04-17 | Antwerpen, Flemish Region, Belgium | within an hour | 2 |
12 | Katrien | 2011-03-09 | Antwerpen, Flemish Region, Belgium | NaN | 1 |
13 | Alexandra | 2012-05-05 | Antwerp/London | within a few hours | 1 |
14 | Koosje | 2013-01-08 | Antwerpen, Flanders, Belgium | within a few hours | 1 |
Removing bottom x rows from dataframe
1 2 3 4 | final_df = sample_df.drop(host_details_df.index[:-5], axis=0) final_df.head(10) # Last 5 rows will be removed |
So finally we learned to use pandas drop function to remove columns and rows.
In case you have any queries regarding this post, reach out to me. You can ask your questions in the comments section below and stay tuned…!!
Leave a Reply