Using pandas describe method to get dataframe summary

Data Analysts often use pandas describe method to get high level summary from dataframe. Pandas describe method plays a very critical role to understand data distribution of each column.

For descriptive summary statistics like average, standard deviation and quantile values we can use pandas describe function. Let’s understand this function with the help of some examples.

Import necessary libraries

In this post also, we will use AirBnB dataframe example to understand the statistics. Here is a small sample dataframe.

Create AirBnB dataframe

accommodateshost_namesjoining_yearlisting_countreview_count
02Maarten2014125.0
12Fronk & Lieve20127NaN
23Elke201314.0
32Francis2012259.0
42Kristof201213.0
54Liesbet2011189.0
62Chloé201413.0
75Klara201412.0
84Maarten201217.0
92Kristina20141NaN

Get summary using pandas describe method

accommodatesjoining_yearlisting_countreview_count
count10.00000010.00000010.0000008.000000
mean2.8000002012.8000001.70000024.000000
std1.1352921.1352921.88856232.745774
min2.0000002011.0000001.0000002.000000
25%2.0000002012.0000001.0000003.000000
50%2.0000002012.5000001.0000005.500000
75%3.7500002014.0000001.00000033.500000
max5.0000002014.0000007.00000089.000000

In above statistical summary, we can see different columns which are generally of interest for any Data Analyst. It shows us minimum, maximum, average, standard deviation as well as quantile values with respect to each numeric column.

Still there are certain summary columns like “count of unique values” which are not available in above dataframe. To get full summary, we should pass include=’all’ option to pandas describe method.

accommodateshost_namesjoining_yearlisting_countreview_count
count10.0000001010.00000010.0000008.000000
uniqueNaN9NaNNaNNaN
topNaNMaartenNaNNaNNaN
freqNaN2NaNNaNNaN
mean2.800000NaN2012.8000001.70000024.000000
std1.135292NaN1.1352921.88856232.745774
min2.000000NaN2011.0000001.0000002.000000
25%2.000000NaN2012.0000001.0000003.000000
50%2.000000NaN2012.5000001.0000005.500000
75%3.750000NaN2014.0000001.00000033.500000
max5.000000NaN2014.0000007.00000089.000000

Looking at above summary dataframe, we can see some additional columns. Following is the detail with respect to each row in above dataframe.

count:

  • Count of values excluding NaN
  • Computed for all columns

unique:

  • Count of unique values excluding NaN
  • Computed only for categorical (non numeric) type of columns (or series)

top:

  • Most commonly occuring value among all values in a column (or series)
  • Computed only for categorical (non numeric) type of columns (or series)

freq:

  • Frequency (or count of occurance) of most commonly occuring value among all values in a column (or series)
  • Computed only for categorical (non numeric) type of columns (or series)

mean:

  • Mean (Average) of all numeric values in a column (or series)
  • Computed only for numeric type of columns (or series)

std:

  • Standard Deviation of all numeric values in a column (or series)
  • Computed only for numeric type of columns (or series)

min:

  • Minimum value of all numeric values in a column (or series)
  • Computed only for numeric type of columns (or series)

25%, 50%, 75%:

  • Given percentile values (quantile 1, 2 and 3 respectively) of all numeric values in a column (or series)
  • Computed only for numeric type of columns (or series)

max:

  • Maximum value of all numeric values in a column (or series)
  • Computed only for numeric type of columns (or series)

We can simply use pandas transpose method to swap the rows and columns.

Transposed summary of a pandas dataframe

countuniquetopfreqmeanstdmin25%50%75%max
accommodates10NaNNaNNaN2.81.135292223.755
host_names109Maarten2NaNNaNNaNNaNNaNNaNNaN
joining_year10NaNNaNNaN2012.81.13529201120122012.520142014
listing_count10NaNNaNNaN1.71.8885611117
review_count8NaNNaNNaN2432.7458235.533.589

Pandas describe only Categorical or only Numeric Columns

Summary dataframe will only include numerical columns if we pass exclude=’O’ as parameter.  Moreover, if we are interested only in categorical columns, we should pass include=’O’.

host_names
count9
unique8
topMaarten
freq2

We just have host_name column as categorical or non numeric column so we just got that column in summary.

Include different quantiles for numeric columns in summary

In cases, data analysts are also interested in 10 as well as 90 percentile values. To add those in summary we can pass list of percentiles using ‘percentiles’ parameter.

accommodateshost_namesjoining_yearlisting_countreview_count
count10.000000910.00000010.0000008.000000
uniqueNaN8NaNNaNNaN
topNaNMaartenNaNNaNNaN
freqNaN2NaNNaNNaN
mean2.800000NaN2012.8000001.70000024.000000
std1.135292NaN1.1352921.88856232.745774
min2.000000NaN2011.0000001.0000002.000000
10%2.000000NaN2011.9000001.0000002.700000
25%2.000000NaN2012.0000001.0000003.000000
50%2.000000NaN2012.5000001.0000005.500000
75%3.750000NaN2014.0000001.00000033.500000
90%4.100000NaN2014.0000002.50000068.000000
max5.000000NaN2014.0000007.00000089.000000

Thanks for reading and stay tuned for more posts on Data Wrangling…!!!!!

 

Leave a Reply

Your email address will not be published. Required fields are marked *