Data Analysts often use pandas describe method to get high level summary from dataframe. Pandas describe method plays a very critical role to understand data distribution of each column.
For descriptive summary statistics like average, standard deviation and quantile values we can use pandas describe function. Let’s understand this function with the help of some examples.
Import necessary libraries
1 2 | import pandas as pd import numpy as np |
Create AirBnB dataframe
1 2 3 4 5 6 7 8 9 10 11 12 13 | host_names = ['Maarten','Fronk & Lieve','Elke','Francis','Kristof', 'Liesbet','Chloé','Klara','Maarten','Kristina'] listing_count = [1, 7, 1, 2, 1, 1, 1, 1, 1, 1] review_count = [25, np.nan, 4, 59, 3, 89, 3, 2, 7, np.nan] accommodates = [2, 2, 3, 2, 2, 4, 2, 5, 4, 2] year=[2014, 2012, 2013, 2012, 2012, 2011, 2014, 2014, 2012, 2014] data = {'joining_year': year, 'listing_count': listing_count, 'review_count': review_count,'accommodates': accommodates, 'host_names' : host_names} host_df = pd.DataFrame(data) host_df |
accommodates | host_names | joining_year | listing_count | review_count | |
---|---|---|---|---|---|
0 | 2 | Maarten | 2014 | 1 | 25.0 |
1 | 2 | Fronk & Lieve | 2012 | 7 | NaN |
2 | 3 | Elke | 2013 | 1 | 4.0 |
3 | 2 | Francis | 2012 | 2 | 59.0 |
4 | 2 | Kristof | 2012 | 1 | 3.0 |
5 | 4 | Liesbet | 2011 | 1 | 89.0 |
6 | 2 | Chloé | 2014 | 1 | 3.0 |
7 | 5 | Klara | 2014 | 1 | 2.0 |
8 | 4 | Maarten | 2012 | 1 | 7.0 |
9 | 2 | Kristina | 2014 | 1 | NaN |
Get summary using pandas describe method
1 | host_df.describe() |
accommodates | joining_year | listing_count | review_count | |
---|---|---|---|---|
count | 10.000000 | 10.000000 | 10.000000 | 8.000000 |
mean | 2.800000 | 2012.800000 | 1.700000 | 24.000000 |
std | 1.135292 | 1.135292 | 1.888562 | 32.745774 |
min | 2.000000 | 2011.000000 | 1.000000 | 2.000000 |
25% | 2.000000 | 2012.000000 | 1.000000 | 3.000000 |
50% | 2.000000 | 2012.500000 | 1.000000 | 5.500000 |
75% | 3.750000 | 2014.000000 | 1.000000 | 33.500000 |
max | 5.000000 | 2014.000000 | 7.000000 | 89.000000 |
In above statistical summary, we can see different columns which are generally of interest for any Data Analyst. It shows us minimum, maximum, average, standard deviation as well as quantile values with respect to each numeric column.
Still there are certain summary columns like “count of unique values” which are not available in above dataframe. To get full summary, we should pass include=’all’ option to pandas describe method.
1 | host_df.describe(include='all') |
accommodates | host_names | joining_year | listing_count | review_count | |
---|---|---|---|---|---|
count | 10.000000 | 10 | 10.000000 | 10.000000 | 8.000000 |
unique | NaN | 9 | NaN | NaN | NaN |
top | NaN | Maarten | NaN | NaN | NaN |
freq | NaN | 2 | NaN | NaN | NaN |
mean | 2.800000 | NaN | 2012.800000 | 1.700000 | 24.000000 |
std | 1.135292 | NaN | 1.135292 | 1.888562 | 32.745774 |
min | 2.000000 | NaN | 2011.000000 | 1.000000 | 2.000000 |
25% | 2.000000 | NaN | 2012.000000 | 1.000000 | 3.000000 |
50% | 2.000000 | NaN | 2012.500000 | 1.000000 | 5.500000 |
75% | 3.750000 | NaN | 2014.000000 | 1.000000 | 33.500000 |
max | 5.000000 | NaN | 2014.000000 | 7.000000 | 89.000000 |
Looking at above summary dataframe, we can see some additional columns. Following is the detail with respect to each row in above dataframe.
count:
- Count of values excluding NaN
- Computed for all columns
unique:
- Count of unique values excluding NaN
- Computed only for categorical (non numeric) type of columns (or series)
top:
- Most commonly occuring value among all values in a column (or series)
- Computed only for categorical (non numeric) type of columns (or series)
freq:
- Frequency (or count of occurance) of most commonly occuring value among all values in a column (or series)
- Computed only for categorical (non numeric) type of columns (or series)
mean:
- Mean (Average) of all numeric values in a column (or series)
- Computed only for numeric type of columns (or series)
std:
- Standard Deviation of all numeric values in a column (or series)
- Computed only for numeric type of columns (or series)
min:
- Minimum value of all numeric values in a column (or series)
- Computed only for numeric type of columns (or series)
25%, 50%, 75%:
- Given percentile values (quantile 1, 2 and 3 respectively) of all numeric values in a column (or series)
- Computed only for numeric type of columns (or series)
max:
- Maximum value of all numeric values in a column (or series)
- Computed only for numeric type of columns (or series)
We can simply use pandas transpose method to swap the rows and columns.
Transposed summary of a pandas dataframe
1 2 3 | host_df.describe(include='all').transpose() # OR you can also use # host_df.describe(include='all').T |
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
accommodates | 10 | NaN | NaN | NaN | 2.8 | 1.13529 | 2 | 2 | 2 | 3.75 | 5 |
host_names | 10 | 9 | Maarten | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
joining_year | 10 | NaN | NaN | NaN | 2012.8 | 1.13529 | 2011 | 2012 | 2012.5 | 2014 | 2014 |
listing_count | 10 | NaN | NaN | NaN | 1.7 | 1.88856 | 1 | 1 | 1 | 1 | 7 |
review_count | 8 | NaN | NaN | NaN | 24 | 32.7458 | 2 | 3 | 5.5 | 33.5 | 89 |
Pandas describe only Categorical or only Numeric Columns
Summary dataframe will only include numerical columns if we pass exclude=’O’ as parameter. Moreover, if we are interested only in categorical columns, we should pass include=’O’.
1 2 3 4 5 6 7 8 9 10 11 | # OR you can also use # host_df.describe(include='all').T # unique, top, freq # Returns a Summary dataframe for numeric columns only # output will be same as host_df.describe() host_df.describe(exclude='O') # Returns a Summary dataframe # for object type (or categorical) columns only host_df.describe(include='O') |
host_names | |
---|---|
count | 9 |
unique | 8 |
top | Maarten |
freq | 2 |
We just have host_name column as categorical or non numeric column so we just got that column in summary.
Include different quantiles for numeric columns in summary
In cases, data analysts are also interested in 10 as well as 90 percentile values. To add those in summary we can pass list of percentiles using ‘percentiles’ parameter.
1 2 | # Adding few more percentile values in summary host_df.describe(percentiles=[.1, .25, .5, .75, .9], include='all') |
accommodates | host_names | joining_year | listing_count | review_count | |
---|---|---|---|---|---|
count | 10.000000 | 9 | 10.000000 | 10.000000 | 8.000000 |
unique | NaN | 8 | NaN | NaN | NaN |
top | NaN | Maarten | NaN | NaN | NaN |
freq | NaN | 2 | NaN | NaN | NaN |
mean | 2.800000 | NaN | 2012.800000 | 1.700000 | 24.000000 |
std | 1.135292 | NaN | 1.135292 | 1.888562 | 32.745774 |
min | 2.000000 | NaN | 2011.000000 | 1.000000 | 2.000000 |
10% | 2.000000 | NaN | 2011.900000 | 1.000000 | 2.700000 |
25% | 2.000000 | NaN | 2012.000000 | 1.000000 | 3.000000 |
50% | 2.000000 | NaN | 2012.500000 | 1.000000 | 5.500000 |
75% | 3.750000 | NaN | 2014.000000 | 1.000000 | 33.500000 |
90% | 4.100000 | NaN | 2014.000000 | 2.500000 | 68.000000 |
max | 5.000000 | NaN | 2014.000000 | 7.000000 | 89.000000 |
Thanks for reading and stay tuned for more posts on Data Wrangling…!!!!!
Leave a Reply