In any data science/data analysis work, the first step is to read CSV file (with pandas library). Pandas read_csv function is popular to load any CSV file in pandas. In this post we’ll explore various options of pandas read_csv function.

Pandas read_csv function has various options which help us to take care of certain things like formatting, handling null values etc. Let’s explore those options step by step.
In my earlier post, we discussed various ways to create dataframes from Lists and Dictionaries. Assuming we have different data-sources in the form of CSV files, following are the ways to read csv files and create pandas dataframe.
Load CSV data using default parameters
Let’s say we have some sample csv files at our /data/ directory.
1 2 3 4 5 6 | # Import Pandas Library import pandas as pd # Default implementation profit_per_df = pd.read_csv('\data\profit-perc.csv') profit_per_df.head() |
COUNTRY | Q1 | Q2 | Q3 | Q4 | |
---|---|---|---|---|---|
0 | US | 10.0 | 9.5 | 7.6 | 12.0 |
1 | UK | 11.2 | 3.8 | 6.9 | 9.0 |
2 | India | 9.6 | 7.3 | 8.3 | 11.0 |
3 | Singapore | 9.0 | 5.6 | 6.9 | 10.0 |
In case we want to consider first row or headers as a row we can specify header=None option
1 2 3 | # Load CSV with no Headers profit_per_df = pd.read_csv('\data\profit-perc-no-headers.csv', header=None) profit_per_df.head() |
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | US | 10.0 | 9.5 | 7.6 | 12.0 |
1 | UK | 11.2 | 3.8 | 6.9 | 9.0 |
2 | India | 9.6 | 7.3 | 8.3 | 11.0 |
3 | Singapore | 9.0 | 5.6 | 6.9 | 10.0 |
Specify column names while loading CSV files
In case there are no headers available in CSV file, we can specify list of column names to names parameter.
With above option, pandas read_csv method will consider first row in csv as a row rather than column.
1 2 3 4 5 | col_names = ['COUNTRY','Q1','Q2','Q3','Q4'] profit_per_df = pd.read_csv('\data\profit-perc-no-headers.csv', names=col_names) profit_per_df.head() |
COUNTRY | Q1 | Q2 | Q3 | Q4 | |
---|---|---|---|---|---|
0 | US | 10.0 | 9.5 | 7.6 | 12.0 |
1 | UK | 11.2 | 3.8 | 6.9 | 9.0 |
2 | India | 9.6 | 7.3 | 8.3 | 11.0 |
3 | Singapore | 9.0 | 5.6 | 6.9 | 10.0 |
Setting a column as an index with pandas read_csv
1 2 3 4 5 6 | col_names = ['COUNTRY','Q1','Q2','Q3','Q4'] profit_per_df = pd.read_csv('\data\profit-perc-no-headers.csv', names=col_names, index_col=['COUNTRY']) profit_per_df.head() |
Q1 | Q2 | Q3 | Q4 | |
---|---|---|---|---|
COUNTRY | ||||
US | 10.0 | 9.5 | 7.6 | 12.0 |
UK | 11.2 | 3.8 | 6.9 | 9.0 |
India | 9.6 | 7.3 | 8.3 | 11.0 |
Singapore | 9.0 | 5.6 | 6.9 | 10.0 |
Using different separator in while loading CSV file
Many times we end-up in getting CSV files with some other separator like ‘;’ or ‘|’ or any other special character. In pandas, read_csv method provides an option delimiter to specify a separator. Let’s say we have co2 emission data separated by ‘;’. We can load our CSV file following way.
1 2 | co2_emission_df = pd.read_csv('\data\co2-emission.csv', delimiter=';') co2_emission_df.head() |
Industry | Share | |
---|---|---|
0 | Agriculture & mining | 10.5 |
1 | Manufacturing and construction | 17.0 |
2 | Energy and water supply and waste treatment | 25.9 |
3 | Households | 22.2 |
4 | Services | 12.4 |
5 | Transport sector | 12.0 |
Parsing dates while loading CSV file in pandas
Sometimes we get date fields in csv file. When we load CSV files with default options, dtype of date columns remains to be Object. In case we want our date columns to be parsed as date we can use parse_date option of read_csv method.
Let’s understand this using one TSV file jira-issues.tsv
1 2 3 | # Loading TSV file jira_issues_df = pd.read_csv('\data\jira-issues.tsv', delimiter='\t') print(jira_issues_df.dtypes) |
PRIORITY object
STATUS object
OPEN_DATE object
CLOSE_DATE object
dtype: object
We can see in above result dtype of OPEN_DATE and CLOSE_DATE columns is Object.
Now let’s try to load it with parse_dates option.
1 2 3 4 5 6 7 | # Loading TSV file with parse_dates option jira_issues_df = pd.read_csv('\data\jira-issues.tsv', delimiter='\t', parse_dates=['OPEN_DATE', 'CLOSE_DATE']) print(jira_issues_df.dtypes) jira_issues_df .head() |
PRIORITY object
STATUS object
OPEN_DATE datetime64[ns] CLOSE_DATE datetime64[ns] dtype: object
ISSUE | PRIORITY | STATUS | OPEN_DATE | CLOSE_DATE | |
---|---|---|---|---|---|
0 | PD-1023 | High | Open | 2018-01-03 | NaT |
1 | PD-1162 | High | Closed | 2018-02-05 | 2018-02-07 |
2 | PD-1231 | Medium | Closed | 2018-02-27 | 2018-03-02 |
3 | PD-1345 | Low | Open | 2018-03-12 | NaT |
Read empty values as blank string instead of NaN
If our csv data source has null values in it and we want to replace them with blank while loading CSV file, we can use keep_default_na option for the same. Let’s have a look at our previous dataframe.
1 2 3 4 5 | # Loading TSV file by replacing NaN with blank jira_issues_df = pd.read_csv('\data\jira-issues.tsv', delimiter='\t', keep_default_na='') jira_issues_df.head() |
ISSUE | PRIORITY | STATUS | OPEN_DATE | CLOSE_DATE | |
---|---|---|---|---|---|
0 | PD-1023 | High | Open | 2018-01-03 | |
1 | PD-1162 | High | Closed | 2018-02-05 | 2018-02-07 |
2 | PD-1231 | Medium | Closed | 2018-02-27 | 2018-03-02 |
3 | PD-1345 | Low | Open | 2018-03-12 |
Skip rows while reading csv file
Let’s say we need to skip first two rows while reading csv file in pandas. This is entirely possible with skiprows option.
1 2 3 4 5 6 | # This will skip first two rows along with headers headers co2_emission_df = pd.read_csv('\data\co2-emission.csv', delimiter=';', skiprows=2) co2_emission_df.head() |
Manufacturing and construction | 17.0 | |
---|---|---|
0 | Energy and water supply and waste treatment | 25.9 |
1 | Households | 22.2 |
2 | Services | 12.4 |
3 | Transport sector | 12.0 |
That’s it for now. I hope you enjoyed this post. I’ll cover some more topics regards pandas dataframes in my upcoming posts.
Till that time Stay Tuned….!!!!!
Leave a Reply