Python Pandas – 10 Useful Data Cleaning Tricks

Share on facebook
Share on reddit
Share on twitter
Share on linkedin

In the post below, I will be providing an introduction to the Python Pandas and 10 useful data cleaning tricks that I have used in my previous data analytic projects. Many of these tricks will assist in common Excel or CSV automation.

What is Pandas?

  • One of the most important tools for data scientists and analysts.
  • Pandas provide tools for working with tabular data which is data that is organized into tables that have rows and columns.
  • Tabular data has similar functionality as SQL or Excel.

What could Pandas do?

  • Ingest, clean, and aggregate large quantities of data, and can use that data alongside other Python libraries.

Example of Pandas Use Cases:

  1. Clean data by removing missing values.
  2. Filter rows or columns by certain criteria.
  3. Identify the average, median, max, or min of each column.
  4. Perform mass excel automation (e.g. combining multiple spreadsheets).

Table of Contents

1. Create Pandas DataFrame from dictionary list

To create a DataFrame with Python Pandas, you can use a dictionary list. In the dictionary variable “data”, the dict keys are the column names and the dict values are the rows. In the example below, there are 3 columns and 4 rows. A dictionary list can be converted to a Pandas DataFrame by using the function “pd.DataFrame(dictList)”.

import pandas as pd

data = {
"Name": ['john','sam','ruby','jane'],
"Fav_Fruit": ['apple', 'banana', 'peach', 'banana'],
"Fav_Color": ['green','green', 'red', 'red']
}

items = pd.DataFrame(data)

print(items)
Output of Create Pandas DataFrame from List

2. Export Pandas DataFrame to CSV or XLSX

Pandas Dataframes can easily be converted into a CSV or XLSX file by calling the functions (“items.to_csv” or “items.to_excel). In this function, the index is set to none to prevent the export of index rows/columns. The index column can be seen in the screenshot above (e.g. 0, 1, 2, 3).

import pandas as pd

data = {
"Name": ['john','sam','ruby','jane'],
"Fav_Fruit": ['apple', 'banana', 'peach', 'banana'],
"Fav_Color": ['green','green', 'red', 'red']
}

items = pd.DataFrame(data)

print(items)

export = items.to_csv('item.csv',index=None)
export = items.to_excel('item.xlsx',index=None)

After running the Python script, two files were generated (item.csv, item.xlsx).

Python Pandas export to csv and xlsx

3. Create Pandas DataFrame from CSV File

You can run the “pd.read_csv” function to read the tabular csv/excel data and convert it into a Pandas Dataframe. If necessary, you can also specify the file path of the CSV file if the target file is in another directory.

import pandas as pd

dataframe = pd.read_csv('item.csv')

print(dataframe)

4. Replace all blanks with a certain value

Python Pandas identifies “blank cells” as “NaN” in the DataFrame. The function “fillna” allows you to replace all “blank cells” with the value of your choice. 

import pandas as pd

data = pd.read_csv('item.csv')

items = pd.DataFrame(data)

print(items)

items_filled_blanks = items.fillna('pineapple')

print("---")

print(items_filled_blanks)
Python Pandas replace blank cells with a value

5. Change the letter casing for all of the data

If the data within your excel sheet has different casings (e.g. uppercase, lowercase). You can standardize all of the casing with the “applymap” function.

import pandas as pd

data = pd.read_csv('item.csv')

items = pd.DataFrame(data)

print(items)

items_lower_str = items.applymap(str.lower)

print("---")

print(items_lower_str)

6. Drop all rows with missing value

If there are rows with blank cells, Pandas will resolve them as “NaN”. By running the function dropna(), this function will drop any rows containing “NaN” values.

import pandas as pd

data = pd.read_csv('item.csv')

items = pd.DataFrame(data)

print(items)

items_drop_na = items.dropna()

print("---")

print(items_drop_na)
Python Pandas drop row with missing value

7. Sorting for a single value in a certain column

If you’re looking to sort your data for a particular value in a certain column. You can use the function df[df[“column_name”] = “value”] to sort for the value.

import pandas as pd

data = pd.read_csv('item.csv')

items = pd.DataFrame(data)

print(items)

items_sort_value = items[items["Fav_Fruit"] == "banana"]

print("---")

print(items_sort_value)
Python Pandas sort column by a certain value

8. Find the unique values in a column

The “unique” function allows you to extract all of the unique values from a particular column in a DataFrame. After obtaining the unique values in a list format, you can use this list to create excel files for each of the value or continue parsing it with other Pandas functions.

import pandas as pd

data = pd.read_csv('item.csv')

items = pd.DataFrame(data)

print(items)

items_unique = items.Fav_Fruit.unique()

print("---")

print(items_unique)
Python Pandas Find Unique Value in Column

9. Combine multiple excel sheet into a single sheet

A neat Python Pandas Excel automation trick is combining multiple spreadsheet into a single document, which is possible with the glob library. The glob library allows you to list all of the files within a certain directory. In the example below, the glob function is looking into the specified directory for any excel files. After identifying any excel files, it will read the data using Pandas and append all of the data to a single excel sheet “Combined.xlsx”.

import glob
import pandas as pd

all_data = pd.DataFrame()

for name in glob.glob(r'C:\Users\User\Desktop\TheCyberByte\Python Pandas\New folder\*.xlsx'):
    print(name)
    df = pd.read_excel(name)
    all_data = all_data.append(df,ignore_index=True)

export = all_data.to_excel('Combined.xlsx',index=None)
Python Pandas/Glob Combine Multiple Excel Sheet into One Sheet

10. Compare the values in one column against the values in another column

If you have two sets of data that you want to compare from two different columns or two different excel sheet, you can use the “pd.merge” function to compare the two different DataFrames. The “pd.merge” function gives you SQL “join” abilities, in which you can specified the direction of the join by specifying one of the following in the “how” parameter  (‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’). 

In the example below, I am comparing the column “John_Fav_Fruit” and “Mary_Fav_Fruit”, I want to know which fruit is in John’s list, but not in Mary’s list. To complete this task, I performed a left join which will display the missing fruit in Mary’s column as a “NaN”. After identifying all of the “NaN”, I will sort the DataFrame with the function “isnull()” to identify any of the fruit only on John’s list.

import pandas as pd

data = pd.read_excel('unique_value.xlsx')

items = pd.DataFrame(data)

items_john = pd.DataFrame(items["John_Fav_Fruit"])
items_mary = pd.DataFrame(items["Mary_Fav_Fruit"])

print(items_john)
print("--")
print(items_mary)
print("--")
john_vs_mary = pd.merge(items_john, items_mary,  how='left', left_on=['John_Fav_Fruit'], right_on = ['Mary_Fav_Fruit'])
print(john_vs_mary)
print("--")
john_vs_mary_2 = john_vs_mary[john_vs_mary['Mary_Fav_Fruit'].isnull()]
print("--")
print(john_vs_mary_2)

Thanks for reading through this post. Post any questions or comments below!

Leave a Reply

Your email address will not be published.

Sign up for our Newsletter