In the post below, I will be providing an introduction to the Python Pandas and 10 useful data cleaning tricks that I have used in my previous data analytic projects. Many of these tricks will assist in common Excel or CSV automation.
What is Pandas?
- One of the most important tools for data scientists and analysts.
- Pandas provide tools for working with tabular data which is data that is organized into tables that have rows and columns.
- Tabular data has similar functionality as SQL or Excel.
What could Pandas do?
- Ingest, clean, and aggregate large quantities of data, and can use that data alongside other Python libraries.
Example of Pandas Use Cases:
- Clean data by removing missing values.
- Filter rows or columns by certain criteria.
- Identify the average, median, max, or min of each column.
- Perform mass excel automation (e.g. combining multiple spreadsheets).
Table of Contents
1. Create Pandas DataFrame from dictionary list
To create a DataFrame with Python Pandas, you can use a dictionary list. In the dictionary variable “data”, the dict keys are the column names and the dict values are the rows. In the example below, there are 3 columns and 4 rows. A dictionary list can be converted to a Pandas DataFrame by using the function “pd.DataFrame(dictList)”.
2. Export Pandas DataFrame to CSV or XLSX
Pandas Dataframes can easily be converted into a CSV or XLSX file by calling the functions (“items.to_csv” or “items.to_excel). In this function, the index is set to none to prevent the export of index rows/columns. The index column can be seen in the screenshot above (e.g. 0, 1, 2, 3).
After running the Python script, two files were generated (item.csv, item.xlsx).
3. Create Pandas DataFrame from CSV File
You can run the “pd.read_csv” function to read the tabular csv/excel data and convert it into a Pandas Dataframe. If necessary, you can also specify the file path of the CSV file if the target file is in another directory.
4. Replace all blanks with a certain value
Python Pandas identifies “blank cells” as “NaN” in the DataFrame. The function “fillna” allows you to replace all “blank cells” with the value of your choice.
5. Change the letter casing for all of the data
If the data within your excel sheet has different casings (e.g. uppercase, lowercase). You can standardize all of the casing with the “applymap” function.
6. Drop all rows with missing value
If there are rows with blank cells, Pandas will resolve them as “NaN”. By running the function dropna(), this function will drop any rows containing “NaN” values.
7. Sorting for a single value in a certain column
If you’re looking to sort your data for a particular value in a certain column. You can use the function df[df[“column_name”] = “value”] to sort for the value.
8. Find the unique values in a column
The “unique” function allows you to extract all of the unique values from a particular column in a DataFrame. After obtaining the unique values in a list format, you can use this list to create excel files for each of the value or continue parsing it with other Pandas functions.
9. Combine multiple excel sheet into a single sheet
A neat Python Pandas Excel automation trick is combining multiple spreadsheet into a single document, which is possible with the glob library. The glob library allows you to list all of the files within a certain directory. In the example below, the glob function is looking into the specified directory for any excel files. After identifying any excel files, it will read the data using Pandas and append all of the data to a single excel sheet “Combined.xlsx”.
10. Compare the values in one column against the values in another column
If you have two sets of data that you want to compare from two different columns or two different excel sheet, you can use the “pd.merge” function to compare the two different DataFrames. The “pd.merge” function gives you SQL “join” abilities, in which you can specified the direction of the join by specifying one of the following in the “how” parameter (‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’).
In the example below, I am comparing the column “John_Fav_Fruit” and “Mary_Fav_Fruit”, I want to know which fruit is in John’s list, but not in Mary’s list. To complete this task, I performed a left join which will display the missing fruit in Mary’s column as a “NaN”. After identifying all of the “NaN”, I will sort the DataFrame with the function “isnull()” to identify any of the fruit only on John’s list.
Thanks for reading through this post. Post any questions or comments below!