Essential Pandas You Should Know By Heart
Core Pandas Functionalities for Efficient Data Processing and Analysis
Essential Pandas You Should Know By Heart
In the following article, you will learn Pandas essential constructs that every data person should know by heart:
Creating DataFrames Manually: Initialize with dictionaries.
Importing Data: CSV, Excel, Parquet.
Data Selection:
loc
andiloc
.Filtering DataFrames: Using brackets
[ ]
and.query()
.Joining DataFrames
Applying Functions: Using
.apply()
method with lambda functions.Aggregations: Summarizing data using
.agg()
.Exporting DataFrames to CSV, Excel.
1. Creating DataFrames Manually
When creating or importing data into the DataFrame, I always try to specify the dtype
parameter to avoid complications later.
# Dict to Build DataFrames.
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
""" Pandas """
import pandas as pd
schema = {
'Name': 'str',
'Age': 'int',
'City': 'str'
}
df = pd.DataFrame(data)
df = df.astype(schema)
print(df)
2. Importing Data
Pandas can import data from many different formats like CSV, Excel, Parquet, SQL, and more. Sometimes, if the files are too large, it is recommended to use chunksize
parameter to read data in smaller chunks. And if you are importing data with SQL queries, make sure the connection is closed properly in the end.
import pandas as pd
# CSV
df_csv = pd.read_csv('data.csv', dtype={'Age': 'int', 'City': 'str'})
print(df_csv.head())
# Excel
df_excel = pd.read_excel('data.xlsx', dtype={'Age': 'int', 'City': 'str'})
print(df_excel.head())
# Parquet
df_parquet = pd.read_parquet('data.parquet')
print(df_parquet.head())
# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df_sql = pd.read_sql('SELECT * FROM table_name', conn)
conn.close()
print(df_sql.head())
3. Data Selection with loc
and iloc
When slicing the DataFrames, it is recommended to use loc
or iloc
.
loc
is for label-based selection.iloc
is for position-based selection when the exact positions of rows and columns are known.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago'],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
# Using loc with complex conditions
selected_loc = df.loc[(df['Age'] > 25) & (df['Salary'] > 50000), ['Name', 'City']]
print(selected_loc)
# Output
-----------------------
Name City
1 Bob Los Angeles
2 Charlie Chicago
# Using iloc for position-based selection
selected_iloc = df.iloc[1:3, 0:3]
print(selected_iloc)
# Output
-----------------------
Name Age City
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
4. Filtering DataFrames
We can filter DataFrames by using different approaches: brackets [ ]
and .query()
. And of course, using loc
and iloc
. 😛 But let’s concentrate on [ ]
and .query()
in the following code block.
My advice is to use .query() for readability and when working with dynamic column names. Note: Be careful with string conditions and make sure to escape quotes correctly.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago'],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
city = 'Chicago'
# Using brackets with multiple conditions
filtered_brackets = df[(df['Age'] > 25) & (df['City'] == city)]
print(filtered_brackets)
# Using .query() with the same condition as above
filtered_query = df.query('Age > 25 and City == @city')
print(filtered_query)
# Output
Name Age City Salary
2 Charlie 35 Chicago 70000
5. Joining DataFrames
Join is a common task in any data-related field, especially when combining data from different sources. Note the _merge indicator, it’s used to track the source of rows in outer joins.
import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B', 'C'],
'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'],
'value2': [4, 5, 6]})
# Inner join
result_inner = pd.merge(df1, df2, on='key', how='inner')
print(result_inner)
# Output
key value1 value2
0 A 1 4
1 B 2 5
# Outer join
result_outer = pd.merge(df1, df2, on='key',
how='outer',
indicator=True)
print(result_outer)
# Output
key value1 value2 _merge
0 A 1.0 4.0 both
1 B 2.0 5.0 both
2 C 3.0 NaN left_only
3 D NaN 6.0 right_only
6. Applying Functions with .apply()
The .apply() is one of the most powerful methods in pandas letting us apply functions across DataFrame columns or rows.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago'],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
# Single column
df['AgePlusTen'] = df['Age'].apply(lambda x: x + 10)
print(df)
# Multiple columns
df['Name_Age'] = df.apply(lambda row: f"{row['Name']} - {row['Age']}", axis=1)
print(df)
# Output
Name Age City Salary AgePlusTen Name_Age
0 Alice 25 New York 50000 35 Alice - 25
1 Bob 30 Los Angeles 60000 40 Bob - 30
2 Charlie 35 Chicago 70000 45 Charlie - 35
7. Aggregations with .agg()
Aggregations in data analysis are used to summarize data and derive insights. By using .agg()
method Pandas makes it super easy to aggregate data across single or multiple columns.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago'],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
# Aggregating single column
age_agg = df['Age'].agg(['mean', 'min', 'max'])
print(age_agg)
# Output
mean 30.0
min 25.0
max 35.0
Name: Age, dtype: float64
# Aggregating multiple columns
summary = df.agg({
'Age': ['mean', 'min', 'max'],
'Salary': ['mean', 'min', 'max']
})
print(summary)
# Output
Age Salary
mean 30.000 60000.0
min 25.000 50000.0
max 35.000 70000.0
8. Exporting DataFrames
Exporting data ensures you can share and store your analysis results. I always use index=False to avoid writing DataFrame indices to the file. Make sure to install openpyxl
library to export data into Excel.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago'],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
# Export to CSV
df.to_csv('data.csv', index=False)
# Export to Excel
df.to_excel('data.xlsx', index=False)
Additional Resources
I strongly suggest everyone read the following articles to grasp the ideas behind the basic pandas features:
Learn more about
.query()
Learn more about
.apply()
Learning these functions of pandas will significantly enhance your data manipulation skills. 🚀🚀🚀
Have questions or need further clarification? Leave a comment below or reach out directly.
✅ Thank you for reading my article on SA Space! I welcome any questions, comments, or suggestions you may have.
Keep Knowledge Flowing by following me for more content on Solutions Architecture, System Design, Data Engineering, Business Analysis, and more. Your engagement is appreciated. 🚀
Interesting post, thank you!