Essential Pandas You Should Know By Heart

Core Pandas Functionalities for Efficient Data Processing and Analysis

Lasha Dolenjashvili

May 26, 2024

Essential Pandas You Should Know By Heart

In the following article, you will learn Pandas essential constructs that every data person should know by heart:

Creating DataFrames Manually: Initialize with dictionaries.
Importing Data: CSV, Excel, Parquet.
Data Selection: loc and iloc.
Filtering DataFrames: Using brackets [ ] and .query().
Joining DataFrames
Applying Functions: Using .apply() method with lambda functions.
Aggregations: Summarizing data using .agg().
Exporting DataFrames to CSV, Excel.

1. Creating DataFrames Manually

When creating or importing data into the DataFrame, I always try to specify the dtype parameter to avoid complications later.

# Dict to Build DataFrames.
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

""" Pandas """
import pandas as pd

schema = {
    'Name': 'str',
    'Age': 'int',
    'City': 'str'
}

df = pd.DataFrame(data)
df = df.astype(schema)
print(df)

2. Importing Data

Pandas can import data from many different formats like CSV, Excel, Parquet, SQL, and more. Sometimes, if the files are too large, it is recommended to use chunksize parameter to read data in smaller chunks. And if you are importing data with SQL queries, make sure the connection is closed properly in the end.

import pandas as pd

# CSV
df_csv = pd.read_csv('data.csv', dtype={'Age': 'int', 'City': 'str'})
print(df_csv.head())

# Excel
df_excel = pd.read_excel('data.xlsx', dtype={'Age': 'int', 'City': 'str'})
print(df_excel.head())

# Parquet
df_parquet = pd.read_parquet('data.parquet')
print(df_parquet.head())

# SQL
import sqlite3

conn = sqlite3.connect('database.db')
df_sql = pd.read_sql('SELECT * FROM table_name', conn)
conn.close()
print(df_sql.head())

3. Data Selection with `loc` and `iloc`

When slicing the DataFrames, it is recommended to use loc or iloc.

loc is for label-based selection.
iloc is for position-based selection when the exact positions of rows and columns are known.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago'],
    'Salary': [50000, 60000, 70000]
}

df = pd.DataFrame(data)

# Using loc with complex conditions
selected_loc = df.loc[(df['Age'] > 25) & (df['Salary'] > 50000), ['Name', 'City']]
print(selected_loc)

# Output
-----------------------
      Name         City
1      Bob  Los Angeles
2  Charlie      Chicago

# Using iloc for position-based selection
selected_iloc = df.iloc[1:3, 0:3]
print(selected_iloc)

# Output
-----------------------
      Name  Age         City
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

4. Filtering DataFrames

We can filter DataFrames by using different approaches: brackets [ ] and .query(). And of course, using loc and iloc. 😛 But let’s concentrate on [ ] and .query() in the following code block.

My advice is to use .query() for readability and when working with dynamic column names. Note: Be careful with string conditions and make sure to escape quotes correctly.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago'],
    'Salary': [50000, 60000, 70000]
}

df = pd.DataFrame(data)

city = 'Chicago'

# Using brackets with multiple conditions
filtered_brackets = df[(df['Age'] > 25) & (df['City'] == city)]
print(filtered_brackets)

# Using .query() with the same condition as above
filtered_query = df.query('Age > 25 and City == @city')
print(filtered_query)

# Output
      Name  Age     City  Salary
2  Charlie   35  Chicago   70000

5. Joining DataFrames

Join is a common task in any data-related field, especially when combining data from different sources. Note the _merge indicator, it’s used to track the source of rows in outer joins.

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 
                    'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 
                    'value2': [4, 5, 6]})

# Inner join
result_inner = pd.merge(df1, df2, on='key', how='inner')
print(result_inner)

# Output
  key  value1  value2
0   A       1       4
1   B       2       5

# Outer join
result_outer = pd.merge(df1, df2, on='key', 
                        how='outer', 
                        indicator=True)
print(result_outer)

# Output
  key  value1  value2      _merge
0   A     1.0     4.0        both
1   B     2.0     5.0        both
2   C     3.0     NaN   left_only
3   D     NaN     6.0  right_only

6. Applying Functions with `.apply()`

The .apply() is one of the most powerful methods in pandas letting us apply functions across DataFrame columns or rows.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago'],
    'Salary': [50000, 60000, 70000]
}

df = pd.DataFrame(data)

# Single column
df['AgePlusTen'] = df['Age'].apply(lambda x: x + 10)
print(df)

# Multiple columns
df['Name_Age'] = df.apply(lambda row: f"{row['Name']} - {row['Age']}", axis=1)
print(df)

# Output
      Name  Age         City  Salary  AgePlusTen         Name_Age
0    Alice   25     New York   50000          35      Alice - 25
1      Bob   30  Los Angeles   60000          40        Bob - 30
2  Charlie   35      Chicago   70000          45  Charlie - 35

7. Aggregations with `.agg()`

Aggregations in data analysis are used to summarize data and derive insights. By using .agg() method Pandas makes it super easy to aggregate data across single or multiple columns.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago'],
    'Salary': [50000, 60000, 70000]
}

df = pd.DataFrame(data)

# Aggregating single column
age_agg = df['Age'].agg(['mean', 'min', 'max'])
print(age_agg)

# Output
mean    30.0
min     25.0
max     35.0
Name: Age, dtype: float64

# Aggregating multiple columns
summary = df.agg({
    'Age': ['mean', 'min', 'max'],
    'Salary': ['mean', 'min', 'max']
})
print(summary)

# Output
          Age   Salary
mean   30.000  60000.0
min    25.000  50000.0
max    35.000  70000.0

8. Exporting DataFrames

Exporting data ensures you can share and store your analysis results. I always use index=False to avoid writing DataFrame indices to the file. Make sure to install openpyxl library to export data into Excel.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago'],
    'Salary': [50000, 60000, 70000]
}

df = pd.DataFrame(data)

# Export to CSV
df.to_csv('data.csv', index=False)

# Export to Excel
df.to_excel('data.xlsx', index=False)

Additional Resources

I strongly suggest everyone read the following articles to grasp the ideas behind the basic pandas features:

Learning these functions of pandas will significantly enhance your data manipulation skills. 🚀🚀🚀

This post is public so feel free to share it.

Have questions or need further clarification? Leave a comment below or reach out directly.

✅ Thank you for reading my article on SA Space! I welcome any questions, comments, or suggestions you may have.

Keep Knowledge Flowing by following me for more content on Solutions Architecture, System Design, Data Engineering, Business Analysis, and more. Your engagement is appreciated. 🚀

🔔 You can also follow my work on LinkedIn | Substack | X

Essential Pandas You Should Know By Heart

Core Pandas Functionalities for Efficient Data Processing and Analysis

Essential Pandas You Should Know By Heart

1. Creating DataFrames Manually

2. Importing Data

3. Data Selection with loc and iloc

4. Filtering DataFrames

5. Joining DataFrames

6. Applying Functions with .apply()

7. Aggregations with .agg()

8. Exporting DataFrames

Additional Resources

Discussion about this post

3. Data Selection with `loc` and `iloc`

6. Applying Functions with `.apply()`

7. Aggregations with `.agg()`