Generating Synthetic Data in Python

Realistic datasets for testing, development, and experimentation.

Jun 02, 2024

Generating Synthetic Data with the Faker Library in Python

Having access to realistic datasets is important for different testing, development, and experimentation tasks. However, obtaining such datasets can sometimes be challenging due to privacy concerns, legal restrictions, or the lack of available data. This is where the Faker library in Python comes into play.

Generate massive amounts of fake (but realistic) data for testing and development. - Faker

Simply put, Faker is a Python library that generates fake data for you. Faker can easily create it if you need names, addresses, phone numbers, or even entire user profiles.

Installing Faker and Basic Usage

This is how you install Faker:

pip install faker

Here’s a very basic example of how to use Faker to generate a fake name and address:

from faker import Faker

fake = Faker()

print(f"Name: {fake.name()}")
print(f"Address: {fake.address()}")

# Sample Output
Name: Patty Huerta
Address: 23935 Martin Viaduct Suite 251
North Caleb, GU 46542

Generating Different Types of Data

Faker can generate many different types of data. Let’s see the examples of some of the most common data types and how to generate them.

Personal Information

print(fake.name())
print(fake.address())
print(fake.phone_number())
print(fake.email())
print(fake.date_of_birth())

# Sample Output
Jeffrey Clark
91226 Bryce Point
South Sheilaborough, KY 80118
+1-527-580-7357
qbowen@example.net
1942-01-12

Geographic Data

print(fake.city())
print(fake.country())
print(fake.latitude(), fake.longitude())

# Sample Output
New Lauraville
El Salvador
-15.614543 113.909850

Financial Data

print(fake.credit_card_number())
print(fake.credit_card_expire())
print(fake.currency())

# Sample Output
3519791479787824
06/25
('EUR', 'Euro')

Internet Data

print(fake.url())
print(fake.ipv4())
print(fake.ipv6())

# Sample Output
https://www.aguirre.com/
46.199.29.72
81fa:83e2:5d9f:b014:8960:de39:29e2:2a3f

Company Data

print(fake.company())
print(fake.job())

# Sample Output
Thomas, Jackson and Sanders
Acupuncturist

Numeric Data

print(fake.random_int(min=1, max=100))
print(fake.pyfloat(left_digits=5, right_digits=2, positive=True))
print(fake.random_number(digits=10))

# Sample Output
69
51390.74
5940947220

Random Choice

choices = [10, 20, 30, 40, 50]
print(fake.random_element(elements=choices))

# Sample Output
50

Custom Data

We can also add custom data patterns using Faker’s providers feature.

from faker.providers import BaseProvider

class CustomProvider(BaseProvider):
    def custom_data(self):
        return f"custom-{self.random_int(1000, 9999)}"

fake.add_provider(CustomProvider)
print(fake.custom_data())

# Sample Output
custom-1969

Generating Bulk Data

import pandas as pd
from faker import Faker

# Initialize Faker
fake = Faker()

# Number of records
num_records = 1_000

# Generate synthetic data
def generate_data(num_records):
    data = {
        "name": [fake.name() for _ in range(num_records)],
        "address": [fake.address() for _ in range(num_records)],
        "email": [fake.email() for _ in range(num_records)],
        "date_of_birth": [fake.date_of_birth() for _ in range(num_records)],
        "credit_card_cumber": [fake.credit_card_number() for _ in range(num_records)],
        "salary": [fake.random_number(digits=5) for _ in range(num_records)]
    }
    return data

# Generate records
data = generate_data(num_records)

# Convert to Pandas DataFrame
df_pandas = pd.DataFrame(data)

# Save data to CSV
df_pandas.to_csv('synthetic_data_1K.csv')

Unique Values

Sometimes, we need to generate unique values for certain columns to mimic real-world scenarios. Faker makes it super easy just by using .unique property. Note: If Faker can’t generate unique values, for example, you need 1_000 records and use the .unique property on the birthday column, it will raise UniquenessException.

def generate_data(num_records):
    data = {
        "email": [fake.unique.email() for _ in range(num_records)],
        "credit_card_cumber": [fake.unique.credit_card_number() for _ in range(num_records)]
    }
    return data

data = generate_data(num_records)

df_pandas = pd.DataFrame(data)
print(df_pandas.head(5))

# Sample Output
                         email   credit_card_cumber
0     hansennathan@example.org     3568392173332924
1  barbaramcdonald@example.org       30395223643063
2    michellehurst@example.net         503892746200
3       hughesluis@example.com  4552168501638205097
4    jameshatfield@example.net       30040144295060

This post is public so feel free to share it.

Conclusion

I’m already using the Faker library extensively in my personal and work projects. The ability to generate practically any dataset in seconds lets me and my engineers start developing and testing data pipelines instantly without waiting for different access rights. All you need to know is the schema you’ll be working on in the production environment.

Have questions or need further clarification? Leave a comment below or reach out directly.

✅ Thank you for reading my article on SA Space! I welcome any questions, comments, or suggestions you may have.

Keep Knowledge Flowing by following me for more content on Solutions Architecture, System Design, Data Engineering, Business Analysis, and more. Your engagement is appreciated. 🚀

🔔 You can also follow my work on LinkedIn | Substack | X