Generating Synthetic Data in Python
Realistic datasets for testing, development, and experimentation.
Generating Synthetic Data with the Faker Library in Python
Having access to realistic datasets is important for different testing, development, and experimentation tasks. However, obtaining such datasets can sometimes be challenging due to privacy concerns, legal restrictions, or the lack of available data. This is where the Faker library in Python comes into play.
Generate massive amounts of fake (but realistic) data for testing and development. - Faker
Simply put, Faker is a Python library that generates fake data for you. Faker can easily create it if you need names, addresses, phone numbers, or even entire user profiles.
Installing Faker and Basic Usage
This is how you install Faker:
pip install faker
Here’s a very basic example of how to use Faker to generate a fake name and address:
from faker import Faker
fake = Faker()
print(f"Name: {fake.name()}")
print(f"Address: {fake.address()}")
# Sample Output
Name: Patty Huerta
Address: 23935 Martin Viaduct Suite 251
North Caleb, GU 46542
Generating Different Types of Data
Faker can generate many different types of data. Let’s see the examples of some of the most common data types and how to generate them.
Personal Information
print(fake.name())
print(fake.address())
print(fake.phone_number())
print(fake.email())
print(fake.date_of_birth())
# Sample Output
Jeffrey Clark
91226 Bryce Point
South Sheilaborough, KY 80118
+1-527-580-7357
qbowen@example.net
1942-01-12
Geographic Data
print(fake.city())
print(fake.country())
print(fake.latitude(), fake.longitude())
# Sample Output
New Lauraville
El Salvador
-15.614543 113.909850
Financial Data
print(fake.credit_card_number())
print(fake.credit_card_expire())
print(fake.currency())
# Sample Output
3519791479787824
06/25
('EUR', 'Euro')
Internet Data
print(fake.url())
print(fake.ipv4())
print(fake.ipv6())
# Sample Output
https://www.aguirre.com/
46.199.29.72
81fa:83e2:5d9f:b014:8960:de39:29e2:2a3f
Company Data
print(fake.company())
print(fake.job())
# Sample Output
Thomas, Jackson and Sanders
Acupuncturist
Numeric Data
print(fake.random_int(min=1, max=100))
print(fake.pyfloat(left_digits=5, right_digits=2, positive=True))
print(fake.random_number(digits=10))
# Sample Output
69
51390.74
5940947220
Random Choice
choices = [10, 20, 30, 40, 50]
print(fake.random_element(elements=choices))
# Sample Output
50
Custom Data
We can also add custom data patterns using Faker’s providers feature.
from faker.providers import BaseProvider
class CustomProvider(BaseProvider):
def custom_data(self):
return f"custom-{self.random_int(1000, 9999)}"
fake.add_provider(CustomProvider)
print(fake.custom_data())
# Sample Output
custom-1969
Generating Bulk Data
import pandas as pd
from faker import Faker
# Initialize Faker
fake = Faker()
# Number of records
num_records = 1_000
# Generate synthetic data
def generate_data(num_records):
data = {
"name": [fake.name() for _ in range(num_records)],
"address": [fake.address() for _ in range(num_records)],
"email": [fake.email() for _ in range(num_records)],
"date_of_birth": [fake.date_of_birth() for _ in range(num_records)],
"credit_card_cumber": [fake.credit_card_number() for _ in range(num_records)],
"salary": [fake.random_number(digits=5) for _ in range(num_records)]
}
return data
# Generate records
data = generate_data(num_records)
# Convert to Pandas DataFrame
df_pandas = pd.DataFrame(data)
# Save data to CSV
df_pandas.to_csv('synthetic_data_1K.csv')
Unique Values
Sometimes, we need to generate unique values for certain columns to mimic real-world scenarios. Faker makes it super easy just by using .unique
property. Note: If Faker can’t generate unique values, for example, you need 1_000 records and use the .unique
property on the birthday column, it will raise UniquenessException.
def generate_data(num_records):
data = {
"email": [fake.unique.email() for _ in range(num_records)],
"credit_card_cumber": [fake.unique.credit_card_number() for _ in range(num_records)]
}
return data
data = generate_data(num_records)
df_pandas = pd.DataFrame(data)
print(df_pandas.head(5))
# Sample Output
email credit_card_cumber
0 hansennathan@example.org 3568392173332924
1 barbaramcdonald@example.org 30395223643063
2 michellehurst@example.net 503892746200
3 hughesluis@example.com 4552168501638205097
4 jameshatfield@example.net 30040144295060
Conclusion
I’m already using the Faker library extensively in my personal and work projects. The ability to generate practically any dataset in seconds lets me and my engineers start developing and testing data pipelines instantly without waiting for different access rights. All you need to know is the schema you’ll be working on in the production environment.
Have questions or need further clarification? Leave a comment below or reach out directly.
✅ Thank you for reading my article on SA Space! I welcome any questions, comments, or suggestions you may have.
Keep Knowledge Flowing by following me for more content on Solutions Architecture, System Design, Data Engineering, Business Analysis, and more. Your engagement is appreciated. 🚀