Free Resources for Generating Realistic Fake Data
Generate mock and dummy data for your data science projects at zero cost
Find out how to create mock and dummy data for your data science projects
We all know that data is essential. The problem is that many times we do not have (enough of) it. As we develop data applications or pipelines, we need to test them with data that resembles what might be seen in production.
It is difficult to manually create realistic datasets of sufficient volume and variety (e.g., different data types, characteristics). Furthermore, hand-created data is prone to our subconscious and systematic biases.
Fortunately, there are free online resources that can generate realistic fake data for us to use for testing. Let’s take a look at some of them:
(1) Faker (2) Mockaroo (3) GenerateData (4) JSON Schema Faker (5) FakeStoreAPI (6) Mock Turtle
(1) Faker
The term ‘Faker’ is synonymous with mock data generation, given that there are numerous Faker data mocking libraries for different programming languages (e.g., NodeJS, Ruby, PHP, Perl). The Faker library featured here is the one under the Python version.
Faker is a Python library that helps you generate fake data. From the documentation, we can see that it can be easily installed the following command: pip install Faker
Once installed, we instantiate a Faker class object and call the various methods to generate the type of mock data we want:
Furthermore, Faker has its own pytest plugin, which provides a faker fixture you can use in your tests.
Links
(2) Mockaroo
Mockaroo allows you to quickly and easily download large amounts of randomly generated test data based on the specifications you define.
The wonderful thing about this platform is that no programming is needed, and you can download them in many different formats (e.g., SQL, XML, JSON, CSV) to be loaded directly into your test environment. Upon signing up, you can also create and save your schemas for future reuse.
With Mockaroo, you can design your own mock APIs and deploy them in your private cloud by leveraging the Mockaroo docker image.
Links
(3) GenerateData
GenerateData is a free, open-source tool that allows you to generate large volumes of custom data quickly. Like Mockaroo, the website offers an easy-to-use interface (with a quick-start feature) for creating numerous different types of data in various formats.
After testing out the demo on the main website, you can also download the free, fully functional, GNU-licensed version. If you require mock data beyond the maximum of 100 rows per run, a small $20 donation lets you generate and save up to 5,000 records at a time.
At this writing, the new version of GenerateData (V4) is close to a beta release, so do check out the GitHub repo for updates.
Links
- GenerateData website (Latest version – v4)
- GenerateData website (Old version – v3)
- GenerateData GitHub
(4) JSON Schema Faker
The JSON file format is one of the most popular ways of storing and transmitting data objects. Hence, generating both the fake data and the JSON schema that defines the data structure would be beneficial.
The JSON Schema Faker combines JSON Schema standard with fake data generators, allowing you to generate fake data that conforms to the schema.
The website has a user interface for you to define the schema. Instead of manually writing the schema, you can select and build upon the list of Examples already prepared for you.
Links
(5) FakeStoreAPI
You should have come across a fair share of generic (i.e., Loren Ipsum) kind of mock data by now. This is where FakeStoreAPI switches things up.
It is a free online REST API for creating pseudo-real data for e-commerce or shopping use cases without running server-side code. The mock data will be highly relevant for projects that require retail-related data (e.g., products, carts, users, login tokens) in JSON format.
With just a few lines of code for the API call, the mock data can be readily created or modified:
Links
(6) Mock Turtle
Mock Turtle is a user-friendly GUI-based platform for users to generate fake data in a JSON schema.
The tool mimics a JSON tree structure, and you can directly see the changes in the schema upon each click.
Besides JSON schema parsing, it also allows for the generation of nested structures and large datasets at no cost.
Links
Know of other excellent mock data generators? Let me know in the Comments section.
Before you go
I welcome you to join me on a data science learning journey! Follow this Medium page and check out my GitHub to stay in the loop of more exciting data science content. Meanwhile, have fun generating fake data!
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS