In this article, learn one of the sought out skills for data scientists -how to generate random datasets. We will see why to synthetic data generation is important and we will explore the various Python libraries to generate synthetic data.
Introduction: Why Data Synthesis?
Testing proof of concept
As a data scientist, you can benefit from data generation since it allows you to experiment with various ways of exploring datasets, algorithms, data visualization techniques or to validate assumptions about the behavior of some method against many different datasets of your choosing.
When you have to test a Proof of concept, a tempting option is just to use real data. One small problem though is that production data is typically hard to obtain, even partially, and it is not getting easier with new European laws about privacy and security.
Data is indeed a scarce resource
The algorithms, programming frameworks, and machine learning packages (or even tutorials and courses on how to learn these techniques) are not scarce resources but high-quality data is. And hence arises the need to generate your own dataset.
Let me also be very clear that, in this article, I am only talking about generating data for learning the purpose and not for running any commercial operation.
For a more extensive read on why generating random datasets is useful, head towards ‘Why synthetic data is about to become a major competitive advantage’.
The benefits of having a synthetic dataset
As the name suggests, quite obviously, a synthetic dataset is a repository of data that is generated programmatically. So, it is not collected by any real-life survey or experiment. Its main purpose, therefore, is to be flexible and rich enough to help an ML practitioner conduct fascinating experiments with various classification, regression, and clustering algorithms. Desired properties are:
- It can be numerical, binary, or categorical (ordinal or non-ordinal),
- The number of features and length of the dataset should be arbitrary
- It should preferably be random and the user should be able to choose a wide variety of statistical distribution to base this data upon i.e. the underlying random process can be precisely controlled and tuned,
- If it is used for classification algorithms, then the degree of class separation should be controllable to make the learning problem easy or hard,
- Random noise can be interjected in a controllable manner
- For a regression problem, a complex, non-linear generative process can be used for sourcing the data.
Python libraries to synthesize the data
Faker
Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you.
Trumania
Trumania is a scenario-based random dataset generator library. Trumania is based on scenarios in order to address these shortcomings and generate more realistic datasets. As the scenario unfolds, various populations interact with each other, update their properties and emit logs. In Trumania, the generated datasets are typically time-series because they result from the execution of a scenario that unfolds over time.
Pydbgen
It is a lightweight, pure-python library to generate random useful entries (e.g. name, address, credit card number, date, time, company name, job title, license plate number, etc.) and save them in either Pandas data frame object, or as an SQLite table in a database file, or in an MS Excel file.
Sympy
We can build upon the SymPy library and create functions similar to those available in scikit-learn but can generate regression and classification datasets with a symbolic expression of a high degree of complexity.
Synthetic Data Vault (SDV)
The workflow of the SDV library is shown below. A user provides the data and the schema and then fits a model to the data. At last, new synthetic data is obtained from the fitted model. Moreover, the SDV library allows the user to save a fitted model for any future use.
Check out this article to see SDV in action.
Many of these packages can generate plausible-looking data for a wide definition of data, although they won’t necessarily model the mess of the real data; (any mess you build in will be a model of messy data, but not necessarily a realistic one). This is something to bear in mind when testing.
You should be particularly careful with how you use them if you are testing machine learning models against them, and expect weird things to happen if you make like Ouroboros and use them to train models.
Conclusion
Synthetic data is a useful tool to safely share data for testing the scalability of algorithms and the performance of new software. It aims at reproducing specific properties of the data. Producing quality synthetic data is complicated because the more complex the system, the more difficult it is to keep track of all the features that need to be similar to real data.
We have synthesized the dataset for the U.S. automobile using the faker Python library mentioned above. Here is a snippet of the dataset we generated:
This dataset is used to create the sales cube in Atoti. You can read about the sales cube article here.
I hope you enjoyed reading the above article and you are all set to sort out your data problem by synthesizing it. Let us know about your use case of generating synthetic data, or that of creating a sales cube!