Synthetic data generation: test your proof of concept, without the data!

In this article, learn one of the sought out skills for data scientists -how to generate random datasets. We will see why to synthetic data generation is important and we will explore the various Python libraries to generate synthetic data.

Introduction: Why Data Synthesis?

Testing proof of concept

As a data scientist, you can benefit from data generation since it allows you to experiment with various ways of exploring datasets, algorithms, data visualization techniques or to validate assumptions about the behavior of some method against many different datasets of your choosing.

When you have to test a Proof of concept, a tempting option is just to use real data. One small problem though is that production data is typically hard to obtain, even partially, and it is not getting easier with new European laws about privacy and security.

Data is indeed a scarce resource

The algorithms, programming frameworks, and machine learning packages (or even tutorials and courses on how to learn these techniques) are not scarce resources but high-quality data is. And hence arises the need to generate your own dataset.

Let me also be very clear that, in this article, I am only talking about generating data for learning the purpose and not for running any commercial operation.

For a more extensive read on why generating random datasets is useful, head towards ‘Why synthetic data is about to become a major competitive advantage’.

The benefits of having a synthetic dataset

As the name suggests, quite obviously, a synthetic dataset is a repository of data that is generated programmatically. So, it is not collected by any real-life survey or experiment. Its main purpose, therefore, is to be flexible and rich enough to help an ML practitioner conduct fascinating experiments with various classification, regression, and clustering algorithms. Desired properties are:

  • It can be numerical, binary, or categorical (ordinal or non-ordinal),
  • The number of features and length of the dataset should be arbitrary
  • It should preferably be random and the user should be able to choose a wide variety of statistical distribution to base this data upon i.e. the underlying random process can be precisely controlled and tuned,
  • If it is used for classification algorithms, then the degree of class separation should be controllable to make the learning problem easy or hard,
  • Random noise can be interjected in a controllable manner
  • For a regression problem, a complex, non-linear generative process can be used for sourcing the data.

Python libraries to synthesize the data

Faker

Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you.

Trumania

Trumania is a scenario-based random dataset generator library. Trumania is based on scenarios in order to address these shortcomings and generate more realistic datasets. As the scenario unfolds, various populations interact with each other, update their properties and emit logs. In Trumania, the generated datasets are typically time-series because they result from the execution of a scenario that unfolds over time.

Pydbgen

It is a lightweight, pure-python library to generate random useful entries (e.g. name, address, credit card number, date, time, company name, job title, license plate number, etc.) and save them in either Pandas data frame object, or as an SQLite table in a database file, or in an MS Excel file.

Sympy

We can build upon the SymPy library and create functions similar to those available in scikit-learn but can generate regression and classification datasets with a symbolic expression of a high degree of complexity.

Synthetic Data Vault (SDV)

The workflow of the SDV library is shown below. A user provides the data and the schema and then fits a model to the data. At last, new synthetic data is obtained from the fitted model. Moreover, the SDV library allows the user to save a fitted model for any future use.

Check out this article to see SDV in action.

The SDV workflow

Many of these packages can generate plausible-looking data for a wide definition of data, although they won’t necessarily model the mess of the real data; (any mess you build in will be a model of messy data, but not necessarily a realistic one). This is something to bear in mind when testing.

You should be particularly careful with how you use them if you are testing machine learning models against them, and expect weird things to happen if you make like Ouroboros and use them to train models.

Conclusion

Synthetic data is a useful tool to safely share data for testing the scalability of algorithms and the performance of new software. It aims at reproducing specific properties of the data. Producing quality synthetic data is complicated because the more complex the system, the more difficult it is to keep track of all the features that need to be similar to real data.

We have synthesized the dataset for the U.S. automobile using the faker Python library mentioned above. Here is a snippet of the dataset we generated:

This dataset is used to create the sales cube in atoti. You can read about the sales cube article here.

I hope you enjoyed reading the above article and you are all set to sort out your data problem by synthesizing it. Let us know about your use case of generating synthetic data, or that of creating a sales cube!

Latest posts

Understanding Logs in Atoti
From the default log to how to configure additional logging Application logs are extremely important in any system! Most commonly, they are used to troubleshoot any issue that users may encounter while using an application. For instance, developers use them for debugging and the production support crew uses them to investigate outages. Not just that, in production, they are used to monitor an application’s performance and health. For instance, monitoring tools can pick up certain keywords to identify events such as “server down” or “system out of memory”. It can also serve as an audit trail to track user activity...
Atoti: Working with dates in Python
What is the most problematic data type you have ever dealt with when working with data? I would say dates! Depending on the locale, dates come in different formats such as YYYY-mm-dd, d/m/YYYY, d-mmm-yy etc. Not to mention, sometimes it comes with timestamps and time zones! We can let programs infer the date format or explicitly cast the data to date with a specific format e.g. in Python with Pandas DataFrame: What if we open the CSV file in Microsoft Excel, update it, and try to read it again? The above code snippet will throw out exceptions such as this:...
Understanding conditional statements in Atoti
When do we use filter, where and switch statements? We know that we can perform aggregations and multi-dimensional analysis with Atoti. Aggregation is not always as simple as 1 + 2. Sometimes we end up in an “if…else” situation, and that is where conditional statements come in. Let’s explore some examples with the XVA use case from the Atoti CE Notebook Gallery. Some definitions before moving on: Measures – we refer to metrics or quantifiable data that measure certain aspects of our goals. In the code snippets below, they are represented by measure[<name of metrics>]. Members of a level –...

Join our Community


    Like this post ? Please share

    Documentation
    Information
    Follow Us

    atoti Free Community Edition is developed and brought to you by ActiveViam. Learn more about ActiveViam at activeviam.com.