Automated data sampling for fast application modeling

Having a snappy, interactive experience is very important when modeling an atoti application. In order to provide such an experience, atoti can automatically sample your data during the modeling phase, and seamlessly load the full data set when publishing the application to its users.

Photo by Guillaume Jaillet on Unsplash

One of atoti top feature is that it can provide speed-of-thoughts analytics on very large volumes of data. Some projects load multiple terabytes of data in memory on large machines with hundreds of cores and can enjoy sub-second query response time thanks to our high-performance, multi-core columnar database.

Nonetheless, loading such an amount of data during the modeling phase of the application is rarely a good idea. It requires a large, expensive machine and even though atoti excels at loading data quickly into memory (a few minutes per terabyte), there is no reason to waste time doing so when modeling can be performed very efficiently on a subset of the data.

People therefore model their application either on their personal computer or on a cheap machine in the cloud. In order to do so, they used to extract a sample of their production data to model their application in a Jupyter notebook using this sample. Once this was done, they had to change their code to point to the actual data, sometimes encountering unforeseen issues, verify that their model still matched the data before finally being able to deploy their application to their users.

To ease and speed up this process, we have incorporated an automated sampling mechanism in the latest version of atoti. When modeling your application, you can write it using the actual production data and the library will automatically sample your data and load a subset of it. This can be configured when creating the session:

Loading a subset of the production data ensures that the modeling phase is very snappy and that the created code will not need to change to handle the full data set. Once the application is ready to be consumed, the full data set is loaded when calling session.load_all_data():