Categories
Starting guides

Getting Started with atoti

atoti is a Python library for performing data aggregation and creating data-viz widgets such as pivot tables and charts in a Jupyter notebook. It comes with a business intelligence web application, enabling fuss-free dashboard creation and sharing.

Note: This article has been tested against atoti 0.5.2.

Installation

atoti can be installed with pip/Poetry or conda. Let’s keep it simple with pip:

pip install atoti[jupyterlab]

It is optional to install graphviz, however, it is recommended for this article as we will be displaying the schema of the cube.

Once this is done, you can start JupyterLab:

jupyter lab

Building and exploring a data model

For this guide, we’ll work on this popular Kaggle data set of trending YouTube video statistics. The goal will be to define some key metrics and create practical visualizations.

We start with some data prep with Pandas:

import pandas as pd

videos_df = pd.read_csv(
    "USvideos.csv",
    usecols=[
        "category_id",
        "channel_title",
        "title",
        "trending_date",
        "video_id",
        "views",
    ],
)

# Parse trending date and split it into year/month/day columns.
trending_date = pd.to_datetime(
    videos_df["trending_date"],
    format="%y.%d.%m"
)
videos_df["trending_date"] = trending_date.dt.date
videos_df["trending_year"] = trending_date.dt.year
videos_df["trending_month"] = trending_date.dt.month
videos_df["trending_day"] = trending_date.dt.day

videos_df.sample(5)
5 random lines of our videos DataFrame

import json
from pathlib import Path

# Parse JSON file holding mapping between a category ID and its title, and make a DataFrame out of it.
category_data = json.loads(Path("US_category_id.json").read_text())
data = [
    [int(item["id"]), item["snippet"]["title"]]
    for item in category_data["items"]
]
categories_df = pd.DataFrame(data, columns=["id", "category_title"])

categories_df.head()

First five categories with their ID and title

Now that the DataFrames are prepped, we can create the atoti analytical cube:

import atoti as tt
session = tt.create_session()
videos_store = session.read_pandas(
    videos_df, keys=["video_id", "trending_date"], store_name="videos"
)
categories_store = session.read_pandas(
    categories_df, keys=["id"], store_name="categories"
)
 
videos_store.join(categories_store, mapping={"category_id": "id"})

cube = session.create_cube(videos_store, "YouTube", mode="manual")
h, l, m = cube.hierarchies, cube.levels, cube.measures
cube

In this case, we choose the manual cube creation mode to shape the cube later. By default, however, the cube structure is inferred from the types of the stores’ columns.

Run cube.schema to visualize the schema of the cube created:

Install graphviz if the schema is not getting displayed.

We also create analytical hierarchies — extra available axes in pivot tables or charts:

# A channel has multiple videos and each video can be renamed so it can have multiple titles.
h["video"] = [
    videos_store["channel_title"],
    videos_store["video_id"],
    videos_store["title"]
]

# The trending date can also be organized with multiple levels.
h["trending_date"] = [
    videos_store["trending_year"],
    videos_store["trending_month"],
    videos_store["trending_day"],
    videos_store["trending_date"]
]

# The category hierarchy has a single level: the category title.
h["category"] = [categories_store["category_title"]]
cube

The cube structure is returned as a JSON tree that JupyterLab displays nicely

From there, we can create visualizations to get a sense of the data set. The visualize method on Session instances outputs an interactive widget that can be built with mouse & keyboard inputs — no code needed.

A widget showing that there are almost always 200 trending videos per day

Building a widget showing the 10 channels with the most accumulated trending days

Drilling down the trending_date hierarchy while showing the numbers of trending videos per category

We’ve created these widgets without defining any specific metrics but one of the strengths of atoti is for building a data model with aggregated indicators:

views_max = tt.agg.max(videos_store["views"])
m["views"] = tt.agg.sum(
    views_max, scope=tt.scope.origin(l["video_id"])
)

Adding the views metric tocube.measuresmakes it directly available in the atoti JupyterLab extension:

Drilling down on the video hierarchy to see the most viewed channels and their corresponding videos

Let’s define another metric that will give us the aggregated distinct count of trending videos:

m["trending_videos"] = tt.agg.count_distinct(videos_store["video_id"])
Sorting categories by amount of trending videos

Let’s make one more widget:

Plotting the amount of views Vs. amount of trending videos per channel

Sharing our insights

We can publish all the widgets we’ve built in JupyterLab in the atoti dashboarding app:

Publishing a widget to the app and opening it from the web app

Widgets published in the app can be added to dashboards with additional features such as quick filters and filtering on multi-selection:

Filtering a dashboard in the app by category and then by channel

The dashboarding application is a “safe” environment: all the queries are read-only so there is no risk of breaking the model or tampering with its data.

You can share a link to your atoti app to show it to other people.

If you would like to know more, head over to the documentation.