Categories
Articles

Top count in Jupyter notebook: atoti vs Pandas + Matplotlib

How frequent do you perform a top count while doing some data analysis? In this article, we do a quick comparison between 2 python libraries – atoti and Pandas – to see how we can achieve the below tasks with the Google Play Store Apps data from Kaggle:

  • Visualize top 10 apps with the highest rating in table
  • Visualize top 10 categories with most number of apps rated 5 in Pie chart
  • Visualize top 10 apps within each category in subplots

Note that data has been processed to convert strings with millions and thousands abbreviations into numeric data.

Visualize top 10 apps with the highest rating in table

Pandas

We can make use of nlargest function to return the top count:

Note that the above method is equivalent to the following:

df_playstore.sort_values(["Rating"], ascending=False).head(10)

atoti

Assuming we have the data loaded into the cube, we can run the command “cube.visualize()” to create a visualization.

As we can see from the GIF above, atoti requires minimum coding and users can add levels and measures to the visualization from atoti’s editor, a Jupyter extension.

Quick Compare

Visualize top 10 categories with most number of apps rated 5 in Pie chart

Computing number of apps rated 5

Before visualizing, let’s have a look at how we can derive the number of apps rated 5 for each category with the two libraries:

Pandas
cat_app_rated_5 = (
    df_playstore[df_playstore["Rating"] == 5].groupby(["Category"])["App"].count()
)

With the above code, we get a list of categories with the number of apps rated 5 which is sufficient for our data visualization.

atoti
m["Count with rating 5"] = tt.agg.sum(
    tt.where(m["Rating.MEAN"] == 5, m["contributors.COUNT"], 0),
    scope=tt.scope.origin(l["Category"], l["App"]),
)

Essentially the above code snippet is similar to the one from Pandas, in the sense we need to select only the apps with “Rating” equals to 5. However, atoti aggregates the count at scope of the “App” and “Category” level. This means that we can visualize the number of apps with rating 5 at any level:

Caption: The short snippet creates a measure that allows us to drill down any levels under category and the count gets computed on the fly.

Visualizing top 10 categories

Now let’s see how we can visualize the top 10 categories with the most number of apps rated 5.

Pandas

pandas.DataFrame.plot.pie is a wrapper over “matplotlib.pyplot.pie()”. We can use it to plot a pie chart directly from the dataframe.

Using the function as it is gives us the pie chart on the left.

Just as how we would configure the pie chart with Matplotlib, we can also customize the pie to display the value of each slice.

To do that, we create a function to compute the value:

def func(pct, allvals):
    return int(pct/100.*np.sum(allvals))

Caption: Alternatively, set autopct=’%1.1f%%’ if we want to display the percentage composition.

atoti

As we have seen earlier on, with “cube.visualize()”, we can create visualization in Jupyter notebook interactively via the atoti editor.

You have to see the below GIF to know why we (or rather I personally) absolutely like with this visualization:

Check out how fluid the resizing of the chart is! In the event “Text” option is turned off, we can always mouse over the pie chart slice to see the number of apps that fall under the category.

Quick Compare

Visualize top 10 apps within each category in subplots

Pandas

In this use case, we apply the equivalent of “nlargest” where we perform sorting before we obtain the first 10 rows using the “head” function. By sorting the metrics in ascending order, we get the bottom count (since the smallest value is at the top). Conversely, sorting the metrics in descending order (ascending=False) gives the top count. 

df_playstore.sort_values(
    ["Category", "Rating"], ascending=False, inplace=True
)
top10_by_cat = df_playstore.groupby("Category").head(10)

Note that we have to perform a “groupby” by the category before applying the “head” function since we wanted the top 10 of each category.

What is more cumbersome is the visualization with subplots. Although Pandas support subplot in their charting, each column translates to a subplot against a different metric. In this use case, each category translates to a subplot that is against the number of reviews, split by apps. It is easier to plot using Matplotlib instead as it gives us more control over the plot.

Control is a double edged sword. 

Pros:

  • We can control the number of rows and columns we want to display the subplots over
  • We can control the size of the plots

Cons:

  • As with any libraries, we need to have the “know-how”. Even then, we have to keep adjusting the style such as the size, until it turns out the way we want.

For instance, we get all the subplots overlapping each other with the below code snippet as we do not configure the “figsize”.

The labels didn’t turn out that well with autoscale() function enabled, which is a common problem with charting. It is possible to find alternative solutions such as using tooltip to show the label instead. However, we won’t be exploring that in this article.

Even with “figsize”, it can be a hassle to adjust the size of the figures so that the charts can all fit nicely within the notebook. As there are 33 categories available, we are going to select just a few categories so that the display can appear nicely:

data = top10_by_cat.loc[
    top10_by_cat["Category"].isin(
        ["BUSINESS", "FAMILY", "LIFESTYLE", "MEDICAL", "TOOLS", "GAME"]
    )
][["Category", "App", "Reviews"]].copy()

With figsize=(25, 10), the subset of charts are rendered nicely.

atoti

Select “Chart” type visualization after running “cube.visualize()”. Click on the ellipsis as shown below to show the subplot controls.

We can now add “Category” under horizontal subplot, with “Reviews.MEAN” as the value and sliced by “App” to achieve the charts below:

Next, let’s apply the TopCount filter as before on the “Rating.MEAN” for the “App” level for each category:

Now we get 6 charts out of the 33 categories. Tooltip is available over the slices to show the relevant metrics. 

We can use filters to select a subset of categories to view.

Alternative, access the web application that is included in the atoti library using:

We can build a dashboard that allows use to perform quick filtering:

Note that while it may seem that users have less control over atoti’s charts, we can actually access its underlying state and modify its properties. atoti’s charts are built on top of the plotly library, hence we can configure the chart’s properties as per plotly’s figure reference.

Quick Compare

Did I miss something? How do you handle top count in a notebook? Drop us a note! Otherwise, you can check out both implementations here