Exploring OLAP with atoti and Pokemon

Analyzing Pokemon Fight Data — Which is the strongest Pokemon of them all?

The Beginning

My goal at the start of the project was to learn OLAP (Online Analytical Processing), preferably with an interesting dataset that wasn’t related to products, stores and sales. So I decided on Pokemon. After all, Pokemon is a franchise with quite a lot of data publicly available, and has sufficient depth for data analysis. There are hundreds of unique Pokemon with individual stats such as Speed, Attack… Furthermore, each Pokemon has one or more types, which affect skill damage depending on their opponent based on the type multiplier system.

So my first quest was to find a base dataset to work with that would allow me to perform sufficient data analytics on. I discovered this dataset of 50 thousand Pokemon combat matches on Kaggle, and decided it would serve my needs if I combined it with basic Pokemon stats.

 

Pokemon Combat Dataset
Pokemon Stats Dataset (partial)

The Vision

atoti, the OLAP tool that I am using, is a free python library that can be used to create a full BI analytic platform. It allows us to create interactive dashboards with data that can be aggregated on the fly.

Perfect, because I decided that the product of my data analysis would be a one-stop shop for a Pokemon trainer. After aggregating all the combat data and Pokemon stats, the web application would provide advice on which Pokemon to send out for a certain opponent, historical match-up stats… Anything else I could think of.

 

Individual Pokemon stats page (Select your Pokemon)

 

So let’s see how I got there!


Data transformation

atoti allows data loading from CSV or Pandas DataFrame, which was great since my base data was in CSV format. This flexibility allowed me to either load my CSV directly or perform my ETL in Pandas before loading the resulting DataFrames straight into the atoti tables.

#  ETL in Pandas 

def calc_views_change(l):
    return (l[1] - l[0]) / l[0]

combat_views_df["Live Views Change"] = combat_views_df.apply(
    lambda row: calc_views_change(row["Live Views (Start - End)"]), axis=1
)
# Loading Pandas DataFrame into atoti table

pokemon_t = session.read_pandas(
    pokemon_df,
    table_name="Pokemon",
    keys=["Pokedex"]
)

Expanding Our World

As I progressed, I realised that I did not have enough data to simulate the more practical, real-world scenarios that I might need to know. After all, my goal was to learn OLAP for use in the course of work. So I decided to enrich my cube with price data at the least; I added price info, ticket sales, and stadium costs to this imaginary world. I also generated random location and regional data, and separated the original combat info into five seasons so that I could track changes over time.

 

Location Data
Ticket Sales / Season Data

Creating the Cube

Here is my final cube schema:

 

 

Next, I cleaned up the cube structure. In atoti, each table is automatically created as a dimension, with string type columns created as hierarchies. The dimensions are used to organize the hierarchies.

Here I am adding a custom comparator to an existing level to sort the Pokemon type in a specific order:

l["Pokemon Type"].order = tt.CustomOrder(
    first_elements=[
        "Normal",
        "Fire",
        "Water",
        "Electric",
        "Grass",
        "Ice",
        "Fighting",
        "Poison",
        "Ground",
        "Flying",
        "Psychic",
        "Bug",
        "Rock",
        "Ghost",
        "Dragon",
        "Dark",
        "Steel",
        "Fairy",
    ]
)

Creating a multi-level hierarchy and deleting the original single-level hierarchies:

h["Location"] = [l["Region"], l["Settlement"]]
del h["Region"]
del h["Settlement"]

The Strongest Pokemon is…

Now that the cube has been built, on to the fun part!

In order to get the information I needed, I had to first create some metrics. In atoti, metrics are created by defining measures that aggregate the data from the cube.

I was most interested in the most powerful Pokemon, so I started off with Win Rate as my first measure. First, I created a measure to give me the number of unique matches using the count_distinct function…

m["No. of Matches"] = tt.agg.count_distinct(combat_t["Combat ID"])

Next, I created measures that return the number of winning and losing matches, by filtering that original measure based on the results.

m["Win"] = tt.where(
    tt.filter(m["No. of Matches"], l["Result"] == "WIN") == None,
    0,
    tt.filter(m["No. of Matches"], l["Result"] == "WIN"),
)

m["Lose"] = tt.where(
    tt.filter(m["No. of Matches"], l["Result"] == "LOSE") == None,
    0,
    tt.filter(m["No. of Matches"], l["Result"] == "LOSE"),
)

Finally, I was able to combine the winning and losing measures to calculate the win rate.

m["Win Rate"] = m["Win"] / (m["Win"] + m["Lose"])

Now, a simple table can be created with the dimension Pokemon and measure Win Rate, to find our strongest Pokemon.

 

 

The strongest Pokemon turns out to be Cyndaquil. Cyndaquil fans rejoice!

 


But What About…

I found the most powerful Pokemon, and got a better idea of how to use measures and dimensions. Now I could start on that one-stop shop for Pokemon trainers. But… I really wanted to know how Chikorita, was performing. So I did that first.

I created a measure for trend win rate over the seasons. It tracks the change in win rate from the previous season, so a positive figure will indicate that the win rate has improved from the previous season, and a negative figure will indicate the opposite.

First, I got the previous season’s win rate by shifting the season hierarchy by an offset of -1. Next, I subtracted it from the current season’s win rate to get the number. Lastly, I hid the trend win rate for season 1.

m["Trend Win Rate"] = tt.where(
    l["Season"] != "1",
    m["Win Rate"] - tt.shift(m["Win Rate"], l["Season"], offset=-1),
)

I also created a measure to compare the percentage of matches played in each season.

m["Match Season Ratio"] = m["No. of Matches"] / tt.parent_value(
    m["No. of Matches"], degrees={h["Season"]: 1}
)

Then I combined it all into a glorious Chikorita dashboard page.

 

 

As you can see, Chikorita started off weak but ended up at a 67% win rate in Season 5!

I also added a configurable filter to the page, so technically this isn’t really a Chikorita page, it’s a ‘select a Pokemon and see how it’s doing’ page. This can be found in the ‘trainer’ dashboard in my Jupyter notebook. How did your favorite Pokemon do?


As An Enthusiastic Pokemon Trainer…

Finally, I started building the Pokemon trainer dashboard. First, I created a page to help the trainer select their tournament team. The trainer can select all the Pokemon they own and compare their win rates. I already had the win rate measure, so it was just a matter of creating the charts and putting them together.

 

 

Looks good to me. Next, in order to help the trainer make decisions if they are unsure of a particular Pokemon, I added a page that gives more details on a specific Pokemon. Coincidentally, I already had a page like that, the Chikorita page. Fantastic.

Moving on, I added a page to look at the selected team’s overall win rate against each opponent type, to find out if the team has any weak links. Then, a page with the best Pokemon to counter a specific opponent element so the trainer can round off the team. Finally, a page on locations and a page with each team’s Pokemon individual win rate against a specific opponent Pokemon.

 

 

The complete dashboard can be found in my Jupyter notebook. Have a go at picking your team!


As A Worn-Down Tournament Organizer…

I reached my initial goal of creating a dashboard for Pokemon trainers, but it surprisingly didn’t include much in terms of measures. Since I also generated sales and location data, I decided to create another dashboard, this time for the organizers of this tournament!

Surely they would want to know which stadiums are the most popular… and if there are any that might be better off torn down.

So I started off looking at the number of matches held per stadium.

 

 

I also created the match location ratio, which shows the percentage of matches held in each location.

 

m["Match Location Ratio"] = m["No. of Matches"] / tt.parent_value(
    m["No. of Matches"], degrees={h["Location"]: 1}
)

 

Next, using the generated ticket sales data and stadium capacity, I created a measure to get the percentage of stadium capacity filled during each match. Note that I am applying the formula `Tickets Sold / Stadium Capacity` at the `Combat ID` level by using the scope here because I want the stadium capacity to be calculated for each match. For any queries above the match level, the mean value of the level below will be returned.

 

m["Avg Stadium Capacity"] = tt.agg.mean(
    tt.value(combat_info_t["Tickets Sold"]) / tt.value(location_t["Stadium Capacity"]),
    scope=tt.scope.origin(l["Combat ID"]),
)

 

Now I can compare the stadiums by the average stadium capacity filled during matches.

 

 

It looks similar to the chart with match count, which is to be expected.

What about matches held at full capacity? I was curious, so I created a measure to sum up the matches held at full capacity. This is done by comparing the ticket sales with the stadium capacity. If they match, I return 1. If they don’t, I return 0. Then we sum up the result. And as always, it is scoped by each match.

m["No. of Matches at Full Stadium Capacity"] = tt.agg.sum(
    tt.where(
        tt.value(combat_info_t["Tickets Sold"])
        == tt.value(location_t["Stadium Capacity"]),
        1,
        0,
    ),
    scope=tt.scope.origin(l["Combat ID"]),
)

With this measure, I can now find the percentage of matches held at full stadium capacity by comparing it to the No. of Matches measure.

m["Percentage of Matches at Full Stadium Capacity"] = (
    m["No. of Matches at Full Stadium Capacity"] / m["No. of Matches"]
)

Here is the result.

 

 

About half these locations have almost no instances where they’re filled to max capacity, but Viridian City is actually at full stadium capacity more than 50% of the time!


It’s All About The Pokemon Money

No sales data is complete without the dollars and cents, and I’m sure the tournament organizers would agree. So next for their dashboard I created the profit measure and the profit location ratio measure.

m["Profit"] = tt.agg.sum(
    (tt.value(combat_info_t["Tickets Sold"]) * tt.value(location_t["Ticket Price"]))
    - tt.value(location_t["Stadium Cost"]),
    scope=tt.scope.origin(l["Combat ID"]),
)

m["Profit Location Ratio"] = m["Profit"] / tt.parent_value(
    m["Profit"], degrees={h["Location"]: 1}
)

These are the results.

 

 

Interesting… but what if in this fictitious world, Viridian City decides to increase the cost of renting out their stadium? After all, they are at max capacity more than half the time. I decided to have some fun with this since atoti supports scenario simulation.

So, I modified the existing data in pandas and set the stadium cost for Viridian City to 28,000.

location_s1 = location_df.copy()

location_s1["Stadium Cost"] = np.where(
    location_s1["Location ID"] == "5", 28000, location_s1["Stadium Cost"]
)

Next, I loaded the data as a scenario into the existing location table, which contains the stadium cost data.

location_t.scenarios["Viridian What-if"].load_pandas(location_s1)

Now, in the atoti web application, the source simulation hierarchy appears and I can examine the result.

 

 

Obviously, overall profit dropped. Let’s repeat the process and increase the ticket price by 5 dollars. That seems like a reasonable response to increased costs.

location_s1["Ticket Price"] = np.where(
    location_s1["Location ID"] == "5", 25, location_s1["Ticket Price"]
)

location_t.scenarios["Viridian What-if"].load_pandas(location_s1)

 

Hmm, then ticket sales will probably drop, by perhaps 20%?

combat_info_s1 = combat_info_df.copy()

combat_info_s1["Tickets Sold"] = np.where(
    combat_info_s1["Location ID"] == "5",
    round(combat_info_s1["Tickets Sold"] * 0.8),
    combat_info_s1["Tickets Sold"],
)

combat_info_t.scenarios["Viridian What-if"].load_pandas(combat_info_s1)

 

Surprisingly, or not, there is a negative impact on the overall profit.

In any case, let’s leave the tournament organizers here. As usual, their dashboard is in my Jupyter notebook.


Back To Reality

Lastly, I was interested in the impact of Pokemon stats and type multipliers on the match outcome. So, I started off by creating basic measures to get the average HP/ATK/etc for the Pokemon in each match.

m["Avg HP"] = tt.agg.mean(
    tt.value(pokemon_t["HP"]), scope=tt.scope.origin(l["Registration ID"])
)

m["Avg Def"] = tt.agg.mean(
    tt.value(pokemon_t["Defense"]), scope=tt.scope.origin(l["Registration ID"])
)

...

I realized that the difference in stats seemed negligible, but every single one of them was just slightly higher on the winning side.

 

 

Coincidence? Hmm.

I also created a fun measure to identify the various Pokemon whose win rate never went above 30% for any particular season. This involved first creating a measure which stores a Pokemon’s highest win rate out of all the seasons.

m["Max Win Rate (Season)"] = tt.agg.max(
    m["Win Rate"], scope=tt.scope.origin(l["Season"])
)
m["Max Win Rate (Season)"].formatter = "DOUBLE[0.00%]"

With that, we can create a measure that displays “BOOST” if the Pokemon’s win rate is under 30% for every season.

m["Boost this Pokemon"] = tt.where(
    (m["Max Win Rate (Season)"] < 0.3) & (l["Season"] == None), "BOOST", ""
)

Hopefully, the Pokemon marked with “BOOST” get boosted in the near future!

 


Ending Our Journey

I had a blast learning about multi-dimensional databases and OLAP by playing with this Pokemon dataset and coming up with various far-fetched scenarios for it. 10/10 would recommend. Next, simulating Pokemon fights based on this data would be fun…

Latest posts

Contributing to atoti notebook gallery
Share your atoti use case with our community If you haven’t heard...
atoti: Guide to implementing lookup in a multidimensional cube
A simple example of converting your analytics into any referenced currency Today...
Integrating an External Portfolio Optimizer with atoti
Given a portfolio and an optimizer, can we find a ‘better’ portfolio...

Join our Community

    Like this post ? Please share

    Documentation
    Information
    Follow Us

    atoti Free Community Edition is developed and brought to you by ActiveViam. Learn more about ActiveViam at activeviam.com.