Reddit data analytics trilogy #2 — NLP with spaCy

This is the second article of my Reddit trilogy and in case you haven’t read the first…

Hui Fang Yeo
September 21, 2020
Scroll to read the aricle

Reddit data analytics trilogy #2 — NLP with spaCy

This is the second article of my Reddit trilogy and in case you haven’t read the first article and you are interested in Reddit data scraping, do navigate over for a quick read. While I was intrigued by the idea of playing with Reddit data, I wanted to see how far I could go with Atoti compared to the article I read — Data Table using Data from Reddit. That was when I was introduced to spaCy by Philip Vollet, who commented that I could play around with named entities extracted from the posts.

spaCy is pretty easy to use, with plentiful examples around. In particular, I referenced this article from Real Python to kickstart my first Natural Language Processing (NLP) project!

Quickstart with spaCy

As with any python libraries, install the spaCy library. Also, I’ll need to download the models and data for English language, assuming the Reddit posts that I scraped are all in English as well.

Note: Run the following commands once for the setup

pip install praw spacy
python -m spacy download en_core_web_sm

In the Jupyter notebook, let’s load an instance of the language model in spaCy.

import spacy
nlp = spacy.load("en_core_web_sm")

Text preprocessing

Before starting on the extraction of named entities, text preprocessing is necessary to clean and prepare the data into a predictable and analyzable format.

Through preprocessing, I normalized the text:

  • lowercase
  • remove stop words (words that don’t add much meaning to the sentence) and punctuation symbols
  • lemmatizes each token.

This way, the named entities extracted can easily be grouped to compute the number of times they are mentioned.

Applying the “NLP” function on a text generates a Doc object that gives a sequence of tokens for each post. I can then iterate through the tokens to perform the clean-up as shown in the example below. See the output after a given text is preprocessed:

Image for post

Named Entities Extraction

spaCy supports a long list of named entities. Instead of extracting all of them, I am going to focus only on the following entities from each Reddit posting:

  • Organizations
  • Person
  • Geographical Locations
  • Events
  • Product
  • NORP (Nationalities/religious/political groups)

To do that, again I apply the “NLP” function to the preprocessed text to obtain the tokens for each post. Each token has an “ent” object that contains various attributes, one of which is the label which I will check for the named entities:

Image for post

It’s a pity that it didn’t pick up “fossil fuel” as a product. Nonetheless, “japanese” has been classified correctly under NORP (Nationalities, religious, political group).

Now, that wasn’t too difficult. I will just have to integrate them nicely and I can perform some simple data analytics. Have a peek at what I aim to achieve next with Atoti!

Image for post

Stay tuned for the last part of my trilogy — Data exploration with Atoti. Can’t wait to see the effect of real-time dashboarding with Atoti!

Join our Community

Join our thriving article community and start sharing your insights today!

Like this post? Please share

Latest Articles

View all

Retail Banking Analytics with Atoti

Make smarter decisions by analyzing consumer credit cards and risk profiles. Retail banking, otherwise known as consumer...

Putting Python Code in Production

Python is great for research, but can you really use it for production quality projects? Python, once...

Changes ahead: Atoti Server 6.0-springboot3

ActiveViam is upgrading version 6.0 of Atoti Server from Spring Boot 2 to Spring Boot 3 to...
Documentation
Information
Follow Us

Atoti Free Community Edition is developed and brought to you by ActiveViam. Learn more about ActiveViam at activeviam.com

Follow Us