Keyword Extraction and Topic Analysis - Reddit

Analyzing top performing keywords and topics from a popular subreddit thread, utilizing a variety of NLP techniques.

Purpose

In the online media industry, understanding the best performing content lanes under whichever platform/medium channel of interest is valuable knowledge when considering content creation and redistribution. Furthermore, continuously keeping one's finger on the pulse as to how topics of interest may change at any given time is critical when looking to maintain an agile, sustainable, and effective content strategy.

The current analysis leverages text analysis - particularily keyword extraction and correlation - to understand the most popular and top performing keywords of a popular subreddit thread, filtered by recency. By leveraging keyword correlaton - i.e understanding how said keywords of interest relate to one another - subcategories or subtopics can then be deduced from the information available, providing a richer understanding of content lane performance.

Method

Overview

Access Reddit data at scale via the Python Reddit API Wrapper library (PRAW).
Analyze reddit post titles via 2 competing methods and assess the ouputs of each:
- Standard tokenization + lemmatization
- KeyBERT keyword extraction model (see reference for more details).
Calculate average metric scores of keywords outputted by the model
Calculate a correlation matrix of the keywords to determine a richer understanding of top performing keywords and topics

Data Extraction

See Reddit_Topic_Analysis.ipynb for details

PRAW library
Using the top() function with parameters, the top 1000 posts within the year are accessed, looping through each post to get metrics of interest - see code below:

#Function for retrieving top posts
def get_posts(subreddit_name):

  #Lists
  title_list = []
  score_list = []
  num_comments_list = []
  url_list = []
  up_list = []
  down_list = []
  upvote_ratio_list = []

  subreddit = reddit.subreddit(subreddit_name)
  # Get the top posts from the subreddit - can toggle between top vs hot option (for our case, we want to cast a larger net and therefore go for top)
  top_posts = subreddit.top(limit=1000, time_filter='year')
  #top_posts = subreddit.hot(limit=1000)

  # Process top posts
  for post in top_posts:
    # Accessing different metrics of interest
    title = post.title
    score = post.score
    num_comments = post.num_comments
    upvote_ratio = post.upvote_ratio
    ups = post.ups
    downs = post.downs

    # post conditionals
    if score >= 10000 and num_comments >= 100:
      title_list.append(title)
      score_list.append(score)
      num_comments_list.append(num_comments)
      url_list.append(post.url)
      up_list.append(ups)
      down_list.append(downs)
      upvote_ratio_list.append(upvote_ratio)

  
  #Create dataframe
  aita_df = pd.DataFrame({'post': title_list,
                         'score': score_list,
                         'number_of_comments': num_comments_list,
                         'url': url_list,
                         "ups": up_list,
                         "downs": down_list,
                         'upvote_ratio': upvote_ratio_list
                         })
  
  return aita_df

Tokenization vs KeyBERT

Tokenization + Lemmatization
- Subreddit post titles are broken down via tokenization, lemmatization, and filtered further by removing both stopwords and non-alpha characters, with a limit of 3 letters per word.
- Lemmatization is preferred over stemming here as we're looking for the dictionary-based morphological root of the words of interest rather than the base root - the dictionary-based representation is typically easier for general interpretation and more suitable for this analytical case.

# Lemmatizing/Tokenizing Functions
#Configure
def get_wordnet_pos(word):
# Map tag to the first character lemmatize() accepts
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

#Tokenizing titles - only keep tokens with >2 character length
def tokens(tag):
    tag = remove_stopwords(tag) # remove stopwords with Gensim

    lemmatizer = WordNetLemmatizer()
    tokenized = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(tag)]

    # remove left over stop words with nltk
    tokenized = [token for token in tokenized if token not in stopwords.words("english")]

    # remove non-alpha characters and keep the words of length >2 only
    tokenized = [token for token in tokenized if token.isalpha() and len(token)>2]

    return tokenized

KeyBERT
- The KeyBERT technique is an easy to use framework, leveraging BERT embeddings - i.e a bi-directional transformer model utilizing semantic similarity for keywords and phrases - to extract said keywords or phrases from pieces of text.
- Here we specify our extraction technique based on a number of factors:
  - N-gram range of 1,1 (i.e singular words instead of bi or tri-grams).
  - Removing stopwords (similar to above).
  - Maximal Marginal Relevance
    - Leveraging cosine similarity to first, find keywords with maximum relevance to the entire text and second, to iteratively choose new cadidates that are both similar to the text and not similar to the rest of the chosen keywords
  - Experimenting with ranges of keywords to include - a large value is used to include all keywords captured.

kb = KeyBERT()
bert_list_cleaned = []
for name in keyword_df['lower']:
  bert_keys = kb.extract_keywords(name, keyphrase_ngram_range=(1, 1), stop_words='english', use_mmr=True, diversity=0.7)[:8]
  bert_list_cleaned.append(bert_keys)

Results