View on GitHub

NLP_Topic_Modeling

Insurance News Topic Modeling and Clustering

Click HERE to see the full and detailed script

Table of Contents

  1. Project Overview
  2. Objectives
  3. Data Preparation
  4. Topic Modeling
  5. Advanced Modeling Techniques
  6. Evaluation Metrics
  7. Results
  8. Applications in the Insurance Industry

Project Overview

This project aims to use NLP to categorize and understand insurance news articles scraped from a news website. The goal is to use NLP to derive actionable insights and trends from these articles that could inform business decisions. LDA, which stands for Latent Dirichlet Allocation, helps in capturing underlying themes in the data which might not be immediately apparent. These themes can be crucial in understanding behavior or trends that are predictive of certain outcomes. Advanced analysis was conducted to improve upon the baseline LDA model, by adding a K-means cluster and later a BERT model.

Objectives

  1. Webscrape and collect news articles and store in Amazon S3.
  2. Perform topic modeling using LDA model and improve upon baseline using K-means Clustering and BERT.
  3. Evaluate its performance using various metrics such as topic coherence and silhouette score.
  4. Visualize the word importance per topic and summarize each articles using BART.

Data Preparation

Data and Web Scraping

Data Storage

Data Cleaning and Vectorization

Topic Modeling

LDA (Latent Dirichlet Allocation)

Evaluation Metrics

Advanced Modeling Techniques

K-means Clustering: Used K-means clustering on the document-topic matrix to further segment articles. K-means clustering was used to refine the LDA topics and find nuanced groupings.

BERT (Bidirectional Encoder Representations from Transformers): Applied BERT and combined its embeddings with LDA topics and K-means clusters to create more contextual embeddings.

UMAP (Uniform Manifold Approximation and Projection): Used UMAP to reduce the feature space of BERT embeddings for easier computation and analysis. Hyper-parameter tuned the components to find the most useful components for clustering.

BART (Bidirectional and Auto-Regressive Transformers): Used BART to generate summaries of each article to summarize the articles for easier understanding and verification.

Results

Upon calculating topic coherence, diversity, and perplexity score for LDA, 17 was a reasonable topic number to choose. From the plot shown below, topic coherence and diversity scores are at maximum, and perplexity score is at minimum when topic number equals 17.

To improve upon the current LDA model, several other models were implemented. Here were the combinations of models that I mixed:

  1. LDA
  2. LDA + K-means
  3. LDA + K-means + BERT

The metrics were as following:

Model Topic Coherence Topic Diversity Silhouette Score
LDA 0.4064 0.739 N/A
LDA + K-means 0.4135 0.7417 0.4928
LDA + K-means + BERT 0.4934 0.7725 0.4034

Results showed that combining K-means and BERT to LDA model improved the topic coherence and diversity score. However, I also noticed that adding BERT to the LDA + K-means model actaully lowered the silhouette score. Further exploration is needed to find out why.

There is a caution when using K-means clustering because K-means clustering uses distance based method for clustering, and the document-topic matrix is not a vector but a probability distribution of topics for each document. A standard K-means clustering may not capture the signals in the document well. Rather, using a probability distribution based clustering may be better (such as Jensen-Shannon Divergence).

Applications

The topics generated from this project can be invaluable for business strategy in the insurance sector (and many others). Understanding what themes and issues are prevalent in insurance news can guide companies in their marketing strategies, product development, and customer engagement efforts.