Case Study: Text Mining and Topic Modelling on Unstructured, Descriptive Data

landscape, nature, outdoors, scenery, road, intersection, aerial view, city, urban, metropolis

The Client

  • The client is one of the largest healthcare insurance companies in the Nordics, providing coverage to nearly a half a million customers across multiple segments.
  • The client wanted to leverage the vast amounts of descriptive unstructured data they collected to create an early warning system that was able to detect high risk and costly claims early so they could be managed correctly and to improve operational reporting and the performance of re-insurance models.

The Challenges

  • Limited internal data science capabilities to process client information and convert messy unstructured data into statistically meaningful, structured data
  • Client’s data was manually keyed-in by individual healthcare professionals, poorly structured, inconsistent and compromised by abbreviations, short-hand and empty sets.
  • Inordinate onsite storage was required due to inefficient data structures.

Solution Delivered

  • Leveraged text pre-processing techniques to cleanse and tag medical abbreviations and applied stemming & lemmatization to further reconcile and normalise disparate inputs.
  • Deployed proprietary natural language processing (NLP) engine for topic modelling that surfaced and segmented the most relevant subjects across more than 500k descriptions, spread across four-plus years of data inputs.
  • Sentiment analysis and other propensity scores were created from previously unstructured data, producing for the client clear categorisation of conditions, severity levels and likelihood of specific outcomes.
  • Back-tested data derivation approach to independently validate outcomes across multiple vintages, geographies, and customer profiles.

Results & Benefits

  • Opened a previously untapped data source for the client to incorporate in their operational reporting, strategy setting and cost optimisation processes.
  • Minimised the impact to the internal teams and potential costly development cycle for those not experienced working on complex unstructured data.
  • Topic specific regression models were developed in order to quantify cost estimates for new customers claims.
  • Topics and severity scores were later utilized in underwriting new insurance schemes and readjusting premium for future vintages across different customer segments.