Ensemble Learning to find TV Series Ratings Using Text Scripts

Access the GitHub repository for the project here

Note: This was developed as a final project for the graduate level course CSCI-6443-Data Mining under Prof.Bellaachia. The following is the abstract from the final term paper submission. Access the full text version of the article here. View a bried presentation here.

TV series are arguably one of the most lucrative sectors within the broader entertainment media industry. In the past decade, due to the advent of online streaming services, the number of such series available for audience consumption has been exponentially increasing. Although, it is a consumer-based industry, economic profitability lies at the center of this sector - for media houses expansion of revenue means more resources for improved production quality. Thus, optimizing the performance of a series and by extension the incidental monetary gain is a prime consideration. While various non-computational techniques have been used in the past to estimate series performance, these suffer from various bias problems thereby creating a need for computational/quantitative models to achieve the same functionality.

Various statistical and predictive techniques are thus increasingly being applied in the entertainment industry to predict success to better orient the high investment needed for their preparation. This article introduces an approach that utilizes text analytics, natural language processing, and data mining techniques to extract quantitative features from the script corpus of the popular television series “Friends.” These features are then fed into ensemble learning models, including stacked architectures, Gradient Boost, AdaBoost, among others, to predict IMDb ratings for the scripts. Positioned at the intersection of the technological evolution of the media production industry, this project combines diverse data sources and employs advanced data mining methods to provide a robust and objective tool for estimating episode reception amongst consumers.