Evaluation of IBM Watson Natural Language Processing Service to predict influenza-like illness outbreaks from Twitter data
Social media has opened the gates for collecting big data that can be used to monitor epidemic trends in real time. We evaluate whether Watson NLP service can be used to reliably predict infectious disease such as influenza-like illness (ILI) outbreaks using Twitter data during the period of the main influenza season. Watson’s performance is evaluated by computing Pearson correlation between the number of tweets classified by Watson as ILI and the number of ILI occurrences recovered from traditional epidemic surveillance system of the Centers for Disease Control and Prevention (CDC). Achieved correlation was 0.55. Furthermore, a 12 week discrepancy was found between peak occurrences of ILI predicted by Watson and CDC reported data. Additionally, we developed a scoring method for ILI prediction from Twitter posts using a simple formula with the ability to predict ILI two weeks ahead of CDC reported ILI data. The method uses Watson’s sentiment and emotion scores together with identified ILI features to analyze influenza-related posts in real time. Due to Watson's high computational costs of sentiment and emotion analysis, we tested if machine learning approach can be used to predict influenza using only identified ILI keywords as influenza predictors. All three evaluated methods (Random Forest, Logistic Regression, K-NN), achieved overall accuracy of ~68.2% and 97.5% respectively, when Watson and the developed formula are used as medical experts. The obtained results suggest that data found within social media can be used to supplement the traditional surveillance of influenza outbreaks with the help of intelligent computations.