Date of Award




Document Type

Master's Thesis

Degree Name

Master of Science (MS)


Department of Electrical and Computer Engineering

Content Description

1 online resource (vii, 49 pages) : illustrations (chiefly color)

Dissertation/Thesis Chair

Gary J Saulnier

Committee Members

Daphney-Stavroula Zois, Mohammed Agamy


Deep Learning, Emotion, LSTM, Machine Learning, Multimodal emotion, Emotion recognition, Interactive multimedia, Machine learning, Computer software, Speech processing systems

Subject Categories

Artificial Intelligence and Robotics | Computer Engineering | Psychology


Emotion forecasting is the task of predicting the future emotion of a speaker, i.e., the emotion label of the future speaking turn–based on the speaker’s past and current audio-visual cues. Emotion forecasting systems require new problem formulations that differ from traditional emotion recognition systems. In this thesis, we first explore two types of forecasting windows(i.e., analysis windows for which the speaker’s emotion is being forecasted): utterance forecasting and time forecasting. Utterance forecasting is based on speaking turns and forecasts what the speaker’s emotion will be after one, two, or three speaking turns. Time forecasting forecasts what the speaker’s emotion will be after a certain range of time, such as 3–8, 8–13, and 13–18 seconds. We then investigate the benefit of using the past audio-visual cues in addition to the current utterance. We design emotion forecasting models using deep learning. We compare the performances of FC-DNN, D-LSTM, and D-BLSTM which allows us to examine the benefit of modelling dynamic patterns in emotion forecasting tasks. Our experimental results on the IEMOCAP bench-mark dataset demonstrate that D-BLSTM and D-LSTM outperform FC-DNN by up to2.42% in unweighted recall. When using both the current and past utterances, deep dynamic models show an improvement of up to 2.39% compared to their performance when using only the current utterance. We further analyze the benefit of using current and past utterance information compared to using the current and randomly chosen utterance in-formation, and we find the performance improvement rises to 7.53%. The novelty in this study comes from its formulation of emotion forecasting problems and the understanding of how current and past audio-visual cues reveal future emotional information.