Emotions play an important role in social interactions of humans and it is often said that emotions separate us from machines. Spoken words may have different interpretations depending on how they are uttered. Same sentences can have different meanings under different type of emotional states. Human brain understands different meanings by perceiving underlying emotions in speech. Finding the emotional content from speech signals is desirable because this enables us to teach emotional intelligence to computers.Speech emotion recognition is an important field of study with applications ranging from emotionally intelligent robot creation, audio surveillance, web-based E-learning, computer games, etc. The objective of this paper is to identify emotions in audio speech by using deep learning algorithms including Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) to identify different emotional states of a person. In this regard, the RADVEES dataset, Ryerson Audio-Visual database of Emotional Speech and Song, is used to study speech emotion recognition. For experiments, we used approximately 1247 audio and song files containing eight different emotions for classification of audio data. The experimental results show that the best performing model was CNN based model with accuracy of 74.57% while RNN model only showed 55.47% accuracy which is far less in comparison. This work will be extended in future using different variants of RNNs and other DNNs like auto-encoders. Audio is a complex signal with linguistic and paralinguistic features and our future goal is to combine these features with different neural network architectures for developing improved SER systems.
Key words: Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), DNN
|