I have been following the tutorials on DeepLearning.net to learn how to implement a convolutional neural network that extracts features from images. The tutorial are well explained, easy to understand and follow.
I want to extend the same CNN to extract multi-modal features from videos (images + audio) at the same time.
I understand that video input is nothing but a sequence of images (pixel intensities) displayed in a period of time (ex. 30 FPS) associated with audio. However, I don't really understand what audio is, how it works, or how it is broken down to be feed into the network.
I have read a couple of papers on the subject (multi-modal feature extraction/representation), but none have explained how audio is inputed to the network.
Moreover, I understand from my studies that multi-modality representation is the way our brains really work as we don't deliberately filter out our senses to achieve understanding. It all happens simultaneously without us knowing about it through (joint representation). A simple example would be, if we hear a lion roar we instantly compose a mental image of a lion, feel danger and vice-versa. Multiple neural patterns are fired in our brains to achieve a comprehensive understanding of what a lion looks like, sounds like, feels like, smells like, etc.
The above mentioned is my ultimate goal, but for the time being I'm breaking down my problem for the sake of simplicity.
I would really appreciate if anyone can shed light on how audio is dissected and then later on represented in a convolutional neural network. I would also appreciate your thoughts with regards to multi-modal synchronisation, joint representations, and what is the proper way to train a CNN with multi-modal data.
EDIT: I have found out the audio can be represented as spectrograms. It as a common format for audio and is represented as a graph with two geometric dimensions where the horizontal line represents time and the vertical represents frequency.
Is it possible to use the same technique with images on these spectrograms? In other words can I simply use these spectrograms as input images for my convolutional neural network?
We used deep convolutional networks on spectrograms for a spoken language identification task. We had around 95% accuracy on a dataset provided in this TopCoder contest. The details are here.
Plain convolutional networks do not capture the temporal characteristics, so for example in this work the output of the convolutional network was fed to a time-delay neural network. But our experiments show that even without additional elements convolutional networks can perform well at least on some tasks when the inputs have similar sizes.