Deep learning has been proved to be effective in
multimodal speech recognition using facial frontal images. In
this paper, we propose a new deep learning method, a trimodal
deep autoencoder, which uses not only audio signals and face
images, but also depth images of faces, as the inputs. We collected
continuous speech data from 20 speakers with Kinect 2.0 and
used them for our evaluation. The experimental results with
10dB SNR showed that our method reduced errors by 30%,
from 34.6% to 24.2% from audio-only speech recognition when
SNR was 10dB. In particular, it is effective for recognizing some
consonants including /k/, /t/.