Speech intention classification with multimodal deep learning

Yue Gu, Xinyu Li, Shuhong Chen, Jianyu Zhang, Ivan Marsic

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Scopus citations

Abstract

We present a novel multimodal deep learning structure that automatically extracts features from textual-acoustic data for sentence-level speech classification. Textual and acoustic features were first extracted using two independent convolutional neural network structures, then combined into a joint representation, and finally fed into a decision softmax layer. We tested the proposed model in an actual medical setting, using speech recording and its transcribed log. Our model achieved 83.10% average accuracy in detecting 6 different intentions. We also found that our model using automatically extracted features for intention classification outperformed existing models that use manufactured features.

Original languageEnglish (US)
Title of host publicationAdvances in Artificial Intelligence - 30th Canadian Conference on Artificial Intelligence, Canadian AI 2017, Proceedings
EditorsPhilippe Langlais, Malek Mouhoub
PublisherSpringer Verlag
Pages260-271
Number of pages12
ISBN (Print)9783319573502
DOIs
StatePublished - 2017
Event30th Canadian Conference on Artificial Intelligence, AI 2017 - Edmonton, Canada
Duration: May 16 2017May 19 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10233 LNAI

Other

Other30th Canadian Conference on Artificial Intelligence, AI 2017
CountryCanada
CityEdmonton
Period5/16/175/19/17

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Keywords

  • Convolutional neural network
  • Multimodal intention classification
  • Textual-acoustic feature representation
  • Trauma resuscitation

Fingerprint Dive into the research topics of 'Speech intention classification with multimodal deep learning'. Together they form a unique fingerprint.

Cite this