The research focuses on improving spoken language understanding systems’ ability to handle unseen intents. A generalized zero-shot audio-to-intent classification framework is proposed, which uses a few sample text sentences per intent. A supervised audio-to-intent classifier is trained using a self-supervised pre-trained model. A neural audio synthesizer is used to create audio embeddings for sample text utterances. Unseen intents are classified using cosine similarity. A multimodal training strategy is also proposed, incorporating lexical information into the audio representation, improving zero-shot performance. This approach improves the accuracy of zero-shot intent classification on unseen intents of SLURP by 2.75% and 18.2% for the SLURP and internal goal-oriented dialog datasets, respectively.

 

Publication date: 8 Nov 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2311.02482