FAMuS: Frames Across Multiple Sources
The article introduces FAMuS (Frames Across Multiple Sources), a new corpus of Wikipedia passages reporting on events, paired with diverse source articles for the same event. It aims to tackle…
Continue readingThis focuses on computational approaches to understanding, processing, and generating human languages.
The article introduces FAMuS (Frames Across Multiple Sources), a new corpus of Wikipedia passages reporting on events, paired with diverse source articles for the same event. It aims to tackle…
Continue readingThe article introduces Fast Language-Audio Pre-training (FLAP), a self-supervised learning approach that learns to align audio and language representations through masking, contrastive learning, and reconstruction. FLAP randomly drops audio spectrogram…
Continue readingThe paper explores the efficiency of self-supervised pre-trained audio models. It posits that these models can achieve comparable inference efficiency to more complex models that use speech transformer encoders. These…
Continue readingThe article discusses the application of multi-task learning (MTL) in end-to-end speech translation (ST). It investigates the consistency between different tasks in MTL and their effect on the ST task….
Continue readingThe article discusses the problem of relevance ranking – sorting objects according to a given criterion. This is a challenge because different users may prefer different criteria, so the ranking…
Continue readingThe article presents GateLoop, a sequence model that enhances linear recurrent models such as S4, S5, LRU, and RetNet by using data-controlled state transitions. GateLoop surpasses existing models in auto-regressive…
Continue readingThe paper introduces a new task called multimodal planning problem specification aimed at generating a problem description (PD), a machine-readable file used by planners to find a plan. The authors…
Continue readingThe researchers propose a novel model called Dual-Phase Audio Transformer for Denoising (DPATD) to address the challenges of time-domain speech enhancement systems. DPATD splits the audio input into smaller chunks,…
Continue readingThe article presents Distil-Whisper, a distilled variant of the Whisper model for speech recognition. To address the challenges of running large models in low-latency or resource-constrained environments, the researchers used…
Continue readingThe article discusses the importance of syllable stress in English pronunciation and the potential misunderstandings that can arise from incorrect syllable stress. The authors present a self-attention model for the…
Continue readingLooks like you're using an ad blocker. Please support us by disabling these ad blocker.