The Columbia Speech Lab is Looking for Data Annotators

The Columbia Speech Lab is looking for data annotators for a computational linguistics project in Fall 2025. See below for the project description. Knowledge of Praat is a plus. If interested, please contact Julia Hirschberg ([email protected]) and Run Chen ([email protected]).

Project paper:
https://www.cs.columbia.edu/speech/PaperFiles/2024/iwsds24_raswda_paper.pdf

The Switchboard Dialog Act (SwDA) corpus has been widely used for dialog act prediction and generation tasks. However, due to misalignment between the text and speech data in this corpus, models incorporating prosodic information have shown poor performance. We report the misalignment issues present in the SwDA corpus caused by previous automatic alignment methods and introduce a re-aligned, improved version called RASwDA (Re-Aligned Switchboard Dialog Act Corpus). Our goal is to create the largest publicly available two-speaker dialog act corpus that has correctly aligned transcripts and speech. Through manual realignment and validation of 537.5 conversations completed so far, we have exceeded the state-of-the-art dialog act recognition results trained on SwDA. As we continue to expand RASwDA by re-aligning the remaining conversations from SwDA, we anticipate further improvements in model performance, facilitated by a larger and more accurate dataset.

Leave a comment Cancel reply