Code-switching — mixing two languages within a single conversation — is ubiquitous in multilingual communities worldwide. In India, Hindi-English (Hi-En) mixing is especially common in everyday chat, social media, and customer support conversations. Yet, despite being so prevalent, code-switched text remains largely underserved in NLP.
Our paper at EMNLP 2021, GupShup: Summarizing Open-Domain Code-Switched Conversations, addresses this gap.
The Problem
Automatic summarization of conversations has made great strides for English. But when conversations mix Hindi and English — sometimes mid-sentence — standard summarization models struggle. The vocabulary is mixed, grammar is hybrid, and training data is essentially nonexistent.
What We Built
We introduce GupShup, the first dataset for abstractive summarization of Hi-En code-switched conversations:
- 6,800+ conversations sourced from social media and chat platforms
- Human-annotated summaries in both English and Hi-En
- Detailed annotation guidelines we developed from scratch
- Code-switching statistics and linguistic analysis
Key Findings
We benchmarked multiple summarization approaches. Multilingual mBART and a multi-view seq2seq model achieved the best performance. The results also surface interesting failure modes — models sometimes default to a single language even when the input and expected summary are mixed.
Why It Matters
As LLMs become more capable, handling multilingual and code-switched text is increasingly important for real-world deployments. This dataset provides a foundation for that research.