slap-situational
SLAP-Situational is a dual-encoder model specialized for situational (utterance-level) speech style attributes including emotion and speaking style.
This is part of the SLAP model family from the paper:
SLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining Anuj Diwan, Eunsol Choi, David Harwath — Interspeech 2026
Model Details
- Architecture: WavLM-Large (speech encoder) + Granite Embedding 278M (text encoder) with projection to 768-dim shared space
- Speech encoder:
microsoft/wavlm-large(317M params) - Text encoder:
ibm-granite/granite-embedding-278m-multilingual(278M params) - Embedding dimension: 768
- Training data: ParaSpeechCaps (situational-tag subset, 298 hours)
- Training objective: InfoNCE contrastive loss
- Training: 4500 steps, Adam optimizer, lr=1e-5, 4x NVIDIA A40 GPUs, batch size 128
Usage
import torch
from slap.model import CLAP
from transformers import AutoTokenizer, Wav2Vec2FeatureExtractor
model = CLAP(
speech_name="microsoft/wavlm-large",
text_name="ibm-granite/granite-embedding-278m-multilingual",
embedding_dim=768,
)
state_dict = torch.load("slap-situational.pth.tar", map_location="cpu")
model.load_state_dict(state_dict, strict=False)
model.eval()
See the GitHub repository for full usage examples.
Citation
@inproceedings{diwan2026slap,
title={{SLAP}}: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining,
author={Diwan, Anuj and Choi, Eunsol and Harwath, David},
booktitle={Proc. Interspeech},
year={2026}
}