slap-situational

SLAP-Situational is a dual-encoder model specialized for situational (utterance-level) speech style attributes including emotion and speaking style.

This is part of the SLAP model family from the paper:

SLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining Anuj Diwan, Eunsol Choi, David Harwath — Interspeech 2026

Model Details

Architecture: WavLM-Large (speech encoder) + Granite Embedding 278M (text encoder) with projection to 768-dim shared space
Speech encoder: microsoft/wavlm-large (317M params)
Text encoder: ibm-granite/granite-embedding-278m-multilingual (278M params)
Embedding dimension: 768
Training data: ParaSpeechCaps (situational-tag subset, 298 hours)
Training objective: InfoNCE contrastive loss
Training: 4500 steps, Adam optimizer, lr=1e-5, 4x NVIDIA A40 GPUs, batch size 128

Usage

import torch
from slap.model import CLAP
from transformers import AutoTokenizer, Wav2Vec2FeatureExtractor

model = CLAP(
    speech_name="microsoft/wavlm-large",
    text_name="ibm-granite/granite-embedding-278m-multilingual",
    embedding_dim=768,
)
state_dict = torch.load("slap-situational.pth.tar", map_location="cpu")
model.load_state_dict(state_dict, strict=False)
model.eval()

See the GitHub repository for full usage examples.

Citation

@inproceedings{diwan2026slap,
  title={{SLAP}}: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining,
  author={Diwan, Anuj and Choi, Eunsol and Harwath, David},
  booktitle={Proc. Interspeech},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

ajd12342
/

slap-situational

slap-situational

Model Details

Usage

Citation

Dataset used to train ajd12342/slap-situational