slap-situational

SLAP-Situational is a dual-encoder model specialized for situational (utterance-level) speech style attributes including emotion and speaking style.

This is part of the SLAP model family from the paper:

SLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining Anuj Diwan, Eunsol Choi, David Harwath — Interspeech 2026

Model Details

  • Architecture: WavLM-Large (speech encoder) + Granite Embedding 278M (text encoder) with projection to 768-dim shared space
  • Speech encoder: microsoft/wavlm-large (317M params)
  • Text encoder: ibm-granite/granite-embedding-278m-multilingual (278M params)
  • Embedding dimension: 768
  • Training data: ParaSpeechCaps (situational-tag subset, 298 hours)
  • Training objective: InfoNCE contrastive loss
  • Training: 4500 steps, Adam optimizer, lr=1e-5, 4x NVIDIA A40 GPUs, batch size 128

Usage

import torch
from slap.model import CLAP
from transformers import AutoTokenizer, Wav2Vec2FeatureExtractor

model = CLAP(
    speech_name="microsoft/wavlm-large",
    text_name="ibm-granite/granite-embedding-278m-multilingual",
    embedding_dim=768,
)
state_dict = torch.load("slap-situational.pth.tar", map_location="cpu")
model.load_state_dict(state_dict, strict=False)
model.eval()

See the GitHub repository for full usage examples.

Citation

@inproceedings{diwan2026slap,
  title={{SLAP}}: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining,
  author={Diwan, Anuj and Choi, Eunsol and Harwath, David},
  booktitle={Proc. Interspeech},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ajd12342/slap-situational