🎬 RoBERTa Sentiment Analysis — Rotten Tomatoes

A fine-tuned RoBERTa-base model for binary sentiment classification of movie reviews. Upgraded from a DistilBERT baseline (82%) to achieve 88.5% test accuracy with balanced precision and recall.

Property	Value
Model	RoBERTa-base (124.6M parameters)
Task	Binary sentiment classification
Dataset	Rotten Tomatoes (full 8,530 training samples)
Test Accuracy	88.5%
Macro F1	0.8846
License	MIT

Model Description

This model is a fine-tuned version of FacebookAI/roberta-base trained on the full Rotten Tomatoes movie review dataset for binary sentiment classification.

Architecture: RoBERTa-base (12-layer, 768-hidden, 12-heads, 124.6M parameters) with a sequence classification head
Task: Predicting whether a movie review expresses a positive or negative sentiment
Output labels: POSITIVE (1) and NEGATIVE (0)
Primary use case: Analyzing sentiment in movie reviews and similar short-form text

RoBERTa (Robustly optimized BERT approach) improves on BERT with dynamic masking, larger batches, longer training, and no next-sentence prediction. These changes yield consistently better downstream performance on classification tasks (Liu et al., 2019).

Improvement Over Baseline

Model	Test Accuracy	Test F1	Improvement
DistilBERT (baseline, 2K samples)	82.0%	0.8199	—
RoBERTa-base (this model, 8.5K samples)	88.5%	0.8846	+6.5 pp

Quickstart — Inference

Using the `pipeline` API (Recommended)

from transformers import pipeline

classifier = pipeline(
    "sentiment-analysis",
    model="keerthi1515/roberta-sentiment-rotten-tomatoes"
)

# Single prediction
result = classifier("This movie was absolutely fantastic! A must-watch.")
print(result)
# [{'label': 'POSITIVE', 'score': 0.997}]

# Batch prediction
texts = [
    "A beautifully crafted film with outstanding performances.",
    "Terrible script and wooden acting throughout.",
    "An average movie, nothing special but not terrible either.",
]
results = classifier(texts)
for text, res in zip(texts, results):
    print(f"{res['label']} ({res['score']:.2%}): {text}")

Using model and tokenizer directly

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("keerthi1515/roberta-sentiment-rotten-tomatoes")
model = AutoModelForSequenceClassification.from_pretrained("keerthi1515/roberta-sentiment-rotten-tomatoes")

text = "One of the best films I have ever seen!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(probabilities, dim=-1).item()

label_map = {0: "NEGATIVE", 1: "POSITIVE"}
print(f"Prediction: {label_map[predicted_class]} (confidence: {probabilities[0][predicted_class]:.2%})")

Dataset

Rotten Tomatoes Movie Reviews

Property	Details
Source	cornell-movie-review-data/rotten_tomatoes
Domain	Movie reviews
Task	Binary sentiment classification
Classes	Negative (0), Positive (1)
Total size	10,662 samples

Data Splits

Split	Samples	Class Balance
Train	8,530	50% positive / 50% negative
Validation	1,066	50% positive / 50% negative
Test	1,066	50% positive / 50% negative

Class Imbalance Assessment

The dataset is perfectly balanced at 50/50 across all splits (4,265 per class in train). No class weighting, oversampling, or loss rebalancing is needed — standard cross-entropy loss works optimally.

Preprocessing

Tokenization: Byte-Pair Encoding (RoBERTa's default, 50,265 vocabulary)
Max sequence length: 128 tokens (with truncation)
Padding: Dynamic padding via DataCollatorWithPadding (pads to longest in each batch)
Training data: Full dataset used (8,530 training samples)
No additional text cleaning — raw review text fed directly to the tokenizer

Training Details

Training Configuration

Hyperparameter	Value
Base model	`FacebookAI/roberta-base` (124.6M params)
Learning rate	2e-5
LR scheduler	Linear warmup + linear decay
Warmup steps	100
Batch size (train)	16
Batch size (eval)	32
Max epochs	5 (with early stopping)
Early stopping	Patience=2, threshold=0.001
Optimizer	AdamW (fused) with β₁=0.9, β₂=0.999, ε=1e-8
Weight decay	0.01
Gradient clipping	max_grad_norm=1.0
Mixed precision	Disabled (CPU training)
Max sequence length	128 tokens
Seed	42
Steps per epoch	534
Framework	Transformers 5.7.0, PyTorch 2.11.0

Training Method

Full supervised fine-tuning using the Hugging Face Trainer API with advanced optimization:

Loaded pre-trained RoBERTa-base with a randomly initialized classification head (2 output classes)
Fine-tuned all 124.6M parameters (full fine-tuning, not LoRA/adapter-based)
Applied linear warmup over 100 steps followed by linear decay
Gradient clipping at norm 1.0 to prevent exploding gradients
Evaluated on validation set after each epoch
Early stopping with patience=2 monitors accuracy to prevent overfitting
Best model selected automatically via load_best_model_at_end=True

Training Progress

Epoch	Training Loss	Validation Loss	Val Accuracy	Val Precision	Val Recall	Val F1
1	0.3014	0.3126	87.9%	0.8401	0.9362	0.8855
2	0.2930	0.3687	88.3%	0.8542	0.9231	0.8873
3	0.1685	0.4477	89.0%	0.8741	0.9118	0.8926
4	0.1041	0.5700	88.3%	0.8630	0.9099	0.8858
5	0.0613	0.5546	88.5%	0.8714	0.9024	0.8866

Note: Validation loss increases steadily from epoch 1 onwards while training loss drops to 0.06, showing classic overfitting. Best validation accuracy was at epoch 3 (89.0%). Early stopping loaded the best checkpoint.

Evaluation Results

Test Set Performance (1,066 samples)

All metrics computed on the full test split (1,066 samples, perfectly balanced at 533 per class).

Metric	NEGATIVE	POSITIVE	Macro Average
Precision	0.8839	0.8853	0.8846
Recall	0.8856	0.8837	0.8846
F1-Score	0.8847	0.8845	0.8846
Support	533	533	1,066

	Overall
Accuracy	0.8846 (88.5%)
Weighted F1	0.8846
Test Loss	0.5103

Confusion Matrix

	Predicted NEGATIVE	Predicted POSITIVE
Actual NEGATIVE	472 (TN)	61 (FP)
Actual POSITIVE	62 (FN)	471 (TP)

True Positives: 471 — correctly identified positive reviews
True Negatives: 472 — correctly identified negative reviews
False Positives: 61 — negative reviews misclassified as positive
False Negatives: 62 — positive reviews misclassified as negative
Total misclassified: 123 / 1,066 (11.5%)

Error Analysis

Error Patterns

Error Type	Count	Avg Word Length
False Positives (neg→pos)	61	20.5 words
False Negatives (pos→neg)	62	22.8 words
Correct predictions	—	balanced

The errors are nearly perfectly split between FP and FN, confirming the model has no class bias.

Common Misclassification Patterns

Irony / backhanded praise: "at its worst, the movie is pretty diverting; the pity is that it rarely achieves its best." (POSITIVE labeled, predicted NEGATIVE) — The negative framing masks the positive intent
Mixed sentiment with negative language: "the film feels uncomfortably real, its language and locations bearing the unmistakable stamp of authority." (POSITIVE, predicted NEGATIVE) — "uncomfortably" triggers negative signal
Understated or ambiguous reviews: "shiner can certainly go the distance, but isn't world championship material" (POSITIVE, predicted NEGATIVE) — Subtle conditional praise
Sarcastic inversions: Reviews using negative vocabulary for ironic positive effect
Short, cryptic reviews: Minimal context for inference

Results Analysis

What 88.5% Accuracy Means

The model correctly classifies approximately 9 out of every 10 movie reviews — a 6.5 percentage point improvement over the DistilBERT baseline (82%). On a balanced binary task, this represents a 38.5 pp improvement over random chance (50%).

Why RoBERTa Outperforms DistilBERT

Factor	DistilBERT	RoBERTa
Parameters	67M	124.6M
Layers	6	12
Pre-training data	BookCorpus + Wikipedia	+ CC-News, OpenWebText, Stories
Training samples used	2,000 (23%)	8,530 (100%)
Test accuracy	82.0%	88.5%

The gains come from three sources: (1) RoBERTa's larger capacity and richer pre-training, (2) using the full training dataset, and (3) advanced optimization (warmup, early stopping, gradient clipping).

Limitations & Bias

Dataset Bias

Domain-specific: Trained exclusively on Rotten Tomatoes — a particular style of English-language film criticism
English only: No multilingual support
Temporal bias: Reviews from a specific time period; may not capture evolving language
Binary oversimplification: Neutral, mixed, or conditional sentiments forced into pos/neg

Model Limitations

Max input length: 128 tokens (~50-70 words). Longer reviews are truncated
No aspect-level analysis: Single overall sentiment, not per-aspect (acting, plot, etc.)
No explanation: Outputs label + confidence without reasoning
Domain transfer: Performance degrades on non-movie-review text without adaptation

Ethical Considerations

Not for high-stakes decisions without human review
Potential for misuse: Review manipulation, surveillance, opinion filtering
Representation gaps: Training data may not equally represent all demographics/dialects
Feedback loops: Automated sentiment filtering can suppress legitimate negative opinions

Use Cases

Recommended Applications

Movie review classification — Categorize user reviews as positive/negative
Customer feedback triage — Route feedback by sentiment polarity
Content analysis — Track sentiment trends in entertainment media
Baseline model — Strong starting point before investing in larger models
Educational tool — Demonstrate NLP fine-tuning and evaluation concepts

Not Recommended For

Medical, legal, or financial sentiment analysis
Real-time social media monitoring (without domain adaptation)
Multi-language or code-switched text
Fine-grained sentiment (1-5 stars, emotion detection)

Future Improvements

Improvement	Expected Impact
RoBERTa-large (355M params)	+2-3% — more capacity for nuanced language
Hyperparameter sweep (lr, warmup, epochs)	+1-2% — systematic optimization
Data augmentation (back-translation, paraphrase)	+1-2% — more training diversity
Increase max_length (128→256)	+0.5-1% — capture full review context
Multi-dataset training (SST-2, IMDB, Amazon)	+2-4% — domain generalization
Knowledge distillation	Maintain accuracy with fewer params
GPU + fp16 training	10-20× faster training, enabling larger sweeps

Framework Versions

Component	Version
Transformers	5.7.0
PyTorch	2.11.0
Datasets	4.8.5
Tokenizers	0.22.2
Python	3.12

Citation

@article{liu2019roberta,
  title={RoBERTa: A Robustly Optimized BERT Pretraining Approach},
  author={Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stollhiller, Veselin},
  journal={arXiv preprint arXiv:1907.11692},
  year={2019}
}

@inproceedings{pang2005seeing,
  title={Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales},
  author={Pang, Bo and Lee, Lillian},
  booktitle={Proceedings of the ACL},
  year={2005}
}

Model tree for keerthi1515/roberta-sentiment-rotten-tomatoes

Base model

FacebookAI/roberta-base

Finetuned

(2306)

this model

Dataset used to train keerthi1515/roberta-sentiment-rotten-tomatoes

Spaces using keerthi1515/roberta-sentiment-rotten-tomatoes 2

Paper for keerthi1515/roberta-sentiment-rotten-tomatoes

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Paper • 1907.11692 • Published Jul 26, 2019 • 10

Evaluation results

Accuracy on Rotten Tomatoes
test set self-reported

0.885
F1 (Binary) on Rotten Tomatoes
test set self-reported

0.884
Precision (Binary) on Rotten Tomatoes
test set self-reported

0.885
Recall (Binary) on Rotten Tomatoes
test set self-reported

0.884