Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2307.03109

Everything evaluation

Reading list on evaluation metrics, benchmarks, frameworks, datasets

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Paper • 2310.11324 • Published Oct 17, 2023 • 1
Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

Paper • 2509.01790 • Published Sep 1, 2025 • 7
POSIX: A Prompt Sensitivity Index For Large Language Models

Paper • 2410.02185 • Published Oct 3, 2024
A Survey on Evaluation of Large Language Models

Paper • 2307.03109 • Published Jul 6, 2023 • 43

A Survey on Evaluation of Large Language Models

Paper • 2307.03109 • Published Jul 6, 2023 • 43
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Paper • 2401.17072 • Published Jan 30, 2024 • 25
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models

Paper • 2402.10524 • Published Feb 16, 2024 • 23

Levels of AGI for Operationalizing Progress on the Path to AGI

Paper • 2311.02462 • Published Nov 4, 2023 • 36
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Paper • 2206.04615 • Published Jun 9, 2022 • 6
A Survey on Evaluation of Large Language Models

Paper • 2307.03109 • Published Jul 6, 2023 • 43
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

Paper • 2306.13651 • Published Jun 23, 2023 • 16

A collection of arXiv papers from Chip Huyen's AI Engineering organized by chapter and ordered by when each appears in the book.

Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning

Paper • 2211.04325 • Published Oct 26, 2022 • 1
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper • 1810.04805 • Published Oct 11, 2018 • 27
On the Opportunities and Risks of Foundation Models

Paper • 2108.07258 • Published Aug 16, 2021 • 2
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Paper • 2204.07705 • Published Apr 16, 2022 • 2

Instruction-Following Evaluation for Large Language Models

Paper • 2311.07911 • Published Nov 14, 2023 • 22
HuggingFaceH4/mt_bench_prompts

Viewer • Updated Jul 3, 2023 • 80 • 7.32k • 25
vectara/hallucination_evaluation_model

Text Classification • Updated Oct 20, 2025 • 96.3k • 350
GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 248

Attention Is All You Need

Paper • 1706.03762 • Published Jun 12, 2017 • 121
Language Models are Few-Shot Learners

Paper • 2005.14165 • Published May 28, 2020 • 20
Learning to summarize from human feedback

Paper • 2009.01325 • Published Sep 2, 2020 • 4
Training language models to follow instructions with human feedback

Paper • 2203.02155 • Published Mar 4, 2022 • 24

Everything evaluation

Reading list on evaluation metrics, benchmarks, frameworks, datasets

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Paper • 2310.11324 • Published Oct 17, 2023 • 1
Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

Paper • 2509.01790 • Published Sep 1, 2025 • 7
POSIX: A Prompt Sensitivity Index For Large Language Models

Paper • 2410.02185 • Published Oct 3, 2024
A Survey on Evaluation of Large Language Models

Paper • 2307.03109 • Published Jul 6, 2023 • 43

A collection of arXiv papers from Chip Huyen's AI Engineering organized by chapter and ordered by when each appears in the book.

Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning

Paper • 2211.04325 • Published Oct 26, 2022 • 1
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper • 1810.04805 • Published Oct 11, 2018 • 27
On the Opportunities and Risks of Foundation Models

Paper • 2108.07258 • Published Aug 16, 2021 • 2
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Paper • 2204.07705 • Published Apr 16, 2022 • 2

A Survey on Evaluation of Large Language Models

Paper • 2307.03109 • Published Jul 6, 2023 • 43
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Paper • 2401.17072 • Published Jan 30, 2024 • 25
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models

Paper • 2402.10524 • Published Feb 16, 2024 • 23

Instruction-Following Evaluation for Large Language Models

Paper • 2311.07911 • Published Nov 14, 2023 • 22
HuggingFaceH4/mt_bench_prompts

Viewer • Updated Jul 3, 2023 • 80 • 7.32k • 25
vectara/hallucination_evaluation_model

Text Classification • Updated Oct 20, 2025 • 96.3k • 350
GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 248

Levels of AGI for Operationalizing Progress on the Path to AGI

Paper • 2311.02462 • Published Nov 4, 2023 • 36
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Paper • 2206.04615 • Published Jun 9, 2022 • 6
A Survey on Evaluation of Large Language Models

Paper • 2307.03109 • Published Jul 6, 2023 • 43
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

Paper • 2306.13651 • Published Jun 23, 2023 • 16

Attention Is All You Need

Paper • 1706.03762 • Published Jun 12, 2017 • 121
Language Models are Few-Shot Learners

Paper • 2005.14165 • Published May 28, 2020 • 20
Learning to summarize from human feedback

Paper • 2009.01325 • Published Sep 2, 2020 • 4
Training language models to follow instructions with human feedback

Paper • 2203.02155 • Published Mar 4, 2022 • 24

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs