matlok 's Collections

Papers - Fine-tuning - RLHF - Direct Nash Optimization (DNO)

Reward expressed as win-rates related to general preferences