sDPO: Don’t Use Your Data All at Once - Upstage AI, South Korea

stepwise DPO (sDPO), an exten- sion of the recently popularized direct prefer- ence optimization (DPO) for alignment tuning.

reference model, target model,

Adopting open source models as reference mod- els could be dangerous. the datasets that were used in training those models are often un- clear and could overlap with the preference datasets unintentionally.

initializing the target model as S creates a differential in the reference and target models, which may be amplified as the steps progress.