Sdpo
sDPO: Don’t Use Your Data All at Once - Upstage AI, South Korea
stepwise DPO (sDPO), an exten- sion of the recently popularized direct prefer- ence optimization (DPO) for alignment tuning.
reference model, target model,
Adopting open source models as reference mod- els could be dangerous. the datasets that were used in training those models are often un- clear and could overlap with the preference datasets unintentionally.
initializing the target model as S creates a differential in the reference and target models, which may be amplified as the steps progress.