BEGIN:VCALENDAR
VERSION:2.0
CALSCALE:GREGORIAN
PRODID:UW-Madison-Physics-Events
BEGIN:VEVENT
SEQUENCE:3
UID:UW-Physics-Event-6889
DTSTART:20220323T160000Z
DTEND:20220323T171500Z
DTSTAMP:20260530T085716Z
LAST-MODIFIED:20220321T190309Z
LOCATION:Online Seminar: Please sign up for our mailing list at www.ph
 ysicsmeetsml.org for zoom link. We will also livestream the talk in Ch
 amberlin 5280.
SUMMARY:Tuning Large Neural Networks via Zero-Shot Hyperparameter Tran
 sfer\, Physics ∩ ML Seminar\, Greg Yang\, Microsoft Research
DESCRIPTION:You can't train GPT-3 on a single GPU\, much less tune its
  hyperparameters (HPs)...or so it seems. I'm here to tell you this is 
 not true: you *can* tune its HPs on a single GPU even if you can't tra
 in it that way!\nIn the first half of this talk\, I'll describe how\,
  in the so-call maximal update parametrization (abbreviated µP)\, nar
 row and wide neural networks share the same set of optimal HPs. This l
 ets us tune any large model by just tuning a small version of it — w
 e call this *µTransfer*. In particular\, this allowed us to tune the 
 6.7 billion parameter version of GPT-3 using only 7% of its pretrainin
 g compute budget\, and\, with some asterisks\, we get a performance co
 mparable to the original GPT-3 model with twice the parameter count.\
 nIn the second half of this talk\, I'll discuss the theoretical reason
  µP has this special property and the connection to the study of infi
 nite-width neural networks and\, more generally\, the theory of Tensor
  Programs.\nThe first half will target general practitioners or empir
 ical researchers in machine learning\, while the second half targets t
 hose who are more theoretically curious. This talk is based on http://
 arxiv.org/abs/2011.14522 
URL:https://www.physics.wisc.edu/events/?id=6889
END:VEVENT
END:VCALENDAR