No More Adam: Learning Rate Scaling at Initialization Is All You Need

(arxiv.org)

91 points | by jinqueeny 6 months ago ago

29 comments

akos23 6 months ago
I don't find this very convincing, both from a mathematical and experimental standpoint.
It seems their method is equivalent to SGD where the learning rate of each tensor is scaled by the number of elements in the tensor. The supposed "Signal-to-Noise ratio" they use is just gSNR=norm(g)/RMS(g-mean(g)), where g is the gradient w.r.t. a d-dimensional tensor and the mean is computed across the elements of g. For a zero-mean iid random gradient the elementwise mean(g)≈0. A similar argument probably holds for arbitrary, but not completely random high-dimensional gradients, mean(g)≈0. In this case gSNR=sqrt(d), which explains why it is constant over time and how it varies across the components of the network.
It also seems the optimal value of their hyperparameter sweeps occurs at the edge in almost every case, and a granularity of 10x for the learning rate and weight decay is too large to make direct comparisons anyway.
eden-u4 6 months ago
Tried the source code on a toy model: adam took 2 epochs to train a 10k parameters model, this didn't achieve anything useful in 20.
Tweaked a bit the hyper parameters and such, but nothing. Probably a bogus implementation?
[-]
- johndough 6 months ago
  I tried it to train a CNN-based CIFAR10 classifier, which worked well (only a tiny bit worse than Adam, but the difference might go away with hyper parameter tuning), but the optimizer totally failed (loss -> infinity) when training a U-Net for an image segmentation task. I had to increase eps to 1e-4 and decrease lr to 1e-3 so it would not explode, but that made it very slow to converge.
  My summary is that the memory savings might be great if it works, but it does not work everywhere.
  [-]
  - jszymborski 6 months ago
    Yah I mean that's the rub with SGD... you need to spend a non-trivial compute budget on hyperparam tuning, which sometimes beats Adam.
    Adam, on the other hand, generally gets you pretty good results without futzing too much with hyper params.
  - eden-u4 6 months ago
    ah, numerical instability in the warmup stage might be the issue then?
- akos23 6 months ago
  More likely a bogus paper, neither their mathematical reasoning nor their experiments seem to hold up if you look at them closely.
  [-]
  - Der_Einzige 6 months ago
    A single main conference publication at a top AI conference has ROI in the millions for the first author. I watched someone in the middle of their undergrad with a single ACL workshop publication get a 150K offer starting. It’s remarkable that anything real at all is published given how perverse the incentives are to blatantly make shit up.
- cma 6 months ago
  Did you set them to use the same memory budget? Adam holds more state.
  They do say it consistently matches or outperforms despite simplicity, and I think that statement means at the lower budget for their approach, but a fair comparison fk seems if it is at least promising would be take advantage of the lower memory read to add more params in their version in the comparison.
  Also the paper says slow initial convergence, under limitations:
  > More- over, our methods ensure a steady and stable update during training, allowing the model to converge better in a given task with sufficient training steps. Thus, we might observe that the convergence speed is relatively lower than Adam’s in the early stage of training; as our primary focus is to investigate the effectiveness of the SaI approach, we left the acceleration of convergence speed in future work.
- dist-epoch 6 months ago
  Was the toy model a transformer?
  Maybe it's just way too small, you wouldn't use Karatsuba multiplication to do 3*5.
  [-]
  - eden-u4 6 months ago
    that's a wrong simile given that you would get the same end result in both cases.
    I'm not using a transformer, just a plain Feedforward, Relu and dropout for a simple classifier.
    I don't know, I can be wrong. I hope and some toy experiment shows that even in low case parameters it works fine as well as adam.
spenrose 6 months ago
Something we need is no more papers titled " ... All You Need"
[-]
- fastneutron 6 months ago
  “All you need” considered harmful.
  [-]
  - webmaven 6 months ago
    "Considered Harmful" Considered Harmful...
- upghost 6 months ago
  "All You Need" in the title is apparently All You Need.
- 0xdeadbeefbabe 6 months ago
  "All you need is love" can be a recipe for producing offspring. I've been wondering about AI parallels.
  [-]
  - Smar 6 months ago
    Technically, you don't actually need love there, but yeah.
- 6 months ago
  [deleted]
- joshdavham 6 months ago
  Yeah it’s way too much of a cliché at this point.
  [-]
  - etiam 6 months ago
    I think we're more at the point (or beyond) of it being deliberately obnoxious as a failed attempt at humor. But maybe I'm underestimating just how idolized that original paper is.
    At any rate, by now I'm erring on the side of not promoting or citing them.
    [-]
    - unnah 6 months ago
      Based on a quick googling, apparently the original paper is "One kitchen is all you need" by Sister Eudocia, Isabelle DeVerneil, and Jane Hildebrandt, published in Modern Hospital vol. 79 issue 3, pages 120-122, 1952. https://pubmed.ncbi.nlm.nih.gov/12992940/
      [-]
      - glial 6 months ago
        Funny, I always assumed it was a Beatles joke.
        [-]
        brookst 6 months ago
        “Attention is all you need” is definitely a Beatles joke.
rob_c 6 months ago
Interesting take but:
After a reread it's nice to see the optimizer is faster but how long is spent in the optimizer and can adamw be tuned for a low memory environment given its greedy to try and reduce the impact of statistical noise on gradient calculations.
Note that when training on image1k it only becomes comparable to adamw after many epoch and infact performs measurably worse for most of the training session. (How significant that is, is up to debate and model/task/data)
Why not incorporate 2nd order changes into adamw directly?
The lower memory footprint is nice but it's not immediately why this is the case. Is the batch size reduced? Model changed? I'll be read this after a 2nd coffee and see if it is more obvious...
Still promising if true.
[-]
- yobbo 6 months ago
  I haven't read more than the abstract of this particular paper, but it is expected that training behaves differently with/without adam.
  The problem with adam is that it keeps one more statistic (the same size as the model parameters) in memory. It also adds a little computation.
  The way to deal with it otherwise is to tune the momentum-parameters and clip/limit the gradient in various ways.
amunozo 6 months ago
It's time to stop the "All You Need" titles. This one does not even sound good .
[-]
- v3ss0n 6 months ago
  Need to write an article `All you need should consider harmful` .
  [-]
  - scotty79 6 months ago
    "Considered harmful articles are all you need"
    [-]
    - 0xdeadbeefbabe 6 months ago
      Goto is all you need.
- cuuupid 6 months ago
  It's one of the most irritating snowclones because most of the time the papers are not presenting some dramatic leap forward like attention.