GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

(twitter.com)

18 points | by laxmena an hour ago ago

5 comments

cadamsdotcom 5 minutes ago
Transformers scale poorly vs. context window size and parameter count.
Which means really impressive when those N’s are small!
I’m but a pundit in this area so don’t know much. But one wonders if there’s a future in burning larger models to FPGAs - whether big enough FPGAs exist (or can be built), and whether locating specialized compute right with the memory it needs can speed things up.
Likely would need a lot of algorithm parallelism work that’d translate back to CPUs/GPUs.
genxy 26 minutes ago
The context window is 16 characters. Talking about tokens per second is meaningless.
amelius an hour ago
See also:
https://rits.shanghai.nyu.edu/ai/karpathys-microgpt-on-fpga-...
TL;DR: The CPU implementation was 71x faster than the FPGA.
Note: model has only 4192 parameters.
[-]
- cyanydeez 25 minutes ago
  yeah, then theres prompt loading too.
  but anyone who can fit QWEN-3.6 35B with a sustained ~30 token/s and ~100k context with cache could print money as a hardware vendor.
  [-]
  - wmf 9 minutes ago
    That just sounds like a 3090.