Follow

on your desktop?

FlexGen paper by Ying Sheng et al. shows ways to bring hardware requirements of generative AI down to the scale of a commodity GPU.

github.com/FMInference/FlexGen

Paper on GitHub - authors at Stanford / Berkeley / ETH / Yandex / HSE / Meta / CMU

They run OPT-175B (a GPT-3 equivalent trained by Meta) on a single Nvidia T4 GPU (~ $ 2,300) and achieve 1Token/s throughput (that's approximately 45 words per minute). Not cheap, but on the order of a high-end gaming rig.

Implications of personalized LLMs are - amazing.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.