And they are likely doing something similar to put their LLMs in silicon. I would believe a 10x electricity boost along with it being much faster.
The idea is that you can create a sea of generalized standard cells and it makes for a gate array at the manufacturing layer. This was also done 20 or so years ago, it was called a "structured ASIC".
I'd be curious to see if they use the LUT design of traditional structured ASICs or figured what what I did: you can use standard cells to do the same thing and use regular tools/PDKs to make it.
So something that you can do with PDKs is add your own custom standard cell and tell the EDA tools to use them. This is actually pretty smart, this way you can use most of the foundry cells (which have been extensively validated) and focus on things like this "magic multiplier", that you will have to manually validate. This also makes porting across tech nodes easier if you manage only a handful of custom cells versus a completely custom design.
(I have my guesses as to what that is, but I admittedly don't know enough about that particular part of the field to give anything but a guess).
8B coefficients are packed into 53B transistors, 6.5 transistors per coefficient. Two-inputs NAND gate takes 4 transistors and register takes about the same. One coefficient gets processed (multiplied by and result added to a sum) with less than two two-inputs NAND gates.
I think they used block quantization: one can enumerate all possible blocks for all (sorted) permutations of coefficients and for each layer place only these blocks that are needed there. For 3-bit coefficients and block size of 4 coefficients only 330 different blocks are needed.
Matrices in the llama 3.1 are 4096x4096, 16M coefficients. They can be compressed into only 330 blocks, if we assume that all coefficients' permutations are there, and network of correct permutations of inputs and outputs.
Assuming that blocks are the most area consuming part, we have block's transistor budget of about 250 thousands of transistors, or 30 thousands of 2-inputs NAND gates per block.
250K transistors per block * 330 blocks / 16M transistors = about 5 transistors per coefficient.
Looks very, very doable.
It does look doable even for FP4 - these are 3-bit coefficients in disguise.
largest FPGAs have on the order of tens of millions of logic cells/elements. Theyâre not even remotely big enough to emulate these designs except to validate small parts of it at a time and unlike memory chips or GPUs, companies donât need millions of them to scale infrastructure.
(The chips also cost tens of thousands of dollars each)
I gave a short talk about compiling PyTorch to Verilog at Latte '22. Back then we were just looking at a simple dot product operation, but the approach could theoretically scale up to whole models.
They mentioned that they using strong quantization (iirc 3bit) and that the model was degradeted from that. Also, they don't have to use transistors to store the bits.
gpt-oss is fp4 - they're saying they'll next try mid size one, I'm guessing gpt-oss-20b then large one, i'm guessing gpt-oss-120b as their hardware is fp4 friendly
The form factor discussion is fascinating but I think the real unlock is latency. Current cloud inference adds 50-200ms of network overhead before you even start generating tokens. A dedicated ASIC sitting on PCIe could serve first token in microseconds.
For applications like real-time video generation or interactive agents that need sub-100ms response loops, that difference is everything. The cost per inference might be higher than a GPU cluster at scale, but the latency profile opens up use cases that simply aren't possible with current architectures.
Curious whether Taalas has published any latency benchmarks beyond the throughput numbers.
latency and control, and reliability of bandwidth and associated costs - however this isn't just the pull for specialised hardware but for local computing in general, specialised hardware is just the most extreme form of it
there are tasks that inherently benefit from being centralised away, like say coordination of peers across a large area - and there are tasks that strongly benefit from being as close to the user as possible, like low latency tasks and privacy/control-centred tasks
simultaneously, there's an overlapping pull to either side caused by the monetary interests of corporations vs users - corporations want as much as possible under their control, esp. when it's monetisable information but most things are at volume, and users want to be the sole controller of products esp. when they pay for them
we had dumb terminals already being pushed in the 1960s, the "cloud", "edge computing" and all forms of consolidation vs segregation periods across the industry, it's not going to stop because there's money to be made from the inherent advantages of those models and even the industry leaders cannot prevent these advantages from getting exploited by specialist incumbents
once leaders consolidates, inevitably they seek to maximise profit and in doing so they lower the barrier for new alternatives
ultimately I think the market will never stop demanding just having your own *** computer under your control and hopefully own it, and only the removal of this option will stop this demand; while businesses will never stop trying to control your computing, and providing real advantages in exchange for that, only to enter cycles of pushing for growing profitability to the point average users keep going back and forth
As scary as it sounds today, a lightning-quick zero latency non-networked local LLM could provide value in an application like a self-driving car. It would be a level below Waymo's remote human support, so if the car couldn't figure out how to deal with a weird situation, it could ask the LLM what to do, hopefully avoiding the need to phone home (and perhaps handling cases where it couldn't phone home).
The network latency bit deserves more attention. Iâve been trying to find out where AI companies are physically serving LLMs from but itâs difficult to find information about this. If Iâm sitting in London and use Claude, where are the requests actually being served?
The ideal world would be an edge network like Cloudflare for LLMs so a nearby POP serves your requests. Iâm not sure how viable this is. On classic hardware I think it would require massive infra buildout, but maybe ASICs could be the key to making this viable.
> The network latency bit deserves more attention. Iâve been trying to find out where AI companies are physically serving LLMs from but itâs difficult to find information about this. If Iâm sitting in London and use Claude, where are the requests actually being served?
Unfortunately, as with most of the AI providers, it's wherever they've been able to find available power and capacity. They've contracts with all of the large cloud vendors and lack of capacity is significant enough of an issue that locality isn't really part of the equation.
The only things they're particular about locality for is the infrastructure they use for training runs, where they need lots of interconnected capacity with low latency links.
Inference is wherever, whenever. You could be having your requests processed halfway around the world, or right next door, from one minute to the next.
Id assume the next step is a small reasoning model would demo whether inference speed can fill some intelligence gaps. Combine that with some RAG to see if theres a tension in intrinsic reason or pattern recognition.
This would be a very interesting future. I can imagine Gemma 5 Mini running locally on hardware, or a hard-coded "AI core" like an ALU or media processor that supports particular encoding mechanisms like H.264, AV1, etc.
Other than the obvious costs (but Taalas seems to be bringing back the structured ASIC era so costs shouldn't be that low [1]), I'm curious why this isn't getting much attention from larger companies. Of course, this wouldn't be useful for training models but as the models further improve, I can totally see this inside fully local + ultrafast + ultra efficient processors.
> I'm curious why this isn't getting much attention from larger companies.
I can see two potential reasons:
1) Most of the big players seem convinced that AI is going to continue to improve at the rate it did in 2025, if their assumption is somehow correct by the time any chip entered mass production it would be obsolete.
2) The business model of the big players is to sell expensive subscriptions, and train on and sell the data you give it. Chips that allow for relatively inexpensive offline AI aren't conducive to that.
> I'm curious why this isn't getting much attention from larger companies
I would be shocked if Google isnât working on this right now. They build their own TPUs, this is an extremely obvious direction from there.
(And there are plenty of interesting co-design questions that only the frontier labs can dabble with; Taalas is stuck working around architectural quirks like âtop-8 MoEâ, Google can just rework the architecture hyperparameters to whatever gets best results in silico.)
That's the part that people are missing: it won't get smaller. It already required heroic optimization to get 8B on one megachip. Taalas is more expensive but faster. It is cheaper per token when running 24x7 but not cheap to buy. It will never be small and never be cheap.
Well, there's a limit to how small we can make transistors with our current technology. As I understand it, Intel is already running into those limits with their new CPUs (they had to redesign the fins IIRC). I can imagine that without an actual breakthrough in chip manufacturing the size could stay large. That's not to say that a breakthrough won't happen, though.
The hardware isn't there yet. Apple's neural engine is neat and has some uses but it just isn't in the same league as Claude right now. We'll get there.
> I'm curious why this isn't getting much attention from larger companies.
Time is money and when you're competing with multiple companies with little margin for error you'll focus all your effort into releasing things quickly.
This chip is "only" a performance boost. It will unlock a lot of potential, but startups can't divide their attention like this. Big companies like google are surely already investigating this venue, but they might lack hardware expertise.
I'm surprised people are surprised. Of course this is possible, and of course this is the future. This has been demonstrated already: why do you think we even have GPUs at all?! Because we did this exact same transition from running in software to largely running in hardware for all 2D and 3D Computer Graphics. And these LLMs are practically the same math, it's all just obvious and inevitable, if you're paying attention to what we have, what we do to have what we have.
I believe this is a CPU/GPU vs ASIC comparison, rather than CPU vs GPU. They have always(ish) coexisted, being optimized for different things: ASICs have cost/speed/power advantages, but the design is more difficult than writing a computer program, and you can't reprogram them.
Generally, you use an ASIC to perform a specific task. In this case, I think the takeaway is the LLM functionality here is performance-sensitive, and has enough utility as-is to choose ASIC.
But the BTC mining algorithm has not and will not change. Thatâs the only reason ASICs atleast make a bit of sense for crypto.
AI being static weights is already challenged with the frequent model updates we already see - but may even be a relic once we find a new architecture.
We can expect the model landscape to consolidate some day. Progress will become slower, innovations will become smaller. Not tomorrow, not next year, but the time will come.
And then it'll increasingly make sense to build such a chip into laptops, smartphones, wearables. Not for high-end tasks, but to drive the everyday bread-and-butter tasks.
The world continues to evolve, in a way that requires flexibility - not more constraints. I just fail to see a future where we want less general purpose computers, and more hard-wired ones? Would be interesting to be proven wrong though!
Sounds to me like thereâs potential to use these for established models to provide cost/scale advantage while frontier models will run in the existing setup.
IME llama et all require LoRA or fine-tuning to be usable. That's their real value vs closed source massive models, and their small size makes this possible, appealing, and doable on a recurring basis as things evolve. Again, rendering ASICs useless.
Neither the blog nor Taalas' original post specify what speed to expect when using the SRAM in conjunction with the baked-in weights? To be taken seriously, that is really necessary to explain in detail, than a passing mention.
FPGAs donât scale if they did all GPUs wouldâve been replaced by FPGAs for graphics a long time ago.
You use an FPGA when spinning a custom ASIC doesnât makes financial sense and generic processor such as a CPU or GPU is overkill.
Arguably the middle ground here are TPUs, just taking the most efficient parts of a âGPUâ when it comes to these workloads but still relying on memory access in every step of the computation.
I thought it was because the number logic elements in a GPU is orders of magnitude higher than in a FPGA, rather than just processing speed. And GPU processing is inherently parallel so the GPU beats the FPGA just based on transistor count.
It's not certain this is the future: the obvious trade off is lack of flexibility: not only when a new model comes out, but also varying demand in the data centers - one day people want more LLM queries, another day more diffusion queries.
Aaand, this blocks the holly grail of self improving models, beyond in-context learning.
A realistic use case? More efficient vision based drone targeting in Ukraine/Taiwan/ whatevers next. That's the place where energy efficiency, processing speed, and also weight is most critical. Not sure how heavy ASICS are though, bit they should be proportional to the model size.
I heard many complaints about onboard AI 'not being there yet', and this may change it.
Not listing middle east as there is no serious jamming problem there.
In a not-too-distant future (5 years?) small LLMs will be good enough to be used as generic models for most tasks. And if you have a dedicated ASIC small enough to fit in an iPhone, you have a truly local AI device with the bonus point that you get something really new to sell in every new generation (i.e. acces to an even more powerful model)
Yes but not in five years. The chips will be dirt cheap by then. Weâll get âintelligentâ washing machines that will discuss the amount of detergent and eventually berate us. Toasters with voice input. And really annoying elevators. Also bugs that keep an extremely low RF profile (only phoning home when the target is talking business).
Perceptible latency is somewhere between 10 and 100ms. Even if an LLM was hosted in every aws region in the world, latency would likely be annoying if you were expecting near-realtime responses (for example, if you were using an llm as autocomplete while typing). If, say, apple had an LLM on a chip any app could use some SDK to access, it could feasibly unlock a whole bunch of usecases that would be impractical with a network call.
Also, offline access is still a necessity for many usecases. If you have something like an autocomplete feature that stops working when you're on the subway, the change in UX between offline and online makes the feature more disruptive than helpful.
The real benefit, to a very particular type of mind, is that the alignment will be baked in ( presumably a lot robust than today ) and wrongthink will be eliminated once and for all. It will also help flagging anyone, who would need anything as dangerous as custom, uncensored models. Win/win.
To your point, its neat tech, but the limitations are obvious since 'printing' only one LLM ensures further concentration of power. In other words, history repeats itself.
It doesn't have be to true for all models to be useful. Thinking about small models running on phones or edge devices deployed in the field that would be a perfect use case for a "printed model".
This is a ridiculous mindset. Llama 3.1 8B can do lots of things today and it'll still be able to do those things tomorrow.
If you baked one of these into a smart speaker that could call tools to control lights and play music, it will still be able to do that when Llama 4 or 5 or 6 comes out.
The point is that the GP's mindset is not very ridiculous if you value things by a price/utility ratio. Software and hardware advancements will lead to buyer's remorse faster than people get an ROI from local inference.
SW and HW advancements will bring this topic in the "good enough for vast majority" field, thus making GP point moot. You don't care if your LLM ASIC chip is not the latest one because it works for the use you purchased it for.
The highly dynamical nature of LLM itself will make part of the advantage of upgradable software not that interesting anymorw. [1]
[1] although security might be a big enough reason for upgrades to still be required
Doesn't Google have custom TPUs that are kind of a halfway point between Taalas' approach and a generic GPU? I wonder if that kind of hardware will reach consumers. It probably will, though as I understand them NPUs aren't quite it.
I think the interesting point is the transition time. When is it ROI-positive to tape out a chip for your new model? Thereâs a bunch of fun infra to build to make this process cheaper/faster and I imagine MoE will bring some challenges.
> Because we did this exact same transition from running in software to largely running in hardware for all 2D and 3D Computer Graphics.
We transitioned from software on CPUs to fixed GPU hardware... But then we transitioned back to software running on GPUs! So there's no way you can say "of course this is the future".
Like the chip-software in Gibsonâs sprawl, from the micro-soft to the ROM cowboy to the Aleph, the endgame of computertool distribution is via single-use chunks of quasi-biological computronium
This would be a hell of a hot power bank. It uses about as much power as my oven. So probably more like inside a huge cooling device outside the house. Or integrated into the heating system of the house.
I haven't seen one, but I also don't tend to use it for anything other than a power supply, so I wouldn't know. Since the standard supports it, though, it's just a matter of the market needing a device like that.
The only product they've announced at the moment [0] is a PCI-e card. It's more like a small power bank than a big thumb drive.
But sure, the next generation could be much smaller. It doesn't require battery cells, (much) heat management, or ruggedization, all of which put hard limits on how much you can miniaturise power banks.
I wouldn't call that size a small power bank. That chip is in the same ballpark as gaming GPUs, and based on the VRMs in the picture it probably draws about as much power.
But as you said, the next generations are very likely to shrink (especially with them saying they want to do top of the line models in 2 generations), and with architecture improvements it could probably get much smaller.
That's the kind of hardware am rooting for. Since it'll encourage Open weighs models, and would be much more private.
Infact, I was thinking, if robots of future could have such slots, where they can use different models, depending on the task they're given. Like a Hardware MoE.
Is this accurate? I don't know enough about hardware, but perhaps someone could clarify: how hard would it be to reverse engineer this to "leak" the model weights? Is it even possible?
There are some labs that sell access to their models (mistral, cohere, etc) without having their models open. I could see a world where more companies can do this if this turns out to be a viable way. Even to end customers, if reverse engineering is deemed impossible. You could have a device that does most of the inference locally and only "call home" when stumped (think alexa with local processing for intent detection and cloud processing for the rest, but better).
It's likely possible to extract model weights from the chip's design, but you'd need tooling at the level of an Intel R&D lab, not something any hobbyist could afford.
I doubt anyone would have the skills, wallet, and tools to RE one of these and extract model weights to run them on other hardware. Maybe state actors like the Chinese government or similar could pull that off.
This is what I've been wanting! Just like those eGPUs you would plug into your Mac. You would have a big model or device capable of running a top-tier model under your desk. All local, completely private.
A cartridge slot for models is a fun idea. Instead of one chip running any model, you get one model or maybe a family of models per chip at (I assume) much better perf/watt. Curious whether the economics work out for consumer use or if this stays in the embedded/edge space.
So how does this Taalas chip work? Analog compute by putting the weights/multipliers on the cross-bars? Transistors in the sub-threshold region? Something else?
If we can print ASIC at low cost, this will change how we work with models.
Models would be available as USB plug-in devices. A dense < 20B model may be the best assistant we need for personal use. It is like graphic cards again.
I hope lots of vendors will take note. Open weight models are abundant now. Even at a few thousand tokens/second, low buying cost and low operating cost, this is massive.
Super low latency inference might be helpful in applications like quant trading.
However, in an era where a frontier model becomes outdated after 6 months, I wonder how useful it can be.
I wonder how well this works with MoE architectures?
For dense LLMs, like llama-3.1-8B, you profit a lot from having all the weights available close to the actual multiply-accumulate hardware.
With MoE, it is rather like a memory lookup. Instead of a 1:1 pairing of MACs to stored weights, you suddenly are forced to have a large memory block next to a small MAC block. And once this mismatch becomes large enough, there is a huge gain by using a highly optimized memory process for the memory instead of mask ROM.
At that point we are back to a chiplet approach...
For comparison I wanted to write on how Google handles MoE archs with its TPUv4 arch.
They use Optical Circuit Switches, operating via MEMS mirrors, to create highly reconfigurable, high-bandwidth 3D torus topologies. The OCS fabric allows 4,096 chips to be connected in a single pod, with the ability to dynamically rewire the cluster to match the communication patterns of specific MoE models.
The 3D torus connects 64-chip cubes with 6 neighbors each. TPUv4 also contains 2 SparseCores which specialize handling high-bandwidth, non-contiguous memory accesses.
Of course this is a DC level system, not something on a chip for your pc, but just want to express the scale here.
from someone who runs AI inference pipelines for video production -- the cost per inference is what actually matters to me, not raw speed. right now i'm paying ~$0.003 per image generation and ~7 cents per 10-second animation clip. a full video costs under $2 in compute.
if dedicated ASICs can drop that by 10x while keeping latency reasonable, that changes the economics of the whole content creation space. you could afford to generate way more variations and iterate more, which is where the real quality gains come from. the bottleneck isn't speed, it's cost per creative iteration.
I can imagine, where this becomes a mainstream PCIe extension card. Like back in days we had separate graphics card, audio card etc. Now AI card. So to upgrade the PC to latest model, we could buy a new card, load up the drivers and boom, intelligence upgrade of the PC. This would be so cool.
This is exactly what's going to happen. Assuming no civilization-crippling or Great Filter events, anyway. At this point I fail to see how it could go any other way. The path has already been traveled, and governments (along with many other large organizations) will demand this functionality for themselves, which will eventually have a consumer market as well.
Another commenter mentioned how we keep cycling between local and server-based compute/storage as the dominant approach, and the cycle itself seems to be almost a law of nature. Nonetheless, regardless of where we're currently at in the cycle, there will always be both large and small players who want everything on-prem as much as possible.
Iâm just wondering how this translates to computer manufacturers like Apple. Could we have these kinds of chips built directly into computers within three years? With insanely fast, local on-demand performance comparable to todayâs models?
When output is good enough, other considerations become more important. Most people on this planet cannot afford even an AI subscription, and cost of tokens is prohibitive to many low margin businesses. Privacy and personalization matter too, data sovereignty is a hot topic. Besides, we already see how focus has shifted to orchestration, which can be done on CPU and is cheap - software optimizations may compensate hardware deficiencies, so itâs not going to be frozen. I think the market for local hardware inference is bigger than for clouds, and itâs going to repeat Android vs iOS story.
This is the same justification that was used to ship the (now almost entirely defunct) NPUs on Apple and Android devices alike.
The A18 iPhone chip has 15b transistors for the GPU and CPU; the Taalas ASIC has 53b transistors dedicated to inference alone. If it's anything like NPUs, almost all vendors will bypass the baked-in silicon to use GPU acceleration past a certain point. It makes much more sense to ship a CUDA-style flexible GPGPU architecture.
Why are you thinking about phones specifically? Most heavy users are on laptops and workstations. On smartphones there might be a few more innovations necessary (low latency AI computing on the edge?)
Many laptops and workstations also fell for the NPU meme, which in retrospect was a mistake compared to reworking your GPU architecture. Those NPUs are all dark silicon now, just like these Taalas chips will be in 12-24 months.
Dedicated inference ASICs are a dead end. You can't reprogram them, you can't finetune them, and they won't keep any of their resale value. Outside cruise missiles it's hard to imagine where such a disposable technology would be desirable.
Bake in a Genius Bar employee, trained on your model's hardware, whose entire reason for existence is to fix your computer when it breaks. If it takes an extra 50 cents of die space but saves Apple a dollar of support costs over the lifetime of the device, it's worth it.
Is progress still exponential? Feels like its flattening to me, it is hard to quantify but if you could get Opus 4.2 to work at the speed of the Taalas demo and run locally I feel like I'd get an awful lot done.
Yeah, the space moves so quickly that I would not want to couple the hardware with a model that might be outdated in a month. There are some interesting talking points but a general purpose programmable asic makes more sense to me.
I wonder if you could use the same technique (RAM models as ROM) for something like Whisper Speech-to-text, where the models are much smaller (around a Gigabyte) for a super-efficient single-chip speech recognition solution with tons of context knowledge.
Right now I have to wait 10 minutes at a time for the 2+ hour long transcriptions I've uploaded to Voxstral to process. The speed up here could be immense and worthwhile to so many customers of these products.
the LoRA on-chip SRAM angle is interesting but also where this gets hard. the whole pitch is that weights are physical transistors, but LoRA works by adding a low-rank update to those weights at inference time. so you're either doing it purely in SRAM (limited by how much you can fit) or you have to tape out a new chip for each fine-tune. neither is great. might end up being fast but inflexible -- good for commodity tasks, not for anything that needs customization per customer.
Yeah, I had written the blog to wrap my head around the idea of 'how would someone even be printing Weights on a chip?' 'Or how to even start to think in that direction?'.
I didn't explore the actual manufacturing process.
Frankly the most critical question is if they can really take shortcuts on DV etc, which are the main reasons nobody else tapes out new chips for every model. Note that their current architecture only allows some LORA-Adapter based fine-tuning, even a model with an updated cutoff date would require new masks etc. Which is kind of insane, but props to them if they can make it work.
From some announcements 2 years ago, it seems like they missed their initial schedule by a year, if that's indicative of anything.
For their hardware to make sense a couple of things would need to be true:
1. A model is good enough for a given usecase that there is no need to update/change it for 3-5 years. Note they need to redo their HW-Pipeline if even the weights change.
2. This application is also highly latency-sensitive and benefits from power efficiency.
3. That application is large enough in scale to warrant doing all this instead of running on last-gen hardware.
Maybe some edge-computing and non-civilian use-cases might fit that, but given the lifespan of models, I wonder if most companies wouldn't consider something like this too high-risk.
But maybe some non-text applications, like TTS, audio/video gen, might actually be a good fit.
TTS, speech recognition, ocr/document parsing, Vision-language-action models, vehicle control, things like that do seem to be the ideal applications. Latency constraints limit the utility of larger models in many applications.
> It took them two months, to develop chip for Llama 3.1 8B. In the AI world where one week is a year, it's super slow. But in a world of custom chips, this is supposed to be insanely fast.
LLama 3.1 is like 2 years at this point. Taking two months to convert a model that only updates every 2 years is very fast
2 months of design work is fast, but how much time does fabrication, packaging, testing add? And that just gets you chips, whatever products incorporate them also need to be built and tested.
Does this mean computer boards will someday have one or more slots for an AI chip? Or peripheral devices containing AI models, which can be plugged into computer's high speed port?
It doesn't even need to be high speed. A minimal chip would have four pins: VCC, GND, TX, and RX. Even one-dollar microcontrollers can handle megabit-speed serial connections, which is fast enough for LLM communication.
The 6.5 transistors per coefficient ratio is fascinating. At 3-bit quantization you're already losing a lot of model quality, so the real question is whether the latency gains from running directly on silicon make up for the accuracy loss.
For inference-heavy edge deployments (think always-on voice assistants or real-time video processing), this could be huge even with degraded accuracy. You don't need GPT-4 quality for most embedded use cases. But for anything that needs to be updated or fine-tuned, you're stuck with a new chip fab cycle, which kind of defeats the purpose of using neural nets in the first place.
ChatGPT Deep Research dug through Taalas' WIPO patent filings and public reporting to piece together a hypothesis. Next Platform notes at least 14 patents filed [1]. The two most relevant:
"Large Parameter Set Computation Accelerator Using Memory with Parameter Encoding" [2]
"Mask Programmable ROM Using Shared Connections" [3]
The "single transistor multiply" could be multiplication by routing, not arithmetic. Patent [2] describes an accelerator where, if weights are 4-bit (16 possible values), you pre-compute all 16 products (input x each possible value) with a shared multiplier bank, then use a hardwired mesh to route the correct result to each weight's location. The abstract says it directly: multiplier circuits produce a set of outputs, readable cells store addresses associated with parameter values, and a selection circuit picks the right output. The per-weight "readable cell" would then just be an access transistor that passes through the right pre-computed product. If that reading is correct, it's consistent with the CEO telling EE Times compute is "fully digital" [4], and explains why 4-bit matters so much: 16 multipliers to broadcast is tractable, 256 (8-bit) is not.
The same patent reportedly describes the connectivity mesh as configurable via top metal masks, referred to as "saving the model in the mask ROM of the system." If so, the base die is identical across models, with only top metal layers changing to encode weights-as-connectivity and dataflow schedule.
Patent [3] covers high-density multibit mask ROM using shared drain and gate connections with mask-programmable vias, possibly how they hit the density for 8B parameters on one 815mm2 die.
If roughly right, some testable predictions: performance very sensitive to quantization bitwidth; near-zero external memory bandwidth dependence; fine-tuning limited to what fits in the SRAM sidecar.
Caveat: the specific implementation details beyond the abstracts are based on Deep Research's analysis of the full patent texts, not my own reading, so could be off. But the abstracts and public descriptions line up well.
LSI Logic and VLSI Systems used to do such things in 1980s -- they produced a quantity of "universal" base chips, and then relatively inexpensively and quickly customized them for different uses and customers, by adding a few interconnect layers on top. Like hardwired FPGAs. Such semi-custom ASICs were much less expensive than full custom designs, and one could order them in relatively small lots.
Taalas of course builds base chips that are already closely tailored for a particular type of models. They aim to generate the final chips with the model weights baked into ROMs in two months after the weights become available. They hope that the hardware will be profitable for at least some customers, even if the model is only good enough for a year. Assuming they do get superior speed and energy efficiency, this may be a good idea.
It could simply be bit serial. With 4 bit weights you only need four serial addition steps, which is not an issue if the weight are stored nearby in a rom.
You still need to do a forward pass per token. With massive batching and full pipelining you might be able to break the dependencies and output one token per cycle but clearly they aren't doing that.
I thought about this exact question yesterday. Curious to know why we couldn't, if it isn't feasible. Would allow one to upgrade to the next model without fabricating all new hardware.
FPGAs have really low density so that would be ridiculously inefficient, probably requiring ~100 FPGAs to load the model. You'd be better off with Groq.
Not sure what you're on but I think what you said is incorrect. You can use hi-density HBM-enabled FPGA with (LP)DDR5 with sufficient number of logic elements to implement the inference. Reason why we don't see it in action is most likely in the fact that such FPGAs are insanely expensive and not so available off-the-shelf as the GPUs are.
Yeah, FPGA+HBM works but it has no advantage over GPU+HBM. If you want to store weights in FPGA LUTs/SRAM for insane speed you're going to need a lot of FPGAs because each one has very little capacity.
How feasible would it be to integrate a neural video codec into the SoC/GPU silicon?
There would be model size constraints and what quality they can achieve under those constraints.
Would be interesting if it didn't make sense to develop traditional video codecs anymore.
The current video<->latents networks (part of the generative AI model for video) don't optimize just for compression. And you probably wouldn't want variable size input in an actual video codec anyway.
Edit: reading the below it looks like I'm quite wrong here but I've left the comment...
The single transistor multiply is intriguing.
Id assume they are layers of FMA operating in the log domain.
But everything tells me that would be too noisy and error prone to work.
On the other hand my mind is completely biased to the digital world.
If they stay in the log domain and use a resistor network for multiplication, and the transistor is just exponentiating for the addition that seems genuinely ingenious.
Mulling it over, actually the noise probably doesn't matter. It'll average to 0.
It's essentially compute and memory baked together.
I don't know much about the area of research so can't tell if it's innovative but it does seem compelling!
The document referenced in the blog does not say anything about the single transistor multiply.
However, [1] provides the following description: "Taalasâ density is also helped by an innovation which stores a 4-bit model parameter and does multiplication on a single transistor, Bajic said (he declined to give further details but confirmed that compute is still fully digital)."
It'll be different gates on the transistor for the different bits, and you power only one set depending on which bit of the result you wish to calculate.
Some would call it a multi-gate transistor, whilst others would call it multiple transistors in a row...
That, or a resistor ladder with 4 bit branches connected to a single gate, possibly with a capacitor in between, representing the binary state as an analogue voltage, i.e. an analogue-binary computer. If it works for flash memory it could work for this application as well.
I'd expect this is analog multiplication with voltage levels being ADC'd out for the bits they want. If you think about it, it makes the whole thing very analog.
So if we assume this is the future, the useful life of many semiconductors will fall substantially. What part of the semiconductor supply chain would have pricing power in a world of producing many more different designs?
It might be not that bad. âGood enoughâ open-weight models are almost there, the focus may shift to agentic workflows and effective prompting. The lifecycle of a model chip will be comparable to smartphones, getting longer and longer, with orchestration software being responsible for faster innovation cycles.
"Good enough" open weights models were "almost there" since 2022.
I distrust the notion. The bar of "good enough" seems to be bolted to "like today's frontier models", and frontier model performance only ever goes up.
I donât see why. Today frontier models are already 2 generations ahead of good enough. For many users they did not offer substantial improvement, sometimes things got even worse. What is going to happen within 1 year that will make users desire something beyond already working solution? LLMs are reaching maturity faster than smartphones, which now are good enough to stay on the same model for at least 5-6 years.
Different skills and context. Llama 3.1 8B has just 128k context length, so packing everything in it may be not a great idea. You may want one agent analyzing the requirements and designing architecture, one writing tests, another one writing implementation and the third one doing code review. With LLMs itâs also matters not just what you have in context, but also what is absent, so that model will not overthink it.
EDIT: just in case, I define agent as inference unit with specific preloaded context, in this case, at this speed they donât have to be async - they may run in sequence in multiple iterations.
Just me or does this seems incredibly frightening to anyone else? Imagine printing a misaligned LLM this way and never being able to update the HW to run a different (aligned) model
It frightens me no more than the possibility of building a flawed airplane or a computer that overheats (looking at you, NVIDIA 12-pin) and "never being able to update the HW". Product recalls and redesigns exist for a reason.
If this happens, womp womp, recall the misaligned LLMs and learn from the mistake. It's part of running a hardware business as opposed to a software one.
I can't imagine they'd go for a full production run before at least testing a couple chips and finding issues.
Since model size determines die size, and die size has absolute limits as well as a correlation with yield, eventually it hits physical and economic limits. There was also some discussion about ganging chips.
From what I read here, the required chip size would scale linearly with the number of model weights. That alone puts a ceiling on the size of model.
Also the defect rate grows as the chip grows. It seems like there might be room for innovation in fault tolerance here, compared to a CPU where a randomly flipped bit can be catastrophic.
Imagine a Framework* laptop with these kinds of chips that could be swapped out as models get better over time
*Framework sells laptops and parts such that in theory users can own a ~~ship~~ laptop of Theseus over time without having to buy a whole new laptop when something breaks or needs upgrade.
Hmm I guess you'll get this pile of used boards which hmm is not a great source of waste; but I guess they will get reused for a few generations.
A problem is it doesn't seem to be just the chips that would be thrown but the whole board which gets silly.
Few customers value tokens anywhere near what it costs the big API vendors. When the bubble pops the only survivors will be whoever can offer tokens at as close to zero cost as possible. Also whoever is selling hardware for local AI.
To those who use AI to get real work done in real products we build, we very much appreciate the value of each token given how much operational overhead it offsets. A bubble pop, if one does indeed happen, would at best be as disruptive as the dot-com bust.
New GPUs come out all the time. New phones come out (if you count all the manufacturers) all the time. We do not need to always buy the new one.
Current open weight models < 20B are already capable of being useful. With even 1K tokens/second, they would change what it means to interact with them or for models to interact with the computer.
hm yeah I guess if they stick to shitty models it works out, I was talking about the models people use to actually do things instead of shitposting from openclaw and getting reminders about their next dentist appointment.
The trick with small models is what you ask them to do. I am working on a data extraction app (from emails and files) that works entirely local. I applied for Taalas API because it would be awesome fit.
dwata: Entirely Local Financial Data Extraction from Emails Using Ministral 3 3B with Ollama: https://youtu.be/LVT-jYlvM18
Considering that enamel regrowth is still experimental (only curodont exists as a commercial product), those dentist appointments are probably the most important routine healthcare appointments in your life. Pick something that is actually useless.
It all depends on how cheap they can get.
And another interesting thought: what if you could stack them? For example you have a base model module, then new ones come out that can work together with the old ones and expanding their capabilities.
> To run Llama 3.1 8B locally, you would need a GPU with a minimum of 16 GB of VRAM, such as an NVIDIA RTX 3090
In full precision, yes. But this talaas chip uses a heavily quantized version (the article calls it "3/6 bit quant", probably similar to Q4_K_M). You dont even need a GPU to run that with reasonable performance, a CPU is fine.
You obviously don't believe that AGI is coming in two release cycles, and you also don't seem to have much faith in the new models containing massive improvements over the last ones. So the answer to who is going to pay for these custom chips seems to be you.
>HOW NVIDIA GPUs process stuff? (Inefficiency 101)
Wow. Massively ignorant take. A modern GPUs is an amazing feat of engineering, particularly about making computation more efficient (low power/high throughput).
Then proceeds to explain, wrongly, how inference is supposssedly implemented and draws conclusions from there ...
Arguably DRAM-based GPUs/TPUs are quite inefficient for inference compared to SRAM-based Groq/Cerebras. GPUs are highly optimized but they still lose to different architectures that are better suited for inference.
The way modern Nvidia GPUs perform inference is that they have a processor (tensor memory accelerator) that directly performs tensor memory operations which directly concedes that GPGPU as a paradigm is too inefficient for matrix multiplication.
Ohh neat! A generalized version of this was the topic of my PhD dissertation:
https://kilthub.cmu.edu/articles/thesis/Modern_Gate_Array_De...
And they are likely doing something similar to put their LLMs in silicon. I would believe a 10x electricity boost along with it being much faster.
The idea is that you can create a sea of generalized standard cells and it makes for a gate array at the manufacturing layer. This was also done 20 or so years ago, it was called a "structured ASIC".
I'd be curious to see if they use the LUT design of traditional structured ASICs or figured what what I did: you can use standard cells to do the same thing and use regular tools/PDKs to make it.
I think their "4-bit multiplier with a single transistor" bit is hinting at them using transistors in the sun-threshold regime.
So something that you can do with PDKs is add your own custom standard cell and tell the EDA tools to use them. This is actually pretty smart, this way you can use most of the foundry cells (which have been extensively validated) and focus on things like this "magic multiplier", that you will have to manually validate. This also makes porting across tech nodes easier if you manage only a handful of custom cells versus a completely custom design.
(I have my guesses as to what that is, but I admittedly don't know enough about that particular part of the field to give anything but a guess).
8B coefficients are packed into 53B transistors, 6.5 transistors per coefficient. Two-inputs NAND gate takes 4 transistors and register takes about the same. One coefficient gets processed (multiplied by and result added to a sum) with less than two two-inputs NAND gates.
I think they used block quantization: one can enumerate all possible blocks for all (sorted) permutations of coefficients and for each layer place only these blocks that are needed there. For 3-bit coefficients and block size of 4 coefficients only 330 different blocks are needed.
Matrices in the llama 3.1 are 4096x4096, 16M coefficients. They can be compressed into only 330 blocks, if we assume that all coefficients' permutations are there, and network of correct permutations of inputs and outputs.
Assuming that blocks are the most area consuming part, we have block's transistor budget of about 250 thousands of transistors, or 30 thousands of 2-inputs NAND gates per block.
250K transistors per block * 330 blocks / 16M transistors = about 5 transistors per coefficient.
Looks very, very doable.
It does look doable even for FP4 - these are 3-bit coefficients in disguise.
I'm looking forward to the model.toVHDL() method in PyTorch.
Ugh, quick, everyone start panic-buying FPGAs now.
largest FPGAs have on the order of tens of millions of logic cells/elements. Theyâre not even remotely big enough to emulate these designs except to validate small parts of it at a time and unlike memory chips or GPUs, companies donât need millions of them to scale infrastructure.
(The chips also cost tens of thousands of dollars each)
they also arent power friendly
Deep Differentiable Logic Gate Networks
Is this a thing?
I gave a short talk about compiling PyTorch to Verilog at Latte '22. Back then we were just looking at a simple dot product operation, but the approach could theoretically scale up to whole models.
https://capra.cs.cornell.edu/latte22/paper/2.pdf
https://www.youtube.com/watch?v=QxwZpYfD60g
They mentioned that they using strong quantization (iirc 3bit) and that the model was degradeted from that. Also, they don't have to use transistors to store the bits.
gpt-oss is fp4 - they're saying they'll next try mid size one, I'm guessing gpt-oss-20b then large one, i'm guessing gpt-oss-120b as their hardware is fp4 friendly
I think they are talking about the transistors that apply the weights to the inputs.
Whats the theoretixal full wafer scale model they could produce?
The form factor discussion is fascinating but I think the real unlock is latency. Current cloud inference adds 50-200ms of network overhead before you even start generating tokens. A dedicated ASIC sitting on PCIe could serve first token in microseconds.
For applications like real-time video generation or interactive agents that need sub-100ms response loops, that difference is everything. The cost per inference might be higher than a GPU cluster at scale, but the latency profile opens up use cases that simply aren't possible with current architectures.
Curious whether Taalas has published any latency benchmarks beyond the throughput numbers.
latency and control, and reliability of bandwidth and associated costs - however this isn't just the pull for specialised hardware but for local computing in general, specialised hardware is just the most extreme form of it
there are tasks that inherently benefit from being centralised away, like say coordination of peers across a large area - and there are tasks that strongly benefit from being as close to the user as possible, like low latency tasks and privacy/control-centred tasks
simultaneously, there's an overlapping pull to either side caused by the monetary interests of corporations vs users - corporations want as much as possible under their control, esp. when it's monetisable information but most things are at volume, and users want to be the sole controller of products esp. when they pay for them
we had dumb terminals already being pushed in the 1960s, the "cloud", "edge computing" and all forms of consolidation vs segregation periods across the industry, it's not going to stop because there's money to be made from the inherent advantages of those models and even the industry leaders cannot prevent these advantages from getting exploited by specialist incumbents
once leaders consolidates, inevitably they seek to maximise profit and in doing so they lower the barrier for new alternatives
ultimately I think the market will never stop demanding just having your own *** computer under your control and hopefully own it, and only the removal of this option will stop this demand; while businesses will never stop trying to control your computing, and providing real advantages in exchange for that, only to enter cycles of pushing for growing profitability to the point average users keep going back and forth
As scary as it sounds today, a lightning-quick zero latency non-networked local LLM could provide value in an application like a self-driving car. It would be a level below Waymo's remote human support, so if the car couldn't figure out how to deal with a weird situation, it could ask the LLM what to do, hopefully avoiding the need to phone home (and perhaps handling cases where it couldn't phone home).
Waymo already has on-board NPU(s) with Transformer model(s) that are cheaper than Taalas.
The network latency bit deserves more attention. Iâve been trying to find out where AI companies are physically serving LLMs from but itâs difficult to find information about this. If Iâm sitting in London and use Claude, where are the requests actually being served?
The ideal world would be an edge network like Cloudflare for LLMs so a nearby POP serves your requests. Iâm not sure how viable this is. On classic hardware I think it would require massive infra buildout, but maybe ASICs could be the key to making this viable.
> The network latency bit deserves more attention. Iâve been trying to find out where AI companies are physically serving LLMs from but itâs difficult to find information about this. If Iâm sitting in London and use Claude, where are the requests actually being served?
Unfortunately, as with most of the AI providers, it's wherever they've been able to find available power and capacity. They've contracts with all of the large cloud vendors and lack of capacity is significant enough of an issue that locality isn't really part of the equation.
The only things they're particular about locality for is the infrastructure they use for training runs, where they need lots of interconnected capacity with low latency links.
Inference is wherever, whenever. You could be having your requests processed halfway around the world, or right next door, from one minute to the next.
>You could be having your requests processed halfway around the world, or right next door, from one minute to the next
Wow, any source for this? It would explain why they vary between feeling really responsive and really delayed.
No, not in milliseconds if you have longish context. Prefill is very compute heavy, compared to inference.
Id assume the next step is a small reasoning model would demo whether inference speed can fill some intelligence gaps. Combine that with some RAG to see if theres a tension in intrinsic reason or pattern recognition.
This would be a very interesting future. I can imagine Gemma 5 Mini running locally on hardware, or a hard-coded "AI core" like an ALU or media processor that supports particular encoding mechanisms like H.264, AV1, etc.
Other than the obvious costs (but Taalas seems to be bringing back the structured ASIC era so costs shouldn't be that low [1]), I'm curious why this isn't getting much attention from larger companies. Of course, this wouldn't be useful for training models but as the models further improve, I can totally see this inside fully local + ultrafast + ultra efficient processors.
[1] https://en.wikipedia.org/wiki/Structured_ASIC_platform
> I'm curious why this isn't getting much attention from larger companies.
I can see two potential reasons:
1) Most of the big players seem convinced that AI is going to continue to improve at the rate it did in 2025, if their assumption is somehow correct by the time any chip entered mass production it would be obsolete.
2) The business model of the big players is to sell expensive subscriptions, and train on and sell the data you give it. Chips that allow for relatively inexpensive offline AI aren't conducive to that.
Apple would love to sell new iPhones with new llm models bound to the hardware/chip. One more reason to upgrade.
> I'm curious why this isn't getting much attention from larger companies
I would be shocked if Google isnât working on this right now. They build their own TPUs, this is an extremely obvious direction from there.
(And there are plenty of interesting co-design questions that only the frontier labs can dabble with; Taalas is stuck working around architectural quirks like âtop-8 MoEâ, Google can just rework the architecture hyperparameters to whatever gets best results in silico.)
Well even programmable ASICs like Cerebras and Groq give many-multiples speedup over GPUs and the market has hardly reacted at all.
Seems both Nvidia (Groq) and OpenAI (Codex Spark) are now invested in the ASIC route one way or another.
The problem with groq was they only allowed LORA on llama 8b and 70b, and you had to have an enterprise contract it wasn't self service.
> market has hardly reacted at all
Guess who acqui-hired Groq to push this into GPUs?
The name GPU has been an anachronism for a couple of years now.
Cerebras gives a many multiple speedup but it's also many multiples more expensive.
Apple should have done this yesterday. A local AI on my phone/Macbook is all I really want from this tech.
The cloud-based AI (OpenAI, etc.) are todays AOL.
https://developer.apple.com/documentation/FoundationModels
They did do it yesterday.
And it produced fake headlines and summaries including the threat of lawsuits from involved person(s).
Apple usually waits until somebody else has refined a technology to "invent" it, but I guess they couldn't wait for this one.
The die size is huge. This isnât the kind of chip that would go into your MacBook, let alone an iPhone.
Itâs for cloud based servers.
And computers used to be the size of a room. I think they can get it to iPhone size in the future, this is an early prototype.
That's the part that people are missing: it won't get smaller. It already required heroic optimization to get 8B on one megachip. Taalas is more expensive but faster. It is cheaper per token when running 24x7 but not cheap to buy. It will never be small and never be cheap.
Well, there's a limit to how small we can make transistors with our current technology. As I understand it, Intel is already running into those limits with their new CPUs (they had to redesign the fins IIRC). I can imagine that without an actual breakthrough in chip manufacturing the size could stay large. That's not to say that a breakthrough won't happen, though.
The hardware isn't there yet. Apple's neural engine is neat and has some uses but it just isn't in the same league as Claude right now. We'll get there.
> I'm curious why this isn't getting much attention from larger companies.
Time is money and when you're competing with multiple companies with little margin for error you'll focus all your effort into releasing things quickly.
This chip is "only" a performance boost. It will unlock a lot of potential, but startups can't divide their attention like this. Big companies like google are surely already investigating this venue, but they might lack hardware expertise.
I'm surprised people are surprised. Of course this is possible, and of course this is the future. This has been demonstrated already: why do you think we even have GPUs at all?! Because we did this exact same transition from running in software to largely running in hardware for all 2D and 3D Computer Graphics. And these LLMs are practically the same math, it's all just obvious and inevitable, if you're paying attention to what we have, what we do to have what we have.
I believe this is a CPU/GPU vs ASIC comparison, rather than CPU vs GPU. They have always(ish) coexisted, being optimized for different things: ASICs have cost/speed/power advantages, but the design is more difficult than writing a computer program, and you can't reprogram them.
Generally, you use an ASIC to perform a specific task. In this case, I think the takeaway is the LLM functionality here is performance-sensitive, and has enough utility as-is to choose ASIC.
It reminds me of the switch from GPUs to ASICs in bitcoin mining. I've been expecting this to happen.
But the BTC mining algorithm has not and will not change. Thatâs the only reason ASICs atleast make a bit of sense for crypto.
AI being static weights is already challenged with the frequent model updates we already see - but may even be a relic once we find a new architecture.
We can expect the model landscape to consolidate some day. Progress will become slower, innovations will become smaller. Not tomorrow, not next year, but the time will come.
And then it'll increasingly make sense to build such a chip into laptops, smartphones, wearables. Not for high-end tasks, but to drive the everyday bread-and-butter tasks.
The world continues to evolve, in a way that requires flexibility - not more constraints. I just fail to see a future where we want less general purpose computers, and more hard-wired ones? Would be interesting to be proven wrong though!
Sounds to me like thereâs potential to use these for established models to provide cost/scale advantage while frontier models will run in the existing setup.
IME llama et all require LoRA or fine-tuning to be usable. That's their real value vs closed source massive models, and their small size makes this possible, appealing, and doable on a recurring basis as things evolve. Again, rendering ASICs useless.
Read the blog post. It mentions that their chip has a small SRAM which can store LoRA.
Neither the blog nor Taalas' original post specify what speed to expect when using the SRAM in conjunction with the baked-in weights? To be taken seriously, that is really necessary to explain in detail, than a passing mention.
Heh, I said this exact thing in another thread the other day. Nice to see I wasn't the only one thinking it.
The middle ground here would be an FPGA, but I belive you would need a very expensive one to implement an LLM on it.
FPGAs would be less efficient than GPUs.
FPGAs donât scale if they did all GPUs wouldâve been replaced by FPGAs for graphics a long time ago.
You use an FPGA when spinning a custom ASIC doesnât makes financial sense and generic processor such as a CPU or GPU is overkill.
Arguably the middle ground here are TPUs, just taking the most efficient parts of a âGPUâ when it comes to these workloads but still relying on memory access in every step of the computation.
I thought it was because the number logic elements in a GPU is orders of magnitude higher than in a FPGA, rather than just processing speed. And GPU processing is inherently parallel so the GPU beats the FPGA just based on transistor count.
"This has been demonstrated alreadyâŚ"
I think burning the weights into the gates is kinda new.
("Weights to gates." "Weighted gates"? "Gated weights"?)
Is this not effectively the same thing as a Bitcoin ASIC?
Geights? Wates?
gweights
Not really new, this is 80âs-90âs Neuron MOS Transistor.
Itâs also not that different than how TPUs work where they have special registers in their PEs for weights.
It's not certain this is the future: the obvious trade off is lack of flexibility: not only when a new model comes out, but also varying demand in the data centers - one day people want more LLM queries, another day more diffusion queries. Aaand, this blocks the holly grail of self improving models, beyond in-context learning. A realistic use case? More efficient vision based drone targeting in Ukraine/Taiwan/ whatevers next. That's the place where energy efficiency, processing speed, and also weight is most critical. Not sure how heavy ASICS are though, bit they should be proportional to the model size. I heard many complaints about onboard AI 'not being there yet', and this may change it. Not listing middle east as there is no serious jamming problem there.
In a not-too-distant future (5 years?) small LLMs will be good enough to be used as generic models for most tasks. And if you have a dedicated ASIC small enough to fit in an iPhone, you have a truly local AI device with the bonus point that you get something really new to sell in every new generation (i.e. acces to an even more powerful model)
The Taalas approach is much more expensive than the NPU that phones already have.
Yes but not in five years. The chips will be dirt cheap by then. Weâll get âintelligentâ washing machines that will discuss the amount of detergent and eventually berate us. Toasters with voice input. And really annoying elevators. Also bugs that keep an extremely low RF profile (only phoning home when the target is talking business).
No, Taalas requires more silicon which will always cost more than storing weights in DRAM.
it doesnât need to go in the phone if it only takes a few milliseconds to respond and is cheap
Perceptible latency is somewhere between 10 and 100ms. Even if an LLM was hosted in every aws region in the world, latency would likely be annoying if you were expecting near-realtime responses (for example, if you were using an llm as autocomplete while typing). If, say, apple had an LLM on a chip any app could use some SDK to access, it could feasibly unlock a whole bunch of usecases that would be impractical with a network call.
Also, offline access is still a necessity for many usecases. If you have something like an autocomplete feature that stops working when you're on the subway, the change in UX between offline and online makes the feature more disruptive than helpful.
https://www.cloudping.co/
It does if you care about who can access to your tokens
The real benefit, to a very particular type of mind, is that the alignment will be baked in ( presumably a lot robust than today ) and wrongthink will be eliminated once and for all. It will also help flagging anyone, who would need anything as dangerous as custom, uncensored models. Win/win.
To your point, its neat tech, but the limitations are obvious since 'printing' only one LLM ensures further concentration of power. In other words, history repeats itself.
It doesn't have be to true for all models to be useful. Thinking about small models running on phones or edge devices deployed in the field that would be a perfect use case for a "printed model".
I'd be kind of shocked if Nvidia isn't playing with this.
I don't expect it's like super commercially viable today, but for sure things need to trend to radically more efficient AI solutions.
These are chips that become e-waste the second a better a model comes out, and nvidia is already limited by TSMC capacity.
This is a ridiculous mindset. Llama 3.1 8B can do lots of things today and it'll still be able to do those things tomorrow.
If you baked one of these into a smart speaker that could call tools to control lights and play music, it will still be able to do that when Llama 4 or 5 or 6 comes out.
If you pay $1,500 for a Mistral ASIC that is beaten by a $15 Qwen ASIC that comes out six months later, you'd be feeling pretty dang ridiculous.
I'm equally capable of making up numbers to support my perspective but I don't see the point.
The point is that the GP's mindset is not very ridiculous if you value things by a price/utility ratio. Software and hardware advancements will lead to buyer's remorse faster than people get an ROI from local inference.
SW and HW advancements will bring this topic in the "good enough for vast majority" field, thus making GP point moot. You don't care if your LLM ASIC chip is not the latest one because it works for the use you purchased it for. The highly dynamical nature of LLM itself will make part of the advantage of upgradable software not that interesting anymorw. [1]
[1] although security might be a big enough reason for upgrades to still be required
They'll be perfect for an appliance like the Rick and Morty butter robot.
these arenât made for general chatbot use
Only in VC backed funding land.
In the real world, theres talking refrigerators who dont need to know how to recite shakespeare.
On the upside, Shakespeare isn't going to change soon.
So you're saying we should burn Shakespeare onto a chip? /s
Doesn't Google have custom TPUs that are kind of a halfway point between Taalas' approach and a generic GPU? I wonder if that kind of hardware will reach consumers. It probably will, though as I understand them NPUs aren't quite it.
Are people surprised?
I think the interesting point is the transition time. When is it ROI-positive to tape out a chip for your new model? Thereâs a bunch of fun infra to build to make this process cheaper/faster and I imagine MoE will bring some challenges.
> Because we did this exact same transition from running in software to largely running in hardware for all 2D and 3D Computer Graphics.
We transitioned from software on CPUs to fixed GPU hardware... But then we transitioned back to software running on GPUs! So there's no way you can say "of course this is the future".
Job specific ASICs are are âold as time.â
> Kinda like a CD-ROM/Game cartridge, or a printed book, it only holds one model and cannot be rewritten.
Imagine a slot on your computer where you physically pop out and replace the chip with different models, sort of like a Nintendo DS.
That slot is called USB-C. I can fully imagine inference ASICs coming in powerbank form factor that you'd just plug and play.
Like the chip-software in Gibsonâs sprawl, from the micro-soft to the ROM cowboy to the Aleph, the endgame of computertool distribution is via single-use chunks of quasi-biological computronium
Michael Bay just read "computronium" and spawned an 8 movie franchise in his head.
This would be a hell of a hot power bank. It uses about as much power as my oven. So probably more like inside a huge cooling device outside the house. Or integrated into the heating system of the house.
(Still compelling!)
*the whole server uses 2.2kw or whatever, not a single board. I think that was for 8 boards or something.
Oh does it? Thanks for the clarification then. Their home page said 2.5kW so I assumed that's what it is.
To be fair, 2.5kW does sound too much for a single 3x3cm chip, it would probably melt.
More powwwwaaa!
Yeah, though I suppose once we get properly 3d silicon I would not be surprised at power rating for that, 3cm^3 would be something to behold.
> USB-C
With these speeds you can run it over USB2, though maybe power is limiting.
You would likely need external power anyway.
USB-C is just a form factor and has nothing to do with which protocol you run at which speeds.
I wasn't talking about the form factor.
Not if you need 200w power to run inference.
USB-C can do up to 240W. These days I power all my devices with a USB hub, even my Lipo charger.
Have you seen a device that can supply 240w and act as a data host? Or is the 240w only from dedicated chargers?
I haven't seen one, but I also don't tend to use it for anything other than a power supply, so I wouldn't know. Since the standard supports it, though, it's just a matter of the market needing a device like that.
Pretty sure it'd just be a thumbdrive. Are the Taalas chips particularly large in surface area?
The only product they've announced at the moment [0] is a PCI-e card. It's more like a small power bank than a big thumb drive.
But sure, the next generation could be much smaller. It doesn't require battery cells, (much) heat management, or ruggedization, all of which put hard limits on how much you can miniaturise power banks.
[0] https://taalas.com/the-path-to-ubiquitous-ai/
I wouldn't call that size a small power bank. That chip is in the same ballpark as gaming GPUs, and based on the VRMs in the picture it probably draws about as much power.
But as you said, the next generations are very likely to shrink (especially with them saying they want to do top of the line models in 2 generations), and with architecture improvements it could probably get much smaller.
Iâm old enough to remember your typical computer filling warehouse-sized buildings.
Nowadays, your average cellphone has more computing power than those behemoths.
I have a micro SD card with 256GB capacity, and I think they are up to 2TB. On a device the size of a fingernail.
That is all definitely amazing, but data storage is a fundamentally different process with far fewer constraints than continuous computation.
It all uses the same miniaturization techniques, though.
800 mm2, about 90mm per side, if imagined as a square. Also, 250 W of power consumption.
The form factor should be anything but thumbdrive.
mmmhhhhh 800mm2 ~= (30mm)2, which is more like a (biggish) thumb drive.
Thanks!
I haven't had my coffee yet. ;)
Shit happens :D
always after the coffee :)
the radiator wouldn't be though
That's the kind of hardware am rooting for. Since it'll encourage Open weighs models, and would be much more private.
Infact, I was thinking, if robots of future could have such slots, where they can use different models, depending on the task they're given. Like a Hardware MoE.
> Since it'll encourage Open weighs models
Is this accurate? I don't know enough about hardware, but perhaps someone could clarify: how hard would it be to reverse engineer this to "leak" the model weights? Is it even possible?
There are some labs that sell access to their models (mistral, cohere, etc) without having their models open. I could see a world where more companies can do this if this turns out to be a viable way. Even to end customers, if reverse engineering is deemed impossible. You could have a device that does most of the inference locally and only "call home" when stumped (think alexa with local processing for intent detection and cloud processing for the rest, but better).
It's likely possible to extract model weights from the chip's design, but you'd need tooling at the level of an Intel R&D lab, not something any hobbyist could afford.
I doubt anyone would have the skills, wallet, and tools to RE one of these and extract model weights to run them on other hardware. Maybe state actors like the Chinese government or similar could pull that off.
This is what I've been wanting! Just like those eGPUs you would plug into your Mac. You would have a big model or device capable of running a top-tier model under your desk. All local, completely private.
A cartridge slot for models is a fun idea. Instead of one chip running any model, you get one model or maybe a family of models per chip at (I assume) much better perf/watt. Curious whether the economics work out for consumer use or if this stays in the embedded/edge space.
Plug it into skull bone. Neuralink + slot for a model that you can buy in s grocery store instead of prepaid Netflix card.
We better solve the energy usage and cooling first otherwise that will be a very spicy body mod.
Would somewhat work except for the power usage.
I doubt it would scale linearly, but for home use 170 tokens/s at 2.5W would be cool; 17 tokens/s at 0,25W would be awesome.
On the other hand, this may be a step towards positronic brains (https://en.wikipedia.org/wiki/Positronic_brain)
Yeah maybe you can call it PCIe.
The next frontier is power efficiency.
So how does this Taalas chip work? Analog compute by putting the weights/multipliers on the cross-bars? Transistors in the sub-threshold region? Something else?
If we can print ASIC at low cost, this will change how we work with models.
Models would be available as USB plug-in devices. A dense < 20B model may be the best assistant we need for personal use. It is like graphic cards again.
I hope lots of vendors will take note. Open weight models are abundant now. Even at a few thousand tokens/second, low buying cost and low operating cost, this is massive.
Super low latency inference might be helpful in applications like quant trading. However, in an era where a frontier model becomes outdated after 6 months, I wonder how useful it can be.
Also, quant trading probably care more about embedding the content instead of generating output tokens
I wonder how well this works with MoE architectures?
For dense LLMs, like llama-3.1-8B, you profit a lot from having all the weights available close to the actual multiply-accumulate hardware.
With MoE, it is rather like a memory lookup. Instead of a 1:1 pairing of MACs to stored weights, you suddenly are forced to have a large memory block next to a small MAC block. And once this mismatch becomes large enough, there is a huge gain by using a highly optimized memory process for the memory instead of mask ROM.
At that point we are back to a chiplet approach...
For comparison I wanted to write on how Google handles MoE archs with its TPUv4 arch.
They use Optical Circuit Switches, operating via MEMS mirrors, to create highly reconfigurable, high-bandwidth 3D torus topologies. The OCS fabric allows 4,096 chips to be connected in a single pod, with the ability to dynamically rewire the cluster to match the communication patterns of specific MoE models.
The 3D torus connects 64-chip cubes with 6 neighbors each. TPUv4 also contains 2 SparseCores which specialize handling high-bandwidth, non-contiguous memory accesses.
Of course this is a DC level system, not something on a chip for your pc, but just want to express the scale here.
*ed: SpareCubes to SparseCubes
If each of the Expert models were etched in Silicon, it would still have massive speed boost, isn't it?
I feel printing ASIC is the main block here.
from someone who runs AI inference pipelines for video production -- the cost per inference is what actually matters to me, not raw speed. right now i'm paying ~$0.003 per image generation and ~7 cents per 10-second animation clip. a full video costs under $2 in compute.
if dedicated ASICs can drop that by 10x while keeping latency reasonable, that changes the economics of the whole content creation space. you could afford to generate way more variations and iterate more, which is where the real quality gains come from. the bottleneck isn't speed, it's cost per creative iteration.
Quick! We have to approve all the nuclear plants for AI now, before efficiency from optimization shows up
I can imagine, where this becomes a mainstream PCIe extension card. Like back in days we had separate graphics card, audio card etc. Now AI card. So to upgrade the PC to latest model, we could buy a new card, load up the drivers and boom, intelligence upgrade of the PC. This would be so cool.
This is exactly what's going to happen. Assuming no civilization-crippling or Great Filter events, anyway. At this point I fail to see how it could go any other way. The path has already been traveled, and governments (along with many other large organizations) will demand this functionality for themselves, which will eventually have a consumer market as well.
Another commenter mentioned how we keep cycling between local and server-based compute/storage as the dominant approach, and the cycle itself seems to be almost a law of nature. Nonetheless, regardless of where we're currently at in the cycle, there will always be both large and small players who want everything on-prem as much as possible.
Iâm just wondering how this translates to computer manufacturers like Apple. Could we have these kinds of chips built directly into computers within three years? With insanely fast, local on-demand performance comparable to todayâs models?
Is it possible to supplement the model with a diff for updates on modular memory, or would severely impact perf?
I imagine you could do something like a LORA
this design at 7 transistors per weight is 99.9% burnt in the silicon forever.
and run an outdated model for 3 years while progress is exponential? what is the point of that
When output is good enough, other considerations become more important. Most people on this planet cannot afford even an AI subscription, and cost of tokens is prohibitive to many low margin businesses. Privacy and personalization matter too, data sovereignty is a hot topic. Besides, we already see how focus has shifted to orchestration, which can be done on CPU and is cheap - software optimizations may compensate hardware deficiencies, so itâs not going to be frozen. I think the market for local hardware inference is bigger than for clouds, and itâs going to repeat Android vs iOS story.
Taalas is more expensive than NPUs not less. You have GPU/NPU at home; just use it.
This is the same justification that was used to ship the (now almost entirely defunct) NPUs on Apple and Android devices alike.
The A18 iPhone chip has 15b transistors for the GPU and CPU; the Taalas ASIC has 53b transistors dedicated to inference alone. If it's anything like NPUs, almost all vendors will bypass the baked-in silicon to use GPU acceleration past a certain point. It makes much more sense to ship a CUDA-style flexible GPGPU architecture.
Why are you thinking about phones specifically? Most heavy users are on laptops and workstations. On smartphones there might be a few more innovations necessary (low latency AI computing on the edge?)
Many laptops and workstations also fell for the NPU meme, which in retrospect was a mistake compared to reworking your GPU architecture. Those NPUs are all dark silicon now, just like these Taalas chips will be in 12-24 months.
Dedicated inference ASICs are a dead end. You can't reprogram them, you can't finetune them, and they won't keep any of their resale value. Outside cruise missiles it's hard to imagine where such a disposable technology would be desirable.
Bake in a Genius Bar employee, trained on your model's hardware, whose entire reason for existence is to fix your computer when it breaks. If it takes an extra 50 cents of die space but saves Apple a dollar of support costs over the lifetime of the device, it's worth it.
Is progress still exponential? Feels like its flattening to me, it is hard to quantify but if you could get Opus 4.2 to work at the speed of the Taalas demo and run locally I feel like I'd get an awful lot done.
Yeah, the space moves so quickly that I would not want to couple the hardware with a model that might be outdated in a month. There are some interesting talking points but a general purpose programmable asic makes more sense to me.
It wonât stay exponential forever.
> what is the point of that
Planned obsolescence? /s
Jokes aside, they can make the "LLM chip" removable. I know almost nothing is replaceable in MacBooks, but this could be an exception.
I wonder if you could use the same technique (RAM models as ROM) for something like Whisper Speech-to-text, where the models are much smaller (around a Gigabyte) for a super-efficient single-chip speech recognition solution with tons of context knowledge.
Right now I have to wait 10 minutes at a time for the 2+ hour long transcriptions I've uploaded to Voxstral to process. The speed up here could be immense and worthwhile to so many customers of these products.
the LoRA on-chip SRAM angle is interesting but also where this gets hard. the whole pitch is that weights are physical transistors, but LoRA works by adding a low-rank update to those weights at inference time. so you're either doing it purely in SRAM (limited by how much you can fit) or you have to tape out a new chip for each fine-tune. neither is great. might end up being fast but inflexible -- good for commodity tasks, not for anything that needs customization per customer.
I would appreciate some clarification on the "store 4 bits of data with one transistor" part.
This doesn't sound remotely possible, but I am here to be convinced.
They declined to say: https://www.eetimes.com/taalas-specializes-to-extremes-for-e...
Except they say it's fully digital, so not an analog multiplier
Fully digital, no analog, 4 bits fit into one transistor. Hmm. In one clock cycle?
Note that this doesn't answer the question in the title, it merely asks it.
Yeah, I had written the blog to wrap my head around the idea of 'how would someone even be printing Weights on a chip?' 'Or how to even start to think in that direction?'.
I didn't explore the actual manufacturing process.
You should add an RSS feed so I can follow it!
I don't post blogs often, so haven't added RSS there, but will do. I mostly post to my linkblog[1], hence have RSS there.
[1] https://www.anuragk.com/linkblog
Frankly the most critical question is if they can really take shortcuts on DV etc, which are the main reasons nobody else tapes out new chips for every model. Note that their current architecture only allows some LORA-Adapter based fine-tuning, even a model with an updated cutoff date would require new masks etc. Which is kind of insane, but props to them if they can make it work.
From some announcements 2 years ago, it seems like they missed their initial schedule by a year, if that's indicative of anything.
For their hardware to make sense a couple of things would need to be true: 1. A model is good enough for a given usecase that there is no need to update/change it for 3-5 years. Note they need to redo their HW-Pipeline if even the weights change. 2. This application is also highly latency-sensitive and benefits from power efficiency. 3. That application is large enough in scale to warrant doing all this instead of running on last-gen hardware.
Maybe some edge-computing and non-civilian use-cases might fit that, but given the lifespan of models, I wonder if most companies wouldn't consider something like this too high-risk.
But maybe some non-text applications, like TTS, audio/video gen, might actually be a good fit.
TTS, speech recognition, ocr/document parsing, Vision-language-action models, vehicle control, things like that do seem to be the ideal applications. Latency constraints limit the utility of larger models in many applications.
> It took them two months, to develop chip for Llama 3.1 8B. In the AI world where one week is a year, it's super slow. But in a world of custom chips, this is supposed to be insanely fast.
LLama 3.1 is like 2 years at this point. Taking two months to convert a model that only updates every 2 years is very fast
It only looks that way because Llama failed. Good models like Qwen are shipping every 6 months.
2 months of design work is fast, but how much time does fabrication, packaging, testing add? And that just gets you chips, whatever products incorporate them also need to be built and tested.
Does this mean computer boards will someday have one or more slots for an AI chip? Or peripheral devices containing AI models, which can be plugged into computer's high speed port?
It doesn't even need to be high speed. A minimal chip would have four pins: VCC, GND, TX, and RX. Even one-dollar microcontrollers can handle megabit-speed serial connections, which is fast enough for LLM communication.
Probably more like either USB sidecar or PCIe drop in. I dont think theyll return to a world dedicated coprocessors.
Unless someone finds a way to turn these thijgs into a bios module.
The 6.5 transistors per coefficient ratio is fascinating. At 3-bit quantization you're already losing a lot of model quality, so the real question is whether the latency gains from running directly on silicon make up for the accuracy loss.
For inference-heavy edge deployments (think always-on voice assistants or real-time video processing), this could be huge even with degraded accuracy. You don't need GPT-4 quality for most embedded use cases. But for anything that needs to be updated or fine-tuned, you're stuck with a new chip fab cycle, which kind of defeats the purpose of using neural nets in the first place.
ChatGPT Deep Research dug through Taalas' WIPO patent filings and public reporting to piece together a hypothesis. Next Platform notes at least 14 patents filed [1]. The two most relevant:
"Large Parameter Set Computation Accelerator Using Memory with Parameter Encoding" [2]
"Mask Programmable ROM Using Shared Connections" [3]
The "single transistor multiply" could be multiplication by routing, not arithmetic. Patent [2] describes an accelerator where, if weights are 4-bit (16 possible values), you pre-compute all 16 products (input x each possible value) with a shared multiplier bank, then use a hardwired mesh to route the correct result to each weight's location. The abstract says it directly: multiplier circuits produce a set of outputs, readable cells store addresses associated with parameter values, and a selection circuit picks the right output. The per-weight "readable cell" would then just be an access transistor that passes through the right pre-computed product. If that reading is correct, it's consistent with the CEO telling EE Times compute is "fully digital" [4], and explains why 4-bit matters so much: 16 multipliers to broadcast is tractable, 256 (8-bit) is not.
The same patent reportedly describes the connectivity mesh as configurable via top metal masks, referred to as "saving the model in the mask ROM of the system." If so, the base die is identical across models, with only top metal layers changing to encode weights-as-connectivity and dataflow schedule.
Patent [3] covers high-density multibit mask ROM using shared drain and gate connections with mask-programmable vias, possibly how they hit the density for 8B parameters on one 815mm2 die.
If roughly right, some testable predictions: performance very sensitive to quantization bitwidth; near-zero external memory bandwidth dependence; fine-tuning limited to what fits in the SRAM sidecar.
Caveat: the specific implementation details beyond the abstracts are based on Deep Research's analysis of the full patent texts, not my own reading, so could be off. But the abstracts and public descriptions line up well.
[1] https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
[2] https://patents.google.com/patent/WO2025147771A1/en
[3] https://patents.google.com/patent/WO2025217724A1/en
[4] https://www.eetimes.com/taalas-specializes-to-extremes-for-e...
LSI Logic and VLSI Systems used to do such things in 1980s -- they produced a quantity of "universal" base chips, and then relatively inexpensively and quickly customized them for different uses and customers, by adding a few interconnect layers on top. Like hardwired FPGAs. Such semi-custom ASICs were much less expensive than full custom designs, and one could order them in relatively small lots.
Taalas of course builds base chips that are already closely tailored for a particular type of models. They aim to generate the final chips with the model weights baked into ROMs in two months after the weights become available. They hope that the hardware will be profitable for at least some customers, even if the model is only good enough for a year. Assuming they do get superior speed and energy efficiency, this may be a good idea.
It could simply be bit serial. With 4 bit weights you only need four serial addition steps, which is not an issue if the weight are stored nearby in a rom.
So why only 30,000 tokens per second?
If the chip is designed as the article says, they should be able to do 1 token per clock cycle...
And whilst I'm sure the propagation time is long through all that logic, it should still be able to do tens of millions of tokens per second...
You still need to do a forward pass per token. With massive batching and full pipelining you might be able to break the dependencies and output one token per cycle but clearly they aren't doing that.
More aggressive pipelining will probably be the next step.
Reading from and to memory alone takes much more than a clock cycle.
Could we all get bigger FPGAs and load the model onto it using the same technique?
You could [1], but it is not very cheap -- the 32GB development board with the FPGA used in the article used to cost about $16K.
[1] https://arxiv.org/abs/2401.03868
FPGAs aren't very power-efficient. You could do it, but the numbers wouldn't add up for anything but prototyping.
I thought about this exact question yesterday. Curious to know why we couldn't, if it isn't feasible. Would allow one to upgrade to the next model without fabricating all new hardware.
FPGAs have really low density so that would be ridiculously inefficient, probably requiring ~100 FPGAs to load the model. You'd be better off with Groq.
Not sure what you're on but I think what you said is incorrect. You can use hi-density HBM-enabled FPGA with (LP)DDR5 with sufficient number of logic elements to implement the inference. Reason why we don't see it in action is most likely in the fact that such FPGAs are insanely expensive and not so available off-the-shelf as the GPUs are.
Yeah, FPGA+HBM works but it has no advantage over GPU+HBM. If you want to store weights in FPGA LUTs/SRAM for insane speed you're going to need a lot of FPGAs because each one has very little capacity.
How feasible would it be to integrate a neural video codec into the SoC/GPU silicon?
There would be model size constraints and what quality they can achieve under those constraints.
Would be interesting if it didn't make sense to develop traditional video codecs anymore.
The current video<->latents networks (part of the generative AI model for video) don't optimize just for compression. And you probably wouldn't want variable size input in an actual video codec anyway.
Edit: reading the below it looks like I'm quite wrong here but I've left the comment...
The single transistor multiply is intriguing.
Id assume they are layers of FMA operating in the log domain.
But everything tells me that would be too noisy and error prone to work.
On the other hand my mind is completely biased to the digital world.
If they stay in the log domain and use a resistor network for multiplication, and the transistor is just exponentiating for the addition that seems genuinely ingenious.
Mulling it over, actually the noise probably doesn't matter. It'll average to 0.
It's essentially compute and memory baked together.
I don't know much about the area of research so can't tell if it's innovative but it does seem compelling!
The document referenced in the blog does not say anything about the single transistor multiply.
However, [1] provides the following description: "Taalasâ density is also helped by an innovation which stores a 4-bit model parameter and does multiplication on a single transistor, Bajic said (he declined to give further details but confirmed that compute is still fully digital)."
[1] https://www.eetimes.com/taalas-specializes-to-extremes-for-e...
It'll be different gates on the transistor for the different bits, and you power only one set depending on which bit of the result you wish to calculate.
Some would call it a multi-gate transistor, whilst others would call it multiple transistors in a row...
That, or a resistor ladder with 4 bit branches connected to a single gate, possibly with a capacitor in between, representing the binary state as an analogue voltage, i.e. an analogue-binary computer. If it works for flash memory it could work for this application as well.
That's much more informative, I think my original comment is quite off the mark then.
I'd expect this is analog multiplication with voltage levels being ADC'd out for the bits they want. If you think about it, it makes the whole thing very analog.
Note: reading further down, my speculation is wrong.
Does this offer truly "deterministic" responses when temperature is set to zero?
(Of course excluding any cosmic rays / bit flips)?
I didnt see a editable temperature parameter on their chatjimmy demosite -- only a topK.
Very nice read, thank you for sharing this so well written.
So if we assume this is the future, the useful life of many semiconductors will fall substantially. What part of the semiconductor supply chain would have pricing power in a world of producing many more different designs?
Perhaps mask manufacturers?
It might be not that bad. âGood enoughâ open-weight models are almost there, the focus may shift to agentic workflows and effective prompting. The lifecycle of a model chip will be comparable to smartphones, getting longer and longer, with orchestration software being responsible for faster innovation cycles.
"Good enough" open weights models were "almost there" since 2022.
I distrust the notion. The bar of "good enough" seems to be bolted to "like today's frontier models", and frontier model performance only ever goes up.
The generation of frontier models from H1 2025 is the good enough benchmark.
Flash forward one year and it'll be H1 2026.
I donât see why. Today frontier models are already 2 generations ahead of good enough. For many users they did not offer substantial improvement, sometimes things got even worse. What is going to happen within 1 year that will make users desire something beyond already working solution? LLMs are reaching maturity faster than smartphones, which now are good enough to stay on the same model for at least 5-6 years.
If youâre running at 17k tokens / s what is the point of multiple agents?
Different skills and context. Llama 3.1 8B has just 128k context length, so packing everything in it may be not a great idea. You may want one agent analyzing the requirements and designing architecture, one writing tests, another one writing implementation and the third one doing code review. With LLMs itâs also matters not just what you have in context, but also what is absent, so that model will not overthink it.
EDIT: just in case, I define agent as inference unit with specific preloaded context, in this case, at this speed they donât have to be async - they may run in sequence in multiple iterations.
Just me or does this seems incredibly frightening to anyone else? Imagine printing a misaligned LLM this way and never being able to update the HW to run a different (aligned) model
It frightens me no more than the possibility of building a flawed airplane or a computer that overheats (looking at you, NVIDIA 12-pin) and "never being able to update the HW". Product recalls and redesigns exist for a reason.
If this happens, womp womp, recall the misaligned LLMs and learn from the mistake. It's part of running a hardware business as opposed to a software one.
I can't imagine they'd go for a full production run before at least testing a couple chips and finding issues.
The S in IoT is for security.
Is Taalas' approach scalable to larger models?
The top comment on Friday's discussion does some math on die size. https://news.ycombinator.com/item?id=47086634
Since model size determines die size, and die size has absolute limits as well as a correlation with yield, eventually it hits physical and economic limits. There was also some discussion about ganging chips.
From what I read here, the required chip size would scale linearly with the number of model weights. That alone puts a ceiling on the size of model.
Also the defect rate grows as the chip grows. It seems like there might be room for innovation in fault tolerance here, compared to a CPU where a randomly flipped bit can be catastrophic.
Imagine a Framework* laptop with these kinds of chips that could be swapped out as models get better over time
*Framework sells laptops and parts such that in theory users can own a ~~ship~~ laptop of Theseus over time without having to buy a whole new laptop when something breaks or needs upgrade.
Hmm I guess you'll get this pile of used boards which hmm is not a great source of waste; but I guess they will get reused for a few generations. A problem is it doesn't seem to be just the chips that would be thrown but the whole board which gets silly.
If model makers adopt an LTS model with an extended EOL for certain model versions, these chips would make that very affordable.
Thank god, I hope this reduces prices of RAM and GPUs
Few customers value tokens anywhere near what it costs the big API vendors. When the bubble pops the only survivors will be whoever can offer tokens at as close to zero cost as possible. Also whoever is selling hardware for local AI.
To those who use AI to get real work done in real products we build, we very much appreciate the value of each token given how much operational overhead it offsets. A bubble pop, if one does indeed happen, would at best be as disruptive as the dot-com bust.
Who's going to pay for custom chips when they shit out new models every two weeks and their deluded CEOs keep promising AGI in two release cycles?
New GPUs come out all the time. New phones come out (if you count all the manufacturers) all the time. We do not need to always buy the new one.
Current open weight models < 20B are already capable of being useful. With even 1K tokens/second, they would change what it means to interact with them or for models to interact with the computer.
hm yeah I guess if they stick to shitty models it works out, I was talking about the models people use to actually do things instead of shitposting from openclaw and getting reminders about their next dentist appointment.
The trick with small models is what you ask them to do. I am working on a data extraction app (from emails and files) that works entirely local. I applied for Taalas API because it would be awesome fit.
dwata: Entirely Local Financial Data Extraction from Emails Using Ministral 3 3B with Ollama: https://youtu.be/LVT-jYlvM18
https://github.com/brainless/dwata
Considering that enamel regrowth is still experimental (only curodont exists as a commercial product), those dentist appointments are probably the most important routine healthcare appointments in your life. Pick something that is actually useless.
It all depends on how cheap they can get. And another interesting thought: what if you could stack them? For example you have a base model module, then new ones come out that can work together with the old ones and expanding their capabilities.
To run Llama 3.1 8B locally, you would need a GPU with a minimum of 16 GB of VRAM, such as an NVIDIA RTX 3090.
Talas promises a 10x higher throughtput, being 10x cheaper and using 10x less electricity.
Looks like a good value proposition.
> To run Llama 3.1 8B locally, you would need a GPU with a minimum of 16 GB of VRAM, such as an NVIDIA RTX 3090
In full precision, yes. But this talaas chip uses a heavily quantized version (the article calls it "3/6 bit quant", probably similar to Q4_K_M). You dont even need a GPU to run that with reasonable performance, a CPU is fine.
What do you do with 8b models ? They can't even reliably create a .txt file or do any kind of tool calling
Re-read Brave New World. Deltas and Epsilons have their place, even if Alphas and Betas got smarter overnight.
Roof! Roof!
You obviously don't believe that AGI is coming in two release cycles, and you also don't seem to have much faith in the new models containing massive improvements over the last ones. So the answer to who is going to pay for these custom chips seems to be you.
Why would I buy chips to run handicapped models when the 10+ llms players all offer free tier access to their 1t+ parameters models ?
Do you think the free gravy train will run forever?
Not all applications are chatbots. Many potential uses for LLMs/VLAMs are latency constrained.
I'm guessing this development will make the fabrication of custom chips cheaper.
Exciting times.
Probably the datacenters that serve those models?
Almost all LLM companies have some sort of free tier that does nothing but lose them money.
>HOW NVIDIA GPUs process stuff? (Inefficiency 101)
Wow. Massively ignorant take. A modern GPUs is an amazing feat of engineering, particularly about making computation more efficient (low power/high throughput).
Then proceeds to explain, wrongly, how inference is supposssedly implemented and draws conclusions from there ...
Hey, Can you please point out explain the inaccuracies in the article?
I had written this post to have a higher level understanding of traditional vs Taalas's inference. So it does abstracts lots of things.
Arguably DRAM-based GPUs/TPUs are quite inefficient for inference compared to SRAM-based Groq/Cerebras. GPUs are highly optimized but they still lose to different architectures that are better suited for inference.
The way modern Nvidia GPUs perform inference is that they have a processor (tensor memory accelerator) that directly performs tensor memory operations which directly concedes that GPGPU as a paradigm is too inefficient for matrix multiplication.
This read itself is slop lol, literally dances around the term printing as if its some inkjet printer
Isnât the highly connected nature of the model layers problematic to build into physical layer?