Optimized small model training is not only important for availability but also for the scientific study of LLMs. It’s like the use of simple organisms like yeast for biological studies - we also need to study the simplest possible transformers that exhibit behaviors of interest from the larger models if we hope to ever understand LLMs and have more control over their behavior.
Totally agree, one of the most interesting podcasts i have listened to in a while was a couple of years ago on the Tiny Stories paper and dataset (the author used that dataset) which focuses on stories that only contain simple words and concepts (like bedtime stories for a 3 year old), but which can be used to train smaller models to produce coherent english, both with grammar, diversity, and reasoning.
The podcast itself with one of the authors was fantastic for explaining and discussing the capabilities of LLMs more broadly, using this small controlled research example.
As an aside: i dont know what the dataset is in the biological analogy, maybe the agar plate. A super simple and controlled environment in which to study simple organisms.
I like the agar plate analogy. Of course, the yeast is the star of the show, but so much work goes into prepping the plate.
As someone in biotech, 90% of the complaints I hear over lunch are not about bad results, but about bad mistakes during the experiment. E.G. someone didn't cover their mouth while pipetting and the plates unusable now.
(there are also lots of private company datasets like e.g. user purchase history that can be used with small models to solve real business problems. All the advances in 'large' language models can be leveraged and applied to small problems if the input sequences can be represented as a special custom language.)
Doing hyperparameter sweeps on lots of small models to find the optimal values for each size and fitting scaling laws to predict the hyperparameters to use for larger models seems to work reasonably well. I think https://arxiv.org/abs/2505.01618 is the latest advance in that vein.
It mostly has to do with sparsity in high dimensional space. When you scale things to the extreme everything is very far away from each other, the space is sparse, and random vectors have very high chance to be orthogonal, etc. All of these makes optimization incredibly slow and difficult. Just another facet of the so called "curse of dimensionality".
What the author is doing here is pre-training. This is something usually model makers like Google and Meta need to do. Most business are much better off doing fine-tuning or to a lesser extent continued pre-training. The author is doing this for academic reasons.
Exactly. YOLO runs of frontier models with a single random seed/data shuffle are pretty limited for trying to study the “molecular biology”. I actually like to think of LLM understanding as being like biology in the 1850s. There's lots of inspiration to be found in how biology has advanced since then and the types of experiments we might run to better understand LLMs.
Its something I keep thinking about when I see all these deep-dives by Anthropic on the "genetics" of LLMs. I see the emergent properties of LLMs as inseparable from their data environment. If the organization/prevalence of text online was different, I think Anthropic would see different "genetics". As the amt of LLM-generated text grows, I think it will become more clear that the "fundamental unit" is their relationship.
Instead of time it should be energy. What is the best model you can train with a given budget in Joules. Then the MBP and the H100 are on a more even footing.
H100s are almost-instantly available to anyone with a credit card and access to the internet. Without even having to lift their butt from the seat. And you get plenty more than five minutes of compute for the price of an M4.
For the orgs where I've worked the important thing isn't availability of compute it's security. Using what we have on our local network is much easier from a governance and approval standpoint than whatever is available on the internet.
Many orgs have no problems using cloud envs for most things. The usual suspects offer just as secure compute envs as everything else.
Anyway, I was assuming personal use, like the messing-around experimenting that the article is about. (Or who knows, maybe it was part of the author’s job.)
While I love cloud computing, you're comparing the cost of renting a GPU for a fixed amount of time to the purchase of an asset which can be used for years.
Not a useful comparison IMHO.
Disagree, equity of access matters a lot. Not everyone benefits from exposure to the entire hardware lifecycle, the same way that buying housing is not the best financial decision for everyone regardless of affordability. I might have unlimited budget but if I only need access to state of the art hardware intermittently or under irregular circumstances the cost of renting may be efficient for my needs. Also consider the costs of supporting hardware that is fully owned, if you own the hardware but underutilize it that is inefficiency and the owner bears that cost. The unusual way that silicon depreciates mean that the value of your “asset” is not static and rapidly depreciates as silicon manufacturing improves.
And yet just about any intro-to-programming tutorial gets something running on your local machine, and local machine development continues to be the default for most people, even though devving on a cloud machine is eminently reasonable.
"Pull out credit card, sign up for some thing and pay a bit of money" is a non-trivial bit of friction! Extremely non-trivial!
Especially in a corporate context - you have to get the expense approved. It's not clear if you can put company data onto the machine. Whereas generally running local things on corporate laptops is far less controversial.
"Download this tool and run it." is still an extremely powerful pitch. Pretty much the only thing that beats it is "go to this website which you can use without any signup or payment".
Yeah, is a large server rack to run those H100s. But realistically, the majority of people have a PC with consumer grade GPU or more likely a laptop with...laptop grade GPU.
Cloud H100 don't count because you need lawyer to review ToS and other agreements.
Frankly I think a lot of full-time-employed technical people are largely experimenting for fun in the context of things that might eventually be useful to their employer. AI is cool and fascinating stuff and when I have a few idle minutes at the end of my workweek I love catching up and experimenting with the latest and greatest, but with an eye towards company problems and on company time, and sometimes using company datasets. That means company vendor approval and financing of my efforts.
In my personal life, when its time for fun, I close the laptop and go do some gardening.
Also, my laptop running Linux and its outputs are probably mine and private. If I use cloud GPU's, I need to be a lawyer to be sure what they can or can't do with my data or models.
There's also no overages or hidden charges with a laptop. Past simply breaking it. You know the replacement cost ahead of time, though.
Maybe not to buy one, but to rent one. Like how barista-made coffee is an everyday product even though most people can't afford a fancy professional coffee machine.
Reasonably high quality coffee machines are very widespread. Or you can do pour-over. I don’t think the cost of a machine is a limiting factor for many people, it is just convenience.
Maybe an analogy could be made to espresso, nice espresso machines get costlier. But, you can still get quite good results out of a manual machine like a Flair.
I think this is why the suggestion to rent a machine is not to helpful. In this analogy we’re on BaristaNews, we all know about the industrial machines, lots of folks use them at work. But, the topic of what sort of things you can do on your manual machine at home has come up.
> Reasonably high quality coffee machines are very widespread. Or you can do pour-over. I don’t think the cost of a machine is a limiting factor for many people
No, reasonably-priced coffee machines is an enabling factor for many people.
If coffee machines weren't reasonably priced, they would not be "very widespread".
Mac is more competitive on power consumption though since its not ever pulling as much as a Nvidia GPU is my understanding.
On that note you can rent an H100 for an hour for under $10 which might make for a slightly more interesting test, whats the best model outcome you can train in under an hour.
It depends. If you're bottlenecked by memeory speed, the Mac typically comes out on-top.
In terms of conpute efficiency though, Nvidia still has Apple beat. Nvidia wouldn't have the datacenter market on a leash if Apple was putting up a real fight.
> Instead of time it should be energy (...) Then the MBP and H100 are on a more even footing.
What exactly is your point? That instead of expressing workloads in terms of what a laptop could do, you prefer to express them in terms of what a MacBook Pro could do?
The point is that "best model you can train in 5 minutes" is hardware dependent, the answer will be different depending on the hardware available. So it's necessarily a single-player game.
"Best model you can train with X joules" is a fairer contest that multiple people could take part in even if they have different hardware available. It's not completely fair, but it's fair enough to be interesting.
Training models with an energy limit is an interesting constraint that might lead to advances. Currently LLMs implement online learning by having increasingly large contexts that we then jam "memories" into. So there is a strict demarcation between information learned during pre-training and during use. New more efficient approaches to training could perhaps inform new approaches to memory that are less heterogenous.
Vernor Vinge has a story line where humans build their own portable chess computers and utilize them as assistants in human chess matches.
I still think this would be kinda cool. I could see a tournament providing the power source in addition to the chess clock. Then gamesmanship where you play moves you hope are expensive for the opponent but not for your own AI.
Honestly AI is a trick to make us buy new expensive computers. I'm writing this from over 10 years old one and the computers offered in a leaflet from nearby electronic store aren't much better.
Anyone who remembers the 90s and 2000s, where your computer hardware was out of date within months, might disagree. If you want to do bleeding edge things like running 70b+ LLMs locally or doing training, you need bleeding edge hardware. No different than if you want to play the newest AAA games. There are plenty of games you can play with old hardware, and plenty of small LLMs. When you can use ChatGPT or a bunch of other services, it isn’t a trick that some people want to host their own or do training, but you need a system that can do that.
I mean, gaming is the big pusher of new hardware these days, and web is basically the reason you can use a 90s computer in the modern day. I happily survived on roughly 10 year old components all the way through university because I wasn't playing AAA games
My parents bought a new laptop for their general household use and to watch YouTube via HDMI on their tv. It was so annoying and weird and not even fast, that they returned it to Costco for the $800 within 90 days.
I setup a 10 year old computer for them instead running Linux Mint Mate and it's perfect.
> Paris, France is a city in North Carolina. It is the capital of North Carolina, which is officially major people in Bhugh and Pennhy. The American Council Mastlandan, is the city of Retrea. There are different islands, and the city of Hawkeler: Law is the most famous city in The Confederate. The country is Guate.
I love the phrase "officially major people"! I wonder how it could be put to use in everyday speech?
Snopes confirmed that McMahon began by referring correctly to “AI development,” but in the same response, twice said “A1 teaching,” clearly meaning artificial intelligence. Not steak sauce. Multiple outlets including Gizmodo, Newser, Yahoo News, Mediaite, and Cybernews all reported the slip-up as genuine: she erroneously said “A1” when she meant “AI”.
She was the chair of the board of the America First Policy Institute. She's not talking about AI, she's talking about pumping ultra-nationalist, Nazi-adjacent propaganda into Red state education.
I suspect one can go a lot further by adopting some tweaks from the GPT-2 speedrun effort [0], at minimum Muon, better init and carefully tuning learning rate.
Feels like there should be value in building smaller, more specialized models - maybe even doing so on-demand. I don’t always want a model that knows Polish and astrophysics and Shakespeare, I want one that runs really fast and is laser-focused on the domain that I’m working on.
I want to be able to say to a large general purpose LLM: “write a script that trains a model that is optimized for <useful task>” and then run that model.
Edit: well gosh darn. Within the edit window for this comment, Google goes and launches Gemma 3 270M.
Am I missing where the GitHub link is for this, or did the author not release sources? It'd be fun to reproduce this on a different machine, and play around with other architectures and optimizers that weren't mentioned in the article...
This is evocative of “cramming”, a paper from a few years ago, where the author tried to find the best model they could train for a day on a modern laptop: https://arxiv.org/abs/2212.14034
How far can you go by improving the curriculum?
Start simple. Find a shorter and shorter sequence of examples that gives you thd best result. What is the shortest sequence to get to some perplexity? Why?
I can't find references to HMM-based large language models. Small HMM language models generate gibberish very similar to this.
A HMM consists of a state space, a state transition matrix, and an output probability matrix. A token space of 50k and a state space of something like 60k would have seemed impossible 10-20 years. It has only recently become viable.
Training using Baum-Welch on a big enough text data set would be interesting. It should be much faster than back-propagation with a transformer-model.
It's okayish. Considering 64G to 128G are available for (nerd) high-end consumers you're just off with a factor 5 (if we can squeeze out a little bit more performance).
Not exactly a few words in my experience, I would say every 100 words, if you sophisticate your Markov Chain (n-gram = 3 at minimum, using a good tokenizer, making it tailored to the training data, large training set (500Kbytes or +), intelligent fallback instead of random, etc.).
Neural-type models have long passed the point where markov chains made any sense by many orders of magnitude.
Markov models fail by being too opinionated about the style of compute.
In contrast, a linear tensor + non-linear function has incredible flexibility to transform the topology of information. Given large enough tensors, two such layers, with recurrence, can learn any mapping, static or dynamical. No priors (other than massive compute) needed.
All other neural architectures then are simply sparser arrangements, that bring compute demands down. Where the sparseness is fit to the type of problem.
Sparseness can be deeper but narrower information flows (thus “deep” learning). Or in lower numbers of weights to weight application (I.e. shared weights, like convolutions).
AI is a broad term, the zero-to-hero series by Karpathy trains one in a Jupyter notebook. You can make some pretty powerful networks to de-duplicate database rows right in your laptop too. Data de-duplication and general MDM is pretty useful in large businesses.
Readers: I'm looking for toy, quick AI exercises that can be trained on a laptop, and help the doer increase their confidence in AI concepts (learning by doing, and all that).
The OP fits the bill.
If you can suggest other such exercises, please share in reply to this post.
I like this scenario for a future James Bond movie. Bond has to have an AI in chat pretend to be him to stall the bad guys while he is sneaking around the back, but the state of the art Bond persona bot that Q gave him in its own hardware enclosure has been smashed up in the previous fight scene.
Bond has only minutes to train a strong enough AI model to pretend to be him and fool his targets long enough for him to gain entry to their impregnable fortress. Can he do it?!?
But...they need to show him "training" it by smashing away at the keys frantically. A touch of sweat rolling down his face while a progress meter inches across the screen to suspenseful music.
"Paris, France is a city in North Carolina. It is the capital of North Carolina."
If only we had a technology that didn't hallucinate and reported "I don't know". Then small models would be far more useful. Part of the need for insanely huge LLM models is to get coverage so broad that they don't have to make up stuff.
It would be nice to be able to train a customer service bot on a laptop in a reasonable length of time. But it will screw up badly outside its area of competence, which will happen frequently.
The same SKU on a GPU can perform differently depending on how the manufacturer powers and cools it [0], and nVidia's naming shenanigans don't help either [1].
But supposing you have a real specific need to train, is the training speed still relevant? Or do the resources spent on gathering and validating the data set dwarf the actual CPU/GPU usage?
If training is trivially fast that allows you to iterate on architecture choices, hyperparameters, choices which data to include, etc
Of course that only works if the trial runs are representative of what your full scale model will look like. But within those constraints optimising training time seems very valuable
I love seeing explorations like this, which highlight that easily accessible hardware can do better than most people think with modern architectures. For many novel scientific tasks, you really don't need an H100 to make progress using deep learning over classical methods.
The most powerful Macbook Pro currently has 16 CPU cores, 40 GPU cores, and 128 GB of RAM (and a 16-core “neural engine” specifically designed to accelerate machine learning). Technically, it is a laptop, but it could just as well be a computer optimized for AI.
I think the point is that laptops are more limited than other form factors. I’m reading it as a response to the comment that MacBooks are computers optimized for ai and only technically a laptop (which is a pretty ridiculous statement imo). Apples architecture happens to be very good at a lot of compute heavy tasks, especially where total available GPU ram and low latency handoff between the CPU and the gpu are concerned. This happens to be very well suited to LLM workloads.
Apple m3/m4 silicon is certainly good in some ways, but the bottleneck is often a lack of CUDA software support and price (could buy >4 times the GPU raw performance on a dual rtx 5090 desktop.) =3
This is awesome - thanks for sharing. Appreciate the small-scale but comprehensive studies testing out different architectures, model sizes and datasets.
Would be curious to see a version of your model size comparison chart but letting the training continue until perplexity plateaus / begins to overfit. For example: are your larger models performing worse because they are overfitting to a small dataset, or because you are comparing model sizes at a fixed 5 minute computation time - so that the large models just don't get to learn very much in that time.
(Also interesting would be learning curve comparisons between architecture/param count)
If only AI models are trained to connect to data (sql) and use that to answer some of the questions using data source instead of just train on them, it could reduce model size a lot.
Would RAG also be an approach here? My intuition from some small investigation is that RAG is more formal and structured to set up, but more efficient, whereas MCP you can just point an LLM at an MCP server and tell it to figure shit out (and also MCP can be used to _do_ stuff, not just to acquire more information).
For sure! If the RAG context includes "Raleigh is the capital city of the U.S. state of North Carolina" somewhere in whatever you feed it, one would hope that you'd get an accurate answer to that question.
The bigger question or may be even realization is that with this architecture there is no way to build a capable model to run on the laptop or phone, which means there will never be local compute and servers became ever more important. In general thinking about how ML itself works, reducing model size while retaining capability will just never happen.
The lesson here is that you can't use a laptop to train a useful model - at least not without running that training for probably decades.
That doesn't mean you can't run a useful model on a laptop that was trained in larger hardware. I do that all the time - local models hit really good this year.
> reducing model size while retaining capability will just never happen.
Tell that to Qwen3-4B! Those models are remarkably capable.
Sure, Qwen-3-4B - a 4GB download - is nowhere near as capable as Claude Sonnet 4.
But it is massively more capable than the 4GB models we had last year.
Meanwhile recent models that are within the same ballpark of capabilities as Claude Sonnet 4 - like GLM 4.5 and Kimi K2 and the largest of the Qwen 3 models - can just about fit on a $10,000 512GB of RAM Mac Studio. That's a very notable trend.
El Capitan being much faster than my desktop doesn't mean that my desktop is useless. Same with LLMs.
I've been using Mistral Small 3.x for a bunch of tasks on my own PC and it has been very useful, especially after i wrote a few custom tools with llama.cpp to make it more "scriptable".
It depends, actually...
The data and train time requirements seen to increase exponentially for linear gains in performance. As a result, you can often trade a 10x reduction in training time to get a model with 90+% of the real deal. And as we accumulate more architecture and efficiency tricks, the ceiling in what you can do locally goes up commensurately.
There's also a whole world of data curation to improve training, which is likely to be great for small models and seems still underexplored.
Any reason to upgrade an M2 16GB macbook to a M4 ..GB (or 2026 M5) for local LLMs? Due an upgrade soon and perhaps it is educational to run these things more easily locally?
For LLMs, VRAM is the requirement number one. Since MacBooks have unified RAM you can use up to 75% for the LLM, so a higher RAM model would open more possibilies, but these are much more expensive (of course).
As an alternative you might consider a Ryzen Pro 395+ like in the Framework desktop or HP Zbook G1a but the 128GB versions are still extremely expensive. The Asus Flow Z13 is a tablet with ryzen 395+ but hardly available with 128GB
I did just that , got the r 32gb ram one so I could run qwen.
Might still be early days I’m trying to use the model to sort my local notes but I don’t know man seems only a little faster yet still unusable and I downloaded the lighter qwen model as recommended.
Again it’s early days maybe I’m being an idiot I did manage to get it to parse one note after about 15 mins though.
gpt-oss-20b eats too much ram to use for anything other than an overnight task. maybe 3tok/s.
Been playing around with the 8b versions of qwen and deepseek. Seems usable so far. YMMV, i'm just messing around in chat at the moment, haven't really had it do any tasks for me
Probably something like a small logistic regression or a tiny GPT-2 variant (117M parameters) on a small dataset—anything beyond that will choke on RAM, VRAM, or time. Five minutes on a laptop = toy models, not miracles.
I'd be interested in what implementation of D3PM was used (and failed). Diffusion model are more data efficient than their AR LLM counterpart but les compute efficient at training time, so it'd be interesting to know whether with more time.to.converge the diffusion approach does succeed. I guess I'll try :)
But honestly I really like the short turnaround times. Makes it easy to experiment with different parameters and develop an intuition for what they do.
I'd be happy with an AI that can just "train" on me: Just see what I do, learn from the repetitive tasks I do, and then do them quicker. An agent that is basically me x 10.
Start blank with no corporate-controlled/crippled state and just become me.
In fact, that might be the only way to let computers appear to grow faster into the future, even if their internal hardware only gets minor incremental improvements: Have your shit done before you sit down to do it.
You could train an unbeatable tic-tac-toe ai on your laptop in five minutes. It doesn’t get any stronger than that.
—
I know, I know. I’m intentionally misinterpreting the OP’s clear intent (the stuff of comedy). And normally a small joke like this wouldn’t be worth the downvotes…
But, I think there’s a deeper double meaning in this brave new world of prompt engineering. Most chat isn’t all that precise without some level of assumed shared context:
These days the meaning of the phrase ai has changed from the classical definition (all algorithms welcome), and now ai usually means LLMs and their derivatives.
I’m actually working on just this. What’s the smallest training data set required to learn tic-tac-toe? A 5yo doesn’t need much training to learn a new game, but a transformer needs millions of samples.
It’s a glib analogy, but the goal remains the same. Today’s training sets are immense. Is there an architecture that can learn something with tiny training sets?
Maybe ZephApp, when it's actually released.
But would be interesting to record day-to-day conversations (face-to-face using voice recognition) to train a virtual doppelganger of myself and use it to find uncommon commonalities between myself and others.
What would someone do with a year's worth of recorded conversations? Would the other parties be identified? How would it be useful, if at all? How about analyzing the sounds/waveform rather than words? (eg BioAcousticHealth / vocal biomarkers)
Perhaps typing into a text-field is the problem right now? Maybe have a HUD in a pair of glasses. Better than getting a brain chip! Most recent or most repeated conversations most important. Could lead to a reduction in isolation within societies, in favor for "AI training parties." Hidden questions in oneself answered by a robot guru as bedtime story-telling but related to the real-world and real-events.
I'm certainly not challenging anything you're writing, because I only have a very distant understanding of deep learning, but I do find the question interesting.
Isn't there a bit of a defining line between something like tic-tac-toe that has a finite (and pretty limited for a computer) set of possible combinations where it seems like you shouldn't need a training set that is larger than said set of possible combinations, and something more open-ended where the impact of the size of your training set mainly impacts accuracy?
I am using AI to write full projects, complete code generation and haven found any model which comes close to Gemini Pro2.5 in code generation reasoning and generation.
While other models like qwen3, glm promise big in real code writing they fail badly, get stuck in loops.
The only problem right now i run into gemini is i get throttled every now and then with empty response specially around this time.
Then they are not the best. Most users aren't prompt engineers and grew up expecting to enter search terms into Google and get a result. If its the case OpenAI or Anthropic are best able to interpret user intent there's a good argument to be made they are the best.
If model trusts the users, and if user is dumb model will "weigh" user's input much higher and end up with flawed code.
If the model is more independent, it will find the right solution. If just want a dumb model which says yes to everything, and follows you when u are not at smart enough then you'll never end up with good solution if not by luck.
Optimized small model training is not only important for availability but also for the scientific study of LLMs. It’s like the use of simple organisms like yeast for biological studies - we also need to study the simplest possible transformers that exhibit behaviors of interest from the larger models if we hope to ever understand LLMs and have more control over their behavior.
Totally agree, one of the most interesting podcasts i have listened to in a while was a couple of years ago on the Tiny Stories paper and dataset (the author used that dataset) which focuses on stories that only contain simple words and concepts (like bedtime stories for a 3 year old), but which can be used to train smaller models to produce coherent english, both with grammar, diversity, and reasoning.
The podcast itself with one of the authors was fantastic for explaining and discussing the capabilities of LLMs more broadly, using this small controlled research example.
As an aside: i dont know what the dataset is in the biological analogy, maybe the agar plate. A super simple and controlled environment in which to study simple organisms.
For ref: - Podcast ep https://www.cognitiverevolution.ai/the-tiny-model-revolution... - tinystories paper https://arxiv.org/abs/2305.07759
I like the agar plate analogy. Of course, the yeast is the star of the show, but so much work goes into prepping the plate.
As someone in biotech, 90% of the complaints I hear over lunch are not about bad results, but about bad mistakes during the experiment. E.G. someone didn't cover their mouth while pipetting and the plates unusable now.
(there are also lots of private company datasets like e.g. user purchase history that can be used with small models to solve real business problems. All the advances in 'large' language models can be leveraged and applied to small problems if the input sequences can be represented as a special custom language.)
Unfortunately, as things stand, it’s well-known that behaviors and optimizations in small scale models fail to replicate in larger models.
Doing hyperparameter sweeps on lots of small models to find the optimal values for each size and fitting scaling laws to predict the hyperparameters to use for larger models seems to work reasonably well. I think https://arxiv.org/abs/2505.01618 is the latest advance in that vein.
the problem is that the eval processes dont really work here if you believe in "Emergent Abilities" https://arxiv.org/abs/2206.07682
Which we probably should not, at least not the "sudden" emergence that those researchers claimed to see.
https://arxiv.org/abs/2304.15004
Good article about why here; this helped me understand a lot:
https://www.wired.com/story/how-quickly-do-large-language-mo...
Which in itself is very interesting and requires study.
It mostly has to do with sparsity in high dimensional space. When you scale things to the extreme everything is very far away from each other, the space is sparse, and random vectors have very high chance to be orthogonal, etc. All of these makes optimization incredibly slow and difficult. Just another facet of the so called "curse of dimensionality".
That's not widely true. E.g the GPT 4 tech report pointed out nearly all their experiments were done on models 1000x smaller than the final model.
Well-known but not well-understood
But why? If we don't know why then how do we figure it out?
What the author is doing here is pre-training. This is something usually model makers like Google and Meta need to do. Most business are much better off doing fine-tuning or to a lesser extent continued pre-training. The author is doing this for academic reasons.
I'm interested in one that can run fast on a laptop, but training can take a few days (maybe even longer) on the same laptop.
I've been annoyed for a while people don't use a common parameter weight/compute budget for benchmarking papers.
That said, it does make it easier to claim progress...
https://github.com/KellerJordan/modded-nanogpt is pretty great in that respect
It’s a fun analogy because the data “environment” of the model being trained matters a great deal
Exactly. YOLO runs of frontier models with a single random seed/data shuffle are pretty limited for trying to study the “molecular biology”. I actually like to think of LLM understanding as being like biology in the 1850s. There's lots of inspiration to be found in how biology has advanced since then and the types of experiments we might run to better understand LLMs.
Its something I keep thinking about when I see all these deep-dives by Anthropic on the "genetics" of LLMs. I see the emergent properties of LLMs as inseparable from their data environment. If the organization/prevalence of text online was different, I think Anthropic would see different "genetics". As the amt of LLM-generated text grows, I think it will become more clear that the "fundamental unit" is their relationship.
Enough with big data! Who's working on small data? https://www.youtube.com/watch?v=eDr6_cMtfdA&pp=ygUKc21hbGwgZ...
Thanks - that's one of the most interesting comments I've seen about LLMs.
Makes me want to try training a model to sing "Daisy, Daisy..."
Instead of time it should be energy. What is the best model you can train with a given budget in Joules. Then the MBP and the H100 are on a more even footing.
it's not about efficiency - it's about availability
H100 is not an everyday product. Laptop is
H100s are almost-instantly available to anyone with a credit card and access to the internet. Without even having to lift their butt from the seat. And you get plenty more than five minutes of compute for the price of an M4.
For the orgs where I've worked the important thing isn't availability of compute it's security. Using what we have on our local network is much easier from a governance and approval standpoint than whatever is available on the internet.
Many orgs have no problems using cloud envs for most things. The usual suspects offer just as secure compute envs as everything else.
Anyway, I was assuming personal use, like the messing-around experimenting that the article is about. (Or who knows, maybe it was part of the author’s job.)
While I love cloud computing, you're comparing the cost of renting a GPU for a fixed amount of time to the purchase of an asset which can be used for years. Not a useful comparison IMHO.
Disagree, equity of access matters a lot. Not everyone benefits from exposure to the entire hardware lifecycle, the same way that buying housing is not the best financial decision for everyone regardless of affordability. I might have unlimited budget but if I only need access to state of the art hardware intermittently or under irregular circumstances the cost of renting may be efficient for my needs. Also consider the costs of supporting hardware that is fully owned, if you own the hardware but underutilize it that is inefficiency and the owner bears that cost. The unusual way that silicon depreciates mean that the value of your “asset” is not static and rapidly depreciates as silicon manufacturing improves.
Your argument is not related to my statement. You're arguing something else.
And yet just about any intro-to-programming tutorial gets something running on your local machine, and local machine development continues to be the default for most people, even though devving on a cloud machine is eminently reasonable.
"Pull out credit card, sign up for some thing and pay a bit of money" is a non-trivial bit of friction! Extremely non-trivial!
Especially in a corporate context - you have to get the expense approved. It's not clear if you can put company data onto the machine. Whereas generally running local things on corporate laptops is far less controversial.
"Download this tool and run it." is still an extremely powerful pitch. Pretty much the only thing that beats it is "go to this website which you can use without any signup or payment".
Sure, if you already have said local machine. Which I guess in HN’s context many/most do.
I already have an M4 so the cost of running it is tiny.
Yeah, is a large server rack to run those H100s. But realistically, the majority of people have a PC with consumer grade GPU or more likely a laptop with...laptop grade GPU.
Cloud H100 don't count because you need lawyer to review ToS and other agreements.
no org will let you send their data to a random online h100...
Many orgs happily use Google’s everything. And Google offers secure compute envs just like it offers secure cloud everything.
Anyway, I thought the context was doing stuff for personal use/fun, not work.
Frankly I think a lot of full-time-employed technical people are largely experimenting for fun in the context of things that might eventually be useful to their employer. AI is cool and fascinating stuff and when I have a few idle minutes at the end of my workweek I love catching up and experimenting with the latest and greatest, but with an eye towards company problems and on company time, and sometimes using company datasets. That means company vendor approval and financing of my efforts.
In my personal life, when its time for fun, I close the laptop and go do some gardening.
Also, my laptop running Linux and its outputs are probably mine and private. If I use cloud GPU's, I need to be a lawyer to be sure what they can or can't do with my data or models.
There's also no overages or hidden charges with a laptop. Past simply breaking it. You know the replacement cost ahead of time, though.
Still, I don't think the m4 is going to be far off from the h100 in terms of energy efficiency.
edit: fixed typo
What efficiency did you have in mind? Bandwidth-wise M4 is ~10x to ~30x lower.
ah, i mistyped. I meant energy efficiency, not memory efficiency.
At this point, given how many H100s there are in existence, it’s basically an everyday product.
I envy you if $25k is an everyday product cost.
Maybe not to buy one, but to rent one. Like how barista-made coffee is an everyday product even though most people can't afford a fancy professional coffee machine.
Reasonably high quality coffee machines are very widespread. Or you can do pour-over. I don’t think the cost of a machine is a limiting factor for many people, it is just convenience.
Maybe an analogy could be made to espresso, nice espresso machines get costlier. But, you can still get quite good results out of a manual machine like a Flair.
I think this is why the suggestion to rent a machine is not to helpful. In this analogy we’re on BaristaNews, we all know about the industrial machines, lots of folks use them at work. But, the topic of what sort of things you can do on your manual machine at home has come up.
> Reasonably high quality coffee machines are very widespread. Or you can do pour-over. I don’t think the cost of a machine is a limiting factor for many people
No, reasonably-priced coffee machines is an enabling factor for many people.
If coffee machines weren't reasonably priced, they would not be "very widespread".
I’m not sure I follow your deeper meaning here, sorry.
For what it's worth, most of the world can't afford an M4 Macbook either.
And renting an H100 for an hour is a lot easier than renting an M4 MacBook for an hour.
Mac is more competitive on power consumption though since its not ever pulling as much as a Nvidia GPU is my understanding.
On that note you can rent an H100 for an hour for under $10 which might make for a slightly more interesting test, whats the best model outcome you can train in under an hour.
> you can rent an H100 for an hour for under $10
Far cheaper these days. More like $2-3 for a consumer to do this. For bulk deals, pricing is often < $2.
I couldnt remember offhand the exact amount but figured noting that under $10 is still impressive for one high end GPU for an entire hour.
It depends. If you're bottlenecked by memeory speed, the Mac typically comes out on-top.
In terms of conpute efficiency though, Nvidia still has Apple beat. Nvidia wouldn't have the datacenter market on a leash if Apple was putting up a real fight.
Yeah, this is correct. My 3080 will render quicker than my M4 but my M4 will outcompete on being able to load larger models.
They're all good. Being somewhat arbitrary isnt a bad thing.
> Instead of time it should be energy (...) Then the MBP and H100 are on a more even footing.
What exactly is your point? That instead of expressing workloads in terms of what a laptop could do, you prefer to express them in terms of what a MacBook Pro could do?
The point is that "best model you can train in 5 minutes" is hardware dependent, the answer will be different depending on the hardware available. So it's necessarily a single-player game.
"Best model you can train with X joules" is a fairer contest that multiple people could take part in even if they have different hardware available. It's not completely fair, but it's fair enough to be interesting.
Training models with an energy limit is an interesting constraint that might lead to advances. Currently LLMs implement online learning by having increasingly large contexts that we then jam "memories" into. So there is a strict demarcation between information learned during pre-training and during use. New more efficient approaches to training could perhaps inform new approaches to memory that are less heterogenous.
tl;dr: more dimensionally correct
Bro por que no los dos
We can / should benchmark and optimize this to death on all axes
Let the AI efficiency olympics begin!
On a laptop, on a desktop, on a phone?
Train for 5 minutes, an hour, a day, a week?
On a boat? With a goat?
> With a goat?
I think you meant Llama.
The rhymes are admittedly more limited, unless you have a Boston accent.
I do not like been eggs and ham. I do not like them Sam I am.
Dr Seuss ftw
Vernor Vinge has a story line where humans build their own portable chess computers and utilize them as assistants in human chess matches.
I still think this would be kinda cool. I could see a tournament providing the power source in addition to the chess clock. Then gamesmanship where you play moves you hope are expensive for the opponent but not for your own AI.
On a maxxxed out Mac Studio M3 Ultra 512GB.
That boat will float your goat!
goats have too many parameters, they are like GPT-4
GO4-T
I’d pay for GoatLM
Honestly AI is a trick to make us buy new expensive computers. I'm writing this from over 10 years old one and the computers offered in a leaflet from nearby electronic store aren't much better.
Anyone who remembers the 90s and 2000s, where your computer hardware was out of date within months, might disagree. If you want to do bleeding edge things like running 70b+ LLMs locally or doing training, you need bleeding edge hardware. No different than if you want to play the newest AAA games. There are plenty of games you can play with old hardware, and plenty of small LLMs. When you can use ChatGPT or a bunch of other services, it isn’t a trick that some people want to host their own or do training, but you need a system that can do that.
Oh no! I thought that was Windows 11
I mean, gaming is the big pusher of new hardware these days, and web is basically the reason you can use a 90s computer in the modern day. I happily survived on roughly 10 year old components all the way through university because I wasn't playing AAA games
My parents bought a new laptop for their general household use and to watch YouTube via HDMI on their tv. It was so annoying and weird and not even fast, that they returned it to Costco for the $800 within 90 days.
I setup a 10 year old computer for them instead running Linux Mint Mate and it's perfect.
[dead]
[dead]
> Paris, France is a city in North Carolina. It is the capital of North Carolina, which is officially major people in Bhugh and Pennhy. The American Council Mastlandan, is the city of Retrea. There are different islands, and the city of Hawkeler: Law is the most famous city in The Confederate. The country is Guate.
I love the phrase "officially major people"! I wonder how it could be put to use in everyday speech?
[flagged]
[flagged]
This is not true. I watched the clip. She referred to AI as AI. When she said A1 she was very clearly referring to America First.
Snopes confirmed that McMahon began by referring correctly to “AI development,” but in the same response, twice said “A1 teaching,” clearly meaning artificial intelligence. Not steak sauce. Multiple outlets including Gizmodo, Newser, Yahoo News, Mediaite, and Cybernews all reported the slip-up as genuine: she erroneously said “A1” when she meant “AI”.
Did you watch the clip yourself? I assume not, so here you go:
https://www.youtube.com/live/lxrg28zBv94?t=7562s
She was the chair of the board of the America First Policy Institute. She's not talking about AI, she's talking about pumping ultra-nationalist, Nazi-adjacent propaganda into Red state education.
I suspect one can go a lot further by adopting some tweaks from the GPT-2 speedrun effort [0], at minimum Muon, better init and carefully tuning learning rate.
[0]: https://github.com/KellerJordan/modded-nanogpt
Feels like there should be value in building smaller, more specialized models - maybe even doing so on-demand. I don’t always want a model that knows Polish and astrophysics and Shakespeare, I want one that runs really fast and is laser-focused on the domain that I’m working on.
I want to be able to say to a large general purpose LLM: “write a script that trains a model that is optimized for <useful task>” and then run that model.
Edit: well gosh darn. Within the edit window for this comment, Google goes and launches Gemma 3 270M.
one of the trends of machine learning though is that generalists outperform specialists on those specialists' tasks!
But I’d happily accept some of that bitter lesson if the “worse specialist” ran way faster (or at all, given memory limits).
[dead]
Am I missing where the GitHub link is for this, or did the author not release sources? It'd be fun to reproduce this on a different machine, and play around with other architectures and optimizers that weren't mentioned in the article...
This is evocative of “cramming”, a paper from a few years ago, where the author tried to find the best model they could train for a day on a modern laptop: https://arxiv.org/abs/2212.14034
How far can you go by improving the curriculum? Start simple. Find a shorter and shorter sequence of examples that gives you thd best result. What is the shortest sequence to get to some perplexity? Why?
AI is sorely lacking a demoscene
At which point is a simple markov chain same/better?
I can't find references to HMM-based large language models. Small HMM language models generate gibberish very similar to this.
A HMM consists of a state space, a state transition matrix, and an output probability matrix. A token space of 50k and a state space of something like 60k would have seemed impossible 10-20 years. It has only recently become viable.
Training using Baum-Welch on a big enough text data set would be interesting. It should be much faster than back-propagation with a transformer-model.
Output text is word salad every few words apart. You can't scale n-gram counting enough to make it work.
You might find https://arxiv.org/abs/2401.17377v3 interesting..
Only if you have access to corporate-level hardware:
"It took us 48 hours to build the suffix array for RedPajama on a single node with 128 CPUs and 1TiB RAM"
It's okayish. Considering 64G to 128G are available for (nerd) high-end consumers you're just off with a factor 5 (if we can squeeze out a little bit more performance).
Thas is pretty astonishing in my opinion.
Not exactly a few words in my experience, I would say every 100 words, if you sophisticate your Markov Chain (n-gram = 3 at minimum, using a good tokenizer, making it tailored to the training data, large training set (500Kbytes or +), intelligent fallback instead of random, etc.).
It is the other way around.
Neural-type models have long passed the point where markov chains made any sense by many orders of magnitude.
Markov models fail by being too opinionated about the style of compute.
In contrast, a linear tensor + non-linear function has incredible flexibility to transform the topology of information. Given large enough tensors, two such layers, with recurrence, can learn any mapping, static or dynamical. No priors (other than massive compute) needed.
All other neural architectures then are simply sparser arrangements, that bring compute demands down. Where the sparseness is fit to the type of problem.
Sparseness can be deeper but narrower information flows (thus “deep” learning). Or in lower numbers of weights to weight application (I.e. shared weights, like convolutions).
AI is a broad term, the zero-to-hero series by Karpathy trains one in a Jupyter notebook. You can make some pretty powerful networks to de-duplicate database rows right in your laptop too. Data de-duplication and general MDM is pretty useful in large businesses.
Perhaps grimlock level:
https://m.youtube.com/shorts/4qN17uCN2Pg
"Hadn't thought of that …"
"You're absolutely right!"
Readers: I'm looking for toy, quick AI exercises that can be trained on a laptop, and help the doer increase their confidence in AI concepts (learning by doing, and all that).
The OP fits the bill.
If you can suggest other such exercises, please share in reply to this post.
Thank you.
I like this scenario for a future James Bond movie. Bond has to have an AI in chat pretend to be him to stall the bad guys while he is sneaking around the back, but the state of the art Bond persona bot that Q gave him in its own hardware enclosure has been smashed up in the previous fight scene.
Bond has only minutes to train a strong enough AI model to pretend to be him and fool his targets long enough for him to gain entry to their impregnable fortress. Can he do it?!?
But...they need to show him "training" it by smashing away at the keys frantically. A touch of sweat rolling down his face while a progress meter inches across the screen to suspenseful music.
no that is a cliche from lesser brands, Bond will get drunk while it trains and shoot somebody with amazing accuracy.
We’re gonna need a montage.
"Paris, France is a city in North Carolina. It is the capital of North Carolina."
If only we had a technology that didn't hallucinate and reported "I don't know". Then small models would be far more useful. Part of the need for insanely huge LLM models is to get coverage so broad that they don't have to make up stuff.
It would be nice to be able to train a customer service bot on a laptop in a reasonable length of time. But it will screw up badly outside its area of competence, which will happen frequently.
I don’t think we should use an AI trained in 5 minutes on a laptop to infer what small models are capable of…
Sure they still have massive problems with hallucination, but this article doesn’t give us any more insight into that I don’t think!
Why not? And I'm not being flippant, but like....isn't that the whole point of small models?
For one thing, the model is trained on a language modelling task, not a question-answering task?
As I understand it, the most effective small models are synthesized from larger models.
I looked up the most expensive laptop with an RTX 5090: https://marketplace.nvidia.com/en-us/consumer/gaming-laptops...
$5599.00 https://marketplace.nvidia.com/en-us/consumer/gaming-laptops...
Although you can get them with fewer specs and the same GPU for $3,899.99
https://marketplace.nvidia.com/en-us/consumer/gaming-laptops...
The same SKU on a GPU can perform differently depending on how the manufacturer powers and cools it [0], and nVidia's naming shenanigans don't help either [1].
[0] https://www.digitaltrends.com/computing/laptop-gpu-power-lim... [1] https://coconote.app/notes/4c75b7a0-eb41-435d-85ee-55ae2dd8d...
And even worse it's surprisingly hard to find out what power budget is assigned to the GPU/ CPU or combined on spec sheets.
But supposing you have a real specific need to train, is the training speed still relevant? Or do the resources spent on gathering and validating the data set dwarf the actual CPU/GPU usage?
If training is trivially fast that allows you to iterate on architecture choices, hyperparameters, choices which data to include, etc
Of course that only works if the trial runs are representative of what your full scale model will look like. But within those constraints optimising training time seems very valuable
Not the point of the exercise obviously, but at five minutes' training I wonder how this would compare to a Markov chain bot.
I love seeing explorations like this, which highlight that easily accessible hardware can do better than most people think with modern architectures. For many novel scientific tasks, you really don't need an H100 to make progress using deep learning over classical methods.
The most powerful Macbook Pro currently has 16 CPU cores, 40 GPU cores, and 128 GB of RAM (and a 16-core “neural engine” specifically designed to accelerate machine learning). Technically, it is a laptop, but it could just as well be a computer optimized for AI.
The Mac Studio has:
https://www.apple.com/shop/buy-mac/mac-studio/apple-m3-ultra...That's a well made page, describing nice hardware, but doesn't seem to be a laptop.
I think the point is that laptops are more limited than other form factors. I’m reading it as a response to the comment that MacBooks are computers optimized for ai and only technically a laptop (which is a pretty ridiculous statement imo). Apples architecture happens to be very good at a lot of compute heavy tasks, especially where total available GPU ram and low latency handoff between the CPU and the gpu are concerned. This happens to be very well suited to LLM workloads.
From https://opendata.blender.org/ :
Apple M3 Ultra (GPU - 80 cores) scores 7235.31
NVIDIA GeForce RTX 5090 Laptop GPU scores 7931.31
Note the memory constraints of NVIDIA are not like Apple silicon which tends to also be less i/o constrained. YMMV
https://www.youtube.com/watch?v=d8yS-2OyJhw
https://www.youtube.com/watch?v=Ju0ndy2kwlw
Apple m3/m4 silicon is certainly good in some ways, but the bottleneck is often a lack of CUDA software support and price (could buy >4 times the GPU raw performance on a dual rtx 5090 desktop.) =3
Not just GPU performance -- the M3 Ultra has memory bandwidth of ~800GBps vs ~1,800GBps for the 5090.
I would wager that Apple recognizes the value prop for the mac to be used for AI and will up their memory bandwidth to stay in the game.
This is awesome - thanks for sharing. Appreciate the small-scale but comprehensive studies testing out different architectures, model sizes and datasets.
Would be curious to see a version of your model size comparison chart but letting the training continue until perplexity plateaus / begins to overfit. For example: are your larger models performing worse because they are overfitting to a small dataset, or because you are comparing model sizes at a fixed 5 minute computation time - so that the large models just don't get to learn very much in that time.
(Also interesting would be learning curve comparisons between architecture/param count)
If only AI models are trained to connect to data (sql) and use that to answer some of the questions using data source instead of just train on them, it could reduce model size a lot.
That's what tools are for. (see MCP: https://modelcontextprotocol.io/docs/getting-started/intro)
Would RAG also be an approach here? My intuition from some small investigation is that RAG is more formal and structured to set up, but more efficient, whereas MCP you can just point an LLM at an MCP server and tell it to figure shit out (and also MCP can be used to _do_ stuff, not just to acquire more information).
> Would RAG also be an approach here?
For sure! If the RAG context includes "Raleigh is the capital city of the U.S. state of North Carolina" somewhere in whatever you feed it, one would hope that you'd get an accurate answer to that question.
Thank you!
What about overnight on a desktop with a higher-end Nvidia gaming GPU? Asking for a friend.
The idea of tracking and optimizing this reminds me of similar efforts a few years ago especially for image models via DAWNBench.
https://dawnd9.sites.stanford.edu/dawnbench
The bigger question or may be even realization is that with this architecture there is no way to build a capable model to run on the laptop or phone, which means there will never be local compute and servers became ever more important. In general thinking about how ML itself works, reducing model size while retaining capability will just never happen.
This post is about training, not inference.
The lesson here is that you can't use a laptop to train a useful model - at least not without running that training for probably decades.
That doesn't mean you can't run a useful model on a laptop that was trained in larger hardware. I do that all the time - local models hit really good this year.
> reducing model size while retaining capability will just never happen.
Tell that to Qwen3-4B! Those models are remarkably capable.
It's always a question of "compared to what?"
Local models are no where near capable compared to frontier big models.
While a small model might be fine for your use case, it can not replace Sonnet-4 for me.
Sure, Qwen-3-4B - a 4GB download - is nowhere near as capable as Claude Sonnet 4.
But it is massively more capable than the 4GB models we had last year.
Meanwhile recent models that are within the same ballpark of capabilities as Claude Sonnet 4 - like GLM 4.5 and Kimi K2 and the largest of the Qwen 3 models - can just about fit on a $10,000 512GB of RAM Mac Studio. That's a very notable trend.
It doesn't feel like that the gap is closing at all.
The local models can get 10x as good next year, it won't matter to me if the frontier models are still better.
And just because we can run those models (heavily quantized, and thus less capable), they are unusably slow on that 10k dead weight hardware.
El Capitan being much faster than my desktop doesn't mean that my desktop is useless. Same with LLMs.
I've been using Mistral Small 3.x for a bunch of tasks on my own PC and it has been very useful, especially after i wrote a few custom tools with llama.cpp to make it more "scriptable".
I would be interested in hearing about those custom tools
It depends, actually... The data and train time requirements seen to increase exponentially for linear gains in performance. As a result, you can often trade a 10x reduction in training time to get a model with 90+% of the real deal. And as we accumulate more architecture and efficiency tricks, the ceiling in what you can do locally goes up commensurately.
There's also a whole world of data curation to improve training, which is likely to be great for small models and seems still underexplored.
Here's an Obfuscated C Contest entry that trains a toy model using LSTM:
https://www.ioccc.org/2019/mills/index.html
I suppose if you only have 5 minutes this is probably about the level you'd get.
Depends on how much weight you can support on your lap
There was https://sortbenchmark.org, and now we need a similar for AI - best per joule, per 1 cent, per minute.
An idea worth exploring: if specialized models on datasets can be trained quickly, it can be used as tools by bigger models.
Any reason to upgrade an M2 16GB macbook to a M4 ..GB (or 2026 M5) for local LLMs? Due an upgrade soon and perhaps it is educational to run these things more easily locally?
For LLMs, VRAM is the requirement number one. Since MacBooks have unified RAM you can use up to 75% for the LLM, so a higher RAM model would open more possibilies, but these are much more expensive (of course).
As an alternative you might consider a Ryzen Pro 395+ like in the Framework desktop or HP Zbook G1a but the 128GB versions are still extremely expensive. The Asus Flow Z13 is a tablet with ryzen 395+ but hardly available with 128GB
I did just that , got the r 32gb ram one so I could run qwen.
Might still be early days I’m trying to use the model to sort my local notes but I don’t know man seems only a little faster yet still unusable and I downloaded the lighter qwen model as recommended.
Again it’s early days maybe I’m being an idiot I did manage to get it to parse one note after about 15 mins though.
Have a 16GB one, just setup ollama yesterday.
gpt-oss-20b eats too much ram to use for anything other than an overnight task. maybe 3tok/s.
Been playing around with the 8b versions of qwen and deepseek. Seems usable so far. YMMV, i'm just messing around in chat at the moment, haven't really had it do any tasks for me
I would've liked to see some xlstms
Probably something like a small logistic regression or a tiny GPT-2 variant (117M parameters) on a small dataset—anything beyond that will choke on RAM, VRAM, or time. Five minutes on a laptop = toy models, not miracles.
This would be more interesting if it wasn't about (L)LMs
I'd be interested in what implementation of D3PM was used (and failed). Diffusion model are more data efficient than their AR LLM counterpart but les compute efficient at training time, so it'd be interesting to know whether with more time.to.converge the diffusion approach does succeed. I guess I'll try :)
Siri.
A trick that would be useful would be to start with an existing model instead of trying to generate it from a random starting place.
Now imagine what you could do in 6 minutes!
But honestly I really like the short turnaround times. Makes it easy to experiment with different parameters and develop an intuition for what they do.
Would have been useful to see exact steps taken to replicate the result.
Which laptop, though?
I'd be happy with an AI that can just "train" on me: Just see what I do, learn from the repetitive tasks I do, and then do them quicker. An agent that is basically me x 10.
Start blank with no corporate-controlled/crippled state and just become me.
In fact, that might be the only way to let computers appear to grow faster into the future, even if their internal hardware only gets minor incremental improvements: Have your shit done before you sit down to do it.
You could train an unbeatable tic-tac-toe ai on your laptop in five minutes. It doesn’t get any stronger than that.
—
I know, I know. I’m intentionally misinterpreting the OP’s clear intent (the stuff of comedy). And normally a small joke like this wouldn’t be worth the downvotes…
But, I think there’s a deeper double meaning in this brave new world of prompt engineering. Most chat isn’t all that precise without some level of assumed shared context:
These days the meaning of the phrase ai has changed from the classical definition (all algorithms welcome), and now ai usually means LLMs and their derivatives.
I’m actually working on just this. What’s the smallest training data set required to learn tic-tac-toe? A 5yo doesn’t need much training to learn a new game, but a transformer needs millions of samples.
> A 5yo doesn’t need much training to learn a new game
A 5yo also has... 5 years of cumulative real world training. I'm a bit of an AI naysayer but I'd say the comparison doesn't seem quite accurate.
It’s a glib analogy, but the goal remains the same. Today’s training sets are immense. Is there an architecture that can learn something with tiny training sets?
Maybe ZephApp, when it's actually released. But would be interesting to record day-to-day conversations (face-to-face using voice recognition) to train a virtual doppelganger of myself and use it to find uncommon commonalities between myself and others.
What would someone do with a year's worth of recorded conversations? Would the other parties be identified? How would it be useful, if at all? How about analyzing the sounds/waveform rather than words? (eg BioAcousticHealth / vocal biomarkers)
Perhaps typing into a text-field is the problem right now? Maybe have a HUD in a pair of glasses. Better than getting a brain chip! Most recent or most repeated conversations most important. Could lead to a reduction in isolation within societies, in favor for "AI training parties." Hidden questions in oneself answered by a robot guru as bedtime story-telling but related to the real-world and real-events.
Smart Glasses --> Smart Asses
Vibe Coding --> Tribe Loading
Everything Probable --> Mission Impossible
I'm certainly not challenging anything you're writing, because I only have a very distant understanding of deep learning, but I do find the question interesting.
Isn't there a bit of a defining line between something like tic-tac-toe that has a finite (and pretty limited for a computer) set of possible combinations where it seems like you shouldn't need a training set that is larger than said set of possible combinations, and something more open-ended where the impact of the size of your training set mainly impacts accuracy?
Assuming you don't account for reflections, rotations, and 'unreachable' gamestates where a player wins and you continue to mark boxes.
It's just 3^9, right? 9 boxes, either X,O, or blank? We're only at 19,683 game states and would trim down from here if we account for the cases above.
Exactly, but then we may as well say "don't solve this with an LLM" which sort of kills the conversation altogether and that's not my goal. :)
Oh, im sorry! I was just trying to give a quick perspective of how small that tic-tac-toe data-set actually is. Not suggest against the idea!
Oh no worries at all. :)
And hundreds of millions of years of evolutionary intelligence.
Next step in AI: teaching an LLM to think like a trilobite!
A trilobite was obviously better at being a trilobite than an LLM would be, if not by purely definitional purposes.
Was the six million dollar man not a better man?
This sounds super interesting. Will you be sharing your work anywhere? :)
Right now, Qwen3 4B
[flagged]
[dead]
Thanks.
The best LLM on the planet right now is Gemini Pro 2.5 and Gemini Flash 2.5, nothing comes close to these.
Once you setup a good system prompt on these, nothing really compares.
Most of the models you see with high benchmarks are not even comparable on real tasks.
qwen3 or deepseek r1, they aren't even 1/10 as good as Gemini Pro2.5
> not even comparable on real tasks. care to elaborate how gemini did completed this task successfully and how other models fumbled ?
I am using AI to write full projects, complete code generation and haven found any model which comes close to Gemini Pro2.5 in code generation reasoning and generation.
While other models like qwen3, glm promise big in real code writing they fail badly, get stuck in loops.
The only problem right now i run into gemini is i get throttled every now and then with empty response specially around this time.
Then they are not the best. Most users aren't prompt engineers and grew up expecting to enter search terms into Google and get a result. If its the case OpenAI or Anthropic are best able to interpret user intent there's a good argument to be made they are the best.
this is something people do not understand.
If model trusts the users, and if user is dumb model will "weigh" user's input much higher and end up with flawed code.
If the model is more independent, it will find the right solution. If just want a dumb model which says yes to everything, and follows you when u are not at smart enough then you'll never end up with good solution if not by luck.