> If AI can diminish some of the monotony of research, perhaps we can spend more time thinking, writing, playing piano, and taking walks — with other people.
Whenever any progress is made, this is the logical conclusion. And yet, those who decide about how your time is being used, have an opposing view.
If there is 1 job at a university. And there are 10 researchers applying. And 1 took this improvement in research speed to do more research, and 9 took the change to play more piano and take more walks, then most likely that one will get the job. This competitive nature is what has driven society forward and not kept us at just above subsistence agriculture.
I feel that we’re reaching a limit to our context switching. Any further process improvements or optimizations will be bottlenecked on humans. And I don’t think AI will help here as jobs will account for that and we’ll have to do context switching on even broader and more complex scopes.
I think the limit has been exceeded. That's the primary reason everything sort of sucks now. There is no time to slow down and do things right (or better).
IMO, cyber security, for example, will have to become a government mandate with real penalties for non-compliance (like seat belts in cars were mandated) in order to force organizations to slow down, and make sure systems are built carefully and as correctly as possible to protect data.
This is in conflict with the hurtling pace of garbage in/garbage out AI generated stuff we see today.
I confess this largely surprises me for reasons that I think should not surprise me. I would expect current AI is largely best at guessing at what some writing was based on expectations of other things it has managed to "read." As such, I would think it is largely not going to be much better at hand writing than any other tool.
Yet, it occurs to me that that "guess and check" is exactly what I'm doing when trying to read my 6yo's writing. Often I will do a pass to detect the main sounds, but then I start thinking of what was current on his thoughts and see if I can make a match. Not surprisingly, often I do.
Great post and amazing progress in this field! However, I have to wonder if some of these letters were part of the training data for Gemini, since they are well-known and someone has probably already done the painstaking work of transcribing them...
Most likely, and probably inferring the structure on texts with "similar" writing forms. Tried with my handwriting (in italian) and the performance wasn't that stellar.
More annoyingly, it is still a LLM and not a "pure" OCR, so some sentences were partially rephrased with different words than the one in the text.
This is crucially problematic if they would be used to transcribe historical documents
> Tried with my handwriting (in italian) and the performance wasn't that stellar.
Same here, for diaries/journals written in mixed Swedish/English/Spanish and with absolutely terrible hand-writing.
I'd love for the day where the writing is on the wall for handwriting recognition, which is something I bet on when I started with my journals, but seems that day has yet to come. I'm eager to get there though so I can archive all of it!
I have a personal corpus of letters between my grandparents in WW2. My grandfather fighting in Europe and my grandmother in England. The ability of Claude and ChatGPT to transcribe them is extremely impressive. Though I haven’t worked on them in months and this uses older models. At that time neither system could properly organize pages though and chatGPT would sometimes skip a paragraph.
I've also been working on half a dozen crates of old family letters. ChatGPT does well with them and is especially good at summarizing the letters. Unfortunately, all the output still has to be verified because it hallucinates words and phrases and drops lines here and there. So at this point, I still transcribe them by hand, because the verification process is actually more tiresome than just typing them up in the first place. Maybe I should just have ChatGPT verify MY transcriptions instead.
It helps when you can see the confidence of each token, which downloadable weights usually gives you. Then whenever you (your software) detects a low confidence token, run over that section multiple times to generate alternatives, and either go with the highest confidence one, or manually review the suggestions. Easier than having to manually transcribe those parts at least.
Possibly, but given it can also read my handwriting- which is much, MUCH worse than Boole’s - with better accuracy than any human I’ve given it to- that’s probably not the explanation.
Then write something down yourself and upload a picture to gemini.google.com or chatgpt. Hell, combine it. Make yourself a quick math test, print it, solve with pen and ask these models to correct it.
Anecdata inbound but my PCP, thankfully, used Nuance's speech-to-text platform remarkably well for adding his own commentary on things. It was a refreshing thing to see and I hope my clinicians use it.
Any self-hosted open source solution? I would like to digitize my paper notebooks but I do not want to use anything proprietary or that uses external services. What is the state of the art on the FOSS side?
Ideally something that I can train with my own handwriting. I had a look at Tesseract, wondering if there’s anything better out there.
Historical handwriting, Gemini 3 is the only one which gave a decent result on a 19th century minutes from a town court in Northern Norway (Danish gothic handwriting with bleed through). I'm not 100% sure it's correct, but that's because it's so dang hard to read it to verify it. At least I see it gets many names, dates and locations right.
Please share. I am out of the loop and my searches have not pointed me to the state of the art, which has seen major steps forward in the past 3 or 4 years but most of it seems to be closed or attached to larger AI products.
The best open source OCR model for handwriting in my experience is surya-v2 or nougat, really depends on the docs which is better, each got about 90% accuracy (cosine similarity) in my tests. I have not tried Deepseek-OCR, but mean to at some point.
Try various downloadable weights that has Vision, they're all good at different examples, running multiple ones and then finally something to aggregate/figure out the right one usually does the trick. Some recent ones to keep in the list: ministral-3-14b-reasoning, qwen3-vl-30b, magistral-small-2509, gemma-3-27b
Personally I found magistral-small-2509 to be overall most accurate, but it completely fails on some samples, while qwen3-vl-30b doesn't struggle at all with those same samples. So seems training data is really uneven depending on what exactly you're trying to OCR.
And the trade-off of course is that these are LLMs so not exactly lightweight nor fast on consumer hardware, but at least with the approach of using multiple you greatly increase the accuracy.
Maybe for English, for the other human languages I use, it is still kind of hit and miss, just like speaking recognition, even with English it suffices to have an accent that is off the standard TV one.
I became convinced of this after the release of KuroNet: https://arxiv.org/pdf/1910.09433 (High-quality OCR of Japanese manuscripts, which look almost impossible to read.)
It's painful to see that beautiful hand-writing of the past is now pretty much extinct. For me, handwriting of a person speaks a lot about them, not just their mind, but physical state as well.
No, transcription has nothing to do with written text, it guessed few words here and there but not even general topic. That's doctors note about patient visit, beginning with "Прием: состояние удовл., t*, но кашель / patient visit: condition is OK, t(temperature normal?) but coughing". But unreadable doctors handwriting is a meme...
The result from Gemini 3 Pro using the default media resolution (the medium one): "(Заголовок / Header): Арсеньев (Фамилия / Surname - likely "Arsenyev")
Состояние удовл-
t N, кожные
покровы чистые,
[л/у не увел.]
В зеве умерен. [умеренная]
гипер. [гиперемия]
В легких дыха-
ние жесткое, хрипов
нет. Тоны серд-
[ца] [ритм]ичные.
Живот мяг-
кий, б/б [безболезненный].
мочеисп. [мочеиспускание] своб. [свободное]
Ds: ОРЗ [или ОРВИ]" and with the translation: "Arsenyev
Condition satisfactory.
Temp normal, skin coverings [skin] are clean, lymph nodes not enlarged.
In the throat [pharynx], moderate hyperemia [redness].
In the lungs, breathing is rigid [hard], no rales [crackles/wheezing].
Heart tones are rhythmic.
Abdomen is soft, painless.
Urination is free [unhindered].
Diagnosis: ARD (Acute Respiratory Disease)."
This is a historical church document from 19th century and Gemini got it right with common words but completely hallucinated the names of village and people.
Right, it can do modern writing but anything older than a century ( church records and census)and it produces garbage. Yandex Archives figured that out and have CER in a single digit but they have the resources to collect immense data for training.
I'm slowly building a dataset for finetuning TROCR model and the best it can do is CER 18% ... which is sort of readable.
I'm using TrOCR because it's a smaller model that I can fine tune on a consumer card, but the age of the model and resources certainly make it a challenge. The official notebook for fine tuning hasn't been updated in years and has several errors due to the march of progress in the primary packages.
> Here’s Transkribus’s best guess at George’s letter to Maryann, above:
Transkribus got a new model architecture around the corner and the results look impressive. Not only for trivial cases like text, but also for table structures and layouting.
Best of all, you can train it on your own corpus of text to support obscure languages and handwriting systems.
It feels unbelievable that in Europe literacy rate could be 10% of lower. Then I look at documents even as young as 150 years... fraktur, blackletter, elaborate handwritting. I guess I'm illiterate now.
Hopefully next generations will feel the same about legal contracts, law in general, and Java code bases. They're incomprehensible not because of fonts but because of unfathomable complexity.
Why? The former are just different typefaces (I learned to read them by myself when I was 10 while looking at our old books) and the latter I sort of picked up while travelling through Serbia and Bulgaria (I don't speak the languages).
Silly comment. Handwriting is proven to be correlated with much better memory retention, which ultimately means much greater degree of association with existing memories and the creation of novel ideas.
"The comparison between handwriting and typing reveals important differences in their neural and cognitive impacts. Handwriting activates a broader network of brain regions involved in motor, sensory, and cognitive processing, contributing to deeper learning, enhanced memory retention, and more effective engagement with written material. Typing, while more efficient and automated, engages fewer neural circuits, resulting in more passive cognitive engagement. These findings suggest that despite the advantages of typing in terms of speed and convenience, handwriting remains an important tool for learning and memory retention, particularly in educational contexts."
You are literally handicapping yourself by not thinking with pen and paper, or keeping paper notes.
The future is handwriting with painless digitization for searchability, until we invent a better input device for text that leverages our motor-memory facilities in the brain.
This paper just says that handwriting requires more cognitive load?
Which is exactly my experience with handwriting through my school years. When handwriting notes during lectures all focus goes to plotting down words, and it becomes impossible to actually focus on the meaning behind them.
I call out the Lindy effect. Handwriting survived printed characters, typewriters, and the last 50-70 years of computers and keyboards, it will survive this too.
> If AI can diminish some of the monotony of research, perhaps we can spend more time thinking, writing, playing piano, and taking walks — with other people.
Whenever any progress is made, this is the logical conclusion. And yet, those who decide about how your time is being used, have an opposing view.
If there is 1 job at a university. And there are 10 researchers applying. And 1 took this improvement in research speed to do more research, and 9 took the change to play more piano and take more walks, then most likely that one will get the job. This competitive nature is what has driven society forward and not kept us at just above subsistence agriculture.
I feel that we’re reaching a limit to our context switching. Any further process improvements or optimizations will be bottlenecked on humans. And I don’t think AI will help here as jobs will account for that and we’ll have to do context switching on even broader and more complex scopes.
I think the limit has been exceeded. That's the primary reason everything sort of sucks now. There is no time to slow down and do things right (or better).
IMO, cyber security, for example, will have to become a government mandate with real penalties for non-compliance (like seat belts in cars were mandated) in order to force organizations to slow down, and make sure systems are built carefully and as correctly as possible to protect data.
This is in conflict with the hurtling pace of garbage in/garbage out AI generated stuff we see today.
maybe akin to how faster conputers bred programs that are slower than before.
Better “thinking” computers will breed worse thinking people, huh?
I confess this largely surprises me for reasons that I think should not surprise me. I would expect current AI is largely best at guessing at what some writing was based on expectations of other things it has managed to "read." As such, I would think it is largely not going to be much better at hand writing than any other tool.
Yet, it occurs to me that that "guess and check" is exactly what I'm doing when trying to read my 6yo's writing. Often I will do a pass to detect the main sounds, but then I start thinking of what was current on his thoughts and see if I can make a match. Not surprisingly, often I do.
Great post and amazing progress in this field! However, I have to wonder if some of these letters were part of the training data for Gemini, since they are well-known and someone has probably already done the painstaking work of transcribing them...
Most likely, and probably inferring the structure on texts with "similar" writing forms. Tried with my handwriting (in italian) and the performance wasn't that stellar. More annoyingly, it is still a LLM and not a "pure" OCR, so some sentences were partially rephrased with different words than the one in the text. This is crucially problematic if they would be used to transcribe historical documents
> Tried with my handwriting (in italian) and the performance wasn't that stellar.
Same here, for diaries/journals written in mixed Swedish/English/Spanish and with absolutely terrible hand-writing.
I'd love for the day where the writing is on the wall for handwriting recognition, which is something I bet on when I started with my journals, but seems that day has yet to come. I'm eager to get there though so I can archive all of it!
So it doesn't work is what you're saying, right?
Are you sure to have used the Gemini 3.0 pro model? Maybe try increasing the media resolution on the AI studio if the text is small
I have a personal corpus of letters between my grandparents in WW2. My grandfather fighting in Europe and my grandmother in England. The ability of Claude and ChatGPT to transcribe them is extremely impressive. Though I haven’t worked on them in months and this uses older models. At that time neither system could properly organize pages though and chatGPT would sometimes skip a paragraph.
I've also been working on half a dozen crates of old family letters. ChatGPT does well with them and is especially good at summarizing the letters. Unfortunately, all the output still has to be verified because it hallucinates words and phrases and drops lines here and there. So at this point, I still transcribe them by hand, because the verification process is actually more tiresome than just typing them up in the first place. Maybe I should just have ChatGPT verify MY transcriptions instead.
It helps when you can see the confidence of each token, which downloadable weights usually gives you. Then whenever you (your software) detects a low confidence token, run over that section multiple times to generate alternatives, and either go with the highest confidence one, or manually review the suggestions. Easier than having to manually transcribe those parts at least.
Possibly, but given it can also read my handwriting- which is much, MUCH worse than Boole’s - with better accuracy than any human I’ve given it to- that’s probably not the explanation.
Shhhhh no one cares about data contamination anymore.
Then write something down yourself and upload a picture to gemini.google.com or chatgpt. Hell, combine it. Make yourself a quick math test, print it, solve with pen and ask these models to correct it.
They're very good at it.
For that to be relevant to this post, they would need to write with secretary hand.
Quite certain my doctor can still produce writing, that the models don't stand a chance to be able to recognize.
Anecdata inbound but my PCP, thankfully, used Nuance's speech-to-text platform remarkably well for adding his own commentary on things. It was a refreshing thing to see and I hope my clinicians use it.
I'm just excited that I may finally be able to decipher my meeting notes from yesterday!
Any self-hosted open source solution? I would like to digitize my paper notebooks but I do not want to use anything proprietary or that uses external services. What is the state of the art on the FOSS side?
Ideally something that I can train with my own handwriting. I had a look at Tesseract, wondering if there’s anything better out there.
Regular handwriting there are many.
Historical handwriting, Gemini 3 is the only one which gave a decent result on a 19th century minutes from a town court in Northern Norway (Danish gothic handwriting with bleed through). I'm not 100% sure it's correct, but that's because it's so dang hard to read it to verify it. At least I see it gets many names, dates and locations right.
I've been waiting a long time for this.
> Regular handwriting there are many.
Please share. I am out of the loop and my searches have not pointed me to the state of the art, which has seen major steps forward in the past 3 or 4 years but most of it seems to be closed or attached to larger AI products.
Is it even still called OCR?
The best open source OCR model for handwriting in my experience is surya-v2 or nougat, really depends on the docs which is better, each got about 90% accuracy (cosine similarity) in my tests. I have not tried Deepseek-OCR, but mean to at some point.
Totally not what you asked, but making an OCR model is a learning exercise for AI research students. Using the Kaggle-hosted dataset https://www.kaggle.com/datasets/landlord/handwriting-recogni... and a tutorial, eg https://pyimagesearch.com/2020/08/17/ocr-with-keras-tensorfl... you can follow along and train your own OCR model!
Try various downloadable weights that has Vision, they're all good at different examples, running multiple ones and then finally something to aggregate/figure out the right one usually does the trick. Some recent ones to keep in the list: ministral-3-14b-reasoning, qwen3-vl-30b, magistral-small-2509, gemma-3-27b
Personally I found magistral-small-2509 to be overall most accurate, but it completely fails on some samples, while qwen3-vl-30b doesn't struggle at all with those same samples. So seems training data is really uneven depending on what exactly you're trying to OCR.
And the trade-off of course is that these are LLMs so not exactly lightweight nor fast on consumer hardware, but at least with the approach of using multiple you greatly increase the accuracy.
I thought handwriting recognition is on the wall because no one knows how to do it any more...
Maybe for English, for the other human languages I use, it is still kind of hit and miss, just like speaking recognition, even with English it suffices to have an accent that is off the standard TV one.
Agree here. I've had successes with 18th century Dutch, but again quite a few failures and mistakes
As always, this depends on the amount of training data available. Japanese is another success story: https://digitalorientalist.com/2020/02/18/cursive-japanese-a...
Interesting, thanks for sharing.
ee lay vhen!
They don't do Scottish accents!
Indeed, has it improved anything in 14 years?
https://www.youtube.com/watch?v=BOUTfUmI8vs
I became convinced of this after the release of KuroNet: https://arxiv.org/pdf/1910.09433 (High-quality OCR of Japanese manuscripts, which look almost impossible to read.)
> "transmitted": In the second line of the body, the word "transmitted" is crossed out in the original text
Am I nuts or is this wrong, not “perfect”?
It doesn’t look crossed out at all to me in the image, just some bleeding?
Still very impressive, of course
It's painful to see that beautiful hand-writing of the past is now pretty much extinct. For me, handwriting of a person speaks a lot about them, not just their mind, but physical state as well.
Call me when it can do Russian Cursive.
Seems to do an OK job:
https://g.co/gemini/share/e173d18d1d80
This is a random image from Twitter with no transcript or English translation provided, so it's not going to be in the training data.
No, transcription has nothing to do with written text, it guessed few words here and there but not even general topic. That's doctors note about patient visit, beginning with "Прием: состояние удовл., t*, но кашель / patient visit: condition is OK, t(temperature normal?) but coughing". But unreadable doctors handwriting is a meme...
That's Gemini 2.5 Flash btw
The result from Gemini 3 Pro using the default media resolution (the medium one): "(Заголовок / Header): Арсеньев (Фамилия / Surname - likely "Arsenyev")
Condition satisfactory. Temp normal, skin coverings [skin] are clean, lymph nodes not enlarged. In the throat [pharynx], moderate hyperemia [redness]. In the lungs, breathing is rigid [hard], no rales [crackles/wheezing]. Heart tones are rhythmic. Abdomen is soft, painless. Urination is free [unhindered]. Diagnosis: ARD (Acute Respiratory Disease)."This is a historical church document from 19th century and Gemini got it right with common words but completely hallucinated the names of village and people.
https://gemini.google.com/share/f98de1d5ac55
Right, it can do modern writing but anything older than a century ( church records and census)and it produces garbage. Yandex Archives figured that out and have CER in a single digit but they have the resources to collect immense data for training. I'm slowly building a dataset for finetuning TROCR model and the best it can do is CER 18% ... which is sort of readable.
How do you do, fellow TrOCR fine-tuner?
I'm using TrOCR because it's a smaller model that I can fine tune on a consumer card, but the age of the model and resources certainly make it a challenge. The official notebook for fine tuning hasn't been updated in years and has several errors due to the march of progress in the primary packages.
> Here’s Transkribus’s best guess at George’s letter to Maryann, above:
Transkribus got a new model architecture around the corner and the results look impressive. Not only for trivial cases like text, but also for table structures and layouting.
Best of all, you can train it on your own corpus of text to support obscure languages and handwriting systems.
Really looking forward to it.
Anyone knows how the models do on Russian cursive?
Ultimate Test
Surely the true prize is to be able to ditch computers altogether and just write with pencil on paper.
I am writing on paper with the hope that one day I can digitize everything painlessly with 99.99% accuracy.
Keyboards are faster.
Don't worry, handwriting itself has diminished throughout the decades since the introduction of computers an especially smart phones.
Ah, maybe I'll pick up Qin seal when I retire, if I retire.
If I went back in time to the 90s when I was doing my PhD I would absolutely blow my mind with how well handwriting OCR works now.
My question for OCR automation is always which digits within the numbers being read are allowed to be incorrect?
It feels unbelievable that in Europe literacy rate could be 10% of lower. Then I look at documents even as young as 150 years... fraktur, blackletter, elaborate handwritting. I guess I'm illiterate now.
Hopefully next generations will feel the same about legal contracts, law in general, and Java code bases. They're incomprehensible not because of fonts but because of unfathomable complexity.
Which Europe and which century do you live in where literacy rate is below 10%?
Speaking about the past centuries.
You can learn fraktur or blackletter in a day and cyrillic in a few days, if you already know the latin alphabet.
> learn fraktur or blackletter in a day and cyrillic in a few days
Not a chance, sorry.
Why? The former are just different typefaces (I learned to read them by myself when I was 10 while looking at our old books) and the latter I sort of picked up while travelling through Serbia and Bulgaria (I don't speak the languages).
The writing is on the wall for handwriting. Zoomers use speech recognition or touchscreen keyboards, millennials use keyboards. Boomers use pens
Silly comment. Handwriting is proven to be correlated with much better memory retention, which ultimately means much greater degree of association with existing memories and the creation of novel ideas.
"The comparison between handwriting and typing reveals important differences in their neural and cognitive impacts. Handwriting activates a broader network of brain regions involved in motor, sensory, and cognitive processing, contributing to deeper learning, enhanced memory retention, and more effective engagement with written material. Typing, while more efficient and automated, engages fewer neural circuits, resulting in more passive cognitive engagement. These findings suggest that despite the advantages of typing in terms of speed and convenience, handwriting remains an important tool for learning and memory retention, particularly in educational contexts."
https://pmc.ncbi.nlm.nih.gov/articles/PMC11943480/
You are literally handicapping yourself by not thinking with pen and paper, or keeping paper notes.
The future is handwriting with painless digitization for searchability, until we invent a better input device for text that leverages our motor-memory facilities in the brain.
This paper just says that handwriting requires more cognitive load?
Which is exactly my experience with handwriting through my school years. When handwriting notes during lectures all focus goes to plotting down words, and it becomes impossible to actually focus on the meaning behind them.
I call out the Lindy effect. Handwriting survived printed characters, typewriters, and the last 50-70 years of computers and keyboards, it will survive this too.
I love how you fit right into the current meme that Gen-X never gets mentioned.