Wow this sucks so bad.'na person posted on /r/datahoarder that they have created an archive of the Epstein files with added metadata like mentioned people and etc. But everything is LLM-generated....
-
Wow this sucks so bad.
a person posted on /r/datahoarder that they have created an archive of the Epstein files with added metadata like mentioned people and etc. But everything is LLM-generated.... Including the "full text" of the documents.
Rather than OCRing them, they were fed to chatGPT with a system prompt that told it that it was an expert at OCR.
-
Wow this sucks so bad.
a person posted on /r/datahoarder that they have created an archive of the Epstein files with added metadata like mentioned people and etc. But everything is LLM-generated.... Including the "full text" of the documents.
Rather than OCRing them, they were fed to chatGPT with a system prompt that told it that it was an expert at OCR.
@jonny I've played around with some AI-based OCR and it sucks ass. Just use PaddleOCR
-
Wow this sucks so bad.
a person posted on /r/datahoarder that they have created an archive of the Epstein files with added metadata like mentioned people and etc. But everything is LLM-generated.... Including the "full text" of the documents.
Rather than OCRing them, they were fed to chatGPT with a system prompt that told it that it was an expert at OCR.
Its so cool how LLMs are always finding new ways to do misinformation even when you think you are creating very meticulous archive-grade information
-
@jonny I've played around with some AI-based OCR and it sucks ass. Just use PaddleOCR
@mschfr
Ah is that the new tesseract? Can't find benchmarks (thanks ai search results), but in your experience it's more accurate? -
@mschfr
Ah is that the new tesseract? Can't find benchmarks (thanks ai search results), but in your experience it's more accurate?@jonny I have a blog post (in German) about detecting text in photos and PaddleOCR performed much better than Tesseract there. Haven't checked with strangely scanned PDFs, but I suspect that it also might do well
https://schmalenstroer.net/blog/2025/03/bastelstunde-vision-model-basierte-bildverschlagwortung/
-
Its so cool how LLMs are always finding new ways to do misinformation even when you think you are creating very meticulous archive-grade information
"AI" is just so convenient. Finally we are no longer beholden to traditional OCR which requires a few clicks or commands to yield highly accurate text with predictable failure modes.
Now all you have to do is explain the entire nature of what OCR is, how "reading" works, and how "documents" as a representation of language work. If you remember to insist repeatedly that the position in which text is laid out in two dimensions impacts its representation as a string, you may yield ?????
-
"AI" is just so convenient. Finally we are no longer beholden to traditional OCR which requires a few clicks or commands to yield highly accurate text with predictable failure modes.
Now all you have to do is explain the entire nature of what OCR is, how "reading" works, and how "documents" as a representation of language work. If you remember to insist repeatedly that the position in which text is laid out in two dimensions impacts its representation as a string, you may yield ?????
y'all I am done for
-
y'all I am done for
@jonny obligatory xkcd
-
@jonny obligatory xkcd
-
undefined Oblomov shared this topic on
Gli ultimi otto messaggi ricevuti dalla Federazione
Post suggeriti
-
77 – La teoria dell’internet morto – Dead Internet Theory https://www.camisanicalzolari.it/77-la-teoria-dellinternet-morto-dead-internet-theory/?utm_source=dlvr.it&utm_medium=mastodon
Uncategorized1
-
-
Posting garlic bread on main'nMade with Mexican bolillo'nThis is based on my mom’s garlic bread I grew up eating and loving, although I’ve been evolving the recipe over the course of 20+ years now.
Uncategorized2
-