Wow this sucks so bad.'na person posted on /r/datahoarder that they have created an archive of the Epstein files with added metadata like mentioned people and etc. But everything is LLM-generated....

jonny@neuromatch.social

Wow this sucks so bad.

a person posted on /r/datahoarder that they have created an archive of the Epstein files with added metadata like mentioned people and etc. But everything is LLM-generated.... Including the "full text" of the documents.

Rather than OCRing them, they were fed to chatGPT with a system prompt that told it that it was an expert at OCR.

https://github.com/epstein-docs/epstein-docs.github.io/blob/b92183bb667afd636872d9f854de8154d61f68b4/process_images.py#L102

mschfr@mastodon.social

@jonny I've played around with some AI-based OCR and it sucks ass. Just use PaddleOCR

jonny@neuromatch.social

Its so cool how LLMs are always finding new ways to do misinformation even when you think you are creating very meticulous archive-grade information

jonny@neuromatch.social

@mschfr
Ah is that the new tesseract? Can't find benchmarks (thanks ai search results), but in your experience it's more accurate?

mschfr@mastodon.social

@jonny I have a blog post (in German) about detecting text in photos and PaddleOCR performed much better than Tesseract there. Haven't checked with strangely scanned PDFs, but I suspect that it also might do well

https://schmalenstroer.net/blog/2025/03/bastelstunde-vision-model-basierte-bildverschlagwortung/

jonny@neuromatch.social

"AI" is just so convenient. Finally we are no longer beholden to traditional OCR which requires a few clicks or commands to yield highly accurate text with predictable failure modes.

Now all you have to do is explain the entire nature of what OCR is, how "reading" works, and how "documents" as a representation of language work. If you remember to insist repeatedly that the position in which text is laid out in two dimensions impacts its representation as a string, you may yield ?????

jonny@neuromatch.social

y'all I am done for

eliocamp@mastodon.social

@jonny obligatory xkcd

aburka@hachyderm.io

@eliocamp @jonny 2025 edit

Piero Bosio Social Web Site Personale

Wow this sucks so bad.'na person posted on /r/datahoarder that they have created an archive of the Epstein files with added metadata like mentioned people and etc. But everything is LLM-generated....

Feed RSS

Gli ultimi otto messaggi ricevuti dalla Federazione

Post suggeriti

I didn't use AI to generate the assets in Rowan.

Current* conditions near Alpena, MI:

Designing A Pen Clip That Never Bends Out Of Shape

@pfefferle Lots of love to you and respect for your time.