📚 From Scans to Structured Data: Wrangling a Bibliography with OCR, OpenRefine & Zotero

We started with a simple goal: turn a scanned list of publications by W.W. Judd into something useful—searchable, sortable, and structured.

What we had: a PDF of scanned pages
What we wanted: a clean, organized list of citations we could work with in tools like Zotero, Excel, or BibTeX.

Step 1: OCR Isn’t Magic (But It Helps)

The first pass used basic OCR, which was… not great. So we re-ran the scans using ABBYY FineReader, a more robust tool that does a better job recognizing text. Still, OCR can’t fix everything—misread characters, weird line breaks, and inconsistencies in punctuation are common.

🔍 Why OCR Isn’t Perfect: Real Examples

Even with powerful tools like ABBYY FineReader, OCR can struggle with older fonts, smudges, or inconsistent formatting in scanned documents. Here are a few examples of the kinds of errors we encountered:

  • Misrecognized words:
    • tppearance instead of appearance
    • Jniversity instead of University
    • speci aens instead of specimens
    • T 1rtricidae instead of Tortricidae
    • Unic •lor instead of unicolor
  • Garbled headers or artifacts:
    • -fur -ďż˝G:SB Ql2S- likely junk or bleed-through from another layer of the scan
  • Line breaks and spacing issues:
    • Am1als of the Entomological Society likely meant to be Annals
    • Macrobasis unic •lor Kirby instead of Macrobasis unicolor Kirby

These small errors add up—especially when trying to split citation metadata into columns like journal name, volume, issue, and page numbers. That’s why cleaning and refining is such a key step.

Step 2: Let’s Try BibTeX?

We considered going straight to BibTeX, thinking we could load that into Zotero and export clean records from there. But BibTeX needs structure, and OCR errors (especially around punctuation) made import tricky—volume, issue, and page numbers didn’t always land in the right place.

Step 3: OpenRefine to the Rescue

At this point, we changed tactics and brought the semi-cleaned CSV into OpenRefine. This tool lets you split, clean, cluster, and transform messy data. It helped a lot in teasing apart journal names, volume/issue info, and page ranges—even when the OCR inconsistencies made things tricky.

Some manual cleanup is still needed, but we’re getting closer every time.

đź§° Tools That Made a Difference

  • Zotero: A free citation manager to store, organize, and export bibliographies. Great for importing BibTeX or CSV, editing metadata, and generating citations.
  • OpenRefine: A powerful tool for cleaning messy data—splitting columns, fixing inconsistencies, clustering duplicates, and more.
  • Google Sheets / Excel: Perfect for quick sorting, filtering, and editing when a visual layout helps.

🎓 Advice for the Next Steps

If you’re continuing this project (hello student collaborator!), don’t worry if you haven’t used these tools before. You don’t have to master them overnight—just knowing they exist and what they’re good at is half the battle.

We’re happy to support you:

  • Need help with Zotero? The library has great guides and staff to help.
  • Curious about OpenRefine? It’s free to install and worth exploring.
  • Never heard of BibTeX? No problem—we can walk you through it.

🌟 Final Thoughts

This kind of digital cleanup is never one-and-done. It’s iterative, a bit finicky, and oddly satisfying. Every round of improvement makes the data more useful—and a little closer to the polished, accessible resource we want it to be.

So whether you're fixing a field or learning a new tool, you're helping make scholarship more findable, usable, and shareable.

Written with ChatGPt because I'm in a hurry