Online meeting - using ML to extract images from periodicals and newspaper

In the last decades, millions of pages from historical newspapers, periodicals, and magazines have been digitized. Next to words, which have been made machine-readable using OCR, these publications also contain millions of images. Until recently, it was relatively hard to extract, explore, and analyze this treasure trove of historical visual culture. This online meeting aims to bring together projects and scholars who are interested in the first step of studying these images at scale: extracting them from historical sources. Next to a presentation by Shannon Shen on Layout Parser, an easy-to-use toolkit for deep learning-based Document Image Analysis (DIA), we will discuss best practices and pitfalls.

About our speaker and Layout Parser: Shannon Shen is a Ph.D. student at MIT’s CSAIL. He previously was the lead developer of Layout Parser: an easy-to-use toolkit for deep learning-based Document Image Analysis.

This meeting is part of the ‘Multimodal AI, Image Analysis, and the Illustrated Periodical Press’ of Thomas Smits, Paul Fyfe, Ben Lee, Irene Testini, and Julia Thomas (funded by the RSVP’s 2023 Field Development Grant).

Images on a front page of L'Illustration identified using Layout Parser.