Lecture 11.1 -- Decluttering Slides

Jill Naiman & Sharon Comstock

Data Storytelling - Semester - Fall 2025

notes: Our corpus of interest is the historical astrophysics literature with the initial dataset containing ~50k scanned pages, but with the potential total corpus encompassing about 3 Million articles of 5-10 pages each. This data covers the years from about 1850-1997, i.e. the “pre-digital” era in in which text and other page objects is not easily extracted from the PDF We want to “digitize” these documents. And by “digitizing” here we mean extracting specific page objects including things like tables, math formulas, figures and figure captions. ---


# Version \#2 --- notes: Our corpus of interest is the historical astrophysics literature with the initial dataset containing ~50k scanned pages, but with the potential total corpus encompassing about 3 Million articles of 5-10 pages each. This data covers the years from about 1850-1997, i.e. the “pre-digital” era in in which text and other page objects is not easily extracted from the PDF --- notes: We want to “digitize” these documents. And by “digitizing” here we mean extracting specific page objects including things like… --- notes: --- notes: --- notes: --- notes: ---


## What are some differences that you noticed from Version \#1 and \#2? notes: --> notes: Our corpus of interest is the historical astrophysics literature with the initial dataset containing ~50k scanned pages, but with the potential total corpus encompassing about 3 Million articles of 5-10 pages each. This data covers the years from about 1850-1997, i.e. the “pre-digital” era in in which text and other page objects is not easily extracted from the PDF We want to “digitize” these documents. And by “digitizing” here we mean extracting specific page objects including things like tables, math formulas, figures and figure captions. ---


# Version \#2 --- notes: Our corpus of interest is the historical astrophysics literature with the initial dataset containing ~50k scanned pages, but with the potential total corpus encompassing about 3 Million articles of 5-10 pages each. This data covers the years from about 1850-1997, i.e. the “pre-digital” era in in which text and other page objects is not easily extracted from the PDF --- notes: We want to “digitize” these documents. And by “digitizing” here we mean extracting specific page objects including things like… --- notes: --- notes: --- notes: --- notes: ---


## What are some differences that you noticed from Version \#1 and \#2? notes: -->