INetwork's integration Year: 2019
RAreas 1 "digitization" and 2 "exploration and search"
NFields of study: french language and literature, linguistics, information technology
This project aims to exploit a corpus containing around 5500 documents from the mid-17th century. These documents dealing with the regency of Cardinal Mazarin form the so-called “mazarinades” corpus. The digital humanities project we propose tries to combine NLP and data mining techniques to favor better access to this corpus for various domains (historians, linguists, experts in literary studies….).
Our first contribution will be to improve the quality of the textual data obtained via OCR techniques, in particular by taking advantage of deep learning methods. There is an important need to fine-tune the character recognition for such ancient texts: populating library databases with better quality textual data (not just digitized images of these ancient texts) will be of great help to many researchers. In this respect, the mazarinades, often considered as the first press campaign in french history, are particularly context-dependent data: they need to be put in relation with many other data, for which the automatic obtaining of the text mode is crucial. Some other leads may be followed to fulfill these objectives: crowdsourcing, annotation, Named Entity recognition …
Our second contribution would be to gather various NLP components dedicated to this corpus. Tasks like text dating, authorship attribution, or clustering will be pursued. We want first to apply these components to the raw text and afterward to evaluate how much a better OCR will improve the quality of the subsequent automated analysis.
Finally, one of the objectives is to propose data visualizations (Dataviz) to give meaning to these raw data. The political and polemical texts we are working on only make sense when they are put in relation to each other: when it comes to several thousands of documents, a synthetic visualization replaces the human eye which cannot consider such masses together.