Publications

7 publications à comité de lecture en NLP, reconnaissance d'entités nommées, traitement de textes historiques chinois, OCR et transfer learning. Profil complet : Google Scholar · HAL

HistText: An Application for Leveraging Large-Scale Historical Textbases

JDMDH 2024 — Journal of Data Mining & Digital Humanities

This paper introduces HistText, a pioneering tool devised to facilitate large-scale data mining in historical documents, specifically targeting Chinese sources. Developed in response to the challenges posed by the massive Modern China Textual Database, HistText emerges as a solution to efficiently extract and visualize valuable insights from billions of words spread across millions of documents. With a user-friendly interface, advanced text analysis techniques, and powerful data visualization capabilities, HistText offers a robust platform for digital humanities research. Available at histtext.enpchina.eu.

A Dataset for Named Entity Recognition and Entity Linking in Chinese Historical Newspapers

B. Blouin, C. Armand, C. Henriot

LREC-COLING 2024, Torino, Italy

In this study, we present a novel historical Chinese dataset for named entity recognition, entity linking, coreference and entity relations. We use data from Chinese newspapers from 1872 to 1949 and multilingual bibliographic resources from the same period. The period and the language are the main strength of the present work, offering a resource which covers different styles and language uses, as well as the largest historical Chinese NER dataset with manual annotations from this transitional period. After detailing the selection and annotation process, we present the very first results that can be obtained from this dataset. Texts and annotations are freely downloadable from the GitHub repository.

Unlocking Transitional Chinese: Word Segmentation in Modern Historical Texts

B. Blouin, H.-H. Huang, C. Henriot, C. Armand

NLP4DH 2023, Tokyo, Japan

This paper addresses NLP tokenization for transitional Chinese (early 20th century), using articles from the Shenbao newspaper as a study base. After evaluating existing word segmentation tools, a custom model was developed specifically for historical data. The final model achieves over 83% accuracy with an F-score 35% higher than existing tools. The results show that transitional Chinese is more closely related to ancient Chinese than contemporary Mandarin, necessitating language models specifically trained on historical data. The newly created annotated dataset paves the way for further performance improvements.

Simulation d'erreurs d'OCR dans les systèmes de TAL pour le traitement de données anachroniques

B. Blouin, B. Favre, J. Auguste

TALN 2022 (JEP-TALN-RECITAL), Avignon

L'extraction d'information offre de nouvelles perspectives au sein des recherches historiques. Cependant, la majorité des recherches liées à ce domaine s'effectue sur des données contemporaines. Malgré l'évolution constante des systèmes d'OCR, les textes historiques résultant de ce procédé contiennent toujours de multiples erreurs. Les auteurs quantifient l'impact des erreurs OCR sur trois tâches d'extraction d'information avec des architectures Transformer, et proposent une approche réduisant cet impact de plus de 50% sans nécessiter de ressources historiques spécialisées.

Transferring Modern Named Entity Recognition to the Historical Domain: How to Take the Step?

B. Blouin, B. Favre, J. Auguste, C. Henriot

NLP4DH 2021, Silchar, India

Named entity recognition is of high interest to digital humanities, in particular when mining historical documents. Although the task is mature in the field of NLP, results of contemporary models are not satisfactory on challenging documents corresponding to out-of-domain genres, noisy OCR output, or old-variants of the target language. In this paper we study how model transfer methods, in the context of the aforementioned challenges, can improve historical named entity recognition according to how much effort is allocated to describing the target data, manually annotating small amounts of texts, or matching pre-training resources. We perform extensive experiments with the transformer architecture on the LitBank and HIPE historical datasets. They show that annotating 250 sentences can recover 93% of the full-data performance when models are pre-trained, that the choice of self-supervised and target-task pre-training data is crucial in the zero-shot setting, and that OCR errors can be handled by simulating noise on pre-training data and resorting to recent character-aware transformers.

Creating Biographical Networks from Chinese and English Wikipedia

B. Blouin, N. van den Bosch, P. Magistry

Journal of Historical Network Research, Vol. 5, No. 1

With the rise of digital humanities, historians are exploring how to intellectually engage with textual sources given the computational tools available today. The ENP-China project employs Natural Language Processing methods to tap into sources on an unprecedented scale, with the goal of studying the transformation of elites in Modern China (1830–1949). A large corpus of 228,144 Chinese and 110,713 English Wikipedia biographies is enriched with metadata recording every mentioned person, organization, geopolitical entity and location, linked across languages. This data structure allows researchers to analyze relationships via shared biographical contents and compare networks in different language settings. An online interface built on a bipartite graph structure allows querying and exploring the dataset.

Contextual Characters with Segmentation Representation for Named Entity Recognition in Chinese

B. Blouin, P. Magistry

PACLIC 34, Hanoi, Vietnam

Named Entity Recognition (NER) is a typical sequence labeling task. It remains challenging for Chinese, partly because of the lack of clear typographic word boundaries. Recent approaches have shown that character-based models lack information about larger units (words) useful for NER, while word-based models may suffer from word segmentation errors and a higher rate of Out-of-Vocabulary (OOV) tokens. In this paper, we propose a new representation of sinograms (Chinese characters) enriched with word boundary information, for which different types of embeddings can be built. Experiments show that our solution outperforms other state-of-the-art models. The fully retrainable pipeline does not rely on pretrained models and can be trained in few days on common hardware.