NLP May 2025

nlp@dsv.su.se

1 discussions

Invited speech: June 3, at 14-15.00 Open access to research data and automatic pseudonymization. Two years with Mormor Karl project.

by Hercules Dalianis

Dear NLP group Invited speech by professor Elena Volodina, professor at Språkbanken Text; Dpt. Swedish, Multilingualism, Language Technology, University of Gothenburg. Time: Tuesday, June 3, at 14-15.00 Place: Lilla Hörsalen Title: Open access to research data and automatic pseudonymization. Two years with Mormor Karl project. Abstract: This talk will be devoted to the challenges of working with data that contains personal information. I will describe a set of experiments with automatic pseudonymization that we have performed within Mormor Karl project<https://mormor-karl.github.io/>. Among others, experiments with detection and labeling of personal categories using BERT models (Szawerna et al. 2024, 2025), attempts att using LLMs to "fill in the blanks" when substituting personal information with pseudonyms (yet unpublished) and a study on whether pseudonyms can provoke biased automated classifications (Muñoz Sánchez et al. 2024). The choice of models for our experiments is currently dictated by the sensitive nature of our data. To extend the choice from open source to proprietary models, we are currently collecting a "pseudo-corpus" with fictitious personal information that we will be able to share freely for future research (you are welcome to contribute to the pseudo-corpus collection<https://forms.gle/t4ynDJwqfmFXYitPA> as well). Finally, in this talk I will name several strategies to unify the research on automatic pseudonymization, and outline further challenges, needs for standardization and a proposal of a shared task. * Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, and Elena Volodina. 2025. The Devil’s in the Details: the Detailedness of Classes Influences Personal Information Detection and Labeling<https://hdl.handle.net/10062/107263>. In Proceedings of the The Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025). * Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, Therese Lindström Tiedemann and Elena Volodina. 2024. Detecting Personal Identifiable Information in Swedish Learner Essays<https://aclanthology.org/2024.caldpseudo-1.7/>. In Proceedings of the the EACL workshop Computational Approaches to Language Data Pseudonymization (CALD-pseudo-2024). EACL, Malta, 2024. Association for Language Technology. * Ricardo Muñoz Sánchez, Simon Dobnik, Maria Irena Szawerna, Therese Lindström Tiedemann and Elena Volodina. 2024. Did the Names I Used within My Essay Affect My Score? Diagnosing Name Biases in Automated Essay Scoring<https://aclanthology.org/2024.caldpseudo-1.10/>. In Proceedings of the the EACL workshop Computational Approaches to Language Data Pseudonymization (CALD-pseudo-2024). EACL, Malta, 2024. Association for Language Technology. Warm welcome Hercules _________________________________________________________________________ Dr. Hercules Dalianis, Professor Department of Computer and Systems Sciences ph: +46 8 16 16 16 DSV/Stockholm University mobile ph: +46 70 568 13 59 P.O. Box 7003, 164 07 Kista email: hercules(a)dsv.su.se Stockholm, Sweden www: http://www.dsv.su.se/hercules/ _________________________________________________________________________

4 days, 17 hours

2025

2024

NLP May 2025 ----- 2025 ----- May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024

NLP May 2025