Dear NLP group
Invited speech by professor Elena Volodina, professor at Språkbanken Text;
Dpt. Swedish, Multilingualism, Language Technology, University of Gothenburg.
Time: Tuesday, June 3, at 14-15.00
Place: Lilla Hörsalen
Title: Open access to research data and automatic pseudonymization. Two years with Mormor Karl project.
Abstract: This talk will be devoted to the challenges of working with data that contains personal information. I will describe a set of experiments with automatic pseudonymization that we have performed within Mormor Karl project<https://mormor-karl.github.io/>. Among others, experiments with detection and labeling of personal categories using BERT models (Szawerna et al. 2024, 2025), attempts att using LLMs to "fill in the blanks" when substituting personal information with pseudonyms (yet unpublished) and a study on whether pseudonyms can provoke biased automated classifications (Muñoz Sánchez et al. 2024).
The choice of models for our experiments is currently dictated by the sensitive nature of our data. To extend the choice from open source to proprietary models, we are currently collecting a "pseudo-corpus" with fictitious personal information that we will be able to share freely for future research (you are welcome to contribute to the pseudo-corpus collection<https://forms.gle/t4ynDJwqfmFXYitPA> as well).
Finally, in this talk I will name several strategies to unify the research on automatic pseudonymization, and outline further
challenges, needs for standardization and a proposal of a shared task.
*
Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, and Elena Volodina. 2025. The Devil’s in the Details: the Detailedness of Classes Influences Personal Information Detection and Labeling<https://hdl.handle.net/10062/107263>. In Proceedings of the The Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025).
* Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, Therese Lindström Tiedemann and Elena Volodina. 2024. Detecting Personal Identifiable Information in Swedish Learner Essays<https://aclanthology.org/2024.caldpseudo-1.7/>. In Proceedings of the the EACL workshop Computational Approaches to Language Data Pseudonymization (CALD-pseudo-2024). EACL, Malta, 2024. Association for Language Technology.
* Ricardo Muñoz Sánchez, Simon Dobnik, Maria Irena Szawerna, Therese Lindström Tiedemann and Elena Volodina. 2024. Did the Names I Used within My Essay Affect My Score? Diagnosing Name Biases in Automated Essay Scoring<https://aclanthology.org/2024.caldpseudo-1.10/>. In Proceedings of the the EACL workshop Computational Approaches to Language Data Pseudonymization (CALD-pseudo-2024). EACL, Malta, 2024. Association for Language Technology.
Warm welcome
Hercules
_________________________________________________________________________
Dr. Hercules Dalianis, Professor
Department of Computer and Systems Sciences
ph: +46 8 16 16 16 DSV/Stockholm University
mobile ph: +46 70 568 13 59 P.O. Box 7003, 164 07 Kista
email: hercules(a)dsv.su.se Stockholm, Sweden
www: http://www.dsv.su.se/hercules/
_________________________________________________________________________