[NLP@DSV] Invited speech: June 3, at 14-15.00 Open access to research data and automatic pseudonymization. Two years with Mormor Karl project.

5 May 2025

Dear NLP group
Invited speech by professor Elena Volodina, professor at Språkbanken Text;
Dpt. Swedish, Multilingualism, Language Technology, University of Gothenburg.
Time: Tuesday, June 3, at 14-15.00
Place: Lilla Hörsalen
Title: Open access to research data and automatic pseudonymization. Two years with Mormor
Karl project.
Abstract: This talk will be devoted to the challenges of working with data that contains
personal information. I will describe a set of experiments with automatic pseudonymization
that we have performed within Mormor Karl project<https://mormor-karl.github.io/>.
Among others, experiments with detection and labeling of personal categories using BERT
models (Szawerna et al. 2024, 2025), attempts att using LLMs to "fill in the
blanks" when substituting personal information with pseudonyms (yet unpublished) and
a study on whether pseudonyms can provoke biased automated classifications (Muñoz Sánchez
et al. 2024).
The choice of models for our experiments is currently dictated by the sensitive nature of
our data. To extend the choice from open source to proprietary models, we are currently
collecting a "pseudo-corpus" with fictitious personal information  that we will
be able to share freely for future research (you are welcome to contribute to the
pseudo-corpus collection<https://forms.gle/t4ynDJwqfmFXYitPA> as well).
Finally, in this talk I will name several strategies to unify the research on automatic
pseudonymization, and outline further
challenges, needs for standardization and a proposal of a shared task.
  *
Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, and Elena Volodina. 2025. The
Devil’s in the Details: the Detailedness of Classes Influences Personal Information
Detection and Labeling<https://hdl.handle.net/10062/107263>63>. In Proceedings of the
The Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference
on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025).
  *   Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, Therese Lindström
Tiedemann and Elena Volodina. 2024. Detecting Personal Identifiable Information in Swedish
Learner Essays<https://aclanthology.org/2024.caldpseudo-1.7/>7/>. In Proceedings of the
the EACL workshop Computational Approaches to Language Data Pseudonymization
(CALD-pseudo-2024). EACL, Malta, 2024. Association for Language Technology.
  *   Ricardo Muñoz Sánchez, Simon Dobnik, Maria Irena Szawerna, Therese Lindström
Tiedemann and Elena Volodina. 2024. Did the Names I Used within My Essay Affect My Score?
Diagnosing Name Biases in Automated Essay
Scoring<https://aclanthology.org/2024.caldpseudo-1.10/>0/>. In Proceedings of the the
EACL workshop Computational Approaches to Language Data Pseudonymization
(CALD-pseudo-2024). EACL, Malta, 2024. Association for Language Technology.
Warm welcome
Hercules
_________________________________________________________________________
Dr. Hercules Dalianis, Professor
Department of Computer and Systems Sciences
ph:        +46 8 16 16 16        DSV/Stockholm University
mobile ph: +46 70 568 13 59   P.O. Box 7003, 164 07 Kista
email:     hercules(a)dsv.su.se   Stockholm, Sweden
www:       http://www.dsv.su.se/hercules/
_________________________________________________________________________