Siganalogies: millions of morphological analogies in more than 80 languages
The Siganalogies dataset has been developed as part of a research project on the automatic processing of analogies in morphology, in particular using deep learning approaches. Indeed, morphology is ideal for the study of analogies in linguistics (due, among others, to the presence of regularities and irregularities), and datasets such as Sigmorphon offer a collection of morphological transformations in many languages. This diversity makes it possible to analyze the performance of automatic analogy processing approaches with a wide variety of data of similar nature.
Siganalogies (for "Analogies in Sigmorphon") contains analogies between four words A, B, C and D, considering their morphology (prefix, suffix, ...) in 82 languages. An analogy is noted like this "A : B :: C : D", meaning : "the relation from A to B is the same as the one from C to D". In morphology, these relations are often conjugation, agreement or declension relations. For example, in English, "stash : stashes :: search : searches" is an analogy where the relation is "the conjugation to the 3rd person singular of the present tense". The goal of Siganalogies is to provide a large number of analogies in many languages in a standardized way. Based on Siganalogies, several deep learning approaches have been developed for the manipulation of analogies in morphology [1, 2, 3].
Siganalogies is built from three datasets: Sigmorphon2016 (10 languages), Sigmorphon2019 (44 languages with many and 44 with few analogies) and Japanese Bigger Analogy Test Set (only Japanese). Some of the 99 languages appear in multiple datasets, for a total of 82 different languages. In Sigmorphon2016 and Sigmorphon2019, pairs of words linked by a morphological transformation are available in each language, e.g. "stash stashes V;3;SG;PRS" (where "V;3;SG;PRS" represents the 3rd person singular conjugation of the present tense). The data from the Japanese Bigger Analogy Test Set has been transformed into the same format to facilitate further manipulations. If two pairs correspond to the same morphological transformation, it is possible to create an analogy: "stash stashes V;3;SG;PRS" and " search searches V;3;SG;PRS" allow us to create "stash : stashes :: search : searches".
The Sigmorphon dataset contains :
- the original data from Sigmorphon2016, Sigmorphon2019 and Japanese Bigger Analogy Test Set;
- the Python code to extract the analogies from the source datasets and manipulate them;
- pre-computed analogies in most of the dataset languages;
- some utility functions related to the use of analogies.
References :
[1] Safa Alsaidi, Amandine Decker, Puthineath Lay, Esteban Marquer, Pierre-Alexandre Murena, Miguel Couceiro, A Neural Approach for Detecting Morphological Analogies. DSAA 2021: 1-10. https://hal.inria.fr/hal-03313556
[2] Esteban Marquer, Safa Alsaidi, Amandine Decker, Pierre-Alexandre Murena, Miguel Couceiro. A Deep Learning Approach to Solving Morphological Analogies. To appear in ICCBR 2022. https://hal.inria.fr/hal-03660625
[3] Kevin Chan, Shane Peter Kaszefski-Yaschuk, Camille Saran, Esteban Marquer, Miguel Couceiro. Solving Morphological Analogies Through Generation. To appear in IARML@IJCAI 2022. https://hal.inria.fr/hal-03674913
Liens :
Dataset : https://dorel.univ-lorraine.fr/dataset.xhtml?persistentId=doi:10.12763/MLCFIE
Descriptif des données : https://dorel.univ-lorraine.fr/file.xhtml?persistentId=doi:10.12763/MLCFIE/CJLSWX
Code le plus à jour : https://github.com/EMarquer/siganalogies