Open Multilingual Wordnet (OMW)

Discussion about the Panlexia project
tok a tema of de panlexia projet
pandunia-guru
Posts: 47
Joined: 2024-09-28 09:28

Open Multilingual Wordnet (OMW)

Post by pandunia-guru »

I found another goldmine! Open Multilingual Wordnet is a resource that bundles together wordnets in different languages. It is essentially a collection of files with tab-separated values. Here's a sample from the Indonesian file.

Code: Select all

00019613-n	ind:def 0 masalah fisik yang nyata
00019613-n	ind:lemma	inti
00019613-n	ind:lemma	unsur
Each row consists of a 8-digit concept identiefier from Princeton WordNet 3.0, word class marker ("n" = noun), 3-letter language id ("ind" = Indonesian), definition ("def") or word ("lemma"). The example above has the definition for chemical element i.e. "physical base matter" and two words for it, "inti" and "unsur".

OMW is included in the Natural Language Toolkit for Python. See https://www.nltk.org/howto/wordnet.html. Here's some example Python commands and their output.

Code: Select all

>>> wordnet.synset('dog.n.01').lemma_names('ita')
['Canis_familiaris', 'cane']

>>> sorted(wordnet.synset('dog.n.01').lemmas('por'))
[Lemma('dog.n.01.cachorra'), Lemma('dog.n.01.cachorro'), Lemma('dog.n.01.cadela'), Lemma('dog.n.01.c\xe3o')]
So this interface uses the augmented English lemma, dog.n.01, as the identifier. Unfortunately the results are not immediately usable. The request for Italian gives two results, Canis_familiaris and cane, and only the second one is relevant. The request for Portuguese gives four results, the usual dictionary form is cachorro, cachorra and cadela mean 'female dog', and c\xe3o is a coded way to write cão, the formal word for 'dog'.

So it looks like we can't use OMW data in the simplest possible way but it doesn't look too hard either. It is wonderful that this resource is freely available, because it gives access to words and definitions in more languages. Also WordNet/OMW has much more definitions than Concepticon.