Open Multilingual Wordnet (OMW)

Discussion about the Panlexia project
tok a tema of de panlexia projet
pandunia-guru
Posts: 62
Joined: 2024-09-28 09:28

Open Multilingual Wordnet (OMW)

Post by pandunia-guru »

I found another goldmine! Open Multilingual Wordnet is a resource that bundles together wordnets in different languages. It is essentially a collection of files with tab-separated values. Here's a sample from the Indonesian file.

Code: Select all

00019613-n	ind:def 0 masalah fisik yang nyata
00019613-n	ind:lemma	inti
00019613-n	ind:lemma	unsur
Each row consists of a 8-digit concept identiefier from Princeton WordNet 3.0, word class marker ("n" = noun), 3-letter language id ("ind" = Indonesian), definition ("def") or word ("lemma"). The example above has the definition for chemical element i.e. "physical base matter" and two words for it, "inti" and "unsur".

OMW is included in the Natural Language Toolkit for Python. See https://www.nltk.org/howto/wordnet.html. Here's some example Python commands and their output.

Code: Select all

>>> wordnet.synset('dog.n.01').lemma_names('ita')
['Canis_familiaris', 'cane']

>>> sorted(wordnet.synset('dog.n.01').lemmas('por'))
[Lemma('dog.n.01.cachorra'), Lemma('dog.n.01.cachorro'), Lemma('dog.n.01.cadela'), Lemma('dog.n.01.c\xe3o')]
So this interface uses the augmented English lemma, dog.n.01, as the identifier. Unfortunately the results are not immediately usable. The request for Italian gives two results, Canis_familiaris and cane, and only the second one is relevant. The request for Portuguese gives four results, the usual dictionary form is cachorro, cachorra and cadela mean 'female dog', and c\xe3o is a coded way to write cão, the formal word for 'dog'.

So it looks like we can't use OMW data in the simplest possible way but it doesn't look too hard either. It is wonderful that this resource is freely available, because it gives access to words and definitions in more languages. Also WordNet/OMW has much more definitions than Concepticon.
pandunia-guru
Posts: 62
Joined: 2024-09-28 09:28

Re: Open Multilingual Wordnet (OMW)

Post by pandunia-guru »

I added WordNet offset and PoS columns to the master.tsv in this change. A Concepticon source file (Borin-2015-1532.tsv) offered 1373 ready-made ids, but the rest we have to fill in by hand. This id gives us the access to words in many languages via OMW.

In Python, we can use the Natural Language Toolkit (nltk). Then we can get to words with combined WordNet id i.e. the PoS and the offset like this:
wordnet.synset_from_pos_and_offset(pos, offset)

For example for the word wagon the offset is 04543158 and the PoS is n (noun).

Code: Select all

from nltk.corpus import wordnet as wn
wn.synset_from_pos_and_offset('n',4543158)
>> Synset('wagon.n.01')
One can also use WordNet offset and PoS online as the "synset" id in the web interface of OMW, for example like this:
https://compling.upol.cz/ntumc/cgi-bin/ ... 04543158-n
Post Reply