I found another goldmine!
Open Multilingual Wordnet is a resource that bundles together wordnets in different languages. It is essentially a collection of files with tab-separated values. Here's a sample from the Indonesian file.
Code: Select all
00019613-n ind:def 0 masalah fisik yang nyata
00019613-n ind:lemma inti
00019613-n ind:lemma unsur
Each row consists of a 8-digit concept identiefier from Princeton WordNet 3.0, word class marker ("n" = noun), 3-letter language id ("ind" = Indonesian), definition ("def") or word ("lemma"). The example above has the definition for chemical element i.e. "physical base matter" and two words for it, "inti" and "unsur".
OMW is included in the Natural Language Toolkit for Python. See
https://www.nltk.org/howto/wordnet.html. Here's some example Python commands and their output.
Code: Select all
>>> wordnet.synset('dog.n.01').lemma_names('ita')
['Canis_familiaris', 'cane']
>>> sorted(wordnet.synset('dog.n.01').lemmas('por'))
[Lemma('dog.n.01.cachorra'), Lemma('dog.n.01.cachorro'), Lemma('dog.n.01.cadela'), Lemma('dog.n.01.c\xe3o')]
So this interface uses the augmented English lemma,
dog.n.01, as the identifier. Unfortunately the results are not immediately usable. The request for Italian gives two results,
Canis_familiaris and
cane, and only the second one is relevant. The request for Portuguese gives four results, the usual dictionary form is
cachorro,
cachorra and
cadela mean 'female dog', and
c\xe3o is a coded way to write
cão, the formal word for 'dog'.
So it looks like we can't use OMW data in the simplest possible way but it doesn't look too hard either. It is wonderful that this resource is freely available, because it gives access to words and definitions in more languages. Also WordNet/OMW has much more definitions than Concepticon.