Creating concept ids by semantic categorization turned out to be too big work and also in some sense a futile goal. Semantic and thematic categorization of concepts is interesting in itself, but it's probably better to tackle it as a separate work. We need the concept ids only to link definitions and translations together. That work has already been done in WOLD, Concepticon and, to the greatest extent, in Wordnet.
I propose that we base new Panlexia concept ids on Princeton Wordnet ids, which consist of a numeric offset and a part of speech identifier. Also new concept definitions can be taken directly from Wordnet, for example by using Open Multilingual Wordnet web interface. For example, if you search for the word "house" in the web interface, you would get many results from which the correct one would be:
03544360-n house 'a dwelling that serves as living quarters for one or more families'
If you click the id, a new page will open with more information about the concept. The URL address of that page includes the same id: wn-gridx.cgi?gridmode=ntumcgrid&synset=03544360-n.
We can simply copy this "synset id" to the PWN_id column in master.tsv and fetch the corresponding, human-readable "synset name" with a Python program. Then "synset name" will be used as the concept id for Panlexia in the following format: PWN:house.n.01.
We use the WN prefix because Wordnet doesn't include all concepts that there is in Concepticon, WOLD, ULD and NorthEuraLex. We can have additional ids that are specific for Panlexia in a similar format: PLX:some term-a. Here "PLX" stands for Panlexia and what is after the dash (-) is PoS identifier, which may include also additional word classes, like prepositions and pronouns, which are understandably excluded from the Princeton Wordnet because they are semantically almost empty.
New concept ids
Re: New concept ids
I don't get it.
PWN:house.n.01
Which one "house" concept will get the number 1 and why?
And about PLX:anyword.pos, what do you do when you have a duplicates words?
I agree the work on semantic fields was futile, but it had the advantage to avoid duplicates. I'm curious to know the solutions you find about that.
But for now, have a nice 31 December evening, and see you next year.
PWN:house.n.01
Which one "house" concept will get the number 1 and why?
And about PLX:anyword.pos, what do you do when you have a duplicates words?
I agree the work on semantic fields was futile, but it had the advantage to avoid duplicates. I'm curious to know the solutions you find about that.
But for now, have a nice 31 December evening, and see you next year.
-
- Posts: 53
- Joined: 2024-09-28 09:28
Re: New concept ids
From Panlexia's point of view the number is opaque i.e. we don't know and we don't care! But if you really want to know, the number comes from Wordnet. Let's use the Python terminal to find out. First, let's list all Wordnet synsets for the word "house":
>>> wn.synsets('house')
[Synset('house.n.01'), Synset('firm.n.01'), Synset('house.n.03'), Synset('house.n.04'), Synset('house.n.05'), Synset('house.n.06'), Synset('house.n.07'), Synset('sign_of_the_zodiac.n.01'), Synset('house.n.09'), Synset('family.n.01'), Synset('theater.n.01'), Synset('house.n.12'), Synset('house.v.01'), Synset('house.v.02')]
This word has so many meanings in English. We can get the first definition (in programming terms it is in the position 0) by the following query:
>>> wn.synsets('house')[0].definition()
'a dwelling that serves as living quarters for one or more families'
It matches our expectation, so the first synset (whose identifier is house.n.01) is the one that we wanted.
Just now I created a script that extracts Indonesian, Italian and Japanese definitions for Panlexia concepts from Open Multilingual Wordnet (OMW). OMW uses the Wordnet ids, like house.n.01, so we can use it to get the definition. For example, now there is definition for PWN:house.n.01 in Italian in concepts/I/ita-definition.tsv.
Actually, I believe that 99% of concepts will come from Wordnet and their ids use running numbers starting from 1. So those ids have the prefix "PWN:". Imperfections of the Princeton Wordnet index are fixed in Collaborative Interlingual Index (CILI) and maybe we can take it into use later. Then those ids would use the prefix "CILI:". However I haven't studied OMW version 2 yet. It's not availaible in the Natural Language Toolkit (NLTK) for Python, so I will stick to version 1.4 for now. So far it has been enough to use the "PLX:" prefix for certain word class omissions in Princeton Wordnet, like pronouns and prepositions, and a few special terms that are in WOLD or Concepticon.
The old ids were more precise but I can live with the new ids. Connecting Panlexia to Wordnet is a big boost. It gives us a direct access to Chinese, Arabic, Persian, Japanese, Italian, Indonesian and other languages, see the complete list here. Probably those translations are better than what we have collected so far for the old Pandunia dictionary.
Happy new year!
Re: New concept ids
What do we do with the two PLX:that.pro in eng-definition.tsv ?
-
- Posts: 53
- Joined: 2024-09-28 09:28