Mapping ids (Concepticon, WOLD, NELex)

Discussion about the Panlexia project
tok a tema of de panlexia projet
pandunia-guru
Posts: 62
Joined: 2024-09-28 09:28

Mapping ids (Concepticon, WOLD, NELex)

Post by pandunia-guru »

Panlexia uses Concepticon as the initial source for concept definitions. It provides almost 4000 definitions with for a variety of concepts and each concept belongs to a semantic field and has a unique id gloss and number. We build initial Panlexia ids based on Concepticon data.Panlexia id's form, which is Terminology:term.word-class, can be built from Concepticon by mapping semantic field to terminology, id gloss to term and ontological category to word-class. This is only an initial id that goes through manual quality control and adaptation before the id is final.

I wrote a program src/map_ids.py. It maps WOLD ids to Panlexia ids via Concepticon. It uses "CONCEPTICON_ID" and "WOLD_ID" columns in Haspelmath-2009-1460.tsv file for this purpose. It maps also NorthEuraLex ids to Panlexia via Concepticon. It uses "CONCEPTICON_ID" and "NELEX_ID" columns in Dellert-2017-1016.tsv (which is updated version of northeuralex-0.9-concept-data.tsv) for this purpose.


I didn't upload the resulting mapping files yet. I will do it after I have completed checking and adapting Panlexia ids. Then we can start mining word lists from WOLD and NorthEuraLex to our database.
Last edited by pandunia-guru on 2024-10-08 10:27, edited 1 time in total.
seweli
Posts: 50
Joined: 2024-09-29 20:49

Re: Mapping ids (Concepticon, WOLD, NELex)

Post by seweli »

That's a great news.

I will try to understand the steps.

First, we have https://github.com/concepticon/concepti ... pticon.tsv

concepticon.tsv
  • ID (it's an incremental integer)
  • GLOSS (it's the SET, an uppercase English word)
  • SEMANTICFIELD (it looks a lot like the one of WOLD but it's not exactly the same)
  • DEFINITION (every row has a clear definition. It's gold)
  • ONTOLOGICAL_CATEGORY (more or less noun, verb...)
  • REPLACEMENT_ID (to correct an error. Be careful to remove the incorrect rows)
So ID from row that are designed in column REPLACEMENT shouldn't be taken into account.
seweli
Posts: 50
Joined: 2024-09-29 20:49

Re: Mapping ids (Concepticon, WOLD, NELex)

Post by seweli »

Then we have
https://github.com/concepticon/concepti ... 9-1460.tsv

Haspelmath-2009-1460.tsv
(It's WOLD!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!)

ID
CONCEPTICON_ID
CONCEPTICON_GLOSS
NUMBER
ENGLISH
WOLD_ID
URL IDS_ID
DESCRIPTION
CONTEXT
PART_OF_SPEECH
SEMANTIC_FIELD
AGE_SCORE
BORROWING_SCORE
SIMPLICITY_SCORE
REPRESENTATION
RANK

The column CONTEXT is nice, and it should be use in a file English-context.

But the column DESCRIPTION is often missing and rarely better than the CONCEPTICON one. Let's forgive it without regrets.

The column WOLD_ID is the precious sesame toward mini-dictionary for 41 languages from around the world. It is named LWT on their website (Loanword Typology). It is composed from the number of the lexical field, and the number of the word, but actually we don't care anymore about that.

It remains a big question: how to get the files of the WOLD translations, and how to use it.
seweli
Posts: 50
Joined: 2024-09-29 20:49

Re: Mapping ids (Concepticon, WOLD, NELex)

Post by seweli »

Looking in your code.

It seems you map "property" to "A".

word_class_from = {
"Action/Process" : "V",
"Person/Thing" : "N",
"Property" : "A",
"Number" : "NUM",
"Classifier" : "CLF",
"Other" : "?"
}

You may keep the WOLD column "Property Category" when WOLD is available, since it distinguishes Adjective and Adverb. Or not?
pandunia-guru
Posts: 62
Joined: 2024-09-28 09:28

Re: Mapping ids (Concepticon, WOLD, NELex)

Post by pandunia-guru »

seweli wrote: 2024-10-07 23:21 It seems you map "property" to "A".
...
You may keep the WOLD column "Property Category" when WOLD is available, since it distinguishes Adjective and Adverb. Or not?
That program reads only Concepticon data, so it can access only the "property" parameter. I should write more code to get the "Adverb" parameter from WOLD. I can do it, though.
pandunia-guru
Posts: 62
Joined: 2024-09-28 09:28

Re: Mapping ids (Concepticon, WOLD, NELex)

Post by pandunia-guru »

seweli wrote: 2024-10-07 22:40 It remains a big question: how to get the files of the WOLD translations, and how to use it.
WOLD translations are in data/WOLD/forms.csv. Four first columns are ID, Language_ID, Parameter_ID, Form. Language_ID identifies languages according to data/WOLD/languages.csv. For example 1 is Swahili and 123 is French. Parameter_ID identifies the concept according to data/WOLD/parameters.csv. For example 1-1 means 'the world', 1-21 means 'the land' and so on.

At first I will sort that file according to language id, so that all translations for each language form a continuous section. Then it will be easier to extract all words for each language at once, first Swahili, then Iraqw, etc. I will write a program that iterates through all languages one by one, extracts words for each language and writes them to a file, for example swh.tsv and irk.tsv. Each file will be made up of rows that consist of a Panlexia concept id, a tab and the word for the concept in that language. The program will be a little longer and a little more complex than the Python programs that I have uploaded so far. It's maybe one night of work for me.

I can use the same program with small modifications for extracting dictionaries from NorthEuraLex. The data file (northeuralex-0.9-forms.tsv) is already organized so that each language forms one section in the file. The file includes also IPA transcriptions, which will be written to a separate file.
seweli
Posts: 50
Joined: 2024-09-29 20:49

Re: Mapping ids (Concepticon, WOLD, NELex)

Post by seweli »

Thanks a lot. I begin to understand. I need to see the file that contains the Swahili translations to connect my neurones.

Just to say: during the night, for a human, there's rarely something more precious than sleeping. No databasing at night. Take care of yourself ✊
seweli
Posts: 50
Joined: 2024-09-29 20:49

Re: Mapping ids (Concepticon, WOLD, NELex)

Post by seweli »

Thanks for the explanation.

In
https://wold.clld.org/meaning
we see clearly
LWT code Meaning
1.1 the world
1.21 the land
etc.

If we click on Meaning 1.1
https://wold.clld.org/meaning/1-1#2/24.3/-4.8
we see
1 Swahili dunia
1 Swahili ulimwengu
2 Iraqw yaamu
3 Gawwada ʔalame
4 Hausa dúuníyàa
etc.

And we found the same information in your files
https://github.com/barumau/panlexia/blo ... /forms.csv
https://github.com/barumau/panlexia/blo ... guages.csv
Image

What I don't understand, is where did you find your files.
Because I didn't find them in
https://github.com/clld/wold2/blob/master/data/data.zip

Furthemore, I remark that in the zip, there is a file translation.csv, with the words only for French, Spanish, German, Russian, and that they are not in your file form.csv
pandunia-guru
Posts: 62
Joined: 2024-09-28 09:28

Re: Mapping ids (Concepticon, WOLD, NELex)

Post by pandunia-guru »

I downloaded the WOLD data from https://wold.clld.org/download. I think that it's the same data that is present in the website.

Now I downloaded also the zip file that you mentioned. Unfortunately translation.csv doesn't include WOLD ids. How silly! :D But the numbers in meaning_pk column match the numbers in pk column in parameter.csv file, which includes WOLD id column. So with a little programming we can now get translations for all WOLD words in French, Spanish, German and Russian.

danke! :D
pandunia-guru
Posts: 62
Joined: 2024-09-28 09:28

Re: Mapping ids (Concepticon, WOLD, NELex)

Post by pandunia-guru »

I did it and uploaded the file to data/WOLD/forms_fra_spa_deu_rus.tsv.
Post Reply