Mapping ids (Concepticon, WOLD, NELex)
-
- Posts: 62
- Joined: 2024-09-28 09:28
Mapping ids (Concepticon, WOLD, NELex)
Panlexia uses Concepticon as the initial source for concept definitions. It provides almost 4000 definitions with for a variety of concepts and each concept belongs to a semantic field and has a unique id gloss and number. We build initial Panlexia ids based on Concepticon data.Panlexia id's form, which is Terminology:term.word-class, can be built from Concepticon by mapping semantic field to terminology, id gloss to term and ontological category to word-class. This is only an initial id that goes through manual quality control and adaptation before the id is final.
I wrote a program src/map_ids.py. It maps WOLD ids to Panlexia ids via Concepticon. It uses "CONCEPTICON_ID" and "WOLD_ID" columns in Haspelmath-2009-1460.tsv file for this purpose. It maps also NorthEuraLex ids to Panlexia via Concepticon. It uses "CONCEPTICON_ID" and "NELEX_ID" columns in Dellert-2017-1016.tsv (which is updated version of northeuralex-0.9-concept-data.tsv) for this purpose.
I didn't upload the resulting mapping files yet. I will do it after I have completed checking and adapting Panlexia ids. Then we can start mining word lists from WOLD and NorthEuraLex to our database.
I wrote a program src/map_ids.py. It maps WOLD ids to Panlexia ids via Concepticon. It uses "CONCEPTICON_ID" and "WOLD_ID" columns in Haspelmath-2009-1460.tsv file for this purpose. It maps also NorthEuraLex ids to Panlexia via Concepticon. It uses "CONCEPTICON_ID" and "NELEX_ID" columns in Dellert-2017-1016.tsv (which is updated version of northeuralex-0.9-concept-data.tsv) for this purpose.
I didn't upload the resulting mapping files yet. I will do it after I have completed checking and adapting Panlexia ids. Then we can start mining word lists from WOLD and NorthEuraLex to our database.
Last edited by pandunia-guru on 2024-10-08 10:27, edited 1 time in total.
Re: Mapping ids (Concepticon, WOLD, NELex)
That's a great news.
I will try to understand the steps.
First, we have https://github.com/concepticon/concepti ... pticon.tsv
concepticon.tsv
I will try to understand the steps.
First, we have https://github.com/concepticon/concepti ... pticon.tsv
concepticon.tsv
- ID (it's an incremental integer)
- GLOSS (it's the SET, an uppercase English word)
- SEMANTICFIELD (it looks a lot like the one of WOLD but it's not exactly the same)
- DEFINITION (every row has a clear definition. It's gold)
- ONTOLOGICAL_CATEGORY (more or less noun, verb...)
- REPLACEMENT_ID (to correct an error. Be careful to remove the incorrect rows)
Re: Mapping ids (Concepticon, WOLD, NELex)
Then we have
https://github.com/concepticon/concepti ... 9-1460.tsv
Haspelmath-2009-1460.tsv
(It's WOLD!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!)
ID
CONCEPTICON_ID
CONCEPTICON_GLOSS
NUMBER
ENGLISH
WOLD_ID
URL IDS_ID
DESCRIPTION
CONTEXT
PART_OF_SPEECH
SEMANTIC_FIELD
AGE_SCORE
BORROWING_SCORE
SIMPLICITY_SCORE
REPRESENTATION
RANK
The column CONTEXT is nice, and it should be use in a file English-context.
But the column DESCRIPTION is often missing and rarely better than the CONCEPTICON one. Let's forgive it without regrets.
The column WOLD_ID is the precious sesame toward mini-dictionary for 41 languages from around the world. It is named LWT on their website (Loanword Typology). It is composed from the number of the lexical field, and the number of the word, but actually we don't care anymore about that.
It remains a big question: how to get the files of the WOLD translations, and how to use it.
https://github.com/concepticon/concepti ... 9-1460.tsv
Haspelmath-2009-1460.tsv
(It's WOLD!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!)
ID
CONCEPTICON_ID
CONCEPTICON_GLOSS
NUMBER
ENGLISH
WOLD_ID
URL IDS_ID
DESCRIPTION
CONTEXT
PART_OF_SPEECH
SEMANTIC_FIELD
AGE_SCORE
BORROWING_SCORE
SIMPLICITY_SCORE
REPRESENTATION
RANK
The column CONTEXT is nice, and it should be use in a file English-context.
But the column DESCRIPTION is often missing and rarely better than the CONCEPTICON one. Let's forgive it without regrets.
The column WOLD_ID is the precious sesame toward mini-dictionary for 41 languages from around the world. It is named LWT on their website (Loanword Typology). It is composed from the number of the lexical field, and the number of the word, but actually we don't care anymore about that.
It remains a big question: how to get the files of the WOLD translations, and how to use it.
Re: Mapping ids (Concepticon, WOLD, NELex)
Looking in your code.
It seems you map "property" to "A".
word_class_from = {
"Action/Process" : "V",
"Person/Thing" : "N",
"Property" : "A",
"Number" : "NUM",
"Classifier" : "CLF",
"Other" : "?"
}
You may keep the WOLD column "Property Category" when WOLD is available, since it distinguishes Adjective and Adverb. Or not?
It seems you map "property" to "A".
word_class_from = {
"Action/Process" : "V",
"Person/Thing" : "N",
"Property" : "A",
"Number" : "NUM",
"Classifier" : "CLF",
"Other" : "?"
}
You may keep the WOLD column "Property Category" when WOLD is available, since it distinguishes Adjective and Adverb. Or not?
-
- Posts: 62
- Joined: 2024-09-28 09:28
Re: Mapping ids (Concepticon, WOLD, NELex)
That program reads only Concepticon data, so it can access only the "property" parameter. I should write more code to get the "Adverb" parameter from WOLD. I can do it, though.
-
- Posts: 62
- Joined: 2024-09-28 09:28
Re: Mapping ids (Concepticon, WOLD, NELex)
WOLD translations are in data/WOLD/forms.csv. Four first columns are ID, Language_ID, Parameter_ID, Form. Language_ID identifies languages according to data/WOLD/languages.csv. For example 1 is Swahili and 123 is French. Parameter_ID identifies the concept according to data/WOLD/parameters.csv. For example 1-1 means 'the world', 1-21 means 'the land' and so on.
At first I will sort that file according to language id, so that all translations for each language form a continuous section. Then it will be easier to extract all words for each language at once, first Swahili, then Iraqw, etc. I will write a program that iterates through all languages one by one, extracts words for each language and writes them to a file, for example swh.tsv and irk.tsv. Each file will be made up of rows that consist of a Panlexia concept id, a tab and the word for the concept in that language. The program will be a little longer and a little more complex than the Python programs that I have uploaded so far. It's maybe one night of work for me.
I can use the same program with small modifications for extracting dictionaries from NorthEuraLex. The data file (northeuralex-0.9-forms.tsv) is already organized so that each language forms one section in the file. The file includes also IPA transcriptions, which will be written to a separate file.
Re: Mapping ids (Concepticon, WOLD, NELex)
Thanks a lot. I begin to understand. I need to see the file that contains the Swahili translations to connect my neurones.
Just to say: during the night, for a human, there's rarely something more precious than sleeping. No databasing at night. Take care of yourself
Just to say: during the night, for a human, there's rarely something more precious than sleeping. No databasing at night. Take care of yourself
Re: Mapping ids (Concepticon, WOLD, NELex)
Thanks for the explanation.
In
https://wold.clld.org/meaning
we see clearly
LWT code Meaning
1.1 the world
1.21 the land
etc.
If we click on Meaning 1.1
https://wold.clld.org/meaning/1-1#2/24.3/-4.8
we see
1 Swahili dunia
1 Swahili ulimwengu
2 Iraqw yaamu
3 Gawwada ʔalame
4 Hausa dúuníyàa
etc.
And we found the same information in your files
https://github.com/barumau/panlexia/blo ... /forms.csv
https://github.com/barumau/panlexia/blo ... guages.csv

What I don't understand, is where did you find your files.
Because I didn't find them in
https://github.com/clld/wold2/blob/master/data/data.zip
Furthemore, I remark that in the zip, there is a file translation.csv, with the words only for French, Spanish, German, Russian, and that they are not in your file form.csv
In
https://wold.clld.org/meaning
we see clearly
LWT code Meaning
1.1 the world
1.21 the land
etc.
If we click on Meaning 1.1
https://wold.clld.org/meaning/1-1#2/24.3/-4.8
we see
1 Swahili dunia
1 Swahili ulimwengu
2 Iraqw yaamu
3 Gawwada ʔalame
4 Hausa dúuníyàa
etc.
And we found the same information in your files
https://github.com/barumau/panlexia/blo ... /forms.csv
https://github.com/barumau/panlexia/blo ... guages.csv

What I don't understand, is where did you find your files.
Because I didn't find them in
https://github.com/clld/wold2/blob/master/data/data.zip
Furthemore, I remark that in the zip, there is a file translation.csv, with the words only for French, Spanish, German, Russian, and that they are not in your file form.csv
-
- Posts: 62
- Joined: 2024-09-28 09:28
Re: Mapping ids (Concepticon, WOLD, NELex)
I downloaded the WOLD data from https://wold.clld.org/download. I think that it's the same data that is present in the website.
Now I downloaded also the zip file that you mentioned. Unfortunately translation.csv doesn't include WOLD ids. How silly!
But the numbers in meaning_pk column match the numbers in pk column in parameter.csv file, which includes WOLD id column. So with a little programming we can now get translations for all WOLD words in French, Spanish, German and Russian.
danke!
Now I downloaded also the zip file that you mentioned. Unfortunately translation.csv doesn't include WOLD ids. How silly!

danke!

-
- Posts: 62
- Joined: 2024-09-28 09:28
Re: Mapping ids (Concepticon, WOLD, NELex)
I did it and uploaded the file to data/WOLD/forms_fra_spa_deu_rus.tsv.