by pandunia-guru » 2024-10-08 17:14
seweli wrote: ↑2024-10-07 22:40
It remains a big question: how to get the files of the WOLD translations, and how to use it.
WOLD translations are in
data/WOLD/forms.csv. Four first columns are
ID, Language_ID, Parameter_ID, Form. Language_ID identifies languages according to
data/WOLD/languages.csv. For example 1 is Swahili and 123 is French. Parameter_ID identifies the concept according to
data/WOLD/parameters.csv. For example 1-1 means 'the world', 1-21 means 'the land' and so on.
At first I will sort that file according to language id, so that all translations for each language form a continuous section. Then it will be easier to extract all words for each language at once, first Swahili, then Iraqw, etc. I will write a program that iterates through all languages one by one, extracts words for each language and writes them to a file, for example swh.tsv and irk.tsv. Each file will be made up of rows that consist of a Panlexia concept id, a tab and the word for the concept in that language. The program will be a little longer and a little more complex than the Python programs that I have uploaded so far. It's maybe one night of work for me.
I can use the same program with small modifications for extracting dictionaries from NorthEuraLex. The data file (
northeuralex-0.9-forms.tsv) is already organized so that each language forms one section in the file. The file includes also IPA transcriptions, which will be written to a separate file.
[quote=seweli post_id=43 time=1728340808 user_id=60]
It remains a big question: how to get the files of the WOLD translations, and how to use it.
[/quote]
WOLD translations are in [url=https://raw.githubusercontent.com/barumau/panlexia/refs/heads/master/data/WOLD/forms.csv]data/WOLD/forms.csv[/url]. Four first columns are [i]ID, Language_ID, Parameter_ID, Form[/i]. Language_ID identifies languages according to [url=https://github.com/barumau/panlexia/blob/master/data/WOLD/languages.csv]data/WOLD/languages.csv[/url]. For example 1 is Swahili and 123 is French. Parameter_ID identifies the concept according to [url=https://github.com/barumau/panlexia/blob/master/data/WOLD/parameters.csv]data/WOLD/parameters.csv[/url]. For example 1-1 means 'the world', 1-21 means 'the land' and so on.
At first I will sort that file according to language id, so that all translations for each language form a continuous section. Then it will be easier to extract all words for each language at once, first Swahili, then Iraqw, etc. I will write a program that iterates through all languages one by one, extracts words for each language and writes them to a file, for example swh.tsv and irk.tsv. Each file will be made up of rows that consist of a Panlexia concept id, a tab and the word for the concept in that language. The program will be a little longer and a little more complex than the Python programs that I have uploaded so far. It's maybe one night of work for me.
I can use the same program with small modifications for extracting dictionaries from NorthEuraLex. The data file ([url=https://raw.githubusercontent.com/barumau/panlexia/refs/heads/master/data/NorthEuraLex/northeuralex-0.9-forms.tsv]northeuralex-0.9-forms.tsv[/url]) is already organized so that each language forms one section in the file. The file includes also IPA transcriptions, which will be written to a separate file.