Introduction to Panlexia

Discussion about the Panlexia project
tok a tema of de panlexia projet
pandunia-guru
Posts: 27
Joined: 2024-09-28 09:28

Introduction to Panlexia

Post by pandunia-guru »

Panlexia is a system for translating ideas between different languages. It is a database of files that can serve as the source material for word lists and dictionaries between any pair of languages that have words for the meanings in the database. All data files are publicly and freely accessible at https://github.com/barumau/panlexia.

The database consists of TSV (tab-separated value) files. Each row in each file has a unique identifier and a word, a definition or other data. The identifier links each definition to a word in each language. It also links each word in one language to its translations in other languages. There can be also other data, like word-class definition and pronunciation instructions in IPA or another transcription system. They can be linked to the word, too.

In the picture below, there are excerpts of a definition file in English and data files in English (eng), German (deu) and Swahili (swa). The data files include files for translation, transcription, word type ("genus") and style ("stilus").

Code: Select all

file: eng_definition.tst
 anatomia:facies.n      the front part of the head (esp. of a human)
 anatomia:bucca.n       the soft skin on each side of the face, below the eyes

file: eng.tsv
 anatomia:facies.n      face
 anatomia:facies.n-2    visage
 anatomia:facies.n-3    mug
 anatomia:facies.a      facial

file: eng-IPA.tsv
 anatomia:facies.n      feɪs
 anatomia:facies.n-2    ˈvɪzɪd͡ʒ
 anatomia:facies.n-3    mʌɡ
 anatomia:facies.a      ˈfeɪ.ʃəl

file: eng-stilus.tsv
 anatomia:facies.n-2    lit.
 anatomia:facies.n-3    sl.

file: deu.tsv
 anatomia:facies.n      Gesicht
 anatomia:facies.n-2    Visage
 anatomia:facies.n-3    Fresse
 anatomia:facies.a      Gesichts-

file: deu-genus.tsv
 anatomia:facies.n      n
 anatomia:facies.n-2    f
 anatomia:facies.n-3    f

file: deu-stilus.tsv
 anatomia:facies.n-2    sl.
 anatomia:facies.n-3    sl.

file: swa.tsv
 anatomia:facies.n      uso
 anatomia:facies.n-2    sura
 anatomia:facies.n-3    wajihi
 anatomia:facies.a      -a uso

file: swa-genus.tsv
 anatomia:facies.n      11/10
 anatomia:facies.n-2    9/10
 anatomia:facies.n-3    9/10

file: swa-stilus.tsv
 anatomia:facies.n-2    lit.
 anatomia:facies.n-3    lit.
Image

Selected data in the above files could be combined together with scripts or programs for different purposes. For example, one could create an English to German dictionary that would have entries like this:

face n. Gesicht n, (sl.) Visage f, (sl.) Fresse f
facial adj. Gesichts-
mug n. (Slangwort für Gesicht) Visage f, Fresse f
visage n. Gesicht n

This is only one possibility. One could re-use the same program, which generated the English–German dictionary, to generate German–English, Swahili–German, German–Swahili, Swahili–English and English–Swahili dictionaries. When someone adds translations for a new language, like Quechua, Esperanto or Pandunia, to the database, one could generate new dictionaries, which might have been made never before, like Quechua–Swahili and Esperanto–Quechua. So everyone's contribution to the Panlexia database will benefit everybody else.

The meaning identifiers are a key ingredient in this project. They are structured so that they give a mini definition of the meaning in some scientific terminology or sometimes in English: <terminology>:<term>.<word_class>. For example:
anatomia:facies.n
anatomia:bucca.n

Here is a possible division of meaning categories (from Loanwords in the World's Languages, a Comparative Handbook) that we could use initially in Panlexia.
  1. The physical world
  2. Kinship
  3. Animals
  4. Anatomy
  5. Food and drink
  6. Clothing and grooming
  7. The house
  8. Agriculture and vegetation
  9. Basic actions and technology
  10. Motion
  11. Possession
  12. Spatial relations
  13. Quantity
  14. Time
  15. Sense perception
  16. Emotions and values
  17. Cognition
  18. Speech and language
  19. Social and political relations
  20. Warfare and hunting
  21. Law
  22. Religion and belief
  23. Modern world
  24. Miscellaneous function words
Let's use standardized terminologies when possible. For example, the identifiers of the anatomical terms should be created from the Latin terms in Terminologia Anatomica (see https://ifaa.unifr.ch/Public/EntryPage/ViewTA98New.html). When we use established standards, we can avoid misunderstandings by referring to the standard. Also, everybody can create a new identifier in the correct way, when they want to add a word, whose meaning is not yet listed in the database.

This is how Panlexia should be in principle. There is a lot of work to do in practice before everything is settled down for real.
Last edited by pandunia-guru on 2024-10-04 08:42, edited 1 time in total.
seweli2

Re: Introduction to Panlexia

Post by seweli2 »

Let's start with anatomy to make it a little concrete.

https://www.pandunia.info/eng/eng-pandunia.html
alveolar ridge (gum ridge) n dante kume
Well, I have to do some search on this topic to get the Latin term.
Seweli2

Re: Introduction to Panlexia

Post by Seweli2 »

So

Code: Select all

file: eng_definition.tst
4:anatomia:facies.n      the front part of the head (esp. of a human)
4:anatomia:bucca.n       the soft skin on each side of the face, below the eyes
Or we don't need to keep the number?
pandunia-guru
Posts: 27
Joined: 2024-09-28 09:28

Re: Introduction to Panlexia

Post by pandunia-guru »

seweli2 wrote: 2024-09-30 10:07 Well, I have to do some search on this topic to get the Latin term.
I can't find this English term from https://ifaa.unifr.ch/Public/EntryPage/ ... C%20EN.htm, so I can't use it to get to the Latin term. (That's what I did normally.)

So let's make things easier for us and use English terms in anatomical identifiers. Let's keep compatibility with the English index of Terminologia Anatomica when the term is found there and let's use common sense when it's not found. So the id of this term shall be anatomia:alveolar ridge.n.

The idea is that the ids (or "keys") should single out meanings unambiguously. Standardized terminologies do exactly that. That's why I want to use reference works like Terminologia Anatomica. For example, it's much better to use binomial names for species, because colloquial names can be ambiguous. Some examples:

Animalia:Felis cattus.n 'cat'
Plantae:Pinus sylvestris.n 'pine'
Fungi:Agaricus bisporus.n 'champignon, button mushroom'.
Seweli2 wrote: 2024-09-30 12:19 Or we don't need to keep the number?
I don't know where that number came from, but there should be no number in the beginning of keys.
seweli
Posts: 25
Joined: 2024-09-29 20:49

Re: Introduction to Panlexia

Post by seweli »

Or let's use Google Translate: alveolaris iugum
It looks good enough to begin.
seweli
Posts: 25
Joined: 2024-09-29 20:49

Re: Introduction to Panlexia

Post by seweli »

The number four comes from your above list, extracted from
Loanwords in the World's Languages, a Comparative Handbook
By the way, it's not really important for now, but if someone find the PDF...
pandunia-guru
Posts: 27
Joined: 2024-09-28 09:28

Re: Introduction to Panlexia

Post by pandunia-guru »

seweli wrote: 2024-10-01 18:18 The number four comes from your above list, extracted from
Loanwords in the World's Languages, a Comparative Handbook
Aha! That gave me an idea! While that number is not needed in the id in Panlexia, we can create a file that maps Panlexia ids to that number (it is called "LWT code" in The World Loanword Database at https://wold.clld.org). Then we can get dictionary files from WOLD, replace numerical LWT codes with Panlexia ids and insert the files to our database. So our word list database will grow even more. :D There are word lists for 41 languages in WOLD, and the data is free to use with CC Attribution license.

Now we have a general plan for what to do. I can now start populating the database and start programming scripts for data extraction and basic dictionary building.
seweli
Posts: 25
Joined: 2024-09-29 20:49

Re: Introduction to Panlexia

Post by seweli »

To see the 41 languages:
https://wold.clld.org/vocabulary

Swahili, Hausa, Mandarin, Japanese, Indonezian...
But not yet French, German, Russian and Spanish.
pandunia-guru
Posts: 27
Joined: 2024-09-28 09:28

Re: Introduction to Panlexia

Post by pandunia-guru »

I have now created the initial list of concept ids for Panlexia. The file is at https://github.com/barumau/panlexia/blo ... id_map.tsv. The ids are generated from data in WOLD and NorthEuraLex by a Python program that I wrote. I will check the ids manually, merge duplicates, improve terms and improve semantic categorization. The id map is the key for extracting words from WOLD and NorthEuraLex.

I uploaded raw data from NorthEuraLex and WOLD to the Panlexia repository. WOLD provides about 1460 terms in 41 languages around the world. NorthEuraLex (http://northeuralex.org) provides about 1016 terms in 107 languages of Northern Eurasia. There are 8 same languages in WOLD and NorthEuraLex: Romanian, Dutch, English, Kildin Saami, Ket, Sakha (Yakut), Japanese and Mandarin Chinese. The word lists are partly overlapping and partly complementary.

Altogether we got 1000 or more words in 140 different languages in Panlexia to start from!

The next steps:
  • Finalize concept ids. (Must be done first!)
  • Write program to extract words from NorthEuraLex data.
  • Write program to extract words from WOLD data.
  • Write program to create word list in Markdown format. It should take two ISO language codes as parameters, f.ex. create_dictionary fin jpn would create a Finnish-Japanese dictionary.
  • Improve the word list creation program for creating dictionaries in other formats (PDF, HTML look-up dictionary, etc).
  • Insert concept ids to the multilingual Pandunia spreadsheet dictionary and extract words for Pandunia, Esperanto and other languages that are there.
  • Keep on improving output quality of dictionary generation until it looks professional.
  • Keep on adding more words and more languages.
seweli
Posts: 25
Joined: 2024-09-29 20:49

Re: Introduction to Panlexia

Post by seweli »

Interesting file.

After reading it, I thought the first steps were:
* Install git and type: "git clone"
* Complete the missing definitions
* Do a pool request
* Duplicate the file and keep only the left column, then add the Pandunia words in a second column when possible
" Do a pool request
Post Reply