Turkish NLP, A Gentle Introduction
In this article, I’ll cover some basics of the Turkish orthography, morphology and syntax to better understand vectorizing Turkish text. Then most probably you’ll give up the idea of processing the Turkish text immediately 😅 (no worries, I’m here anyway).
Main beauty (and the trouble) of the Turkish language is the word formation. One can derive and inflect; while these morphological processes happening some orthographic changes might also occur. These are typical Turkish words:
kağıt paper
kağıda to the paper
kağıttan from the paper
kağıtçılık process of producing paper, paper industry
kağıtsız without a paper
eklem joint
eklemleşmek to form a joint; to go additive
eklemleştirmek make sth to form a joint
eklemleşmiş ossified
gittim I went
gittin you went
gittiniz you(Plu) went
gitmedim I didn’t go
gittiğimde when I went
gittiğinde when you go
gitmediğimde when/if I didn’t go
gitmezsem if I don’t go
gitmişsiniz you had went
Yes, these are word forms from daily spoken language (sorry foreigners 😁). The vast majority of our words which are not monosyllabic are complex. Our main word formation process is suffixation, i.e. forming new words by adding affix(es) to the right of the root. Prefixation also occurs but very limited and mostly in loan words. As you see, we Turks are quite additive 😀.
Turkish morphotactics is quite complicated. If you’re not intimated until this point 😄 , you should see what happens to who in detail:
Vowel Harmony
Vowel harmony restricts which vowels can follow the first syllable vowel in the rest of the word. We have to harmony types, fronting harmony and rounding harmony. As a result, only the following sequences are feasible:
‘a’ can only be followed by ‘a‘or ‘ı’
‘ı’ can only be followed by ‘a’ or ‘ı’
‘o’ can only be followed by ‘a’ or ‘u’
‘u’ can only be followed by ‘a’ or ‘u’
‘e’ can only be followed by ‘e’ or ‘i’
‘i’ can only be followed by ‘e’ or ‘i’
‘ö’ can only be followed by ‘e’ or ‘ü’
‘ü’ can only be followed by ‘e’ or ‘ü’
Then which sequences of these vowels are feasible?
Erik, erek, aramak, buluşmak, gülüşmek, görüşmek, barışmak, oturak, otlamak → YES
Urlaub, Chloe, autonomy, Olek, peach, Führung, coin, initial, David -> NO
el, kol, palm -> The rule is not applicable for monosyllabic words
Quick Question: Are there exceptions? Sure, loan words and few native suffixes violate the rule.
Morphology: Welcome to the Wordland
Suffixation
Almost all suffixes has more than one form due to vowel and consonant alterations. The initial consonant in some suffixes and the vowels in almost all suffixes depend on the initial consonant/vowels before them. Hence Turkish type allomorphy is that the surface phonetic realizations are bound to the phonological environment in which the morpheme occurs. For instance take the plurality inflection lAr :
ev house, evler houses
ayak foot, ayaklar feet
while the perfective suffix “-DI” has 8 forms: geldi (came), gitti (left), gördü (saw), durdu (stayed), koptu (broke off) , kaldı (remained)…
The choice between /e/ and /a/ are up to the stem vowel. You can see allomorph notation in many Turkish NLP papers, lAr, DIr…Vowel alternations are due to vowel harmony and consonant alternations are due to voiceless/voiced pairs.
Quick Q: Can stems go through alternations? Yes, last consonant of the stem can also alternate due to final devoicing during suffixation. Basically voiceless consonant alternates with its voiced counterpart if he meets a vowel:
kitap + a → kitaba
ağaç + ı → ağacı
Derivation
The derivational morphemes change the meaning and possibly the word category. It’s possible to derive verb →noun, noun → verb, adj →verb, verb →adj, noun →noun, verb →verb, adj →adj, adj →adv, verb →adv (also see the examples above). Again some examples are:
kitap book
kitap-lık shelf
kitap-çı bookstore employee
böl-mek to divide
böl-üş-mek to share
böl-üş-tür-mek to distribute shares to the others
böl-ü divison sign, /
böl-üm episode, part
The derivational morphemes carry meaning and main source of semantic analogies. For instance:
deniz sea, denizci sailor; göz eye, gözcü eye doctor
anlam meaning, anlamsiz pointless, meaningless; göz eye, gözlük eyeglasses, gözlüksüz without eyeglasses
Free morphemes vs bound morphemes also exist in Turkish. However, Turkish bound morphemes carry “more meaning” than other language counterparts.(see the next article’s MT parts). Example above: “-less” = “without”, and “-ci” = “performer person”. Some bound morpheme sequences indeed correspond to individual words in other languages:
-tim: I went (DI+m 1PPast)
-medim: I didn’t go (mA+DI+m 1PNegPast)
-miyordum I wasn’t going (mA+yor+DI+m 1PNegProgPast)
-mezken while without (mA+z+iken NegPresAdv)
This part makes Turkish not to align with many languages for MT purposes. As you see, one subword can correspond to many words even without the root. Especially the tense, aspect, mood trio can lead to epic combinations 😉
Inflection
The inflectional morphemes marks the case, person, number, tense…
There are 7cases in Turkish: nominative, accusative, dative, locative, ablative, genitive and comitative/instrumental.
ev house nominative
evi house accusative
evde at the house
evden from the house, out of the house
evin of the house
evle with the house
Quick Comparison: Cases exist? English — NO, German — YES.
Gender Inflection? English — NO, German — YES, Turkish — NO.
Grammatical Gender? English — NO(a bit left: he, she…), German — YES , Turkish — NO
Natural Gender? Turkish — YES (kral, kraliçe), German — A BIT (profession names and animals, grammatical gender is more important), English — YES
Quick Notice: We don’t (have to) use personal pronouns as person markers exist; verb carries the person information. This can give headache in chatbot NLU, understanding the sentence subject requires a full morphological analysis of the verb.
Pazartesi gel-di-m. I arrived on Monday.
Pazartesi gel-di-n. You arrived on Monday.
Pazartesi gel-di-k. We arrived on Monday.
Personal pronouns rarely used in simple sentences. If one wants to stress the committer of the action, then they are included.
Dün eve geç geldi, ben de nerdeydin diye sordum. Yesterday he came home late, then I asked where he was.
Tense, Aspect and Modality
Long story 😄 If you noticed above, long story is accomplished by the suffixes.
Tense: Annem pazartesi geld-di (PF). Annem pazartesi gel-ecek (FUT).
Aspect: Elma ye-di-m. (PF) Elma yi-yor-dum. (IMPF-P.COP)
Modality: Berk bu konu-yu araştır-abil-ir. Berk can/could/may investigate this matter. (this matter-ACC investigate-PSB-AOR)
Berk bu konuyu araştır-mış-tır. Probably Berk investigated this matter. (investigate-PF-GM)
Most of our verbs and gerunds are long compared to other language counterparts as we need to inflect them a lot 😁 :
Sen okula gitmiyorken de maddi durumumuz kötüydü. Our financial situation was bad, even when you were not going to the school.
Gitmiş ya da gitmemiş olması önemli değil, önemli olan niyet. It does not matter whether he had went there or not, what matters is his intentions.
Morpheme Ordering
Morpheme ordering is also important. Not every morpheme can attach to every root, derivational morphemes usually come just after the root and inflectional ones follow them, negativity morpheme always stay close to the root…There’s definitely a LM of morpheme ordering (keep your eyes on here for language modelling purposes)😉
The inflectional morphemes contribute to the syntax, while the derivational morphemes contribute to the semantics mostly.
Quick Q: Does Turkish have compounds? Yes, but the number is limited. We really don’t go German: Insurance Company — Sigorta Şirketi — Versicherungsunternehmen (we write seperate).
A bit of Syntax
We use subordinate clauses a lot in daily written/spoken language. See the following noun phrase as the complement of the postposition için/for:
Bunları [Türkiye’nin son yıllarda izlediği politikayı daha iyi anlamak isteyenler] için yazdım. I wrote all of this for [the people who wants to understand the Turkey’s recent foreign policies better].
Turkish is also a complement-drop language, I mentioned above but I want to stress again:
Sen eve hızlı gittin. You went home fast.
Eve hızlı gittin. You went home fast.
Hızlı gittin. You went fast.
NP: In Turkish, the head of the NP always the last element of the NP. In this way, the structure of the NP is similar to German & English counterparts.
Relative Clauses: NPs can be also be qualified by relative clauses. In this case, the NP comes at the final position of the RC. Here, the structure is quite different from the English counterpart.
Arkadaşım Istanbul’da calışıyor. My friend works in Istanbul. / Mein Freund arbeitet in Istanbul.
Istanbul’da çalışan arkadaşım. My friend who works in Istanbul / Mein Freund, der in Istanbul arbeitet
PP: Okay, structure of postpositional phrase is very different in Turkish.First of all, WE DON”T HAVE PREPOSITIONS, instead we case marked nouns and postpositions. See:
masanın altında under the table
Cumadan sonra after Friday
senden önce, senden sonra before you, after you
bensiz, sensiz, annemsiz without me, without you, without my mom
evden from the house
evde at the house
evin çiçekleri flowers of the house
arkadaşıyla with her girlfriend / mit Ihrer Freundin
kötü havaya rağmen despite the bad weather / trotz schweren Unwetters
German also has postpositions, but number is quite limited compared to the prepositions. In Turkish, PP consists of a postposition as the head and a NP. The postposition always takes the final position. Postpositions can be used with nominative, accusative, genitive, dative, ablative complements.
Here are some more postposition examples:
Nominative:
Sarmaşık bahçe duvarı boyunca devam ediyor. The ivy lies along the garden wall.
Benim gibi uzun boylu ve inceydi. She was tall and thin just like me.
Sizler kadar akıllı olamadık. We are not as intelligent as you!
Genitive
Her şeyi senin için yaptım. I did it all for you.
Accusative
Basın açıklamasını müteakip gazeteciler sorularını yöneltti. The reporters asked their questions after the press statement.
Dative
Işığa doğru yürüdüm. I walked towards the light.
Bugüne kadar kimse böyle bir şey görmemişti. Nobody has seen anything like this until today.
Ablative
Bu nedenlerden dolayı anlaşmayı iptal ettik. We canceled the contract due to some reasons.
Dünden beri konuşmuyoruz. We haven’t spoken to each other since yesterday.
Quick Remark: We don’t have postposition attachment disambiguity. In Turkish, morphosyntactic cues signal the attachment. Compare:
Kanepedeki gazeteyi okudum. I read [the newspaper on the couch]. / Couch-LOC-REL newspaper-ACC read-Past1Per
Kanepede gazeteyi okudum. I read the newspaper [on the couch]. / Couch-LOC newspaper-ACC read-Past1Per
I ate the cake with a spoon. Keki kaşıkla yedim.
I ate the cake with icing. Kremalı keki yedim.
Verb Infinitive Forms: We achieve infinitive form of verbs by attaching the -mAk suffix.
Doktora gitmek yerine tatile çık. Go to vacation instead of visiting doctor. / doctor-DAT go-INF instead of vacation-DAT go-PERS
Case suffixes are allowed after the infinitive form except the genitive case.
Oturmaktan bıktım. I got sick and tired from sitting. / sit-INF-ABL tire-TENSE-PERS
VP with Adverbial Use: Verb stems can function as adverb with proper suffixation. See:
Okula kadar koşarak gittim. I went home running. / school-DAT until run go-TENSE-PERS
Ödevimi evde unutup gelmişim. I came, having forgotten my homework at home. / homework-POSS-ACC home-ACC forget come-TENSE-PERS
Sen gelince göle gideriz. We visit the lake when you come. / You come lake-DAT go-TENSE-PERS
Negation: Negation in Turkish is also achieved by suffixation mostly. Gerundive forms are also welcomed to admit negation suffix mA as well as verbs. Notice how we can interchange the place in the negation in last 2 sentences to achieve the same semantics:
Oraya gittim. [go+Past1Per] I went there.
Oraya gitmedim. [go+NegPast1Per] I did not go there.
Oraya gidemedim. [go+AbleNegPast1Per] I could not go there.
Okula gitmeyip evde kalmış. [go+Neg+Ger] school+DAT go+NEG+GER home+DAT stay+TENSE+PERS / He stayed home by not going to the school.
Oraya gidip gitmemek sana kalmış. [go+Neg+Inf] there go+GER go+NEG+GER you+DAT stay+TENSE+PERS / It’s up to you to go there or not to go there.
Bunları ona söylememeni istiyorum. [tell+Neg+Ger+ACC] these+ACC him+DAT tell+NEG+GER want+TENSE+PERS / I want you not to tell him these.
Bunları ona söylemeni istemiyorum. [want+Neg+Prog+1Per] these+ACC him+DAT tell+GER want+NEG+TENSE+PERS / I do not want you to tell him these.
Negation suffix is lexically same as the gerundive form suffix. Also both suffixes go directly after the root. The difference is differentiable from the morphology and context.
Okuman düzelmiş. [Oku+GER+POSS]Your reading got better.
In this example, after the suffix ma, the possessive suffix comes. Hence one can deduce that, before the possessive suffix there should be a noun-like, so ma here is not the negation particle, it’s the gerund-maker.
Dear reader, if you reached the end of this article; I’d like to congratulate you for your patience 😄. You successfully read some highlights of Turkish morphology and syntax. Turkish might look intimidating at the first place, but according to my own natural language theories 😄, no such thing as easy language exists. All natural languages equivalently difficult, sometimes syntax, sometimes morphology, sometimes phonology but one cannot find a language that has “easy” all of these 3 elements. Turkish has lots of great material by great researchers and definitely offers its own beauty to NLP researchers.
Now you are ready to read “how to process, vectorize and model Turkish” article and continue to have a great time. Remember, it takes patience and subwords to process Turkish text!