Turkish NLP, A Gentle Introduction

Duygu ALTINOK
9 min readJun 18, 2018

In this article, I’ll cover some basics of the Turkish orthography, morphology and syntax to better understand vectorizing Turkish text. Then most probably you’ll give up the idea of processing the Turkish text immediately 😅 (no worries, I’m here anyway).

Main beauty (and the trouble) of the Turkish language is the word formation. One can derive and inflect; while these morphological processes happening some orthographic changes might also occur. These are typical Turkish words:

kağıt paper

kağıda to the paper

kağıttan from the paper

kağıtçılık process of producing paper, paper industry

kağıtsız without a paper

eklem joint

eklemleşmek to form a joint; to go additive

eklemleştirmek make sth to form a joint

eklemleşmiş ossified

gittim I went

gittin you went

gittiniz you(Plu) went

gitmedim I didn’t go

gittiğimde when I went

gittiğinde when you go

gitmediğimde when/if I didn’t go

gitmezsem if I don’t go

gitmişsiniz you had went

Yes, these are word forms from daily spoken language (sorry foreigners 😁). The vast majority of our words which are not monosyllabic are complex. Our main word formation process is suffixation, i.e. forming new words by adding affix(es) to the right of the root. Prefixation also occurs but very limited and mostly in loan words. As you see, we Turks are quite additive 😀.

Sorry foreigners

Turkish morphotactics is quite complicated. If you’re not intimated until this point 😄 , you should see what happens to who in detail:

Vowel Harmony

Vowel harmony restricts which vowels can follow the first syllable vowel in the rest of the word. We have to harmony types, fronting harmony and rounding harmony. As a result, only the following sequences are feasible:

‘a’ can only be followed by ‘a‘or ‘ı’
‘ı’ can only be followed by ‘a’ or ‘ı’
‘o’ can only be followed by ‘a’ or ‘u’
‘u’ can only be followed by ‘a’ or ‘u’
‘e’ can only be followed by ‘e’ or ‘i’
‘i’ can only be followed by ‘e’ or ‘i’
‘ö’ can only be followed by ‘e’ or ‘ü’
‘ü’ can only be followed by ‘e’ or ‘ü’

Then which sequences of these vowels are feasible?

Erik, erek, aramak, buluşmak, gülüşmek, görüşmek, barışmak, oturak, otlamak → YES

Urlaub, Chloe, autonomy, Olek, peach, Führung, coin, initial, David -> NO

el, kol, palm -> The rule is not applicable for monosyllabic words

Quick Question: Are there exceptions? Sure, loan words and few native suffixes violate the rule.

Morphology: Welcome to the Wordland

Suffixation

Almost all suffixes has more than one form due to vowel and consonant alterations. The initial consonant in some suffixes and the vowels in almost all suffixes depend on the initial consonant/vowels before them. Hence Turkish type allomorphy is that the surface phonetic realizations are bound to the phonological environment in which the morpheme occurs. For instance take the plurality inflection lAr :

ev house, evler houses

ayak foot, ayaklar feet

while the perfective suffix “-DI” has 8 forms: geldi (came), gitti (left), gör (saw), durdu (stayed), koptu (broke off) , kal (remained)…

The choice between /e/ and /a/ are up to the stem vowel. You can see allomorph notation in many Turkish NLP papers, lAr, DIr…Vowel alternations are due to vowel harmony and consonant alternations are due to voiceless/voiced pairs.

Quick Q: Can stems go through alternations? Yes, last consonant of the stem can also alternate due to final devoicing during suffixation. Basically voiceless consonant alternates with its voiced counterpart if he meets a vowel:

kitap + a → kitaba

ağaç + ı → ağacı

Derivation

The derivational morphemes change the meaning and possibly the word category. It’s possible to derive verb →noun, noun → verb, adj →verb, verb →adj, noun →noun, verb →verb, adj →adj, adj →adv, verb →adv (also see the examples above). Again some examples are:

kitap book

kitap-lık shelf

kitap-çı bookstore employee

böl-mek to divide

böl-üş-mek to share

böl-üş-tür-mek to distribute shares to the others

böl-ü divison sign, /

böl-üm episode, part

The derivational morphemes carry meaning and main source of semantic analogies. For instance:

deniz sea, denizci sailor; göz eye, gözcü eye doctor

anlam meaning, anlamsiz pointless, meaningless; göz eye, gözlük eyeglasses, gözlüksüz without eyeglasses

Free morphemes vs bound morphemes also exist in Turkish. However, Turkish bound morphemes carry “more meaning” than other language counterparts.(see the next article’s MT parts). Example above: “-less” = “without”, and “-ci” = “performer person”. Some bound morpheme sequences indeed correspond to individual words in other languages:

-tim: I went (DI+m 1PPast)

-medim: I didn’t go (mA+DI+m 1PNegPast)

-miyordum I wasn’t going (mA+yor+DI+m 1PNegProgPast)

-mezken while without (mA+z+iken NegPresAdv)

This part makes Turkish not to align with many languages for MT purposes. As you see, one subword can correspond to many words even without the root. Especially the tense, aspect, mood trio can lead to epic combinations 😉

Inflection

The inflectional morphemes marks the case, person, number, tense…

There are 7cases in Turkish: nominative, accusative, dative, locative, ablative, genitive and comitative/instrumental.

ev house nominative

evi house accusative

evde at the house

evden from the house, out of the house

evin of the house

evle with the house

Quick Comparison: Cases exist? English — NO, German — YES.

Gender Inflection? English — NO, German — YES, Turkish — NO.

Grammatical Gender? English — NO(a bit left: he, she…), German — YES , Turkish — NO

Natural Gender? Turkish — YES (kral, kraliçe), German — A BIT (profession names and animals, grammatical gender is more important), English — YES

Quick Notice: We don’t (have to) use personal pronouns as person markers exist; verb carries the person information. This can give headache in chatbot NLU, understanding the sentence subject requires a full morphological analysis of the verb.

Pazartesi gel-di-m. I arrived on Monday.

Pazartesi gel-di-n. You arrived on Monday.

Pazartesi gel-di-k. We arrived on Monday.

Personal pronouns rarely used in simple sentences. If one wants to stress the committer of the action, then they are included.

Dün eve geç geldi, ben de nerdeydin diye sordum. Yesterday he came home late, then I asked where he was.

Tense, Aspect and Modality

Long story 😄 If you noticed above, long story is accomplished by the suffixes.

Tense: Annem pazartesi geld-di (PF). Annem pazartesi gel-ecek (FUT).

Aspect: Elma ye-di-m. (PF) Elma yi-yor-dum. (IMPF-P.COP)

Modality: Berk bu konu-yu araştır-abil-ir. Berk can/could/may investigate this matter. (this matter-ACC investigate-PSB-AOR)

Berk bu konuyu araştır-mış-tır. Probably Berk investigated this matter. (investigate-PF-GM)

Most of our verbs and gerunds are long compared to other language counterparts as we need to inflect them a lot 😁 :

Sen okula gitmiyorken de maddi durumumuz kötüydü. Our financial situation was bad, even when you were not going to the school.

Gitmiş ya da gitmemiş olması önemli değil, önemli olan niyet. It does not matter whether he had went there or not, what matters is his intentions.

Morpheme Ordering

Morpheme ordering is also important. Not every morpheme can attach to every root, derivational morphemes usually come just after the root and inflectional ones follow them, negativity morpheme always stay close to the root…There’s definitely a LM of morpheme ordering (keep your eyes on here for language modelling purposes)😉

The inflectional morphemes contribute to the syntax, while the derivational morphemes contribute to the semantics mostly.

Quick Q: Does Turkish have compounds? Yes, but the number is limited. We really don’t go German: Insurance Company — Sigorta Şirketi — Versicherungsunternehmen (we write seperate).

A bit of Syntax

We use subordinate clauses a lot in daily written/spoken language. See the following noun phrase as the complement of the postposition için/for:

Bunları [Türkiye’nin son yıllarda izlediği politikayı daha iyi anlamak isteyenler] için yazdım. I wrote all of this for [the people who wants to understand the Turkey’s recent foreign policies better].

Turkish is also a complement-drop language, I mentioned above but I want to stress again:

Sen eve hızlı gittin. You went home fast.

Eve hızlı gittin. You went home fast.

Hızlı gittin. You went fast.

NP: In Turkish, the head of the NP always the last element of the NP. In this way, the structure of the NP is similar to German & English counterparts.

Relative Clauses: NPs can be also be qualified by relative clauses. In this case, the NP comes at the final position of the RC. Here, the structure is quite different from the English counterpart.

Arkadaşım Istanbul’da calışıyor. My friend works in Istanbul. / Mein Freund arbeitet in Istanbul.

Istanbul’da çalışan arkadaşım. My friend who works in Istanbul / Mein Freund, der in Istanbul arbeitet

PP: Okay, structure of postpositional phrase is very different in Turkish.First of all, WE DON”T HAVE PREPOSITIONS, instead we case marked nouns and postpositions. See:

masanın altında under the table

Cumadan sonra after Friday

senden önce, senden sonra before you, after you

bensiz, sensiz, annemsiz without me, without you, without my mom

evden from the house

evde at the house

evin çiçekleri flowers of the house

arkadaşıyla with her girlfriend / mit Ihrer Freundin

kötü havaya rağmen despite the bad weather / trotz schweren Unwetters

German also has postpositions, but number is quite limited compared to the prepositions. In Turkish, PP consists of a postposition as the head and a NP. The postposition always takes the final position. Postpositions can be used with nominative, accusative, genitive, dative, ablative complements.

Here are some more postposition examples:

Nominative:

Sarmaşık bahçe duvarı boyunca devam ediyor. The ivy lies along the garden wall.

Benim gibi uzun boylu ve inceydi. She was tall and thin just like me.

Sizler kadar akıllı olamadık. We are not as intelligent as you!

Genitive

Her şeyi senin için yaptım. I did it all for you.

Accusative

Basın açıklamasını müteakip gazeteciler sorularını yöneltti. The reporters asked their questions after the press statement.

Dative

Işığa doğru yürüdüm. I walked towards the light.

Bugüne kadar kimse böyle bir şey görmemişti. Nobody has seen anything like this until today.

Ablative

Bu nedenlerden dolayı anlaşmayı iptal ettik. We canceled the contract due to some reasons.

Dünden beri konuşmuyoruz. We haven’t spoken to each other since yesterday.

Quick Remark: We don’t have postposition attachment disambiguity. In Turkish, morphosyntactic cues signal the attachment. Compare:

Kanepedeki gazeteyi okudum. I read [the newspaper on the couch]. / Couch-LOC-REL newspaper-ACC read-Past1Per

Kanepede gazeteyi okudum. I read the newspaper [on the couch]. / Couch-LOC newspaper-ACC read-Past1Per

I ate the cake with a spoon. Keki kaşıkla yedim.

I ate the cake with icing. Kremalı keki yedim.

Verb Infinitive Forms: We achieve infinitive form of verbs by attaching the -mAk suffix.

Doktora gitmek yerine tatile çık. Go to vacation instead of visiting doctor. / doctor-DAT go-INF instead of vacation-DAT go-PERS

Case suffixes are allowed after the infinitive form except the genitive case.

Oturmaktan bıktım. I got sick and tired from sitting. / sit-INF-ABL tire-TENSE-PERS

VP with Adverbial Use: Verb stems can function as adverb with proper suffixation. See:

Okula kadar koşarak gittim. I went home running. / school-DAT until run go-TENSE-PERS

Ödevimi evde unutup gelmişim. I came, having forgotten my homework at home. / homework-POSS-ACC home-ACC forget come-TENSE-PERS

Sen gelince göle gideriz. We visit the lake when you come. / You come lake-DAT go-TENSE-PERS

Negation: Negation in Turkish is also achieved by suffixation mostly. Gerundive forms are also welcomed to admit negation suffix mA as well as verbs. Notice how we can interchange the place in the negation in last 2 sentences to achieve the same semantics:

Oraya gittim. [go+Past1Per] I went there.

Oraya gitmedim. [go+NegPast1Per] I did not go there.

Oraya gidemedim. [go+AbleNegPast1Per] I could not go there.

Okula gitmeyip evde kalmış. [go+Neg+Ger] school+DAT go+NEG+GER home+DAT stay+TENSE+PERS / He stayed home by not going to the school.

Oraya gidip gitmemek sana kalmış. [go+Neg+Inf] there go+GER go+NEG+GER you+DAT stay+TENSE+PERS / It’s up to you to go there or not to go there.

Bunları ona söylememeni istiyorum. [tell+Neg+Ger+ACC] these+ACC him+DAT tell+NEG+GER want+TENSE+PERS / I want you not to tell him these.

Bunları ona söylemeni istemiyorum. [want+Neg+Prog+1Per] these+ACC him+DAT tell+GER want+NEG+TENSE+PERS / I do not want you to tell him these.

Negation suffix is lexically same as the gerundive form suffix. Also both suffixes go directly after the root. The difference is differentiable from the morphology and context.

Okuman düzelmiş. [Oku+GER+POSS]Your reading got better.

In this example, after the suffix ma, the possessive suffix comes. Hence one can deduce that, before the possessive suffix there should be a noun-like, so ma here is not the negation particle, it’s the gerund-maker.

Dear reader, if you reached the end of this article; I’d like to congratulate you for your patience 😄. You successfully read some highlights of Turkish morphology and syntax. Turkish might look intimidating at the first place, but according to my own natural language theories 😄, no such thing as easy language exists. All natural languages equivalently difficult, sometimes syntax, sometimes morphology, sometimes phonology but one cannot find a language that has “easy” all of these 3 elements. Turkish has lots of great material by great researchers and definitely offers its own beauty to NLP researchers.

Now you are ready to read “how to process, vectorize and model Turkish” article and continue to have a great time. Remember, it takes patience and subwords to process Turkish text!

--

--

Duygu ALTINOK

Senior NLP Engineer from Berlin. Deepgram member, spaCy contributor. Enjoys quality code, in love with grep. Youtube: https://www.youtube.com/c/NLPwithDuygu