Morphology with CRP/HDP

This page turns a dense Bayesian NLP paper into something you can walk through. Instead of treating morphology as a wall of symbols, it shows how words break apart, how reusable pieces cluster, and how a generative story becomes a runnable model on English, Finnish, and Turkish data.

Why This Matters

Agglutinative languages create many word forms, so word-level vocabularies become sparse. Splitting words into morphemes is a practical fix.

What DP/HDP Adds

Dirichlet Process allows new morphemes to appear naturally. HDP adds dependency: unigram, bigram, and trigram transitions with smoothing.

What CRP Shows

Each new token either joins an existing table or opens a new one. α controls this novelty pressure and affects segment reuse.

DP, CRP, and HDP: Theory to Model Design

Dirichlet Process (DP). Let $G\sim DP(\alpha, G_0)$ . This defines a distribution over distributions. The concentration $\alpha$ controls novelty and $G_0$ is the base distribution. Posterior predictive has a closed form mixture:

P(\theta_{n+1}\in A\mid\theta_{1:n})=\frac{\alpha}{\alpha+n}G_0(A)+\sum_k\frac{n_k}{\alpha+n}\,\delta_{\phi_k}(A)

What this means in plain terms. You do not pre-fix the number of morpheme types. The model can always create a new type, but it is penalized by data size. Early in training, novelty is easier; later, reuse dominates unless evidence strongly supports a new morpheme.

Chinese Restaurant Process (CRP). CRP is the exchangeable partition induced by DP: $P(z_{n+1}=k\mid z_{1:n})=\frac{n_k}{n+\alpha}$ for existing table $k$ , and $P(\mathrm{new})=\frac{\alpha}{n+\alpha}$ . Expected cluster count grows approximately logarithmically with data size:

\mathbb{E}[K_n]\approx\alpha\log\left(\frac{n+\alpha}{\alpha}\right)

Worked CRP Example

Assume $n=9$ , $\alpha=1.5$ , and current table counts are $(5,3,1)$ . Then:

table 1: $5/(9+1.5)=0.476$
table 2: $3/(9+1.5)=0.286$
table 3: $1/(9+1.5)=0.095$
new table: $1.5/(9+1.5)=0.143$

This is exactly what your visualization animates: most customers join dense tables, but some probability mass remains for new ones.

Hierarchical DP (HDP). For sequential morphology, each context gets a context-specific distribution sharing global atoms: $G_0\sim DP(\gamma,H),\;G_c\sim DP(\alpha,G_0)$ . This creates principled backoff. If trigram context is sparse, probability mass is inherited through bigram and unigram levels instead of collapsing to zero.

Why hierarchy matters. In language, the next morpheme depends on context. For Turkish, after a plural + possessive pattern, case suffix probabilities are very different from the global average. HDP lets that local context specialize without overfitting, because each context still shares global atoms.

Morphological segmentation mapping. A word is decoded as morpheme sequence $m_{1:T}$ . Unigram model assumes $P(m_t)$ ; bigram uses $P(m_t\mid m_{t-1})$ ; trigram uses $P(m_t\mid m_{t-2},m_{t-1})$ . HDP prior smooths all three, enabling rare but legal morphotactic patterns.

Example: English

unhappiness can be segmented as un + happi + ness. A unigram model can often find frequent pieces, but bigram/trigram context helps choose boundary positions that create linguistically plausible chains.

Example: Turkish

evlerimizden can be segmented as ev + ler + imiz + den. The sequence constraints between plural, possessive, and ablative morphemes are exactly the signal captured better by HDP than independent unigram choices.

Interpretation for your results. Turkish and Finnish benefit because their morpheme ordering constraints are stronger. Moving from unigram DP to higher-order HDP captures these constraints, increasing boundary-level F1 while remaining Bayesian and interpretable.

How to read this page while experimenting

Increase $\alpha$ in the simulation: more new tables appear.
Switch model order in Train/Test: compare unigram vs bigram vs trigram metrics.
Check Boundary F1: this is the core segmentation quality indicator.

Visualization Controls

Language

Concentration α: 1.50

Higher α creates more new tables.

Customers: 60

Speed (ms): 350

Base Distribution Snapshot

-lar/-ler130

-de/-da92

-in/-ın88

-lik/-lık64