Zipf's Law
The law #
The Zipf-Mandelbrot law says that the frequency of the $n$
th most common word in a sufficiently long sample of natural language follows a shifted power law: $f\propto (n + \beta)^\alpha.$
Empirically, it’s found that $\beta \approx 2.7$
and $\alpha \approx -1$
across all human languages.
(1)Lest you should worry that this distribution is not normalizable, remember that no natural language has infinitely many words.
Piantadosi (§2) points out that the traditional way of demonstrating ZM is statistically flawed. Suppose you take a single corpus of natural language, as Zipf did, and you calculate the observed frequency $\widehat{f}$
for each word in the corpus. You then use these observed frequencies to compute observed frequency ranks $\widehat n.$
But wait—we don’t intrinsically care about $\widehat f$
and $\widehat n$
for the particular corpus we happen to have chosen. $\widehat f_i$
is just an estimate for the true frequency $f$
of word $i$
in the language, and it’s subject to some amount of sampling error. Then the issue is that the error of $\widehat f_i$
is correlated with the error of $\widehat n_i$
. If by pure chance, we overestimated a word’s frequency, we will also underestimate its rank, and if we underestimated a word’s frequency, we will overestimate its rank. So even if all words were really equiprobable in the language, Zipf’s methodology would produce a spurious negative correlation between frequency and rank.
The solution to this issue is to use two separate corpora: one to estimate frequency ranks and another to estimate frequencies. This way, your errors in $\widehat f$
and $\widehat n$
will be uncorrelated, and the residuals on your fit will be interpretable. Piantadosi does this analysis and finds that ZM is accurate but not fully precise. To first order, $f\propto 1/n$
, but there are significant higher-order corrections.
There’s evidence that the ZM law also applies to nonhuman languages. Arnon et al. took several years worth of transcribed humpback whale song and calculated frequencies for all the bigrams observed. They then segmented the songs into word-like units by inferring that whenever the probability of some bigram was less than 0.425 times the probability of the previous bigram, the second bigram spanned a word boundary. This makes intuitive sense. If you take the string “previousbigram” letter-pair by letter-pair, they all look pretty reasonable until you get to “sb”, which is the 344th most common bigram in the Google corpus. Further, we have independent evidence (2)Saffran, Aslin, & Newport, “Statistical Learning by 8-Month-Old Infants” that human infants use something close to this algorithm to learn word boundaries in speech, lending Arnon et al.’s approach biological plausibility.
Once the whale song is chopped up into words, you can calculate the frequency of each word observed and fit the resulting distribution to a power law. The authors give an $R^2$
value (0.93) for this fit, and they provide a log-log plot that looks vaguely straight, but frustratingly, they don’t report the best fit parameters $\alpha$
and $\beta$
anywhere in their paper or in the supplementary materials. Just going by eyeball, it looks like $\alpha \approx -0.8$
, which isn’t far off from the observed value for human languages.
Why a power law? #
The ZM law is surprising for two reasons. (1) Why is there a right tail of uncommon words at all? Language should presumably be optimized for information density, so you might expect word frequencies to follow an entropy-maximizing uniform distribution. All words should be equally frequent in order to make the average word maximally surprising. (2) Even if we accept that word frequencies can’t follow a uniform distribution for some reason, why should they follow this specific distribution? What’s so special about the ZM power law? It seems like a crazy coincidence that mutually isolated cultures all over the world should all have invented languages whose word frequencies go as $1/n$
.
There are a ton of hypothesized explanations for the ZM law floating around:
To give a brief picture of the range of explanations that have been worked out, such distributions have been argued to arise from random concatenative processes, mixtures of exponential distributions, scale-invariance, (bounded) optimization of entropy or Fisher information, the invariance of such power laws under aggregation, multiplicative stochastic processes, preferential re-use, symbolic descriptions of complex stochastic systems, random walks on logarithmic scales, semantic organization, communicative optimization, random division of elements into groups, first- and second-order approximation of most common (e.g. normal) distributions, optimized memory search, among many others. (3)Piantadosi §1
Note that all of these explanations were put forward before there was good evidence for ZM in non-human languages. Now that we do have such evidence, all explanations that rely on strict assumptions about how human brains form and parse language should get a bit less credence.
Questions #
- Does ZM hold even approximately for any artificial languages?
- Can we sanity check Arnon et al.’s procedure for inferring word boundaries in whale songs on human language?
Reading #
- Piantadosi, “Zipf’s word frequency law in natural language”
- Arnon et al. “Whale song shows language-like statistical structure“✔
- Lavi-Rotbain & Arnon, “The learnability consequences of Zipfian distributions in language”
Last updated 18 February 2025