Estimating an author’s vocabulary

How would you estimate the size of an author’s vocabulary? Suppose you have a analyzed the author’s available works and found n words, x of which are unique. Then you know the author’s vocabulary was at least x, but it’s reasonable to assume that the author may have know words he never used in writing, or that at least not in works you have access to.

Brainerd [1] suggested the following estimator based on a Markov chain model of language. The estimated vocabulary is the number N satisfying the equation

The left side is a decreasing function of N, so you could solve the equation by finding a values of N that make the sum smaller and larger than n, then use a bisection algorithm.

We can see that the model is qualitatively reasonable. If every word is unique, i.e. x = n, then the solution is N = ∞. If you haven’t seen any repetition, you the author could keep writing new words indefinitely. As the amount of repetition increases, the estimate of N decreases.

Brainerd’s model is simple, but it tends to underestimate vocabulary. More complicated models might do a better job.

Problems analogous to estimating vocabulary size come up in other applications. For example, an ecologist might want to estimate the number of new species left to be found based on the number of species seen so far. In my work in data privacy I occasionally have to estimate diversity in a population based on diversity in a sample. Both of these examples are analogous to estimating potential new words based on the words you’ve seen.

[1] Brainerd, B. On the relation between types and tokes in literary text, J. Appl. Prob. 9, pp. 507-5

Controversial CRISPR scientist promises “no more gene-edited babies” until society comes around

OpenAI taking on Google Search with prototype of SearchGPT

แนวทางการใช้ Go package โดย Jaana Dogan(rakyll)

Estimating an author’s vocabulary

Check out our other content

Controversial CRISPR scientist promises “no more gene-edited babies” until society comes around

OpenAI taking on Google Search with prototype of SearchGPT

แนวทางการใช้ Go package โดย Jaana Dogan(rakyll)

Controversial CRISPR scientist promises “no more gene-edited babies” until society comes around

OpenAI taking on Google Search with prototype of SearchGPT

แนวทางการใช้ Go package โดย Jaana Dogan(rakyll)

Establishing Standards for Embodied AI – Communications of the ACM

23 Stationery Designs For Brand Consistency

How to get a FAANG Dev Job in your 40s with Coding Interview University creator John Washam [#134]

Most Popular Articles

Controversial CRISPR scientist promises “no more gene-edited babies” until society comes around

OpenAI taking on Google Search with prototype of SearchGPT

แนวทางการใช้ Go package โดย Jaana Dogan(rakyll)

Establishing Standards for Embodied AI – Communications of the ACM

23 Stationery Designs For Brand Consistency

How to get a FAANG Dev Job in your 40s with Coding Interview University creator John Washam [#134]

The Download: AI’s math solutions, and brewing beer with sunlight

Android’s new Collections feature brings together relevant content from installed apps into one spot