We use thousands of words every day, with all kinds of meanings and belonging to very different grammatical categories. However, not all of them are used with the same frequency. Depending on how important they are to the structure of the sentence, some words are more recurrent than others.

The Zipf law is a postulate that takes into account this phenomenon and specifies how likely a word is to be used based on its position in the ranking of the total words used in a language. We will now go into more detail about this law.

Zipf’s Law

George Kingsley Zipf (1902-1950) was an American linguist, born in Freeport, Illinois, who found a curious phenomenon in his studies of comparative philology. In his work, in which he was conducting statistical analysis, he found that the most commonly used words seemed to have a pattern of occurrence , this being the birth of the law that receives his surname.

According to Zipf’s law, in the vast majority of cases, if not always, words used in a written text or in an oral conversation will follow the following pattern : the most used word, which would occupy the first place in the ranking, would be twice as used as the second most used, three times as used as the third, four times as used as the fourth, and so on.

In mathematical terms, this law would be:

Pn ≈ 1⁄na

Where ‘Pn’ is the frequency of a word in the order ‘n’ and the exponent ‘a’ is approximately 1.

It can be said that George Zipf was not the only one to observe this regularity in the frequency of the most used words in many languages, both natural and artificial. In fact, there are records of others, such as the steganographer Jean-Baptiste Estoup and the physicist Felix Auerbach.

Zipf studied this phenomenon with texts in English and it seems to be true. If we take the original version of The Origin of Species by Charles Darwin (1859) we see that the most used word in the first chapter is “the”, with an appearance of about 1050, while the second is “and”, appearing about 400 times, and the third is “to,” appearing about 300. Although not exactly, we can see that the second word appears half as many times as the first and the third a third.

In Spanish it is the same . If we take this same article as an example, we can see that the word “de” is used 85 times, being the most used, while the word “la”, which is the second most used, can be counted up to 57 times.

Seeing that this phenomenon occurs in other languages, it becomes interesting to think about how the human brain processes language. Although there are many cultural phenomena that measure the use and meaning of many words, the language in question being a cultural factor in itself, the way in which we make use of the most used words seems to be a factor independent of culture.

Frequency of words function

Let’s look at the following ten words: ‘what’, ‘from’, ‘no’, ‘to’, ‘the’, ‘the’, ‘is’, ‘and’, ‘in’ and ‘it’. What do they all have in common? That they are words without meaning by themselves but, ironically, are the 10 most used words in the Spanish language .

By saying that they are meaningless we mean that if you say a sentence in which there is no no noun, adjective, verb or adverb, the sentence is meaningless. For example:

… and … … in … … one … from … … to … from … …

On the other hand, if we replace the dots with words with meaning, we can have a sentence like the following.

Miguel and Ana have a little brown table in their house next to their bed.

These widely used words are known as function words, and are in charge of giving grammatical structure to the sentence . These are not only the 10 words we have seen, in fact there are dozens of them, and they are all among the hundred most used words in Spanish.

Although they lack meaning on their own, they are impossible to omit in any sentence that is to be given meaning . It is necessary for human beings, in order to be able to transmit a message efficiently, to have recourse to words that constitute the structure of the sentence. For this reason, they are, curiously, the most commonly used.

Research

Despite what George Zipf observed in his studies of comparative philosophy , until relatively recently it had not been possible to address empirically the postulates of the law . Not because it was materially impossible to analyze all conversations or texts in English, or any other language, but because of the titanic task and the great effort involved.

Fortunately, thanks to the existence of modern computing and software, it has been possible to investigate whether this law was in the form that Zipf originally proposed or whether there were variations.

One case is the research carried out by the Centre for Mathematical Research (CRM, in Catalan Centre for Mathematical Research) linked to the Autonomous University of Barcelona. The researchers Álvaro Corral, Isabel Moreno GarcĂ­a and Francesc Font Clos carried out a large-scale analysis in which they analysed thousands of digitised texts in English to see how true Zipf’s law was.

His work, in which an extensive corpus of nearly 30,000 volumes was analyzed, made it possible to obtain a law equivalent to that of Zipf , in which it was seen that the word most used was twice as much as the second, and so on.

Zipf law in other contexts

Although Zipf’s law was originally used to explain the frequency of words used in each language, comparing their range of occurrence with their actual frequency in texts and conversations, it has also been extrapolated to other situations.

A rather striking case is the number of people living in capitals of the United States . According to Zipf’s law, the most populated American capital city had twice as many people as the second most populated, and three times as many as the third most populated.

If you look at the 2010 population census, this is consistent. New York had a total population of 8,175,133 people, with Los Angeles being the next most populated capital with 3,792,621 and the following capitals in the ranking, Chicago, Houston and Philadelphia with 2,695,598, 2,100,263 and 1,526,006, respectively

This can also be seen in the case of the most populated cities in Spain, although the Zipf law is not completely fulfilled but it does correspond, to a greater or lesser extent, to the rank that each city occupies in the ranking. Madrid, with a population of 3,266,126 has twice the population of Barcelona, with 1,636,762, while Valencia has nearly a third with 800,000 inhabitants.

Another observable case of Zipf law is with web pages . Cyberspace is very large, with about 15 billion web pages created. Considering that there are about 6.8 billion people in the world, in theory for each of them there would be two web pages to visit every day, which is not the case.

The ten most visited pages at present are: Google (60.49 million monthly visits), Youtube (24.31 million), Facebook (19.98 million), Baidu (9.77 million), Wikipedia (4.69 million), Twitter (3.92 million), Yahoo (3.74 million), Pornhub (3.36 million), Instagram (3.21 million) and Xvideos (3.19 million). Looking at these numbers, you can see that Google is twice as visited as Youtube, three times as visited as Facebook, more than four times as visited as Baidu…

Bibliographic references:

  • Font-Clos, F., Boleda, G. and Corral, Á.(2013) A scaling law beyond Zipf’s law and its relation to Heaps’ law. New Journal of Physics, 15. doi.org/10.1088/1367-2630/15/9/093033.
  • Montemurro, M. A. (2001). Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and its Applications 300: 567 – 578.