# Analyzing Outliers and Significance: Insights into Dunbar’s Number and Genetic Conditions in Human Evolution

0
729

## Abstract

This article introduces the Square Root Law of Outliers and provides its derivation, highlighting its significance and the context of its application. It explores the use of the Inverse of the Herfindahl-Hirschman Index (IHHI) in analyzing data distributed across bins in a normal distribution, and establishes a natural level of significance for the Student’s t-test in the limit of large sample sizes. This approach leads to the identification of a critical value , and a two-sided natural level of significance . By combining the Square Root Law of Outliers with this natural level of significance in an analysis of how humans make friends, and by applying them to a “likability” measure, the article successfully formulates an analytical expression for Dunbar’s Number. This offers a mathematical explanation for the cognitive limits of human social networks. As a concrete application of the Square Root Law of Outliers, and by interpreting Dunbar’s Number as the typical size of a hunter-gatherer community, the study analyzes the prevalence rates of genetic conditions such as color-blindness, autism, and bipolar disorder. It paves the way for a novel hypothesis: that these genetic conditions were integrated into the human genetic pool approximately 10,000 years ago in early agricultural societies, serving as evolutionary advantages in warfare. The article also discusses some other potential areas of application for the Square Root Law of Outliers.

## Introduction

The “Square Root Law” presents itself in many situations. The most famous example is the so-called “Birthday Problem1.” This problem consists in estimating the necessary number of people for at least two of them to share a birthday, with a probability greater than 0.5. The correct answer is 23 people, far less than the number of days in a year, but closer to . Another well-known situation where a square root law appears is in the Central Limit Theorem2, which states that given i.i.d. (independent and identically distributed) random numbers each with mean and standard deviation , the distribution for the average value of these random variables will approach a normal distribution with the same mean but with a much narrower standard deviation . A square root law also shows up in Brownian motion3, where the displacement is expected to increase as the square root of time. Square root laws have also been applied empirically to various contexts, such as in (a) Inventory management4, where it states that the number of warehouses only needs to increase as the square root of size of inventory, in (b) in analysis of productivity, where it is known as Price’s Square Root Law5, stating that 50% of the results will be achieved by the square root of the size of the total group of workers, or in (c) pharmaceutical inspection, where it is known as the rule of “Square Root of Plus One”6 , and it dictates how many boxes to sample and inspect.

There appears to be no published results on the Square Root Law as pertaining to the size of outliers, and this article represents an effort to provide a context and justification for its applicability in this new domain.

Establishing a Square Root Law of Outliers aids in situations where attention comes at a premium. For instance, in statistical analysis, continuous variables are often binned into classes for ease of discussion. An example is the use of tax brackets. Also, in the plotting of histograms, often the last category is an exceptional category like “\$100 or above.” The use of the Square Root Law of Outliers helps to find an appropriate cutoff point for defining outliers.

This law also contributes to the justification of the distribution of resources in policymaking. For instance, many diseases warrant research studies. However, governments and human societies have limited resources. Therefore, it becomes a matter of practical necessity to fund only those research studies on diseases that impact a significant portion of the population. Similarly, a political representative (e.g., a city council member or a congressperson) cannot possibly address all the concerns of their individual constituents and must focus only on those matters that are of general concern. The Square Root Law also helps to establish the selection criterion and separate the exceptional members from regular members. For instance, there are almost 200 countries in the world, which gives a square root value of . Therefore, it is not surprising that a forum like the G8 (The Group of Eight) or G7 (The Group of Seven) has been criticized as being too narrow, and a more inclusive forum like the G20 (The Group of Twenty) has emerged to allow for a more comprehensive dialogue on issues related to the global economy.

In the context of evolution, genetic variations may either bring about novel subspecies/species or cause genetic anomalies. The Square Root Law of Outliers helps to separate normal variations from anomalies and may assist in the choice of action between accommodation and treatment.

## The Square Root Law of Outliers

This new square root law applies to competition between two teams: team I and team II. First, assume team I is composed of i.i.d. team members, each capable of making a contribution that follows a distribution with mean and standard deviation . By the Central Limit Theorem, the average member contribution in team I would approach a normal distribution with mean , but with a standard deviation approaching , where is the size of team I.

Now, consider a second team II with the same sample size , but with two different types of team members:

• i.i.d. team members just like those in team I, and
• i.i.d. team members of a different kind that make a contribution following a distribution with a mean and a standard deviation .

Now, the two teams will compete. The question is: which team will win? Obviously, team II has an advantage because . But in real life, a better team may still lose to a worse team, due to random factors. The question is then: at what size of would it be most clear that team II will win in a game against team I? This question mimics the Student’s t-test7 for comparing two means. Essentially, the new mean should differ from the old one by an amount larger than the standard error coming from statistical fluctuations. In this test, one defines a t-statistic to be:

.

where a sample of size for each team has been assumed, as represented by their respective team members. In that case, the following formulas result:

From there, it can be inferred that

In the limit , the statistic become

In the limit of a large number of degrees of freedom in the -test, the Student’s -distribution approaches a normal distribution . Thus, one simply looks up the inverse cumulative distribution function of to obtain a desired critical value. (In the SciPy Python package, this is given by the method). For instance, at significance level, . Using the critical value for a given significance level, one can then infer from the above formula the value of necessary to more-or-less guarantee team II winning a match against team I. That is:

As seen in the next section, there is a natural choice of for the critical value given by , corresponding to a singled-sided significance level of . Choosing this value of ,

Assuming the factor inside the parenthesis to be around 1, a general rule is created:

That is, a subpopulation of high-performing outliers will only become statistically significant if their size is larger than the square root of the total population. Incidentally, by taking the natural choice of , if the multiplicative factor in the square root law were to be 1, that would mean:

That is, the new mean would be about two and half times larger than the standard deviation, away from the mean of the distribution of contribution of a member in team I. This is a reasonable requirement for the new type of team member to excel over the existing type. Of course, if this is not the case, the multiplicative factor in the square root law will need to be adjusted accordingly.

Obviously, this analysis carries through for the case of underperforming outliers as well. In general, in a population of members, due to the square root law and under fairly general conditions, usually a subgroup’s differential contribution will stand out only if the size of the subgroup is larger than , save for a multiplicative factor that is typically of the order of 1.

## Natural Level of Significance

Whenever one performs a Student’s -test, one must always specify a level of significance . Popular choices of level of significance are , , and . One may wonder whether there is a more natural way of automatically selecting a level of significance that is suitable to most typical situations. By applying the \textit{Inverse Herfindahl-Hirschman Index} (IHHI)8 to a binned normal distribution, a natural value for a level of significance can be effectively attained.

The continuous distribution of samples that follows a normal distribution is given by

Upon binning, the population size in each bin (using a left Riemann sum) is given by:

The “effective number of bins” can be obtained by calculating the Inverse Herfindahl-Hirschmann Index (IHHI), which is also commonly used in the analysis of market concentration and monopoly situations9. It has also been applied to the effective number of political parties in parliaments.

In the continuous limit ,

When assigning these bins to the central area of the normal distribution, the total horizontal span of the shaded area would be . This means

The effective fraction of the total population (the shaded area) would be:

where is the error function. The two-sided level of significance can be expressed in terms of the complement of the error function :

whereas the single-sided level of significance would be given by

The corresponding critical value for a Student’s -statistic in the limit of large is given by:

One may wonder why bother choosing a natural level of significance or a critical value of the -statistic. This question is similar to asking why one should choose Euler’s number as the base of the natural logarithmic function, when any other base would do. This article aims to prove that the special analytic properties of these natural choices are what ultimately lend support to their usage.

## An Analytical Expression for Dunbar’s Number

In the ambit of social sciences, Dunbar’s Number10 has widely been used as a suggested cognitive limit to the number of people with whom one can maintain stable social relationships. It was first introduced by the British anthropologist Robin Dunbar, and it proposes that humans can maintain 150 stable relationships. Dunbar has also analyzed the typical size of hunter-gatherer societies. Dunbar’s number has been proposed to lie between 100 and 250, with a commonly used value of 15011.

Although Robin Dunbar reached his number by studying relations between primate brain sizes and their average social group size, one may wonder whether there is a more fundamental mathematical mechanism behind this number. This section represents such an attempt.

The creation of a measure of “likability” of a person’s friends serves as a foundational step in this approach. A friend may score high if they are a candidate for a romantic relationship. Another friend may score high if they are a good fellow co-worker. However, romantic relationships are typically discouraged in a workplace environment. That is, a person that is likable in one dimension may not be likable in another dimension. No one person can score high in all dimensions simultaneously. There are a multitude of dimensions to measure a person’s likability. A Principal Component Analysis (PCA)12 can be applied to synthesize all these dimensions down to one single dimension. Once typical variable centering and z-scaling13 are performed, the distribution of people that a person regularly interacts with is expected to more-or-less follow a normal distribution, along the direction of the first principal component. It is important to point out that people that a person may like a lot may not reciprocate, since they themselves may prefer to hang around with other friends that they like better.

By applying the “Inverse Herfindahl-Hirschman Index” analysis of the previous section, it can be inferred that out of the friends that a person regularly interacts with, two groups may emerge:

• Stable friends: this group constitutes of friends, with whom a person has stronger attachments to, and these friends are thus within the “effective bins” in the normal distribution. They would fall under the “exploitation” category.
• Outlier friends: this group constitutes of friends, with whom a person has weaker attachments to, and these friends are thus outside the “effective bins” of the normal distribution. These are the outlier friends. They would fall under the “exploration” category.

If the square root law from the first section of this article is further applied to this dichotomy of friends, the size of outliers can be estimated to be , which is the number of friends that a person doesn’t particularly like (left-tail region of the normal distribution), or that they are not particularly liked by (right-tail region of the normal distribution). That is, the fraction of these outlier friends will be:

which means that the number of friends that simultaneously satisfy the IHHI criterion and the square root law of outliers is given by

In particular, the size of stable friends (the “exploitation” group), which shall be identified as the Dunbar’s Number , is given by

The numerical value obtained here is fairly close to the value 150 that is commonly used for Dunbar’s number.

How should one interpret the origin of this number? This present paper obviously has not looked into the size of brains of primates—it has only relied on a mathematical analysis. Yet, this analysis finds a number that is in the general vicinity of the commonly used value. Dunbar’s number may be a natural consequence, if the human brain manages relationships in two layers: first at an individual level, and then at a group level. This is akin to how a country like the U.S.A. is organized: it has a federal government, but the country is also divided into states. This two-level split may ultimately be the responsible for the occurrence of Dunbar’s Number. That is, if a person correlates and associates their friends into groups so they can better manage their relationships inside their brain, then Dunbar’s Number could arise as a natural consequence. As to why other primates do not reach the level of Dunbar’s Number in the number of their relationships, a possible explanation is that they have not fully developed this two-layer split inside their brains.

## An Application

The Square Root Law of Outliers is applicable to any situation of competition. For instance, in the context of evolution; due to recombination, natural mutations and epigenetic effects; genetic expressions of individuals will differ. Unusual genetic conditions may be interpreted as Mother Nature doing exploration instead of exploitation. This begs the question: what types of genetic conditions should be considered normal variations, and what types of genetic conditions should be considered outliers? The Square Root Law of Outliers provides one simple rule: if a genetic subgroup’s prevalence rate is above , then the condition is considered to be a part of normal variation, whereas if a genetic subgroup’s prevalence rate is below , then it can be regarded as an outlier. Here, represents a collective group of a population that competes with other collective groups. Within each collective group, assume that the group members may reproduce randomly with other members. In human society, the unit of competition typically is at the level of countries, because within the same country, people can move freely and form families more easily. In antiquity, the unit of competition could be human clans or tribes. Additionally, wars often happen between different clans, which provide a force of natural selection. The typical size of a country today is around 10 million people. That would mean that any genetic subgroup with a prevalence rate above should be considered as a significant contributor. That is, such a subgroup probably carries some evolutionary reason for its existence.

Autism has a prevalence rate of 1 in 36 children14, or about 3%. This prevalence rate is two orders of magnitude higher than the threshold value of 0.03%. Therefore, there are probably some good evolutionary reasons why Mother Nature has included this subgroup in humans. The same is true with color blindness, which makes up about 2% to 4% of the general population (4% to 8% in boys)15. The same is also true with bipolar disorder (lifetime prevalence rate 4.4% in the U.S. adults)16. On the other hand, congenital hypothyroidism has an incidence rate of 1:3000 to 1:4000, or below 0.03%, making it suitable to be considered an anomaly17. A condition such as Phelan-McDermid Syndrome has an occurrence rate of 2.5—10 per million births18, or around 0.0005%, so it should be considered an anomaly as well. This separation between normal variations and anomalies can guide the creation of an approach to handling these different types of population subgroups. In fact, in the case of color blindness, modern society has largely accepted and accommodated color-blind citizens. In the case of autism, instead of “treatments,” perhaps others should try to explore the reasons behind Mother Nature’s design, and provide alternative development paths for children with this genetic condition.

The prevalence rate of color-blindness, autism and bipolar disorder all fall around the general vicinity of 3-4%. One may wonder whether this is beyond just a coincidence. If Dunbar’s Number 150 indeed is the size of a typical hunter-gatherer community, if the square root law is applied to the inverse of the prevalence rate, then the critical population size for the three genetic conditions (color-blindness, autism and bipolar disorder) would be around , or about the size of early agricultural villages.

This conjecture may be testable in the future, if more DNA samples from early humans can be recovered. Even without historical DNA evidence, studies on modern humans may also shed light on the plausibility of this conjecture. In the case of color-blindness, some studies have been conducted a few decades ago on surviving hunter-gatherer societies in the world22, and researchers have generally found that color blindness is rare in these populations. Similarly, there is also some evidence that bipolar disorder may be rare among hunter-gatherers23. All these findings lend support to this conjecture. It is perhaps not as easy today to conduct similar research regarding autism, due to the dwindling number of hunter-gatherer societies worldwide. Nonetheless, this remains an interesting topic to explore, as many researchers are looking at these genetic conditions from the perspective of evolution.

## Conclusion

In conclusion, this paper has introduced a Square Root Law applicable to the estimation of size of outliers and stated the context of its derivation. By combining this law with the application of the Inverse Herfindahl-Hirschman Index to the binned normal distribution, an analytic expression for Dunbar’s Number is created. The Square Root Law of Outliers is proposed to be a simple rule of thumb to distinguish normal variations from anomalies for subgroups of genetic conditions. By applying this analysis to genetic conditions such as color blindness, autism and bipolar disorder, this paper proposes that these conditions may have entered the human race about 10,000 years ago to provide advantages in warfare. The application of the Square Root Law of Outliers extends beyond human genetics. It can be applied to general situations where limited attention or resources pose a constraint, such as prioritizing disease research studies or addressing political concerns from constituents. It can also be used to find an appropriate cutoff value for the definition of exceptional categories.

## References

1. Wikipedia, Birthday problem. https://en.wikipedia.org/wiki/Birthday_problem (2023). []
2. Wikipedia, Central limit theorem. https://en.wikipedia.org/wiki/Central_limit_theorem (2023). []
3. Wikipedia, Brownian motion. https://en.wikipedia.org/wiki/Brownian_motion (2023). []
4. D. H. Maister, Centralisation of Inventories and the “Square Root Law”, International Journal of Physical Distribution, 6, 124-134 (1976). []
5. Derek John de Solla Price, Little Science, Big Science – and Beyond (1986). []
6. Hewa Saranadasa, The Square Root of N Plus One Sampling Rule. Pharmaceutical Technology, 27, 50-62 (2003). []
7. Wikipedia, Student’s t-test. https://en.wikipedia.org/wiki/Student%27s_t-test (2023). []
8. Wikipedia, Effective number of parties. https://en.wikipedia.org/wiki/Effective_number_of_parties (2023). []
9. Wikipedia, Herfindahl–Hirschman index. https://en.wikipedia.org/wiki/Herfindahl%E2%80%93Hirschman_index (2023). []
10. R. I. M. Dunbar, Neocortex size as a constraint on group size in primates. Journal of Human Evolution. 22, 469–493 (1992). []
11. Wikipedia, Dunbar’s number. https://en.wikipedia.org/wiki/Dunbar%27s_number (2023). []
12. I. Jolliffe, J. Cadima, Principal component analysis: A review and recent developments. Phil Trans. R. Soc. 374:20150202 (2016). []
13. Wikipedia, Standard score. https://en.wikipedia.org/wiki/Standard_score (2023). []
14. Matthew J. Maenner et al., Prevalence and Characteristics of Autism Spectrum Disorder Among Children Aged 8 Years — Autism and Developmental Disabilities Monitoring Network, 11 Sites, United States, 2020. Surveillance Summaries. 72, 1–14 (2023). []
15. Jennifer Birch, Worldwide prevalence of red-green color deficiency. Journal of the Optical Society of America A. 29, 313-320 (2012). []
16. Kathleen R. Merikangas, Hagop S. Akiskal, Jules Angst, et al., Lifetime and 12-Month Prevalence of Bipolar Spectrum Disorder in the National Comorbidity Survey Replication. Arch Gen Psychiatry 64, 543-552 (2007). []
17. Liu, L., He, W., Zhu, J. et al., Global prevalence of congenital hypothyroidism among neonates from 1969 to 2020: a systematic review and meta-analysis. Eur J Pediatr. 182, 2957–2965 (2023). []
18. F. Cammarata-Scalisi et al., Clinical and Genetic Aspects of Phelan-McDermid Syndrome: An Interdisciplinary Approach to Management. Genes (Basel). 13,504. (2022). []
19. M. J. Morgan, A. Adam and John Dixon Mollon, Dichromats detect colour-camouflaged objects that are not detected by trichromats. Proc. R. Soc. Lond. B. 248, 291-295 (1992). []
20. Rachel Loomes, Laura Hull, William Polmear Locke Mandy, What Is the Male-to-Female Ratio in Autism Spectrum Disorder? A Systematic Review and Meta-Analysis. Am Acad Child Adolesc Psychiatry. 56:466-474 (2017). []
21. Eugene Bergers, George Israel, Charlotte Miller, Brian Parkinson, Nadejda Williams, History 101: World History I (Malick and Gurian). https://human.libretexts.org/Courses/Harrisburg_Area_Community_College/History_101%3A_World_History_I_(Malick_and_Gurian)/01%3A_Prehistory/1.05%3A_Agriculture_and_the_Neolithic_Revolution (2023). []
22. Alan E. Stark, On Random and Systematic Variation in the Prevalence of Defective Color Vision. Twin Res Hum Genet. 23, 278-282 (2020). []
23. Markus J. Rantala, Severi Luoto, Javier I. Borráz-León, Indrikis Krams, Bipolar disorder: An evolutionary psychoneuroimmunological approach, Neuroscience & Biobehavioral Reviews. 122, 28-37 (2021). []