The Missing Denominator: Should HIEs and Public Health use Benford’s Law to improve data quality and accuracy?

What if I told you that in any random population set I could predict the frequency of significant digits to a fairly accurate degree? Malarkey! You might yell and go back to liking articles on LinkedIn about AI. However, as mystical as it sounds, this is a numerical phenomenon called Benford’s Law.

Okay, let's just back up a tad, for those that “know” but would like a refresher.

In any large data set, be it financial statements, historical weather almanacs, or even the geographical distribution of a species--a pattern occurs. This pattern, as depicted in Figure 1, is known as Benford’s Law. It states that the leading digit in a large number series is more likely to be a smaller number, like 1, 2, or 3, than a larger number.

Figure 1. Sample Frequency of Leading Digits

The applications for this natural number phenomenon are quite broad. Governments use this system to identify fraud when auditing financial statements, insurance claims, and tax returns. It applies to determining bot activity in social networks, locating potential election interference, and even disease rates in populations.

Some of the more healthcare-applicable studies conducted with Benford have to do with cancer or disease registries. Figure 2, depicts a study conducted across 43 cancer registries with more than 140 thousand patients. The data demonstrated that incidence rates follow this pattern regardless of location and type. The red line depicts the Benford distribution, and the bar charts depict the leading digits in the registry. For those charts with uneven distributions, this could indicate higher rates of incomplete data, rounding issues, or undefined data types.

Figure 2. Cancer incidence rates by registry. Theoretical (line) and observed distributions (columns) of first digits for all the analyzed incidence rates, by registries and populations (reg). 1: Malawi, Blantyre; 2: Argentina, Tierra del Fuego; 3: Brazil, San Paolo; 4: Ecuador, Quito; 5: USA, Virginia, Asian and Pacific Islanders; 6: USA, Nebraska, Black; 7: USA, Ohio; 8: USA, Vermont; 9: USA, Montana; 10: USA, Michigan; 11: USA, Georgia; 12: USA, Indiana, White; 13: USA, Missouri, White; 14: USA, NPCR-National program of cancer registries (including 42 States); 15: USA, Colorado, Asian and Pacific Islanders; 16: USA, Arkansas, Black; 17: USA, Alabama, White; 18: USA, Arkansas, White; 19: USA, California, Asian and Pacific Islanders; 20: USA, Connecticut, Black; 21: USA, Virginia, Black; 22: USA, California; 23: India, Karunagappally; 24: Singapore, Malay; 25: Turkey, Edirne; 26: Israel, Jewish; 27: Japan, Hiroshima Prefecture; 28: Japan, Fukui Prefecture; 29: Israel; 30: France, Isère; 31: Germany, North Rhine – Westphalia; 32: France, Hérault; 33: UK, England; 34: Estonia; 35: Switzerland, St Gall-Appenzell; 36: Bulgaria; 37: Malta; 38: Ukraine; 39: Spain, Navarra; 40: Italy, Sondrio; 41: Germany, Brandemburg; 42: New Zealand: Other; 43: USA, Hawaii.

How can this approach be leveraged more effectively in health technology? The challenge with large healthcare data sets, particularly public health data, is that the certainty of the population size they are analyzing is never really known.

Do you ever really know the true percentage of the unhoused population in a county or infections in a given region?

Unfortunately, this real number is obfuscated by interoperability challenges, health access issues, incomplete data entry, and various coding interpretations. So from a practical standpoint, we must assume this quantity will never really be known.

But this is exactly the problem that Benford's Law can help address, while we may never really know the 'denominator' when assessing the incident rate in a population, we can know if we are close.

By employing Benford's Law in a Health Information Exchange (HIE) we can evaluate the quality of data collected by a given region, such as a county, state, or health network. By employing it in Public Health organizations, we can compare disparities across regions and networks, enabling organizations to address gaps and correct reporting practices.

However, methods to structure and analyze data according to a Benford distribution are not self-evident. While tax companies and financial institutions have developed indicators for fraud and data quality, these methods are less developed, if even present, in the health technology industry.

As data sharing becomes more prominent and widespread across states and regions, the need to formulate tools that can improve quality and address disparities will be essential to the utility of our national health infrastructure.

Lastly, Benford's Law is not meant to solve the complex data challenges inherent in Public Health organizations or HIEs, but more so serve as a Pelorus (A Polynesian simplified compass). It is merely a tool that can point you to something that may require further investigation. However, even a simple course change at sea can be the difference between finding port or remaining aimlessly adrift. So, if you find yourself at the helm of a large data set, perhaps this mathematical instrument can help navigate your organization out of the doldrums of data disparities.


Note: The mathematical application of Benford's Law involves a solid knowledge of statistics and was beyond the scope of this article for discussion. However, of the organizations I reviewed that do apply Benford's Law utilize common tools like Excel or R to gather insights.


Comments

Popular Posts