The U.S. has a new way to mask census data in the name of privacy. How does it affect accuracy?

: Published: 08 May 2024; Created: 08 May 2024

U.S. residents who fill out the census questionnaire every 10 years are told their answers will remain confidential. But that promise from the U.S. Census Bureau comes with a statistical caveat: The stronger the privacy shield, the less accurate the data.

That trade-off, long familiar to researchers, became a hot-button issue after the government adopted a new approach to protecting privacy for the 2020 census. Many demographers worried the new method would degrade the quality of census data, which are used not just for academic studies, but also for drawing congressional districts and allocating federal funds. But neither the agency nor its critics knew by how much, kicking off a fierce debate over how the new approach compared with the previous method.

Now, a study appearing this week in ScienceAdvances provides the first-ever, independent hard numbers to answer that question. Experts say the findings could help the Census Bureau do a better job on the next census in 2030. They could also influence high-stakes legal battles over voting rights.

Here’s a guide to understanding this highly technical analysis and why it matters.

Why are census data important?

Researchers use census data to paint a picture of the country’s changing demographics. Government officials at all levels use the information to determine the population eligible for federal programs ranging from health, nutrition, and housing benefits to highway construction.

But the key reason the federal government has conducted a census every 10 years since 1790 is to meet a requirement in the U.S. Constitution that seats in the now 435-member U.S. House of Representatives be allocated according to the population of each state. To draw districts of roughly equal size, election officials need to know exactly where people live, down to the smallest voting unit. Census data are also used to ensure compliance with federal statutes such as the 1965 Voting Rights Act, which requires that election maps don’t discriminate against certain minority groups or unfairly dilute their ballot clout.

What prompted this study?

Demographers have always known about the trade-off between accuracy and privacy. But they learned to work around the distortions introduced by the privacy method the Census Bureau used in the 1990, 2000, and 2010 censuses.

That approach is called swapping. It selects responses to questions about age, race, ethnicity, and household characteristics from a resident in one block and exchanges it for a similar response by someone in another block. (The bureau divides the country into some 11 million geographic units, called census blocks, which have a median size of 23 people, and then aggregates blocks to tally up larger areas.) Researchers assumed that swapping was used on people with unusual demographic characteristics that make them more vulnerable to being identified. But Census officials never revealed how often they applied the tool.

In preparing for the 2020 census, however, agency officials decided that swapping wasn’t good enough to preserve the promise of absolute confidentiality. They said their research had shown that a determined data hacker might be able to figure out an individual’s identity by combining census data with information from other publicly available databases.

So, Census officials replaced swapping with another statistical tool, called differential privacy. It adds varying amounts of statistical “noise” to every piece of data based on its perceived vulnerability, with more vulnerable data getting more noise.

How was the study done?

In a bid to better understand how the switch to differential privacy might affect data quality, a group of researchers asked the bureau in 2021 to release its so-called noisy measurement file (NMF) from the 2020 census. The NMF contains tweaks made to the raw data by applying differential privacy, but before the bureau cleaned up the data and released a “postprocessed” file. The bureau refused, so in October 2021 the researchers sued.

The lawsuit was dismissed in April 2023 after Census officials agreed to hand over data files from the 2010 census in which swapping had been applied, as well as files from a series of experiments they did to learn what happens to census data when differential privacy is used. (The tests were run on files from 2010, because the 2020 census hadn’t been carried out.)

Analyzing those files allowed a team of scholars from Harvard, New York, and Yale universities to measure the impacts of each method on the accuracy of the census data as well as to compare the two methods. The study represents “the first comprehensive, public analysis of the impact of both approaches to disclosure avoidance on the quality of the data,” says Harvard political scientist Stephen Ansolabehere, who was not involved in the work.

What did they find?

The study concludes that differential privacy and swapping were equally effective when it came to preserving the accuracy of data on larger population groups—such as at the state level. But at the smallest geographic unit, the census block, differential privacy resulted in larger errors and greater variation. The impact was most severe among Hispanic residents and multiracial populations, with the magnitude of the error occasionally exceeding the total number of minorities in those units. For example, a block with three Hispanic residents might appear to have zero or six Hispanic people after statisticians applied differential privacy.

“What they have shown is that the true cost of what the Census Bureau is doing now compared to [using] swapping comes at the block level,” Ansolabehere explains.

The greater distortion of the smallest groups from differential privacy stems from the different ways in which the two methods are applied. Census officials decided to retain the overall and voting age populations of the affected geographic unit when swapping responses, providing certainty for both demographers and election officials redrawing voting districts. That is, a block with 23 people still has 23 people after swapping.

But they didn’t impose that requirement for differential privacy. That meant the noise added could affect the total population of the block. It can also produce such illogical results as a negative number of residents, groups of children living without an adult, or occupants on a block with no recorded housing units.

To prevent a public backlash to such oddities, Census officials eliminate those nonsensical outliers, such as by changing negative numbers to zero, before releasing a revised file to the public. But the study shows such adjustments actually magnify the distortion.

One advantage of differential privacy over swapping is that the random noise it injects “has very well-known statistical properties,” explains California Institute of Technology social scientist Jonathan Katz. “That was supposed to be better than swapping, where the distortions are harder to correct. And Census officials would have been right if they hadn’t truncated values that were negative.”

“The paper describes how bad things can get” when those values are truncated, he adds. “And the answer is, pretty bad.”

The study did show that the impacts of differential privacy were no worse than errors produced by other factors, such as chronic undercounting of hard-to-reach groups, ambiguous answers, or mistakes in entering data. (The 2010 census missed an estimated 16 million people, for example, some 5% of the total U.S. population, and 10 million people were double counted.)

But that finding is hardly reassuring, says study co-author and statistician Cory McCartan, who will move to Pennsylvania State University next month. “It’s fair to say [that] at … the national, state, and county level the undercount errors are much bigger than anything introduced by differential privacy,” he says. “But at the block level, certainly the privacy errors are now as big or bigger than these other sources. And Census introduced” those errors, he adds.

What impact might those errors have?

Ansolabehere, who studies election systems, worries inaccuracies at the smallest geographic units could affect how courts decide whether voting rights laws that require equal protection for minority groups have been met when policymakers draw election boundaries at all levels. “If I go into court and the opposing side says the data at the precinct level are messy and can’t be trusted, that could undermine my case,” he says.

One surefire way to improve accuracy is for the Census Bureau to release less data from the decennial census. That would reduce the amount of noise that must be injected into every response to prevent reidentification. But it would also restrict the ability of researchers to understand demographic changes by sifting through the trillions of data points now available.

Releasing less data would also be a blow to anyone doing surveys drawn from a representative sample of U.S. residents, Ansolabehere says. “The entire survey research industry depends on census data to define what its samples should look like,” he explains. “If too few numbers are released, I have to fill in the gaps. And I may not get it right.”

Is the study likely to settle the debate over the two methods?

Probably not, because supporters of each method claim it backs their positions.

John Abowd, a former chief scientist of the Census Bureau who spearheaded the agency’s adoption of differential privacy, says, “We never expected [differential privacy] to be better in the sense of both more accurate and more protective of confidentiality, because that’s impossible” from a statistical perspective. But what the agency did expect, he says, is that the data “would be fit for [its congressionally mandated] purpose: giving accurate estimates of the overall population and the percentage of that population from large minority groups” that can be used to reapportion House seats every decade.

But one leading critic of differential privacy, University of Minnesota demographer Steven Ruggles, says the study supports his view that the agency should return to swapping. “It does minimal damage to accuracy, preserves accurate counts of the number of people and the number of adults at every level of geography, and can be effectively targeted to focus on people at high risk of disclosure,” he explains in a recent paper.

What might the study mean for the 2030 census?

The Census Bureau has until 2028 to decide what type of privacy system it will use in the 2030 census. And it’s not assuming the status quo is fine. Although Census Bureau officials declined to comment on the study itself, a spokesperson says the agency is conducting research aimed at both “improving the existing disclosure avoidance system … and evaluating alternative approaches that might outperform” what was used in 2020.

Harvard statistician Kosuke Imai, a co-author of the new study, would like to see the Census Bureau release the NMF “in a more usable format” than what his team had to work with. More hands would generate more information for the bureau to incorporate into its decision-making, he says.

Abowd, who retired from federal service last year and is an emeritus professor at Cornell University, agrees that “there are things that can be done to make those data easier to use,” he says. “But given its limited resources, the Census Bureau doesn’t have any obligation to fund those improvements.”

Overall, the new study highlights how much is at stake, says Katz, who studies redistricting. “Unless you’re a nerd like me who uses the Census data all the time, most Americans don’t even know this fight has been going on,” he says. “But the outcome could have a major impact on their lives.”

More: https://www.science.org/content/article/u-s-has-new-way-mask-census-data-name-privacy-how-does-it-affect-accuracy

Popular articles

Comment of the week