IPUMS provides demographic data for international COVID-19 research

By Lara Cleveland

Since the onset of the COVID-19 outbreak, researchers across the globe have been accessing census microdata from IPUMS International for COVID-19-related research. Scholars at universities from the U.S. to Nepal, Columbia to Belgium, Nigeria to China, and elsewhere have used IPUMS data to assess population dynamics contributing to COVID-19 vulnerability or spread. Divisions of the United Nations, World Bank, and other policy research institutes have similarly accessed IPUMS census data for COVID response and relief efforts.

IPUMS International harmonizes and disseminates household-level microdata census samples from more than 100 countries. Access to microdata is essential for rapid response in new areas because of its analytic flexibility. Researchers needing to build custom tables or construct variables for complex modeling suited to specific research questions can only do that with microdata. Of particular interest for research on population dynamics of COVID-19 is information about the age structure of the population, household living arrangements (household size, intergenerational co-residence, etc.), indicators of health vulnerability (age, work status, housing conditions, disability, etc.), healthcare workforce distribution, and migration patterns. IPUMS International census samples also include valuable subnational geographic identifiers at the first and second administrative levels, which are especially useful for highlighting particular regions or localities of vulnerability.

The surge in new or renewing IPUMS International data user applications listing COVID-related research topics provides only a glimpse of how IPUMS is being used to aid in pandemic response efforts. Many existing IPUMS users are also turning their research focus toward pandemic response. For example, long-time IPUMS colleagues, Esteve et al.,1 analyzed how national age and co-residence patterns shape COVID-19 vulnerability using data entirely from IPUMS International (see figure below). The paper, published in the Proceedings of the National Academy of Sciences, simulates COVID-19 outbreak in 10% of the population to investigate mortality vulnerability by country based on national age and co-residence patterns.

From Esteve et al., 2020: Estimated number of direct (dark) and indirect (light) deaths per 100,000 individuals if primary infections of specific age groups are avoided. Data are from 2010 census round. Individuals from each age group who were selected in the 10% random draw are recoded as not infected before calculating direct deaths and simulating within household transmission.

Over the summer, researchers at IPUMS calculated a series of population-based indicators from census microdata for the UNFPA’s COVID-19 Population Vulnerability Dashboard2, which maps demographic characteristics contributing to COVID-19 vulnerability (outbreak, spread, or mortality). The resulting dashboard went live in July, and includes additional data layers from WorldPop, the World Health Organization (WHO), and the Johns Hopkins University Coronavirus COVID-19 Global Cases Dashboard. Interestingly enough, the WHO measures of healthcare workforce prevalence also relied, in part, on occupational information from the census samples in IPUMS International. The interactive dashboard aims to provide health workers, policy makers, and the public with important information about vulnerable populations to aid in preparedness and response to COVID-19.

UNFPA’s COVID-19 Population Vulnerability Dashboard
UNFPA’s COVID-19 Population Vulnerability Dashboard

We would love to hear about your COVID-19-related research! Send us a note (ipums@umn.edu), and make sure to submit your work to our bibliography. It is particularly difficult for us to track down work on dashboards, indicators, and policy briefs using IPUMS data. We depend upon you to let us know how you have used the data. All of your research products help us give back to our national data partners and secure future funding.

Census microdata samples from IPUMS International belong to the countries that partner with us. IPUMS adds standardization, harmonization, and documentation work in order to save researchers countless hours of data preparation. We are grateful to the many countries who choose to share their data so that we can all use it for good!

  1. Esteve, Albert, Inaki Permeyer, Diederick Boertien, James Vaupel. 2020. “National age and coresidence patterns shape COVID-19 vulnerability.” Proceedings of the National Academy of Sciences, July 14, 2020, 117 (28): 16118-16120; first published June 23, 2020 https://doi.org/10.1073/pnas.2008764117 (https://www.pnas.org/content/117/28/16118)
  2. 2020. COVID-19 Population Vulnerability Dashboard (https://covid19-map.unfpa.org).

Cite us! Seriously though…

By Renae Rodgers and Kari Williams

Hi there IPUMS users! Let’s talk about citations. When using our datasets in your insightful, groundbreaking, interesting work, please cite us! 

Seriously though. 

Cite us. 

You wouldn’t steal a car, you wouldn’t rob a little old lady of her handbag, you wouldn’t base work on that of a colleague and not put their paper(s) in your reference section, right?!? Then don’t use IPUMS data and fail to mention it! 

To help you on your way, here are some answers to frequently asked questions:

Q:   Do I have to though? 

A:   Yes. Properly citing IPUMS data is part of the user agreement. Before you ever submitted your first extract, you agreed to do this!

Screenshot of citation agreement

Q:   I’ve mentioned IPUMS in the caption of my figures and tables, so I am good to go, right?

A:   Nope. Putting our URL in a footnote, endnote, or caption is insufficient. Name-checking IPUMS in your “Data and Methods” section is not enough. Just for good measure, we will mention that, citing a paper by IPUMS staff about IPUMS data is not the same thing as citing a dataset.


Q:   What about talking about how much I love IPUMS on Twitter or naming my firstborn after this amazing data provider? Is that an appropriate substitute?

A:   [public radio voice] If you appreciate the resources that IPUMS provides, using the data and citing it is the best way to support us. Our ability to continue to provide this service is dependent on capable and intelligent users like you citing our datasets! Seriously, a core part of our funding depends on our ability to prove that the data infrastructure IPUMS offers is being used. If you want IPUMS to keep offering the latest data and developing new tools, we need you to cite us so we can demonstrate to our funders that IPUMS is useful. Citing us is the best way to support us (though we are keen to hear about your children with middle names based on your favorite IPUMS variables).


Q:   How do I cite IPUMS properly?

A:   We are so glad you asked! When you receive an email notification that your custom dataset from IPUMS is ready to download, it includes the citation! Each IPUMS data product has its own citation – be sure to use the citation associated with the IPUMS data that you used. If you use more than one IPUMS data product, cite all of them!

Q:   Okay, wait! I have one more. 

A:   Go for it.

Q:   What if…I deleted the extract email and didn’t make note of the citation? 

A:   Not a problem! For your convenience, we just happen to have this handy link of all the current IPUMS dataset citations with DOIs. You can also find each IPUMS dataset’s citation on the left menu of the homepage.

To those users who are diligent about citing IPUMS datasets, we thank you! If you have used IPUMS without citing it or committed one of the other faux pas above in the past, we hope you now have the instruction and incentive to do better going forward!

Overview of NHIS Data Collection, 1997-2018

By Julia A. Rivera Drew, Kari C.W. Williams, and Natalie Del Ponte

The IPUMS NHIS project offers integrated versions of the National Health Interview Survey (NHIS) data, the leading source of nationally representative information on the health of the U.S. population. The National Center for Health Statistics (NCHS) collects the NHIS data through face-to-face interviews covering information about health, health insurance coverage, health care utilization, socioeconomic characteristics, and demographics of all household members. It is representative of the civilian, non-institutionalized U.S. population with annual samples ranging between 30,000-50,000 households and 75,000-100,000 people. NCHS has collected the NHIS annually since 1957 (with digital copies of the data available going back to 1963), making it the longest running annual survey of health in the world.

Periodically, aspects of data collection – such as the sampling frame, oversampled populations, or questionnaire content – change to better capture changes in the most pressing health concerns of Americans or changes in the demographic makeup of ­­Americans and where they reside within the U.S. Most of these changes are modest, reflecting changes in U.S. population composition and distribution detected in the most recent decennial census. However, 2019 heralded the largest change in NHIS data collection since 1997. In fall 2020, the NCHS will release the 2019 public use data files, the first data collected under the newly redesigned NHIS. The upcoming release of the 2019 data warrants a look back at how NCHS collected the NHIS data over the 1997-2018 period.

1997-2018 at a Glance

The data collection design of the 1997-2018 NHIS was largely comparable over time. There were a few minor changes during this period, the largest taking place between 2005 and 2006 to update the sampling frame to reflect the 2000 Census and add an oversample of Asian persons. Most oversamples were discontinued in 2016 (see the IPUMS NHIS note on Sample Design for more information). Under the 1997-2018 design (illustrated in Figure 1), the NHIS was a sample of households, where each household could potentially contain multiple families. One representative from each family, the family respondent, provided demographic, health status, and health insurance coverage information about all family members. In addition to the data collected for all family members, interviewers randomly sampled one adult and one child per family to complete additional interviews (the “sample adult” and “sample child” questionnaires, respectively). Through this mechanism, the NHIS collected further information on topics such as Body Mass Index, mental health, access to health care, health behaviors, and (for adults) sexual orientation and details about paid employment. NCHS releases standalone data files for each of these content areas (households, families, family members, sample adults, and sample children) every year. IPUMS NHIS allows users to review variables from all content areas and include them in a single data extract.

Figure 1. 1997-2018 NHIS Data Collection

Illustration of sampling of data for NHIS

For IPUMS NHIS users interested in combining information collected on different parts of the survey, understanding the NHIS data collection process is important for two reasons. First, when users design analyses of the NHIS data, they must take into account the extent to which the overlap of topical supplements collected for sample adults and sample children varies by subject area and over time. Second, which variables analysts combine determines which sampling weight is most appropriate for analyses that utilize data from these different content areas.

Overlapping Sample Adult and Sample Child Content

Users interested in the rich topical content of NHIS may wish to design analyses that take advantage of the occasional and recurring supplements asked of sample adults and sample children. However, it is important to note that the items collected by the sample adult questionnaire are not necessarily also part of the sample child questionnaire, and vice versa. Even when similar topics are covered, the two questionnaires may not include identical measures. IPUMS NHIS combines sample adult and sample child measures into a single integrated variable wherever they overlap to make it easier for users interested in looking at both groups.

Additionally, because NCHS fields some supplements only in certain years, there are topical combinations that are not possible because NCHS never asks specific supplemental questions in the same year (e.g., the balance problems supplement never overlaps with the complementary and alternative medicine supplement). IPUMS NHIS users who add variables of topical interest to their data requests without confirming that they are available for all the relevant years may be confused to find missing values where they did not expect any.

Selection of Appropriate Sampling Weight

As described above, NHIS is a complex, multistage probability sample. Users must make use of sampling weights to produce population representative point estimates. For information on producing correct standard errors and statistical tests, see the IPUMS NHIS user note on variance estimation. Because NCHS releases standalone data files for each content area, they offer more weight variables (at least one for each file). Most person-level analyses using IPUMS NHIS will use PERWEIGHT or SAMPWEIGHT.

The IPUMS NHIS variable PERWEIGHT corresponds to WTFA in the original NCHS data files. PERWEIGHT is appropriate for analyses that use variables collected for all family members. The IPUMS NHIS variable SAMPWEIGHT combines two separate weights, one for sample adults and one for sample children, from the original NCHS data files. SAMPWEIGHT reports the sample adult weight only if the person is the selected sample adult and the sample child weight only if the person is the selected sample child. SAMPWEIGHT is 0 for all other persons. SAMPWEIGHT is appropriate for analyses that include variables collected as part of the sample adult or sample child content of the questionnaire. In cases where both types of variables are included, users should apply the more restrictive of the two weights (SAMPWEIGHT in this case).

Look for a future post describing the 2019 NHIS redesign after it is released in the fall of 2020. Until then, you may be interested in these IPUMS NHIS user notes on Sample Design, Sampling Weights, and Variance Estimation in NHIS data.