Guidance for Pooling Multiple Years of NHIS Data

By Julia A. Rivera Drew

Introduction

Depending on their research question, analysts will commonly pool multiple years of the National Health Interview Survey (NHIS) data together in order to increase sample sizes of particular subpopulations of interest, such as bisexual adults, immigrants, or pregnant women. The complex design of the NHIS, however, requires analysts to take additional steps to correctly construct and analyze pooled NHIS datasets. Moreover, planned changes to the NHIS design implemented in 2019, as well as changes made in response to the COVID-19 pandemic, require additional special handling to correctly analyze datasets combining multiple years of NHIS data. The objectives of this blog post are to: (1) share tips to correctly construct and analyze pooled NHIS datasets and (2) identify resources for more information.

Tips to Correctly Construct and Analyze Pooled NHIS Datasets

1. Create a pooled sampling weight to use with your pooled dataset.

In general, when pooling multiple years of NHIS data together, you will need to create a new sampling weight to use with the pooled sample. To create this new sampling weight, divide the appropriate sampling weight by the number of years within each distinct sample design period. For example, if one wished to estimate the number of children living in families with low or very low food security (FSSTAT) using pooled 2020-2021 NHIS data (e.g., similar to this report), one would need to create a new sampling weight by dividing the sampling weight identified under the “weights” tab for FSSTAT, SAMPWEIGHT, by the number of years pooled together from the same sampling design period (in this case, two). The sum of the pooled weights would then represent the average annual population size for the pooled time period, rather than the total cumulative population size for the pooled time period. For any given combination of variables, refer to information under the “weights” tab for the variables included in your analysis to help select the appropriate sampling weight. The distinct NHIS sample design periods are 1963-1974, 1975-1984, 1985-1994, 1995-2005, 2006-2015, 2016-2018, and 2019-present.

2. Do not pool 2018 and earlier data with 2019 and later data.

Due to significant changes introduced in 2019 to NHIS data collection, do not combine 2018 and earlier data with 2019 and later data. The changes introduced in 2019 (see our user note on the 2019 redesign) reduced or eliminated the ability to compare estimates using the 2019 and later data with estimates based on earlier years of data. Changes were made to the sampling, the questionnaire, and the approach to constructing sampling weights; these changes make it impossible to distinguish real change in phenomena under study between 2018 and 2019-forward from observed changes that arise from changes in survey methodology (for more information, please see our user note on the 2018 Bridge Test).

3. Any pooled sample that includes both 2019 and 2020 needs additional special handling.

To better account for COVID-related nonresponse in the weighting and estimation techniques for the 2020 NHIS sample, the National Center for Health Statistics (NCHS) opted to introduce a one-time longitudinal sample made up of sample adults who had also completed the 2019 NHIS interview. Roughly half of the sample adults interviewed in calendar quarters 3 and 4 of 2020 also completed the 2019 interview, a total of 10,415 adults.

There are two implications of the 2019-2020 longitudinal sample for researchers who wish to analyze a pooled sample including both the 2019 and 2020 samples. First, to correctly construct the pooled sample, analysts must drop the 2020 records for longitudinal sample members (those with a value of 1 on SALNGPRTFLG), thereby retaining only their 2019 records and eliminating duplicate records. Second, analysts should create a new sampling weight for 2020 before dividing by the number of years to produce the pooled weight. Specifically, analysts should create a new sampling weight equal to the value of PARTWEIGHT for sample adults and equal to the value of SAMPWEIGHT for sample children. They should then divide the sampling weights, substituting the new 2020 sampling weight, by the total number of years pooled.

Below is an illustration of how to create a pooled sample of the 2019-2021 NHIS data, for a study examining the frequency of feeling depressed in the last year (DEPFREQ) by sexual orientation. To construct the dataset, pool together the 2019-2021 NHIS data to increase sample size of adults who self-identify as lesbian, gay, or bisexual (values of 1 or 3 on SEXORIEN). Without accounting for the 2019-2020 longitudinal sample, the initial sample size, including those who had missing values on SEXORIEN, would include 978 adults who self-identified as lesbian, gay, or bisexual in the 2020 sample (see Table 1).

Table 1. Incorrect pooled 2019-2020 NHIS sample of adults 18+ by sexual orientation

Survey YearLesbian, Gay, or BisexualStraight/HeterosexualMissingꟸTotal
201986530,20592731,997
202097829,76582531,568
20211,17827,1981,10629,482
Total3,02187,1682,85893,047

ꟸPersons reporting “something else” and “I don’t know the answer” on SEXORIEN were recoded to missing

However, as noted above, the 2020 sample includes 10,415 sample adults who were also interviewed in 2019. When pooling 2019 and 2020 for the same analysis, NCHS recommends that analysts exclude the 2020 responses for members of the longitudinal sample. The 2020 records of these people can be identified as those records in 2020 whose value of SALNGPRTFLG is 1. When we exclude these people, our actual (correct) sample size of adults self-identifying as lesbian, gay, or bisexual in 2020 is 667 (see Table 2).

Table 2. Correct pooled 2019-2020 NHIS sample of adults 18+ by sexual orientation

Survey YearLesbian, Gay, or BisexualStraight/HeterosexualMissingꟸTotal
201986530,20592731,997
202066719,71177521,153
20211,17827,1981,10629,482
Total2,71077,1142,80882,632

ꟸPersons reporting “something else” and “I don’t know the answer” on SEXORIEN were recoded to missing

There are 311 (31.8%) fewer adults self-identifying as lesbian, gay, or bisexual and 10,054 (33.8%) fewer adults self-identifying as straight/heterosexual in the actual (correct) pooled sample than in the initial (incorrect) pooled sample.

Note that there are other special considerations for analysts who wish to produce single-year estimates from the 2020 data and to analyze COVID variables introduced in calendar quarters 3 and 4 of 2020.

Additional Resources for More Information

Below is a compilation of useful in-depth IPUMS NHIS summaries and NCHS reports that provide more information on the appropriate analysis of pooled data, the 2019 redesign, and adjustments to the NHIS in response to the COVID-19 pandemic.

Relevant IPUMS NHIS user notes:

NHIS Documentation from NCHS: