Data Projects – Use It for Good

IPUMS MLP: Revolutionizing Linked Data

July 17, 2025 by mpcblog

By Etienne Breton

As researchers, we often ask questions we cannot answer due to lack of data. More intriguingly, however, there are questions we only think of asking once we encounter data that may answer them. Good data address existing problems; great data inspire new questions. The latest iteration of the IPUMS Multigenerational Longitudinal Panel (MLP) project, which links together records from the full count US census data, fits this description. Visit our data browser and project description for inspiration on research questions you did not know could be asked – and answered.

Full count census data offer unprecedented opportunities for social scientific research. Once harmonized, these data enable precise measurement of key demographic, economic, and social patterns across time and space. Researchers can observe entire populations over long periods and produce estimates virtually free of sampling error. Estimates can also be produced down to the smallest geographical units, allowing researchers to define and observe communities with an outstanding level of detail.

Perhaps even more powerfully, full count data have opened the possibility of automated record linkages across census years to construct millions of individual life histories and trace millions of families over multiple generations. These linked data speak compellingly to core research questions in the social sciences, including intergenerational mobility and the intergenerational transmission of socioeconomic characteristics; exhaustive descriptions of individual and family trajectories; internal migration patterns within small geographic units; long-term outcomes of early-life conditions; and many more.

IPUMS disseminates full count census enumerations for ten census years from 1850 to 1950. These data, covering over 800 million individual records, are the fruit of collaboration between IPUMS and the world’s two largest genealogical organizations — Ancestry.com and FamilySearch — to leverage genealogical data for scientific purposes. IPUMS MLP now offers longitudinal links between individuals and households enumerated in those ten censuses. As shown in the figure below, we offer 645 million links between census pairs in MLP’s current iteration. This amounts to more than 175 million people linked over two or more censuses.

Figure 1: Case Counts for Linked Census Pairs

Grid of decennial census pairs, 1850-1950. Cells in grid show number of links. — The IPUMS USA access system allows users to see a detailed count of the number of links between census pairs in the latest version of the MLP data. See the IPUMS MLP data description for more information about links.

Full count and linked census data can further be merged with datasets outside the census, including administrative data and surveys from the present day, to multiply research opportunities. Already researchers have been drawing on these new data to tackle ambitious questions. This includes studies on the life-course impacts of New Deal policies on individual wellbeing and economic outcomes; on the role of declining kin availability on birth rates; on the late-life consequences of early-life exposure to lead; on the intergenerational mobility of immigrants over the long run; and many other studies.

Figure 2. Intergenerational mobility in the US from Connor et al. (2025)

Choropleth maps of continental US. Blue counties denote more upward mobility; yellow denotes lower mobility. — This figure shows two maps of the continental US taken from a recent publication by Connor et al. (2025). On the top map, the authors use MLP data to construct county-level estimates of father-to-son intergenerational mobility between 1904 and 1950. On the lower map, they calculate the same estimates from other data sources, for years 1978 to 2015. Comparing these two maps, the authors find that “[w]hile states in the central and northern regions exhibit rising upward mobility rates relative to the rest of the country, much of the South continues to perform poorly”.

Despite the enormous potential of these linked data to advance research, however, users must also remain aware of their limitations. Record linkage is inherently probabilistic and subject to non-random omissions, raising concerns about selection bias and generalizability. Yet these challenges also present important methodological opportunities – not only for causal and multilevel modelling, but also for improving the methods needed to create the links themselves.

We have only scratched the surface of the potential of full count and linked census data. These data provide a fertile ground for formulating new and original research questions. They allow revisiting more foundational research questions, which can now be answered with added depth, granularity and exhaustiveness. The field of publication opportunities using these data remains wide open. Studies published in major journals already attest to the data’s significance for scholarly contributions. In both academic and applied research settings, there is also great demand and momentum for improving the methods – chiefly machine learning models – needed to create these data.

If any of these possibilities piques your curiosity, consider yourself cordially invited to visit the IPUMS USA website for more information on full count census data, as well as the MLP project page for more information on the latest updates to MLP data and for the project’s next steps.

Working with Subnational Geographies in IPUMS Global Health

June 17, 2025 by mpcblog

By Divya Pandey and Anna Bolgrien

In a research project combining data from IPUMS MICS and IPUMS DHS, IPUMS Global Health staff examined trends in the relationship between open defecation and high infant mortality rates (IMR) in the Eastern Indo-Gangetic Plains. The project focused on selected bordering regions in Nepal, Bangladesh, and India. By analyzing these environmentally and agro-climatically comparable regions, the study aimed to isolate the impact of national and local policies on open defecation and infant mortality rates.

Figure 1: Regions included in the study

A map of the border and surrounding regions of Nepal, Bangladesh, and India that highlights the sub-national regions included in the study.

The study pooled data from IPUMS MICS and IPUMS DHS to look at trends over almost two decades. IPUMS DHS includes data for all three countries, and IPUMS MICS provides additional years of data for Nepal and Bangladesh. Since the study focused on selected bordering geographies, the authors worked with data from lower administrative levels—divisions in Bangladesh, states in India, and regions in Nepal. Leveraging the geography resources provided by IPUMS, the team used both spatially harmonized and sample-specific geography variables (learn more about IPUMS DHS geography variables and IPUMS MICS geography variables). Spatially harmonized geography variables identify geographic regions using a consistent spatial footprint to allow for the comparison of the same physical space over time. Sample-specific geography variables are not harmonized across time; as their name suggests, they use the geographic boundaries that are sample-specific or contemporaneous to the survey-year in a country.

What is going on with the weighted counts in the January 2025 CPS?

May 21, 2025May 6, 2025 by mpcblog

By Kari Williams & Sarah Flood

The signature activity of IPUMS is data harmonization, or making variables interoperable across time, to facilitate pooling of multiple months or years of data, as well as comparative and trend analyses. It’s easy to get carried away in the magic of not needing to perform routine data cleaning and having documentation organized at the variable level, and perhaps miss some bigger picture considerations. The Current Population Survey (CPS) annual population controls adjustment is an excellent example.

Each January, the Census Bureau revises the CPS weights to incorporate new population controls, based on the Census Bureau’s updated population estimates. However, the Census Bureau doesn’t re-release previous weights for the CPS based on the new population controls. If you look at trendlines of weighted count estimates using CPS monthly data, you might notice a discontinuity between each December and January – these are the annual population control adjustments at work. In January 2025, the shift is particularly abrupt; this is because the 2024 vintage population estimates (i.e., the population controls for the 2025 CPS) reflect an improvement in the Census Bureau’s methodology for measuring net international migration.

Line chart showing a general upward trend from 2020-2025 with disruptions each January

Figure from Jed Kolko’s Population adjustments will cause the next jobs report to be misinterpreted and misconstrued.

IPUMS CPS Checks on Basic Monthly Data

April 9, 2025 by mpcblog

By Sarah Flood, Renae Rodgers, and Kari Williams

Federal data are critical for understanding much about the US population from its size and composition to its health and employment. The Current Population Survey (CPS) is our nation’s official source of information about the labor force. At the beginning of each month, we eagerly await the first Friday when the Employment Situation Summary (aka the monthly jobs report) will be released (it isn’t just us, right??). The monthly snapshot of the US labor force serves as a bellwether for how our economy is faring.

The Wednesday after the jobs report is released, we at IPUMS clear the decks in preparation for the release of the CPS Basic Monthly Survey (BMS) by the Census Bureau. The CPS BMS is the individual-level data from which the jobs report is generated. Our goal is always to process these data as soon as they’re released by the Census Bureau so that we can deliver them to IPUMS CPS users as quickly as possible. Those who rely on CPS BMS data each month might be familiar with coping strategies while waiting for the data–obsessive page refreshing, some nervous pacing, maybe wondering why they haven’t yet been released (iykyk).

While quickly processing CPS Basic Monthly data is a priority, so, too, is ensuring data quality. Each month, we carefully inspect CPS BMS data at several points in our process. First, we review all of the variables for codes that are undocumented or have suspicious frequencies. Second, we rely on a suite of tools during our integration process that alert us to any codes in the data that we haven’t accounted for in our variable-level harmonizations. After harmonization, we compare univariate statistics from the newest month data to the previous month of data. Generally we expect very little change across months and we have built tools that are designed to flag variable-level differences above a certain threshold as well as new codes on either end of the distribution.

Unlocking Spatial and Social Data with R: Introducing the R Spatial Notebook Series

May 14, 2025April 4, 2025 by mpcblog

By Kate Vavra-Musser

Introduction: What is the R Spatial Notebooks Project?

The R Spatial Notebooks Project is a series of R code notebooks, structured like a textbook, designed to guide users through the intricacies of data extraction, integration, cleaning, analysis, and visualization using R. The notebooks are specifically tailored for social science research and applications using spatial data. The modular textbook-style structure is designed for comprehensive skill development by working through sequences of notebooks. The project was developed through a partnership between the Institute for Social Research and Data Innovation (ISDRI), which houses IPUMS, and the Institute for Geospatial Understanding through an Integrated Discovery Environment (I-GUIDE). IPUMS provides census and survey data from around the world integrated across time and space. I-GUIDE is cyberinfrastructure that combines distributed geospatial data with computing for researchers, students, and policymakers.

The initial R Spatial Notebooks release includes roughly 20 freely-available notebooks on topics including IPUMS data extraction via API, accessing open-source data, data cleaning, foundational spatial data principles, exploratory data analysis, and mapping.

New Data! IPUMS International Spring 2025 Data Release

March 17, 2025 by mpcblog

By Derek Burk, Lara Cleveland, Jane Lee, Rodrigo Lovaton, and Sula Sarkar

Megaphone with Exciting news speech bubble banner.

Great news for IPUMS International (IPUMSI) users! Our ever-expanding census and survey data collection has just released new harmonized census samples from Honduras (2013), Kenya (2019), Malawi (2018), Mongolia (2010, 2020), and Mozambique (2017). We now have an average of 4.5 censuses per country. The Kenya census collection now spans 50 years!

This release also includes a large series of quarterly Labor Force Surveys from the Philippines (1997-2019). The 91 waves of the Philippines Labor Force Survey contain a total of 18 million person records.

Many thanks to the National Statistical Office partners in these countries for their ongoing contributions.

Tools for Combining Data Across IPUMS Global Health Surveys

May 14, 2025March 3, 2025 by mpcblog

By Miriam King, Devon Kristiansen, and Anna Bolgrien

IPUMS Global Health includes integrated data from three international health surveys: Demographic and Health Surveys (IPUMS DHS), Multiple Indicator Cluster Surveys (IPUMS MICS), and Performance Monitoring for Action (IPUMS PMA). All three surveys are nationally representative, primarily focus on low- and middle-income countries, and address issues related to the health and well-being of women and young children. These commonalities make combining integrated data across these data collections appealing. As Figure 1 shows, IPUMS DHS and IPUMS MICS cover different countries; combining them extends the geographic coverage of harmonized versions of data covering similar topics. Researchers can also combine data for those countries included in both IPUMS DHS and IPUMS MICS to provide additional observation points for time-series analyses.

Figure 1: Countries covered by IPUMS DHS and IPUMS MICS

World map with the countries included in IPUMS MICS and IPUMS DHS shaded in

Researchers who want to carry out cross-survey analyses face practical challenges. IPUMS imposes consistent variable names and codes within one kind of survey (DHS, MICS, or PMA); harmonized variable names and codes differ between these surveys. On each project’s website, the documentation for each variable highlights comparability issues to keep in mind when combining multiple samples, either within one type of survey or across survey types. IPUMS users must make separate customized data files from each database and merge those files. And subtle differences in question wording, skip patterns, geographic boundaries, and sampling procedures—such as MICS’ taking reports on child health from caretakers other than the biological mother—can introduce inconsistencies and inadvertent errors.