Use It for Good

Does 1 + 2 = 8? Automating QA/QC for Tabular Data

By Tracy Kugler and Tsu Zhu

The problem with OCR and numbers

To extract data tables from census reports only available as print documents, IPUMS IHGIS uses optical character recognition (OCR) software to automate the conversion of scanned images into digital representations of letters and numbers. OCR software has made great strides in accuracy for textual information by using dictionaries of known words to interpret uncertain letters. However, dictionaries do not help in distinguishing uncertain numerical digits. While a dictionary can suggest that the third character in “wh_t” should be an ‘a’ and not an ‘o’, there is no simple way to tell whether the third digit in “45_” should be a 3 or an 8. To ensure that IHGIS data are accurate, we must have confidence that each number has been recognized correctly and matches the number in the source document.

To address this gap, we developed an R package that leverages IHGIS structured metadata to identify logical relationships between cell counts and row/column totals and determine where cells don’t add up as expected. Often, a given cell participates in multiple relationships, which allows the package to use patterns among discrepancies to pinpoint and correct errors. The package can automatically identify and correct up to 95% of error cells, depending on the structure of relationships.

Identifying relationships from structured metadata

The R package currently relies on structured metadata generated by earlier stages in the IHGIS data processing pipeline to identify sum and total relationships among rows and columns. After tables are OCR’ed from source documents, we use a customized markup framework to generate metadata. We then convert the marked up files into CSV files with a standard structure, which serve as input to the quality assurance/quality control (QA/QC) process. The CSV files include hierarchical labels for categories on the columns and geographic units on the rows. Within the labels, blanks are used to indicate totals. The package identifies a column/row with a blank header cell as the sum of other columns/rows that share the same non-blank label(s) and have sub-category labels corresponding to the blank.

Continue reading…

IPUMS Announces 2025 Research Award Recipients

IPUMS research awardsIPUMS is excited to announce the winners of its annual IPUMS Research Awards. These awards honor both published research and nominated graduate student papers from 2025 that use IPUMS data to advance or deepen our understanding of social and demographic processes.

The 2025 competition awarded prizes for both published research and graduate student research (published or unpublished) in eight categories:

  • IPUMS USA: data from the U.S. decennial censuses (including full count data for 1850-1950) and American Community Survey Data
  • IPUMS CPS: monthly data from the Current Population Survey (back to 1976) and Annual Social and Economic supplement (back to 1962)
  • IPUMS International: harmonized data from censuses and labor force surveys around the world, contributed by more than 100 international statistical office partners, for 1960-forward
  • IPUMS Health Surveys: harmonized data from the U.S. National Health Interview Survey (NHIS) for 1963 onward and Medical Expenditure Panel Survey (MEPS) for 1996 onward
  • IPUMS Spatial: Census summary tables and GIS data from the US (IPUMS NHGIS) and around the world (IPUMS IHGIS), and measures of contextual determinants of health (IPUMS CDOH)
  • IPUMS Global Health: harmonized health survey data from around the world, including harmonized versions of the Demographic and Health Surveys (IPUMS DHS), Multiple Indicator Cluster Surveys (IPUMS MICS), and the Performance Monitoring for Action (IPUMS PMA)
  • IPUMS Time Use: time diary data from the American Time User Survey (IPUMS ATUS), historical and contemporary time use data from the U.S. (IPUMS AHTUS), and around the world (IPUMS MTUS)
  • IPUMS Excellence in Research: The IPUMS mission of democratizing data is strengthened by broad representation among our data users and the research that we highlight. This award was created to recognize the diversity of scholars doing innovative research with IPUMS data. This category includes submissions from all IPUMS data collections.

The award committee received and reviewed hundreds of nominations for our 2025 competition. From these publications the award committees selected the 2025 honorees.

Continue reading…

IPUMS DHS Goes Global

By Miriam L. King and Sula Sarkar

IPUMS DHS now includes integrated variables for 84 counties (up from 51) and nearly 350 samples (up from 233), including new data from Latin America, Eastern Europe, Oceania, the Caribbean, and Central and East Asia. Providing DHS data in a form that facilitates micro-analyses across countries is one of IPUMS’ greatest strengths, so researchers will be excited to learn that they can now do even more! Our latest data release expands the scope of IPUMS DHS beyond its initial coverage of Africa, the Middle East, and South Asia and adds the latest samples for 12 countries previously in the database. Figure 1 shows the full geographic scope of IPUMS DHS, as well as highlighting newly added countries and previously included countries with new samples.

Figure 1: Countries included in IPUMS DHSWorld map with countries that are new to IPUMS DHS, have new samples in IPUMS DHS, or have no new samples in IPUMS DHS filled in

Continue reading…

Adjust Monetary Values for IPUMS CPS

By Kari Williams with support from former IPUMS research staff member Danika Brockman

We love to extend useful functionality across multiple IPUMS data collections, so we were delighted to extend the the Adjust Monetary Values (AMV) feature, which adjusts dollar values for inflation and was first developed for IPUMS USA, to IPUMS CPS. The initial release of the AMV feature in IPUMS CPS in 2023 provided adjustment for a limited number of variables. Late last year, we extended the feature to cover variables from the ASEC as well. This blog post provides a quick introduction to the AMV tool and step-by-step guidance for using the tool in IPUMS CPS – for full details on the feature, see our IPUMS working paper on the AMV feature.

The Basics

The AMV feature allows users to adjust the monetary variables in a customized dataset from IPUMS into constant dollars, so that all monetary variables for months and years of data in your downloaded data file are in comparable units. IPUMS CPS variables are adjusted to 2010 dollars using the Consumer Price Index for All Urban Consumers (CPI-U). When you add an inflation-adjusted version of a variable to your data extract, the IPUMS data access system applies the appropriate CPI-U adjustment factor for each year to the variable(s) you’ve selected and includes both the original variable and an inflation-adjusted version of that variable in your extract. The adjustment factor is only applied to codes that represent monetary values in the original variable. All missing data codes (e.g., NIU, “Refused”, “Don’t Know”, and “No response”) from the original variable are combined into a single NIU code consisting entirely of 9s in the adjusted version of the variable (which is two digits wider than the original variable).

Note that IPUMS only adjusts monetary variables for years with a final published CPI-U. Final CPI-U values for a given year are typically published early in the next year (e.g., the 2025 CPI-U values are published in 2026). Notably for basic monthly CPS data, current year samples will not be available for adjustment because the final CPI-U will not be published. However, the reference period for the ASEC is the previous calendar year (e.g., 2024 is the reference year for the 2025 ASEC); the adjustment factor for the reference year has been published by the time Census Bureau releases the ASEC data each September and we integrate them into IPUMS CPS. One quick way to check whether the CPI-U value has been published for a given year is to consult the IPUMS CPS CPI99 documentation. We also update the IPUMS CPS revision history to note when we have extended the AMV tool to cover an additional year of data. Any adjusted variables in your extract for samples that are not yet available for monetary adjustment will consist entirely of the adjusted variable NIU code (i.e., a string of 9s).

Continue reading…

Linking children and adolescents to their mothers using IPUMS MICS

By Anna Bolgrien

IPUMS MICS offers hundreds of harmonized variables related to children’s health and wellbeing that allow for rich and innovative research. From the IPUMS MICS website, users can browse variables and create custom data extracts within a selected unit of analysis. In order to conduct many analyses, however, users will want to combine and link datasets relating to different units of analysis available in MICS.

IPUMS MICS menu of units of analysis for data browsing

For example, to investigate how child characteristics are related to characteristics of their mother, users will need to download and link data between the Children (either 0-4 or 5-17) unit of analysis and the Women unit of analysis.

IPUMS MICS provides instructions for linking across units of analysis as a user note. This user note lists the variables available as linking keys for each unit of analysis, and is a general guide for linking across the units, such as linking household characteristics with individual person records.

In this blog post, we provide more detailed information on how to link children and adolescents to their mothers. Similar logic can be applied to link children to fathers or other caregivers in the household. As IPUMS MICS requires Stata to conduct harmonization, we provide example code in Stata syntax.

Continue reading…

New Tool! ATUS-CPS Linking Counts

By Sarah Flood

The team at IPUMS is excited to introduce something brand new! ATUS-CPS Linking Counts is an interactive tool for exploring the number of ATUS respondents who can be linked to specific CPS months. We know that linkages between ATUS and CPS have great potential for enabling exciting new research, but we also know firsthand how hard it can be to wrap your head around the panel component of the CPS, the relationship between ATUS and CPS, and the many possibilities for linking them. Even researchers who have deep knowledge of the ATUS and CPS may still wonder whether there is a sufficient number of cases to conduct an analysis of interest. This new tool helps address all of these challenges. It very quickly allows you to view the number of ATUS respondents who should appear in each CPS month and determine if there is sufficient sample size for a particular application of linked ATUS-CPS data.

Linking ATUS and CPS data enables an incredible wealth of research questions. This tool allows users to specify and view different linking scenarios to assess the feasibility of various ATUS to CPS linkages. For example, you may want to investigate the relationship between food security in the CPS with shopping or eating-related behavior in the ATUS. This interactive tool would allow you to select only years of ATUS data that contain, for example, the Eating and Health module and view the CPS months in which the Food Security supplement was fielded to assess the sample sizes for your desired analysis. Figure 1 shows how you would select ATUS years of interest and find information about which ATUS modules were fielded in each year.

Figure 1. Selecting ATUS Years of Interest

drop-down menu displaying ATUS years with colored bubbles to indicate which ATUS modules are available in each year

Continue reading…

New Variables! IPUMS International Fall 2025 Data Release

By Rodrigo Lovaton Davila

IPUMS International recently added twenty-one new harmonized variables that expand the thematic coverage of the data collection and enable new possibilities for research. Most notably, the data release introduces harmonized variables representing sample level information, including selected characteristics of the statistical operation and the sampling design (accessible in technical household). This information was previously available in the sample descriptions section, but is now also accessible through variables that can be included in data extracts. Read on for more details on these new sample-level variables and a few new work and household amenity variables!

New variables about the statistical operation describe whether the data correspond to a census or a survey; whether enumeration was de jure or de facto; the type of form received by respondents in the sample; and the month of data collection. The IPUMS International data collection currently includes 395 census samples, 233 labor force surveys, and 27 population surveys.

FORMTYPE allows users to identify whether the data for each sample consist of responses to a single, standard questionnaire applied to the entire population; responses to a short or long form, in a census that gathered more information from a sample of the population; or records derived from administrative registers (with no questionnaire used in data collection.) Most datasets in the collection correspond to one standard questionnaire (79% of 395 census samples). For censuses where a short and a long form were applied, the samples in IPUMS typically correspond to the long questionnaire (78% of 78 samples), which includes additional questions and is richer for research purposes.

ENUMTYPE indicates whether the enumeration was de jure or de facto, an important distinction for understanding how the population was counted in the census operation. Some censuses enumerate combining both de jure (usual residents) and de facto (those present on the census reference date whether resident or visitor), which is reflected in this new variable. Importantly, users can work with the existing variable RESIDENT to eliminate double-counting of persons who were enumerated both at their permanent residence and at the residence they were visiting on census night. ENUMMO complements the variable YEAR to provide a more accurate indicator of the timing of data collection.

Continue reading…