Does 1 + 2 = 8? Automating QA/QC for Tabular Data

By Tracy Kugler and Tsu Zhu

The problem with OCR and numbers

To extract data tables from census reports only available as print documents, IPUMS IHGIS uses optical character recognition (OCR) software to automate the conversion of scanned images into digital representations of letters and numbers. OCR software has made great strides in accuracy for textual information by using dictionaries of known words to interpret uncertain letters. However, dictionaries do not help in distinguishing uncertain numerical digits. While a dictionary can suggest that the third character in “wh_t” should be an ‘a’ and not an ‘o’, there is no simple way to tell whether the third digit in “45_” should be a 3 or an 8. To ensure that IHGIS data are accurate, we must have confidence that each number has been recognized correctly and matches the number in the source document.

To address this gap, we developed an R package that leverages IHGIS structured metadata to identify logical relationships between cell counts and row/column totals and determine where cells don’t add up as expected. Often, a given cell participates in multiple relationships, which allows the package to use patterns among discrepancies to pinpoint and correct errors. The package can automatically identify and correct up to 95% of error cells, depending on the structure of relationships.

Identifying relationships from structured metadata

The R package currently relies on structured metadata generated by earlier stages in the IHGIS data processing pipeline to identify sum and total relationships among rows and columns. After tables are OCR’ed from source documents, we use a customized markup framework to generate metadata. We then convert the marked up files into CSV files with a standard structure, which serve as input to the quality assurance/quality control (QA/QC) process. The CSV files include hierarchical labels for categories on the columns and geographic units on the rows. Within the labels, blanks are used to indicate totals. The package identifies a column/row with a blank header cell as the sum of other columns/rows that share the same non-blank label(s) and have sub-category labels corresponding to the blank.

Continue reading…

IPUMS DHS Goes Global

By Miriam L. King and Sula Sarkar

IPUMS DHS now includes integrated variables for 84 counties (up from 51) and nearly 350 samples (up from 233), including new data from Latin America, Eastern Europe, Oceania, the Caribbean, and Central and East Asia. Providing DHS data in a form that facilitates micro-analyses across countries is one of IPUMS’ greatest strengths, so researchers will be excited to learn that they can now do even more! Our latest data release expands the scope of IPUMS DHS beyond its initial coverage of Africa, the Middle East, and South Asia and adds the latest samples for 12 countries previously in the database. Figure 1 shows the full geographic scope of IPUMS DHS, as well as highlighting newly added countries and previously included countries with new samples.

Figure 1: Countries included in IPUMS DHSWorld map with countries that are new to IPUMS DHS, have new samples in IPUMS DHS, or have no new samples in IPUMS DHS filled in

Continue reading…

Linking children and adolescents to their mothers using IPUMS MICS

By Anna Bolgrien

IPUMS MICS offers hundreds of harmonized variables related to children’s health and wellbeing that allow for rich and innovative research. From the IPUMS MICS website, users can browse variables and create custom data extracts within a selected unit of analysis. In order to conduct many analyses, however, users will want to combine and link datasets relating to different units of analysis available in MICS.

IPUMS MICS menu of units of analysis for data browsing

For example, to investigate how child characteristics are related to characteristics of their mother, users will need to download and link data between the Children (either 0-4 or 5-17) unit of analysis and the Women unit of analysis.

IPUMS MICS provides instructions for linking across units of analysis as a user note. This user note lists the variables available as linking keys for each unit of analysis, and is a general guide for linking across the units, such as linking household characteristics with individual person records.

In this blog post, we provide more detailed information on how to link children and adolescents to their mothers. Similar logic can be applied to link children to fathers or other caregivers in the household. As IPUMS MICS requires Stata to conduct harmonization, we provide example code in Stata syntax.

Continue reading…

New Tool! ATUS-CPS Linking Counts

By Sarah Flood

The team at IPUMS is excited to introduce something brand new! ATUS-CPS Linking Counts is an interactive tool for exploring the number of ATUS respondents who can be linked to specific CPS months. We know that linkages between ATUS and CPS have great potential for enabling exciting new research, but we also know firsthand how hard it can be to wrap your head around the panel component of the CPS, the relationship between ATUS and CPS, and the many possibilities for linking them. Even researchers who have deep knowledge of the ATUS and CPS may still wonder whether there is a sufficient number of cases to conduct an analysis of interest. This new tool helps address all of these challenges. It very quickly allows you to view the number of ATUS respondents who should appear in each CPS month and determine if there is sufficient sample size for a particular application of linked ATUS-CPS data.

Linking ATUS and CPS data enables an incredible wealth of research questions. This tool allows users to specify and view different linking scenarios to assess the feasibility of various ATUS to CPS linkages. For example, you may want to investigate the relationship between food security in the CPS with shopping or eating-related behavior in the ATUS. This interactive tool would allow you to select only years of ATUS data that contain, for example, the Eating and Health module and view the CPS months in which the Food Security supplement was fielded to assess the sample sizes for your desired analysis. Figure 1 shows how you would select ATUS years of interest and find information about which ATUS modules were fielded in each year.

Figure 1. Selecting ATUS Years of Interest

drop-down menu displaying ATUS years with colored bubbles to indicate which ATUS modules are available in each year

Continue reading…

New Variables! IPUMS International Fall 2025 Data Release

By Rodrigo Lovaton Davila

IPUMS International recently added twenty-one new harmonized variables that expand the thematic coverage of the data collection and enable new possibilities for research. Most notably, the data release introduces harmonized variables representing sample level information, including selected characteristics of the statistical operation and the sampling design (accessible in technical household). This information was previously available in the sample descriptions section, but is now also accessible through variables that can be included in data extracts. Read on for more details on these new sample-level variables and a few new work and household amenity variables!

New variables about the statistical operation describe whether the data correspond to a census or a survey; whether enumeration was de jure or de facto; the type of form received by respondents in the sample; and the month of data collection. The IPUMS International data collection currently includes 395 census samples, 233 labor force surveys, and 27 population surveys.

FORMTYPE allows users to identify whether the data for each sample consist of responses to a single, standard questionnaire applied to the entire population; responses to a short or long form, in a census that gathered more information from a sample of the population; or records derived from administrative registers (with no questionnaire used in data collection.) Most datasets in the collection correspond to one standard questionnaire (79% of 395 census samples). For censuses where a short and a long form were applied, the samples in IPUMS typically correspond to the long questionnaire (78% of 78 samples), which includes additional questions and is richer for research purposes.

ENUMTYPE indicates whether the enumeration was de jure or de facto, an important distinction for understanding how the population was counted in the census operation. Some censuses enumerate combining both de jure (usual residents) and de facto (those present on the census reference date whether resident or visitor), which is reflected in this new variable. Importantly, users can work with the existing variable RESIDENT to eliminate double-counting of persons who were enumerated both at their permanent residence and at the residence they were visiting on census night. ENUMMO complements the variable YEAR to provide a more accurate indicator of the timing of data collection.

Continue reading…

Historical Supplemental Poverty Measure

By Stephanie Richards, Kari Williams, and Sarah Flood

The Annual Social and Economic Supplement (ASEC) of the Current Population Survey is the official source of information about poverty in the United States. Since 1968, the ASEC has been used to create the Official Poverty Measure (OPM) and has included the variables needed to create that measure. The Supplemental Poverty Measure (SPM) and the variables needed to create it were first released by the Census Bureau in 2010, reporting the SPM for 20091. In contrast to the OPM, the SPM provides a more complete picture of the economic wellbeing of American households.

The value of the SPM is apparent – it is a comprehensive and nuanced measure that accounts for the diversity of living arrangements, variability in cost of living, and a wider array of available financial resources and demands. However, the temporal coverage of SPM is limited; the Census Bureau only has data back to 2010. Over the last ten years, researchers at Columbia University’s Center on Poverty and Social Policy (CPSP) have eliminated this constraint by compiling the data necessary to create SPM and make it available back to 1968, and have shared the data with the research community via the CPSP Historical SPM Data Portal.

CPSP researchers have also partnered with IPUMS to disseminate their historical SPM data via IPUMS CPS. This includes the poverty status variables (i.e., SPMPOV and SPMPOVANC12) as well as the inputs and thresholds for creating them. If you know IPUMS, you know that we loooooove the chance to extend a valuable measure back in time. We are incredibly grateful to CPSP for the important work they have done and are thrilled to make it even easier for IPUMS CPS users to access the historical SPM data.

In this blog post, we briefly describe differences between the components – family, resources, and needs – used to create OPM and (historical) SPM, preview CPSP’s “anchored” poverty variables that facilitate comparisons over time that reference a set cost-of-living standard, and share suggestions for further reading (because we know you are going to want to learn even more about this!).

Continue reading…

Multigenerational Households Across Multiple Data Collections

By Etienne Breton

We recently updated a key IPUMS-constructed variable for understanding multigenerational households: MULTGEN, which identifies the number of generations in a household. This variable is needed to answer important questions in our era of rapid population aging. For example, do multigenerational households become more numerous during economic recessions, and if so for whom exactly? Can they buffer against physical and cognitive decline for older adults? Do young people living with their grandparents have distinct educational, professional or even health trajectories? All of these questions – and many more – can be investigated creatively and rigorously using MULTGEN.

MULTGEN has long been available for most IPUMS USA samples. We recently adapted our methodology to add this variable to IPUMS CPS for all samples from January 1994 to the present. This means that users can now research multigenerational households with another IPUMS data collection, tackling key research questions with added precision and contextual richness, in addition to analysis of topics in the CPS that are not covered in the ACS (e.g., tobacco use, volunteering, voting and registration).

The construction of MULTGEN in IPUMS CPS (as in IPUMS USA) relies on IPUMS family interrelationship variables (see this classic paper, or this more recent paper, or our user guide, for how these variables are constructed) and information from the variable RELATE (insufficient information in the RELATE variable before 1994 explains why MULTGEN is not available for older samples). At present, MULTGEN in IPUMS CPS only provides general codes about the number of generations per household, whereas MULTGEN in IPUMS USA also provides detailed codes identifying subtypes of 2-generations and 3+ generations households.

Continue reading…