IHGIS – Use It for Good

Does 1 + 2 = 8? Automating QA/QC for Tabular Data

May 21, 2026May 21, 2026 by mpcblog

By Tracy Kugler and Tsu Zhu

The problem with OCR and numbers

To extract data tables from census reports only available as print documents, IPUMS IHGIS uses optical character recognition (OCR) software to automate the conversion of scanned images into digital representations of letters and numbers. OCR software has made great strides in accuracy for textual information by using dictionaries of known words to interpret uncertain letters. However, dictionaries do not help in distinguishing uncertain numerical digits. While a dictionary can suggest that the third character in “wh_t” should be an ‘a’ and not an ‘o’, there is no simple way to tell whether the third digit in “45_” should be a 3 or an 8. To ensure that IHGIS data are accurate, we must have confidence that each number has been recognized correctly and matches the number in the source document.

To address this gap, we developed an R package that leverages IHGIS structured metadata to identify logical relationships between cell counts and row/column totals and determine where cells don’t add up as expected. Often, a given cell participates in multiple relationships, which allows the package to use patterns among discrepancies to pinpoint and correct errors. The package can automatically identify and correct up to 95% of error cells, depending on the structure of relationships.

Identifying relationships from structured metadata

The R package currently relies on structured metadata generated by earlier stages in the IHGIS data processing pipeline to identify sum and total relationships among rows and columns. After tables are OCR’ed from source documents, we use a customized markup framework to generate metadata. We then convert the marked up files into CSV files with a standard structure, which serve as input to the quality assurance/quality control (QA/QC) process. The CSV files include hierarchical labels for categories on the columns and geographic units on the rows. Within the labels, blanks are used to indicate totals. The package identifies a column/row with a blank header cell as the sum of other columns/rows that share the same non-blank label(s) and have sub-category labels corresponding to the blank.

Accessing International Census Data Tables in R

September 24, 2025 by mpcblog

by Tsu Zhu and Tracy Kugler

ipumsr now supports IHGIS!

IPUMS spatial data users now have programmatic access to international census data tables in IHGIS. The recent release of ipumsr 0.9.0 enables users to explore metadata; build, submit, and download extracts; and read IHGIS tables directly in R. Many of the ipumsr functions that had been NHGIS-specific have now been generalized to accommodate both IHGIS and NHGIS. More information on the new functions can be found in the ipumsr changelong.

About IPUMS IHGIS

The International Historical Geographic Information System (IHGIS) provides data tables from population, housing, and agricultural censuses from around the world. The data are derived from tables originally published by national statistical offices. The format and structure of the published tables varies widely between countries and across time, even within the same country. IHGIS extracts the tables and standardizes them into a machine-readable structure along with consistently formatted metadata and corresponding GIS boundary files. As of this writing, the IHGIS collection consists of 40 datasets from population and housing censuses, 14 from agricultural censuses, and an additional 305 datasets tabulated from IPUMS International microdata samples.

Working with ipumsr for IHGIS

The following sections walk through how you would use new ipumsr functionality to explore metadata and submit an extract request for data tables from Ireland censuses from 1966 through 1991. This example highlights changes to ipumsr functions that have been generalized to support both IHGIS and NHGIS. For more details see the Aggregate Data API Requests article on the ipumsr webpage.

Digitizing and Exploring Qatar’s Population Censuses

June 15, 2025June 24, 2024 by mpcblog

By Shine Min Thant

Qatar, a small yet influential state in the Middle East, is a very interesting case study for demographic research because of its rapid development over the past thirty years. Qatar occupies a peninsula only slightly larger than the U.S. state of Rhode Island that juts out into the Persian Gulf from its border with Saudi Arabia. The country has experienced relatively rapid economic growth since the late 20th century, mainly due to its vast reserves of natural gas and oil. This newfound wealth allowed Qatar to invest heavily in its healthcare, infrastructure, and education – therefore making the country an ideal case study for social change and development. Additionally, a recent surge in Qatar’s immigrant population (which constitutes over 78 percent of the population) also makes it an ideal country to study social mobility and social change.

As part of the ISRDI Diversity Fellowship Program, I worked with Dr. Tracy Kugler, Professor Steven Manson, Professor Evan Roberts, and undergraduate student Rawan AlGahtani on a project to examine Qatar’s change using census data from 1984, 1997, and 2004. Summary tables from all three censuses were previously only available as printed documents. As a first step, we needed to transform the data from a hard-to-get printed format to widely accessible IPUMS IHGIS format. This process included multiple steps from conducting optical character recognition (OCR) to conducting data quality checks using R scripts (Figure 1).

Figure 1: IPUMS IHGIS Workflow

Malaria Transmission in Context: Linking Health, Census, and Ecological Data

June 13, 2025January 12, 2023 by mpcblog

by Yara Ghazal, Ilyana Hohenkirk, Tracy Kugler, and Kelly Searle

Malaria, like many vector-borne diseases, impacts health, economic growth, and society. The burden of malaria incidence and death is concentrated in Sub-Saharan Africa; in 2020, 95% of all malaria cases and 96% of all deaths occurred in Sub-Saharan Africa (WHO, 2022). Malaria impacts not only population health but also the economic growth of these 32 countries. It is estimated that up to 1.3% of economic growth in this region of Africa is slowed each year due to malaria (CCP-JHU, 2015). Understanding malaria transmission is essential to ending its spread and creating a healthier and more prosperous future for developing nations.

The literature on malaria transmission patterns has shown that several environmental factors impact mosquito and parasite vital rates, and thus affect the transmission intensity, seasonality, and geographical distribution of malaria (Castro, 2017). Temperature and precipitation are the primary climate-based factors that influence malaria transmission patterns. Temperature creates geographical constraints for vector and parasite development. Increasing temperatures have been found to shorten mosquito maturation time and increase feeding frequency. However, areas of extremely high temperatures usually yield smaller, less fecund mosquitoes. In parallel, because mosquitoes often breed in pools formed by rainfall and flooding, the frequency, duration, and intensity of precipitation have a significant influence on mosquito populations.

IHGIS Research Example: Fertilizer Use from Agricultural Census Data

March 17, 2026October 6, 2021 by mpcblog

By Chris M. Boyd

The IPUMS International Historical Geographic Information System (IHGIS) provides subnational data from agricultural and population and housing censuses from around the world. The agricultural census data cover a wide range of information on agricultural inputs, labor, production, and more, which can be used to explore a variety of research questions. IHGIS data can help understand, for instance, which factors contribute to better crop productivity, including the role of fertilizer use. Researchers have used agricultural census data at the subnational level to analyze the negative relationship between farm size and fertilizer overuse in China¹; the relationship between maize yield, farm size and fertilizer and irrigation use in Mexico²; the use of chemical fertilizers in direct market farms in the U.S.³; and the environmental sustainability of using fertilizers, insecticides and pesticides in Pakistan⁴.

To date, IHGIS has released Agricultural Census tables for ten countries, including seven developing countries in Africa and the Pacific Islands. These seven datasets include information about fertilizer use, though each measures it in a different way (see Table 1). Despite the differences, these data can reveal broad patterns in the use of fertilizer by farmers among these countries.

IPUMS Announces 2020 Research Award Recipients

June 10, 2025May 17, 2021 by mpcblog

IPUMS is excited to announce the winners of its annual IPUMS Research Awards. These awards honor the best-published research and nominated graduate student papers from 2020 that used IPUMS data to advance or deepen our understanding of social and demographic processes.

IPUMS, developed by and housed at the University of Minnesota, is the world’s largest individual-level population database, providing harmonized data on people in the U.S. and around the world to researchers at no cost.

There are six award categories, and each is tied to the following IPUMS projects:

IPUMS USA, providing data from the U.S. decennial censuses, the American Community Survey, and IPUMS CPS from 1850 to the present.
IPUMS International, providing harmonized data contributed by more than 100 international statistical office partners; it currently includes information on 500 million people in more than 200 censuses from around the world, from 1960 forward.
IPUMS Health Surveys, which makes available the U.S. National Health Interview Survey (NHIS) and the Medical Expenditure Panel Survey (MEPS).
IPUMS Spatial, covering IPUMS NHGIS and IPUMS Terra. NHGIS includes GIS boundary files from 1790 to the present; Terra provides data on population and the environment from 1960 to the present.
IPUMS Global Health: providing harmonized data from the Demographic and Health Surveys and the Performance Monitoring and Accountability surveys, for low and middle-income countries from the 1980s to the present.
IPUMS Time Use, providing time diary data from the U.S. and around the world from 1965 to the present.

Over 2,500 publications based on IPUMS data appeared in journals, magazines, and newspapers worldwide last year. From these publications and from nominated graduate student papers, the award committees selected the 2020 honorees.

IPUMS IHGIS: Unlocking International Population and Agricultural Census Data

June 20, 2025November 5, 2020 by mpcblog

By Tracy Kugler

Nearly all countries throughout the world conduct population and housing censuses at least every ten years, and most also conduct agricultural censuses or surveys regularly. These censuses collect information on demographics, education, employment, housing characteristics, migration, agricultural land ownership, agricultural workforce, livestock, crops, and more. The resulting data can be used to study a wide range of questions, from the character of demographic transitions within and across countries, to utilization of irrigation, to educational trends among women.

Unfortunately, this wealth of data has remained largely inaccessible to researchers. The data are typically published in reports as tables summarizing population characteristics. In recent decades, many of these reports have been published as PDF documents and made available on national statistical office websites. While the reports are available, data from a PDF document cannot be easily imported into a statistical or GIS package. Furthermore, the table structures are highly heterogeneous, both across countries and even within the same report.

The International Historical Geographic Information System (IPUMS IHGIS) is designed to provide easy access to these data in a way that researchers can easily use for analysis. In the early phases, IHGIS was known internally as “Project Mako,” named after the Mako shark, which has a global range, voracious appetite, and a reputation for a broad-ranging diet. Like the shark, IHGIS (née Project Mako) will encompass the world and ingest all kinds of data tables.