Accessing International Census Data Tables in R

by Tsu Zhu and Tracy Kugler

ipumsr now supports IHGIS!

IPUMS spatial data users now have programmatic access to international census data tables in IHGIS. The recent release of ipumsr 0.9.0 enables users to explore metadata; build, submit, and download extracts; and read IHGIS tables directly in R. Many of the ipumsr functions that had been NHGIS-specific have now been generalized to accommodate both IHGIS and NHGIS. More information on the new functions can be found in the ipumsr changelong.

About IPUMS IHGIS

The International Historical Geographic Information System (IHGIS) provides data tables from population, housing, and agricultural censuses from around the world. The data are derived from tables originally published by national statistical offices. The format and structure of the published tables varies widely between countries and across time, even within the same country. IHGIS extracts the tables and standardizes them into a machine-readable structure along with consistently formatted metadata and corresponding GIS boundary files. As of this writing, the IHGIS collection consists of 40 datasets from population and housing censuses, 14 from agricultural censuses, and an additional 305 datasets tabulated from IPUMS International microdata samples.

Working with ipumsr for IHGIS

The following sections walk through how you would use new ipumsr functionality to explore metadata and submit an extract request for data tables from Ireland censuses from 1966 through 1991. This example highlights changes to ipumsr functions that have been generalized to support both IHGIS and NHGIS. For more details see the Aggregate Data API Requests article on the ipumsr webpage.

Getting Started

To install the latest version of ipumsr from CRAN, run install.packages("ipumsr") in your R console. You can ensure your currently installed version is 0.9.0 with packageVersion("ipumsr").

Requesting IHGIS data and metadata via ipumsr also requires that you first register to use IHGIS (if you haven’t already) and obtain an IPUMS API key. The ipumsr website includes several articles that demonstrate working with the IPUMS API within R, including instructions on how to get an API key and use it with ipumsr. If you are unfamiliar with creating, submitting, and downloading an extract within ipumsr, we suggest you start with the introduction to the IPUMS API for R users.

Explore IHGIS Metadata

You can view a list of all available datasets, data tables, or tabulation geographies (sets of geographic units) using 'get_metadata_catalog()'. You can then use R functions to filter the returned list.

Datasets

First, let’s get a list of available datasets for Ireland.

get_metadata_catalog(collection = "ihgis", metadata_type = "datasets") |> 
  dplyr::filter(country == "IE") |> 
  dplyr::select("name", "description", "dataset_type")


# A tibble: 15 × 3
   name      description                           dataset_type                                       
   <chr>     <chr>                                 <chr>                                              
 1 IE1966pop Census of Population of Ireland, 1966 Population Census                                  
 2 IE1971pop Census of Population of Ireland, 1971 Population Census                                  
 3 IE1971tab Census of Population of Ireland, 1971 Tabulated from IPUMS International Microdata Sample
 4 IE1979pop Census of Population of Ireland, 1979 Population Census                                  
 5 IE1979tab Census of Population of Ireland, 1979 Tabulated from IPUMS International Microdata Sample
 6 IE1981pop Census of Population of Ireland, 1981 Population Census                                  
 7 IE1981tab Census of Population of Ireland, 1981 Tabulated from IPUMS International Microdata Sample
 8 IE1986pop Census of Population of Ireland, 1986 Population Census                                  
 9 IE1986tab Census of Population of Ireland, 1986 Tabulated from IPUMS International Microdata Sample
10 IE1991pop Census of Population of Ireland, 1991 Population Census                                  
11 IE1991tab Census of Population of Ireland, 1991 Tabulated from IPUMS International Microdata Sample
12 IE1996tab Census of Population of Ireland, 1996 Tabulated from IPUMS International Microdata Sample
13 IE2002tab Census of Population of Ireland, 2002 Tabulated from IPUMS International Microdata Sample
14 IE2006tab Census of Population of Ireland, 2006 Tabulated from IPUMS International Microdata Sample
15 IE2011tab Census of Population of Ireland, 2011 Tabulated from IPUMS International Microdata Sample

We can see that IHGIS includes several Ireland datasets tabulated from microdata (those with dataset names ending in ‘tab’) and several derived from published data sources (those with dataset names ending in ‘pop’). In addition to the name, description, and dataset_type selected above, fields containing information about how the census was conducted and definitions of key terms are also available.

To view detailed metadata for a specific dataset, we can use ‘get_metadata()‘. In addition to the high-level summary metadata available in the catalog listing, the detailed dataset-level metadata includes information on available data tables and tabulation geographies. Below is a list of available data tables for the IE1991pop dataset.

get_metadata(collection = "ihgis", dataset = "IE1991pop")$data_tables |> 
  dplyr::select("name","label")

# A tibble: 25 × 2
   name          label                                                                                           
   <chr>         <chr>                                                                                           
 1 IE1991pop.AAA Population by Sex and Age Group                                                                 
 2 IE1991pop.AAB Population, Marriages, Births, Deaths, Natural Increase, and Estimated Net Migration [1926-1991]
 3 IE1991pop.AAC Population, Area and Density [1986-1991]                                                        
 4 IE1991pop.AAD Population and Percent Change [1971-1991]                                                       
 5 IE1991pop.AAE Percentage Change in Population [1946-1991]                                                     
 6 IE1991pop.AAF Percentage Change in Population in Each Age Group by Sex [1966-1991]                            
 7 IE1991pop.AAG Average Annual Rate of Change in Population Per 1,000 by Age Group and Sex [1966-1991]          
 8 IE1991pop.AAH Population by Sex, Age Group, and Marital Status [1926-1991]                                    
 9 IE1991pop.AAI Population by Single  Year of Age, Sex, and Marital Status                                      
10 IE1991pop.AAJ Population by Sex, Age Group, and Detailed Marital Status                                       
# ℹ 15 more rows

In addition to the name and label selected above, available table-level metadata also include the universe, table number, tabulation geographies, and footnotes.

Tabulation Geographies

IHGIS tabulation geographies are sets of units over which population data are summarized. Most tabulation geographies are organized into hierarchies of child units nested within parent units (e.g., states, provinces, districts). To view available tabulation geographies for Ireland population census datasets, the following example filters the catalog to match dataset names to those starting with “IE” and ending with “pop”. Metadata on the number of units and mean population and area of units provides an idea of the level of geographic granularity available. In the case of IHGIS Ireland datasets, hierarchical levels go down to 'g5', which represents towns and environs. For this example, we apply a filter for labels containing the word “Counties” to see which tabulation geographies represent counties.

get_metadata_catalog(collection = "ihgis", metadata_type = "tabulation_geographies") |> 
  dplyr::filter(stringr::str_detect(name, "^IE.*pop.*$")) |> 
  dplyr::filter(stringr::str_detect(label,"Counties")) |> 
  dplyr::arrange(unit_count)

# A tibble: 15 × 7
   name         label                                               hierarchical_level mean_population mean_area sequence unit_count
   <chr>        <chr>                                               <chr>                        <int>     <dbl>    <int>      <int>
 1 IE1966pop.ga Counties with County & City (County Borough) groups ga                          110923     2702         7         26
 2 IE1971pop.gc Counties with County & County Borough groups        gc                          110305     2602.        9         27
 3 IE1979pop.gc Counties with County & County Borough groups        gc                          124749     2602.        8         27
 4 IE1981pop.gc Counties with County & County Borough groups        gc                          127534     2602.        8         27
 5 IE1986pop.gc Counties with County & County Borough groups        gc                          131135     2602.        7         27
 6 IE1966pop.g3 Counties/County Boroughs                            g3                           93032     2266.        4         31
 7 IE1971pop.g3 Counties/County Boroughs                            g3                           96073     2266.        4         31
 8 IE1979pop.g3 Counties/County Boroughs                            g3                          108652     2266.        4         31
 9 IE1981pop.g3 Counties/County Boroughs                            g3                          111078     2266.        4         31
10 IE1971pop.gb Counties/County Boroughs [including Dun Laoghaire]  gb                           93070     2195.        8         32
11 IE1979pop.ga Counties/County Boroughs [including Dun Laoghaire]  ga                          105257     2195.        7         32
12 IE1981pop.ga Counties/County Boroughs [including Dun Laoghaire]  ga                          107606     2195.        7         32
13 IE1986pop.g3 Counties/County Boroughs                            g3                          110645     2195.        4         32
14 IE1991pop.g3 Counties                                            g3                          110179     2195.        4         32
15 IE1966pop.gb Counties/County Boroughs [including Dun Laoghaire]  gb                           87394     2129.        8         33

We can see that several tabulation geographies, designated 'ga', 'gb', or 'g3', represent counties and related units. Also note that the unit counts are similar across the range of years.

Data Tables

Now let’s look for tables we could use to analyze changes in Ireland’s employed population over time at the county/county borough level. First, we get a list of Ireland datasets derived from published population censuses by filtering the results of get_metadata_catalog().

Then, we iterate over each dataset to pull table titles that contain “employ”.

# Pull list of available population censuses for Ireland.
ie_datasets <- get_metadata_catalog(collection = "ihgis", metadata_type = "datasets") |>
  dplyr::filter(stringr::str_detect(name, "^IE.*pop.*$"))

purrr::map(ie_datasets$name, function(dataset) {
  get_metadata(collection = "ihgis", dataset = dataset)$data_tables}) |> 
  dplyr::bind_rows(.id = "dataset") |> 
  dplyr::filter(grepl("employ",label, ignore.case = TRUE)) |> 
  dplyr::pull("label")

[1] "Working Age Population by Employment Status and Sex"                  
[2] "Employed Population by Industrial Group and Sex"                      
[3] "Population by Sex and Employment Status"                              
[4] "Employed Population by Sex and Industrial Group"                      
[5] "Population 15 Years and Over by Sex, Employment Status, and Age Group"
[6] "Economically Active Population by Employment Status"                  
[7] "Employed Population by Industrial Group"                              
[8] "Population by Sex, Employment Status, and Age Group"                  
[9] "Employed Population by Sex and Industrial Group"

From the list above, we decide to select datasets with tables that have “Employed population” in the title. We can then apply a similar iteration method to refine the filter and identify tables that are available for the 'ga', 'gb', and/or 'g3' county-level tabulation geographies.

purrr::map(ie_datasets$name, function(dataset) {get_metadata(collection = "ihgis", dataset = dataset)$data_tables}) |> 
  bind_rows(.id = "dataset") |>
  filter(grepl("Employed population", label, ignore.case = TRUE),
         map_lgl(tabulation_geographies, ~ any(grepl("ga|gb|g3", .x, ignore.case = TRUE)))) |>
  dplyr::select("dataset","name","dataset_name","label", "universe")


# A tibble: 4 × 5
  dataset name          dataset_name label                                           universe           
  <chr>   <chr>         <chr>        <chr>                                           <chr>              
1 1       IE1966pop.AAI IE1966pop    Employed Population by Industrial Group and Sex Employed population
2 2       IE1971pop.AAT IE1971pop    Employed Population by Sex and Industrial Group Employed population
3 4       IE1981pop.AAX IE1981pop    Employed Population by Industrial Group         Employed Population
4 6       IE1991pop.AAO IE1991pop    Employed Population by Sex and Industrial Group Employed population

The resulting list includes four tables, one each from 1966, 1971, 1981, and 1991.

Request and Download IHGIS Data

After identifying the datasets, data tables, and tabulation geographies you are interested in, you can define an IHGIS data extract using 'define_extract_agg()' and 'ds_spec()'.1

Here we’ll define an extract for the tables representing employed populations from the list above.

extract <- define_extract_agg(
  "ihgis",
  description = "Ireland employed population tables for ga, gb, g3",
  datasets = list(
    ds_spec("IE1966pop", data_tables = "IE1966pop.AAI", tabulation_geographies = "IE1966pop.gb"),
    ds_spec("IE1971pop", data_tables = "IE1971pop.AAT", tabulation_geographies = "IE1971pop.gb"),
    ds_spec("IE1981pop", data_tables = "IE1981pop.AAX", tabulation_geographies = "IE1981pop.ga"),
    ds_spec("IE1991pop", data_tables = "IE1991pop.AAO", tabulation_geographies = "IE1991pop.g3")
  ))

print(extract)

Unsubmitted IPUMS IHGIS extract 
Description: Ireland employed population tables for ga, gb, g3

Dataset: IE1966pop
  Tables: IE1966pop.AAI
  Tabulation Geogs: IE1966pop.gb

Dataset: IE1971pop
  Tables: IE1971pop.AAT
  Tabulation Geogs: IE1971pop.gb

Dataset: IE1981pop
  Tables: IE1981pop.AAX
  Tabulation Geogs: IE1981pop.ga

Dataset: IE1991pop
  Tables: IE1991pop.AAO
  Tabulation Geogs: IE1991pop.g3

Once the extract is defined, you can use ipumsr functions to

See details in the Introduction to the IPUMS API for R Users.

After downloading an extract, you can load the data into R with detailed data field descriptions by calling the function 'read_ipums_agg()'.2  IHGIS extracts come with full metadata that is accessible by calling 'read_ihgis_codebook()'. For more detail on extract metadata, see the article Read metadata from an IHGIS extract’s codebook files.

Final Notes

The workflow outlined above is just a sample of the new features in ipumsr that support aggregate data API requests. See the full updated documentation in the Aggregate Data API Requests article.

Python Support

We have also added IHGIS support to ipumspy, a python library that provides much of the same functionality that is available in ipumsr! This library is maintained by IPUMS.

 

Footnotes

  1. The function 'define_extract_agg()' has replaced 'define_extract_nhgis()'. ↩︎
  2. The function 'read_ipums_agg()' has replaced 'read_nhgis()'. ↩︎