Helpful Functions for Constructing Reproducible Workflows in the Absence of an IPUMS Microdata Metadata API

by Renae Rodgers

As you may have heard, the IPUMS Microdata Extract API is now in beta for our IPUMS USA and IPUMS CPS collections! Yay! You may have also heard that an IPUMS Metadata API is not yet available. Bummer. This means that users still need to do data discovery via the IPUMS web sites, which means a lot of clicking. It also makes it difficult to set up a reproducible workflow that can be updated as new data become available (and who doesn’t want that?). This blog post will show you some easy functions you can use to retrieve sample metadata to create a monthly or annual workflow that doesn’t require visiting an IPUMS website, and walk through some ipumspy functionality to access metadata for variables already included in your IPUMS extracts. The workflow in this blog post is using ipumspy v0.2.1.

If Stata is your stats package of choice, you can leverage ipumspy  to make IPUMS extracts from a Stata do file! You can spruce up your workflow with any of the functions in the blog post below by incorporating them into the template .do file offered in our blog post on making IPUMS extracts from Stata.

Sample Metadata

To get started, import the utilities  module from ipumspy. All of the helper functions in this blog post will be using the  CollectionInformation  class from this module. The sample_ids attribute of  CollectionInformation  returns a dictionary with sample descriptions as keys and sample ids as values. This information is pulled from the sample ID page for the specified IPUMS data collection. This page is the source of the sample_ids dictionary for IPUMS CPS, and this page is the source of the sample_ids  dictionary for IPUMS USA.

from ipumspy import utilities

Functions for retrieving IPUMS CPS sample IDs

First, let’s take a look at the sample ID dictionary for IPUMS CPS.

cps_samples = utilities.CollectionInformation("cps").sample_ids
# show the last 10 items in the cps_samples dictionary
dict(list(cps_samples.items())[-10:])

{‘IPUMS-CPS, October 2021’: ‘cps2021_10s’,
‘IPUMS-CPS, November 2021’: ‘cps2021_11s’,
‘IPUMS-CPS, December 2021’: ‘cps2021_12s’,
‘IPUMS-CPS, January 2022’: ‘cps2022_01s’,
‘IPUMS-CPS, February 2022’: ‘cps2022_02b’,
‘IPUMS-CPS, March 2022’: ‘cps2022_03b’,
‘IPUMS-CPS, April 2022’: ‘cps2022_04s’,
‘IPUMS-CPS, May 2022’: ‘cps2022_05b’,
‘IPUMS-CPS, June 2022’: ‘cps2022_06s’,
‘IPUMS-CPS, July 2022’: ‘cps2022_07s’}

Just scrolling through this dictionary would make it so you didn’t need to actually use your browser to visit a website, but there is a lot more that we can do to programatically grab the samples we want! For starters, let’s just grab a list of all availabe IPUMS CPS samples. The code block below get a list of all the values from the IPUMS CPS sample_ids dictionary and displays the last ten items in that list; these values are the sample IDs that are used to specify IPUMS CPS samples in an IPUMS Microdata Extract API extract request.

# show the last 10 items in the sample ids list
list(utilities.CollectionInformation("cps").sample_ids.values())[-10:]

[‘cps2021_10s’,
‘cps2021_11s’,
‘cps2021_12s’,
‘cps2022_01s’,
‘cps2022_02b’,
‘cps2022_03b’,
‘cps2022_04s’,
‘cps2022_05b’,
‘cps2022_06s’,
‘cps2022_07s’]

The CPS ASEC is the most commonly used CPS data set. These data sets always have a “03s” suffix on the sample ID and we can use this fact to loop through the values from the sample_ids dictionary and pick out only the ASEC sample ids. The function below returns a list of all available IPUMS CPS ASEC sample ids.

def get_all_asec_samples(cps_sample_dict):
    # create an empty list to hold the ASEC sample ids we find
    asec_samples = []
    # for each sample id in the values of the sample_ids dictionary...
    for sample_id in cps_sample_dict.values():
        # if the current sample_id contains the string "03s"
        if "03s" in sample_id:
            # add that sample id to the asec_samples list
            asec_samples.append(sample_id)
    # return the asec_samples list so it is avaialble outside of this function
    return asec_samples

An aside: This function could also be written with just two lines using list comprehension.

def get_all_asec_samples(cps_sample_dict):
    return [s for s in cps_sample_dict.values() if "03s" in s]

While list comprehension can be great for keeping code brief and read-able, this post will use for loops through out for clarity.

get_all_asec_samples(cps_samples)

[‘cps1962_03s’,
‘cps1963_03s’,
‘cps1964_03s’,
‘cps1965_03s’,
‘cps1966_03s’,
‘cps1967_03s’,
‘cps1968_03s’,
‘cps1969_03s’,
‘cps1970_03s’,
‘cps1971_03s’,
‘cps1972_03s’,
‘cps1973_03s’,
‘cps1974_03s’,
‘cps1975_03s’,
‘cps1976_03s’,
‘cps1977_03s’,
‘cps1978_03s’,
‘cps1979_03s’,
‘cps1980_03s’,
‘cps1981_03s’,
‘cps1982_03s’,
‘cps1983_03s’,
‘cps1984_03s’,
‘cps1985_03s’,
‘cps1986_03s’,
‘cps1987_03s’,
‘cps1988_03s’,
‘cps1989_03s’,
‘cps1990_03s’,
‘cps1991_03s’,
‘cps1992_03s’,
‘cps1993_03s’,
‘cps1994_03s’,
‘cps1995_03s’,
‘cps1996_03s’,
‘cps1997_03s’,
‘cps1998_03s’,
‘cps1999_03s’,
‘cps2000_03s’,
‘cps2001_03s’,
‘cps2002_03s’,
‘cps2003_03s’,
‘cps2004_03s’,
‘cps2005_03s’,
‘cps2006_03s’,
‘cps2007_03s’,
‘cps2008_03s’,
‘cps2009_03s’,
‘cps2010_03s’,
‘cps2011_03s’,
‘cps2012_03s’,
‘cps2013_03s’,
‘cps2014_03s’,
‘cps2015_03s’,
‘cps2016_03s’,
‘cps2017_03s’,
‘cps2018_03s’,
‘cps2019_03s’,
‘cps2020_03s’,
‘cps2021_03s’]

TaDa! This list will contain all ASEC samples available from IPUMS CPS. When it is run after the 2022 ASEC is released, that sample ID will also be included in the list returned by this function!

We can do something similar for Basic Monthly Survey (BMS) samples only.

def get_all_bms_samples(cps_sample_dict):
    # create an empty list to hold the BMS sample ids we find
    bms_samples = []
    # for each sample id in the values of the sample_ids dictionary...
    for sample_id in cps_sample_dict.values():
        # if the current sample_id does NOT contain the string "03s"
        if "03s" not in sample_id:
            # add that sample id to the bms_samples list
            bms_samples.append(sample_id)
    # return the bms_samples list so it is avaialble outside of this function
    return bms_samples
bms_samples = get_all_bms_samples(cps_samples)
# show last 10 BMS samples
bms_samples[-10:]

[‘cps2021_10s’,
‘cps2021_11s’,
‘cps2021_12s’,
‘cps2022_01s’,
‘cps2022_02b’,
‘cps2022_03b’,
‘cps2022_04s’,
‘cps2022_05b’,
‘cps2022_06s’,
‘cps2022_07s’]

What if we’re only interested in a particular year of BMS data? No problem! The function below will grab only IPUMS CPS BMS samples for the specified year.

def get_year_of_bms(cps_sample_dict, year):
    # create an empty list to hold the BMS sample ids we find
    bms_samples = []
    # for each sample id in the values of the sample_ids dictionary...
    for sample_id in cps_sample_dict.values():
        # if the current sample_id does NOT contain the string "03s"
        # AND the current sample_id contains the specified year
        if "03s" not in sample_id and year in sample_id:
            # add that sample id to the bms_samples list
            bms_samples.append(sample_id)
    # return the bms_samples list so it is avaialble outside of this function
    return bms_samples

At time of writing, we are only about half way through the year 2022; if 2022 is passed as an argument to get_year_of_bms(), only currently available samples will be returned.

get_year_of_bms(cps_samples, '2022')

[‘cps2022_01s’,
‘cps2022_02b’,
‘cps2022_03b’,
‘cps2022_04s’,
‘cps2022_05b’,
‘cps2022_06s’]

But when 1976 is passed as an argument to this function, all twelve months of BMS data are returned.

get_year_of_bms(cps_samples, '1976')

[‘cps1976_01s’,
‘cps1976_02b’,
‘cps1976_03b’,
‘cps1976_04s’,
‘cps1976_05s’,
‘cps1976_06s’,
‘cps1976_07b’,
‘cps1976_08b’,
‘cps1976_09b’,
‘cps1976_10s’,
‘cps1976_11s’,
‘cps1976_12b’]

Perhaps we just want to retrieve the most recent BMS sample ID. We can do that too!

def get_most_recent_bms(cps_samples_dict):
    # in case an ASEC sample is the most recently released sample, 
    # take the second to most recent sample
    if "03s" in sorted(cps_samples_dict.values())[-1]:
        return (cps_samples_dict.values())[-2]
    # otherwise, take the last sample id in the sorted list
    # of sample ids
    else:
        return sorted(cps_samples_dict.values())[-1]

get_most_recent_bms(cps_samples)

‘cps2022_06s’

Functions for retrieving IPUMS USA sample IDs

We can use the same ipumspy functionality we used for IPUMS CPS to retrieve a dictionary of all IPUMS USA sample IDs and descriptions or just a simple list of sample IDs. We simply need to use “usa” instead of “cps” as an argument in CollectionInformation.

usa_samples = utilities.CollectionInformation("usa").sample_ids
# show the last 10 items in the usa_samples dictionary
dict(list(usa_samples.items())[-10:])

{‘2018 PRCS’: ‘us2018b’,
‘2014-2018, ACS 5-year’: ‘us2018c’,
‘2014-2018, PRCS 5-year’: ‘us2018d’,
‘2019 ACS’: ‘us2019a’,
‘2019 PRCS’: ‘us2019b’,
‘2015-2019, ACS 5-year’: ‘us2019c’,
‘2015-2019, PRCS 5-year’: ‘us2019d’,
‘2020 ACS’: ‘us2020a’,
‘2016-2020, ACS 5-year’: ‘us2020c’,
‘2016-2020, PRCS 5-year’: ‘us2020d’}

We can write similar functions to return lists of only specific types of IPUMS USA samples. Unlike IPUMS CPS, IPUMS USA sample IDs don’t have a consistent suffix pattern, and so parsing and filtering on the sample description (keys) in the sample_ids dictionary is preferable. Suppose we’re only interested in single-year American Community Survey (ACS) samples.

def get_all_1yr_acs(usa_sample_dict):
    # create an empty list to store the acs samples
    one_yr_acs = []
    # for each sample description in the sample_ids dictionary
    for sample_desc in usa_sample_dict.keys():
        # if the last three characters in the sample description are "ACS"
        if "ACS" == sample_desc[-3:]:
            # append the sample id value for that sample description key
            # in the sample_ids dictionary to the  one_yr_acs list
            one_yr_acs.append(usa_sample_dict[sample_desc])
    # return the one_yr_acs list so it is avaialble outside of this function
    return one_yr_acs
get_all_1yr_acs(usa_samples)

[‘us2000d’,
‘us2001a’,
‘us2002a’,
‘us2003a’,
‘us2004a’,
‘us2005a’,
‘us2006a’,
‘us2007a’,
‘us2008a’,
‘us2009a’,
‘us2010a’,
‘us2011a’,
‘us2012a’,
‘us2013a’,
‘us2014a’,
‘us2015a’,
‘us2016a’,
‘us2017a’,
‘us2018a’,
‘us2019a’,
‘us2020a’]

What if we want single-year Puerto Rico Community Survey (PRCS) samples instead? Coming right up!

def get_all_1yr_prcs(usa_sample_dict):
    # create an empty list to store the prcs samples
    one_yr_prcs = []
    # for each sample description in the sample_ids dictionary
    for sample_desc in usa_sample_dict.keys():
        # if the last four characters in the sample description are "PRCS"
        if "PRCS" == sample_desc[-4:]:
            # append the sample id value for that sample description key
            # in the sample_ids dictionary to the  one_yr_prcs list
            one_yr_prcs.append(usa_sample_dict[sample_desc])
    # return the one_yr_acs list so it is avaialble outside of this function
    return one_yr_prcs
get_all_1yr_prcs(usa_samples)

[‘us2005b’,
‘us2006b’,
‘us2007b’,
‘us2008b’,
‘us2009b’,
‘us2010b’,
‘us2011b’,
‘us2012b’,
‘us2013b’,
‘us2014b’,
‘us2015b’,
‘us2016b’,
‘us2017b’,
‘us2018b’,
‘us2019b’]

IPUMS USA also has full-count census data available! Let’s just grab those sample IDs.

def get_full_count_samples(usa_sample_dict):
    # create an empty list to store the full count samples
    full_count = []
    # for each sample description in the sample_ids dictionary
    for sample_desc in usa_sample_dict.keys():
        # if the string "100%" is in the sample description
        if "100%" in sample_desc:
            # append the sample id value for that sample description key
            # in the sample_ids dictionary to the full_count list
            full_count.append(usa_sample_dict[sample_desc])
    # return the full_count list so it is avaialble outside of this function
    return full_count
get_full_count_samples(usa_samples)

[‘us1850c’,
‘us1860c’,
‘us1870c’,
‘us1880e’,
‘us1900m’,
‘us1910m’,
‘us1920c’,
‘us1930d’,
‘us1940b’]

Variable Metadata

Unfortunately, without an IPUMS Metadata API there is no easy way to do programatic data discovery at the variable level. However, once you have included a variable in an extract and read that extract’s DDI file using ipumspy, you can retrieve variable metadata from the Codebook object and examine it without needing to head to the IPUMS website.

The code below reads the DDI codebook from an IPUMS CPS extract I made a while ago and returns an ipumspy Codebook instance.

from ipumspy import readers

cps_ddi = readers.read_ipums_ddi("cps_00024.xml")

We can use the Codebook.get_variable_info() method to examine specific variables. Let’s look at the employment status variable, EMPSTAT. First we’ll grab the variable label and variable description.

empstat_vardesc = cps_ddi.get_variable_info("EMPSTAT")
print(empstat_vardesc.label)
print(empstat_vardesc.description)

Employment status
EMPSTAT indicates whether persons were part of the labor force–working or seeking work–and, if so, whether they were currently unemployed. The variable also provides information on the activity (e.g., doing housework, attending school,) or status (e.g., retired, unable to work) of persons not in the labor force, as well as limited additional information on those who are in the labor force (e.g. members of the Armed Forces, those with a job, but not at work last week). See LABFORCE for a dichotomous variable identifying whether a person participated in the labor force.

In the CPS, individuals’ employment status was determined on the basis of answers to a series of questions relating to their activities during the preceding week. Those who reported doing any work at all for pay or profit, or working at least fifteen hours without pay in a family business or farm, were classified as “at work.” Those who did not work during the previous week but who acknowledged having a job or business from which they were temporarily absent (e.g., due to illness, vacation, bad weather, or labor dispute) were also classified as employed, under the heading “has job, not at work last week.”

Because the CPS is designed to measure unemployment in the civilian population, the original employment status variable in the survey classifies members of the armed forces as NIU (Not in universe).

Unemployed persons make up the third element of the labor force. Individuals were coded as unemployed if they did no work for pay or profit, did not have a job from which they were briefly absent, and either reported looking for work as their major activity during the previous week (for 1962 through 1993) or answered yes to a question about whether they had been looking for work in the past four weeks. People who were temporarily laid off from a job were also classified as unemployed. A separate CPS variable specifying whether an unemployed person had worked before or was looking for a first job was used to distinguish between “experienced” and “inexperienced” unemployed persons in IPUMS CPS.

Persons who were neither employed nor unemployed fall into the residual category, “not in labor force.” Such individuals might be retired, disabled due to an illness lasting at least 6 months, occupied with other activities such as attending school or keeping house, or convinced that they are unlikely to find employment (discouraged workers).

Next we’ll take a look at the values and value labels in EMPSTAT. The codes attribute of the VariableDescription returns a dictionary with value labels as keys.

empstat_vardesc.codes

{‘NIU’: 0,
‘Armed Forces’: 1,
‘At work’: 10,
‘Has job, not at work last week’: 12,
‘Unemployed’: 20,
‘Unemployed, experienced worker’: 21,
‘Unemployed, new worker’: 22,
‘Not in labor force’: 30,
‘NILF, housework’: 31,
‘NILF, unable to work’: 32,
‘NILF, school’: 33,
‘NILF, other’: 34,
‘NILF, unpaid, lt 15 hours’: 35,
‘NILF, retired’: 36}

Let’s make a quick bar graph of employment status in June of 2022 – we’ll only look at those respondents that were asked about their employment status. First we’ll need to read in the extract data file as well. As this extract contains multiple samples we will also first filter the data frame to contain only observations from June.

import pandas as pd
import matplotlib.pyplot as plt

# read ipums extract into cps_df pandas data frame
cps_df = readers.read_microdata(cps_ddi, "cps_00024.dat.gz")

# This extract contains several months and years, so we will filter the data frame to retain
# only records from June of 2022 who were asked about their employment status
jun = cps_df[(cps_df["YEAR"] == 2022) & (cps_df["MONTH"] == 6) & (cps_df["EMPSTAT"] > 1)]
# group the filtered data frame by EMPSTAT and sum up WTFINL for each EMPSTAT
# cateogry to arrive at weighted totals for each group
jun_grp = jun.groupby(by=["EMPSTAT"])["WTFINL"].sum()
# create a bar graph of the grouped data frame
jun_grp.plot(kind="bar")
# retrieve the current plot
ax = plt.gca()
# Specify a title for the bar graph
ax.set_title("Employment Status in June 2022")
plt.show()

Bar graph with vertical bars showing employment status in June 2022

Having the EMPSTAT values along the x axis in this bar graph is not very informative. It would be much nicer if we could have the value labels along the x axis instead! Let’s use the codes and labels dictionary we retrieved with empstat_vardesc.codes to fill in the appropriate labels in our plot.

def get_xaxis_labels(values, codes_dict):
    # create an empty list to hold the labels
    labels = []
    # for each  label in the codes_dict
    for label in codes_dict.keys():
        # look through each value in the list of values
        for value in values:
            # if the value in the codes dictionary associated
            # with the label key is equal to the current value
            if codes_dict[label] == value:
                # append that label to the list of labels
                labels.append(label)
    # return the labels list so that it is available outside of this function
    return labels
jun_grp = jun.groupby(by=["EMPSTAT"])["WTFINL"].sum()
# create a bar graph of the grouped data frame
jun_grp.plot(kind="bar")
# retrieve the current plot
ax = plt.gca()
# Specify a title for the bar graph
ax.set_title("Employment Status in June 2022")
# use the get_xaxis_labels() function to get 
# the correct list of labels
labels = get_xaxis_labels(jun["EMPSTAT"].unique(), empstat_vardesc.codes)
# specify x tick labels for the x tick values
ax.set_xticklabels(labels, rotation=45, ha="right")
plt.show()

Bar graph with vertical bars showing employment status in June 2022 and value labels along the x axis

Let’s make the same plot, broken out by sex.

# create a data frame grouped by EMPSTAT and SEX
empstat_by_sex = jun.groupby(by=["EMPSTAT", "SEX"], observed=True)["WTFINL"].sum()
# create a bar graph grouped by SEX
empstat_by_sex.unstack("SEX").plot(kind="bar")
# retrieve the current plot
ax = plt.gca()
# Specify a title
ax.set_title("Employment Status in June 2022")
# use the get_xaxis_labels() function to get 
# the correct list of labels
labels = get_xaxis_labels(jun["EMPSTAT"].unique(), empstat_vardesc.codes)
# specify x tick labels for the x tick values
ax.set_xticklabels(labels, rotation=45, ha="right")
plt.show()

Bar graph with vertical bars that are orange and blue to represent employment status in June 2022 by sex with employment status on x axis

The legend that is automatically generated by matplotlib uses the values it finds in the data for SEX. Once again, this isn’t very informative and we would rather have the value labels included in the plot legend rather than the values. The get_legend_labels() function defined below uses the codes and labels dictionary and the numeric values matplotlib automatically assigns to the legend to generate a list of value labels that correspond to those numeric values. This label list can then be plugged into the plot legend.

def get_legend_labels(numeric_labels, codes_dict):
    # create an empty list to store labels
    labels = []
    # for each value label in the codes dictionary
    for label in codes_dict.keys():
        # look through each numeric label
        for n in numeric_labels:
            # if the value associated with the label in the
            # codes dictionary is the same as the numeric label
            if str(codes_dict[label]) == n:
                # append the label to the labels list
                labels.append(label)
    # return the labels list so that it is available outside of this function
    return labels

The matplotlib function ax.get_legend_handles_labels() returns a tuple that contains the legend labels assigned by matplotlib as the second element. We can use this as the list of numeric labels to pass to our get_legend_labels() function along with the get_variable_info("SEX).codes dictionary. The code block below re-creates the grouped bar chart with proper legend labels.

# create a data frame grouped by EMPSTAT and SEX
empstat_by_sex = jun.groupby(by=["EMPSTAT", "SEX"], observed=True)["WTFINL"].sum()
# create a bar graph grouped by SEX
empstat_by_sex.unstack("SEX").plot(kind="bar")
# retrieve the current plot
ax = plt.gca()
# Specify a title
ax.set_title("Employment Status in June 2022")
# use the get_xaxis_labels() function to get 
# the correct list of labels
labels = get_xaxis_labels(jun["EMPSTAT"].unique(), empstat_vardesc.codes)
# specify x tick labels for the x tick values
ax.set_xticklabels(labels, rotation=45, ha="right")
# use the `get_legend_labels()` to return the list of labels for the plot's legend
legend_labels = get_legend_labels(ax.get_legend_handles_labels()[1], 
                                  cps_ddi.get_variable_info("SEX").codes)
# use the matplotlib `ax.legend()` function to associate the labels
# with the correct legend color swatches
ax.legend(ax.get_legend_handles_labels()[0],
          legend_labels)
plt.show()

Bar graph with vertical bars in orange and blue to separate sexes with male and female labeled and employment status labeled on x axis

Conclusion

We hope that we will be able to offer an IPUMS Microdata Metadata API in the future that will make exploring IPUMS data offerings and setting up these types of workflows even easier! In the meantime, this blog post has shown some helpful functions for retrieving a sample-level metadata for IPUMS CPS and IPUMS USA data collections and variable metadata from extract codebooks to help users set up streamlined, reproducible workflows that can be repeated to take new data into account as they become available.

Remember to Use it for Good!