By Renae Rodgers
What is an Extract?
IPUMS users will already be familiar with the concept of an extract, but for those who may just be joining us, we’ll do a brief recap. Public Use data files are often large, unwieldy blocks of data, many variables wide and many many records long. Most analyses will only require a small subset of the available variables in any given dataset, but downloading public data from government agencies is an all-or-nothing endeavor. In addition to offering public use data that is harmonized across time and place, IPUMS allows users to choose only their variables of interest for download. These individualized datasets and accompanying metadata are IPUMS extracts.
What is an Extract Definition?
In short, an IPUMS extract definition is all the information needed to create a user’s personalized extract data file and accompanying metadata – everything short of those files themselves.
An IPUMS extract is defined by:
- The name of the IPUMS collection (e.g. “usa”, “cps”)
- A list of sample names or IDs (to be) included in the extract file
- A list of variable names (to be) included in the extract file
- An extract description (e.g. “2022 ACS demographic variables”)
IPUMS users build these extract definitions piece by piece when they create an extract through the IPUMS website, selecting samples, variables, and formats.
What is Extract Sharing?
The IPUMS user agreement(s) prohibit re-distributing extracted data (with some narrow exceptions that I won’t detail here). Extract definition sharing is a way to pass IPUMS extract specifications between IPUMS users, allowing collaborators, students, or anyone else who you might want to reproduce your IPUMS dataset to do so. An IPUMS user with whom you share an extract definition can recreate your extract and download their own data files and accompanying metadata based on the extract definition you created!
It is also important to note what extract sharing is NOT.
Extract sharing is not the sharing of extract data files, it is the sharing of extract definitions. Only a representation of your extract may be passed between users and the recipient must submit that extract definition to the IPUMS extract system in order to retrieve the data associated with that extract definition.
Extract sharing does not enable extract creation from past versions of IPUMS databases. If you submit an extract definition to the IPUMS extract system that a colleague created 10 years ago, the extract you get back will be created from the most recent version of the IPUMS data and may not match exactly what your colleague originally downloaded. If names of variables in that extract definition have changed or if variables included in that extract definition have been removed from the IPUMS database, your extract will fail.
The remainder of this post will demonstrate how to share extracts using the IPUMS Extract API and ipumspy
. Note that this only works for IPUMS microdata collections that are supported by the IPUMS Extract API! Those unfamiliar with the IPUMS Extract API may find it helpful to read our Introduction to the IPUMS Extract API for Microdata blog post before continuing.
An Example of Extract Sharing with ipumspy
Before being able to share IPUMS extracts (or have IPUMS extracts shared with you), you must be a registered IPUMS user! In addition, you must have an IPUMS API key and have ipumspy
installed.
Let’s get started by importing all the relevant modules.
import sys
import os
import pandas as pd
import numpy as np
from ipumspy import (IpumsApiClient,
UsaExtract,
readers,
save_extract_as_json,
define_extract_from_json,
define_extract_from_ddi)
Creating an IPUMS USA Extract
First, I pass my API key to the IpumsApiClient
class. This key will be used to authenticate all future API calls.
# I have stored my API key in the "IPUMS_API_KEY" environment varible
my_api_key = os.getenv("IPUMS_API_KEY")
ipums = IpumsApiClient(my_api_key)
extract = UsaExtract(["us2020a"],
["AGE", "SEX", "RACE", "STATEFIP"],
description="Renae's amazing USA extract.")
Now that my extract is defined, I can submit that definition to the IPUMS extract engine.
ipums.submit_extract(extract)
Checking the status of the extract, we see that it has been received and is in line to be processed.
ipums.extract_status(extract)
‘queued’
I’ll use the wait_for_extract()
wrapper to let me know when the extract is complete and ready for download.
ipums.wait_for_extract(extract)
print(f"{extract.collection} number {extract.extract_id} is complete!")
ipums.download_extract(extract)
usa number 166 is complete!
Now that the extract is complete and downloaded to my current working directory, I can use the ipumspy
readers to parse the DDI codebook and read the data file into a Pandas DataFrame.
extract_file = f"{extract.collection}_{str(extract.extract_id).zfill(5)}"
renae_ddi = readers.read_ipums_ddi(f'{extract_file}.xml')
renae_df = readers.read_microdata(renae_ddi, f'{extract_file}.dat.gz')
renae_df.head()
YEAR | SAMPLE | SERIAL | CBSERIAL | HHWT | CLUSTER | STATEFIP | STRATA | GQ | PERNUM | PERWT | SEX | AGE | RACE | RACED | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020 | 202001 | 1 | 2020010000060 | 53.0 | 2020000000011 | 1 | 140001 | 3 | 1 | 53.0 | 1 | 68 | 1 | 100 |
1 | 2020 | 202001 | 2 | 2020010000084 | 39.0 | 2020000000021 | 1 | 130101 | 4 | 1 | 39.0 | 2 | 18 | 2 | 200 |
2 | 2020 | 202001 | 3 | 2020010000128 | 17.0 | 2020000000031 | 1 | 200001 | 4 | 1 | 17.0 | 2 | 35 | 1 | 100 |
3 | 2020 | 202001 | 4 | 2020010000189 | 74.0 | 2020000000041 | 1 | 120001 | 3 | 1 | 74.0 | 2 | 46 | 8 | 818 |
4 | 2020 | 202001 | 5 | 2020010000207 | 188.0 | 2020000000051 | 1 | 90001 | 4 | 1 | 188.0 | 2 | 79 | 1 | 100 |
Analysis Montage
Preparing Extract Definition for Sharing
Having now done some work, I am ready to pass it off to a collaborator for further work. Or perhaps I want to share some code with a student or a class. Asking them to recreate my extract just so through the IPUMS website is fiddly and error-prone. Instead, I can share the extract definition and any accompanying code so they can recreate my exact starting point with just a few lines of code!
First step is for me to save mine to a json file.
save_extract_as_json(extract, "renae_ipums_extract.json")
Now I can send ‘renae_ipums_extract.json’ to anyone for them to re-create my extract for their own purposes. I’d like to share my work with my collaborator Marvin.
Creating an IPUMS Extract from a Shared Extract Definition
Upon receipt of my json file, Marvin will just do my process in reverse. First though, he needs to get himself an API key and pass it to the IpumsApiClient
marvin_api_key = os.getenv("MARVIN_IPUMS_API_KEY")
ipums = IpumsApiClient(marvin_api_key)
Next Marvin will read my json extract definition and transform it into an ipumspy
extract object.
renae_extract = define_extract_from_json("renae_ipums_extract.json")
Marvin can update the extract description to distinguish it as his own
# Renae's description
print(renae_extract.description)
renae_extract.description = f"Marvin submission of: {renae_extract.description}"
# Marvin's description
print(renae_extract.description)
Renae’s amazing USA extract.
Marvin submission of: Renae’s amazing USA extract.
Or he can replace it all together
renae_extract.description = f"Marvin's recreation of Renae's IPUMS USA extract"
renae_extract.description
“Marvin’s recreation of Renae’s IPUMS USA extract”
And now he is ready to submit, wait, and download his very own data file of my extract!
ipums.submit_extract(renae_extract)
ipums.wait_for_extract(renae_extract)
ipums.download_extract(renae_extract)
Now that Marvin has downloaded his extract, he can read it into Pandas.
extract_file = f"{renae_extract.collection}_{str(renae_extract.extract_id).zfill(5)}"
marvin_ddi = readers.read_ipums_ddi(f'{extract_file}.xml')
marvin_df = readers.read_microdata(marvin_ddi, f'{extract_file}.dat.gz')
marvin_df.head()
YEAR | SAMPLE | SERIAL | CBSERIAL | HHWT | CLUSTER | STATEFIP | STRATA | GQ | PERNUM | PERWT | SEX | AGE | RACE | RACED | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020 | 202001 | 1 | 2020010000060 | 53.0 | 2020000000011 | 1 | 140001 | 3 | 1 | 53.0 | 1 | 68 | 1 | 100 |
1 | 2020 | 202001 | 2 | 2020010000084 | 39.0 | 2020000000021 | 1 | 130101 | 4 | 1 | 39.0 | 2 | 18 | 2 | 200 |
2 | 2020 | 202001 | 3 | 2020010000128 | 17.0 | 2020000000031 | 1 | 200001 | 4 | 1 | 17.0 | 2 | 35 | 1 | 100 |
3 | 2020 | 202001 | 4 | 2020010000189 | 74.0 | 2020000000041 | 1 | 120001 | 3 | 1 | 74.0 | 2 | 46 | 8 | 818 |
4 | 2020 | 202001 | 5 | 2020010000207 | 188.0 | 2020000000051 | 1 | 90001 | 4 | 1 | 188.0 | 2 | 79 | 1 | 100 |
And there you have it!
To satisfy ourselves that the extract has been properly replicated, let’s compare my and Marvin’s data frames.
marvin_df.equals(renae_df)
True
TA DA!
Extract Sharing using IPUMS DDI Codebooks
Suppose that I am not hip to the API scene and I prefer the classic point-and-click method of creating extracts through the IPUMS website. Not being a Python user, I never created or submitted an extract via ipumspy
, so I have no ipumspy
extract object to save to a json file. How will I ever share my extract with Marvin?!
If the extract I created through the IPUMS web UI contains only microdata collections and features supported by the IPUMS Extract API, I can just download and send him the DDI codebook that accompanies my original IPUMS extract. Marvin can parse that DDI and use that ipumspy
Codebook object to create his own extract! Check it out.
extract_file = f"{renae_extract.collection}_{str(renae_extract.extract_id).zfill(5)}"
renae_ddi = readers.read_ipums_ddi(f'{extract_file}.xml')
marvin_extract = define_extract_from_ddi(renae_ddi)
marvin_extract.build()
{‘description’: ‘My IPUMS USA extract’,
‘data_format’: ‘fixed_width’,
‘data_structure’: {‘rectangular’: {‘on’: ‘P’}},
‘samples’: {‘us2020a’: {}},
‘variables’: {‘YEAR’: {},
‘SAMPLE’: {},
‘SERIAL’: {},
‘CBSERIAL’: {},
‘HHWT’: {},
‘CLUSTER’: {},
‘STATEFIP’: {},
‘STRATA’: {},
‘GQ’: {},
‘PERNUM’: {},
‘PERWT’: {},
‘SEX’: {},
‘AGE’: {},
‘RACE’: {},
‘RACED’: {}},
‘collection’: ‘usa’}
Marvin has now rebuilt my extract from my extract’s DDI file and is now ready to submit and download as shown above!
Note that this work flow will ONLY work if the extract made through the IPUMS web UI contains samples from IPUMS data collections, data formats, and features supported by the IPUMS Microdata Extract API. Attempts to create and submit an extract using ipumspy
based on a DDI codebook that contain variables created through the “Attach characteristics” feature, that request a hierarchical data format, or that contain samples from unsupported IPUMS data collections will result in an error.
Final Thoughts and Further Resources
The IPUMS Extract API opens up exciting new possibilities for streamlined workflows and collaboration using IPUMS data! This blog post has shown you how to create, download, and read an IPUMS USA extract into Pandas using ipumspy
and how to share that extract definition with another IPUMS user. For more information on the IPUMS Extract API, see our API developer documentation. For more details about ipumspy
functionality, see the ipumspy
documentation. Have a question about the IPUMS Extract API? Check out the API discussion board on our user forum!
Use it for Good!