A.2 Census Data

The US Census provides invaluable information about American communities. The datasets provided by the Census encompas many disparate topics, comprehensively cover the entire US population, and are freely available to download. However, there are some barriers to using these data sets. When dealing with small geographies like a neighborhood, the statistical uncertainty may be high because the information comes from a small sample of the population. This complicates common data processing steps like combining or calculating the proportion of given indicator. Additionally, because census tract boundaries are subject to change every ten years, datasets must be normalized to account for this change before any comparative analysis can begin. Lastly, the census datasets represent information that is specific to a given geography (e.g. people in Tukwila, or households in King County) and are, therefore, spatial in nature. Spatial data, from the Census or other sources, present their own challenges that researchers must address in their choice of methods.

This analysis addresses these challenges by leveraging the capabilities of R, an opensource statistical programming language. While other software exists for working with Census data, there are several R-based tools that can be combined together efficiently and effectively. The method for downloading, organizing, and processing the data are summarized in the following steps:

  1. Define census geographies of interest
  2. Identify relevant tables from American Community Survey
  3. Download tables using the acs R package
  4. Normalize the pre-2010 dataset using Brown University’s Longitudinal Tract Database
  5. Approximate the COO site communities by combining census tracts

ACS Geographies

There are many ways to collect US Census data, but this method uses the R package called acs to extract data with the official US Census API. This method is efficient, reproducible, and allows users to download census tables for a group of dissimilar geographies. To learn more about this, see the acs package documentation.

This analysis uses the following three types of census geographies:

  • Counties (King)
  • County subdivisions (Seattle CCD)
  • Census tracts (all tracts within King County)

ACS Tables

The following tables from the American Community Survey (ACS) are used to created indicators in this assessment

Table Name Topic Universe
B03002 HISPANIC OR LATINO ORIGIN BY RACE Total population
B15002 SEX BY EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER Population 25 years and over
B19001 HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2015 INFLATION-ADJUSTED DOLLARS) Households
B25033 TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY UNITS IN STRUCTURE Total population in occupied housing units

Prior to normalization, the tables are stored in two separate dataframes: one for the 2005-2009 data, and another for the 2011-2015 data:

Census Tables, 2005-2009

FALSE Simple feature collection with 375 features and 92 fields
FALSE geometry type:  MULTIPOLYGON
FALSE dimension:      XY
FALSE bbox:           xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID):    NA
FALSE proj4string:    NA
FALSE First 10 features:

Census Tables, 2011-2015

FALSE Simple feature collection with 400 features and 92 fields
FALSE geometry type:  MULTIPOLYGON
FALSE dimension:      XY
FALSE bbox:           xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID):    NA
FALSE proj4string:    NA
FALSE First 10 features:

Data Structure: acs objects distributed in sf objects

In this method, each row contains a different census geography and each column contains a single column of a single census table. For instance, column B03002_003 contains the third column of the ‘Hispanic or Latino, By Race’ table, which contains the estimate of people who identify as “Not Hispanic or Latino: White alone”:

FALSE Simple feature collection with 1 feature and 4 fields
FALSE geometry type:  MULTIPOLYGON
FALSE dimension:      XY
FALSE bbox:           xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID):    NA
FALSE proj4string:    NA

Each “cell” of the dataframe contains a single acs-class object1, which itself contains a set of metadata including the estimate value, standard error, geographic identifier, and other useful information:

FALSE ACS DATA: 
FALSE  2005 -- 2009 ;
FALSE   Estimates w/90% confidence intervals;
FALSE   for different intervals, see confint()
FALSE                                         B03002_003  
FALSE Census Tract 1, King County, Washington 3596 +/- 358
FALSE Formal class 'acs' [package "acs"] with 9 slots
FALSE   ..@ endyear       : int 2009
FALSE   ..@ span          : int 5
FALSE   ..@ geography     :'data.frame':    1 obs. of  5 variables:
FALSE   .. ..$ NAME             : chr "Census Tract 1, King County, Washington"
FALSE   .. ..$ state            : int 53
FALSE   .. ..$ county           : chr "33"
FALSE   .. ..$ countysubdivision: chr NA
FALSE   .. ..$ tract            : chr "000100"
FALSE   ..@ acs.colnames  : chr "B03002_003"
FALSE   ..@ modified      : logi TRUE
FALSE   ..@ acs.units     : Factor w/ 5 levels "count","dollars",..: NA
FALSE   ..@ currency.year : int 2009
FALSE   ..@ estimate      : num [1, 1] 3596
FALSE   .. ..- attr(*, "dimnames")=List of 2
FALSE   .. .. ..$ : chr "Census Tract 1, King County, Washington"
FALSE   .. .. ..$ : chr "B03002_003"
FALSE   ..@ standard.error: num [1, 1] 218
FALSE   .. ..- attr(*, "dimnames")=List of 2
FALSE   .. .. ..$ : chr "Census Tract 1, King County, Washington"
FALSE   .. .. ..$ : chr "B03002_003"

Storing acs objects in a simple feature dataframe2 is unconventional but it follows a general principle of computing: don’t repeat yourself (DRY). The dataframe structure keeps related acs objects and geometries together, yielding benefits when the time comes to operate on the data.

For example, if census tracts need to be normalized before temporal comparison (as is the case in this project), that process can occur in a single, comprehensive step rather than individually for each census table. This efficiency gain is particularly important if census tables are added or removed, which may occur fequently in the exploratory phase of an analysis.

Normalized pre-2010 Data

Ultimately the ACS data will be combined into a single simple feature object, but before that can happen the pre-2010 must be normalized. The LTDB 2000-2010 Crosswalk file is a tabular tool that clarifies which tracts change from decade to decade, what type of change occurred (e.g., consolidation, split, many-to-many, none), and what weighting metric should be used to inpute the pre-2010 values. This information makes it possible to conduct meaningful temporal analysis on tracts whose boundaries changed between the two decades. More information regarding the normalization method can be found at the Longitudinal Tract Database website.

Once the pre-2010 data has been normalized, the data for the two observations periods can be combined into a single dataframe:

FALSE Simple feature collection with 399 features and 176 fields
FALSE geometry type:  MULTIPOLYGON
FALSE dimension:      XY
FALSE bbox:           xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID):    NA
FALSE proj4string:    NA
FALSE First 10 features:


COO Communities

The primary geographic unit of this assessment is the census tract. As is the case with many communities, the census geographies do not coincide exactly with the formal geographic boundary of the study’s three sites, and should be considered as spatial approximations of these communities.

Listed below are the geographic identifiers of the census tracts that approximate each site.

TABLE A.1: Census Tract GEOIDs
Rainier Valley
(2009)
Rainier Valley
(2015)
White Center
(2009)
White Center
(2015)
SeaTac/Tukwila
(2009)
SeaTac/Tukwila
(2015)
53033010000 53033010001 53033026900 53033026600 53033026100 53033026200
53033010300 53033010300 53033026500 53033026700 53033026200 53033027300
53033010400 53033010401 53033026600 53033026500 53033026300 53033028000
53033011000 53033011001 53033026700 53033026801 53033026400 53033028100
53033011101 53033011002 53033026801 53033026802 53033027100 53033028300
53033011102 53033011101 53033026802 53033027000 53033027200 53033028402
53033011700 53033011102 53033027000 NA 53033027300 53033028403
53033011800 53033011700 NA NA 53033028000 53033028500
53033011900 53033011800 NA NA 53033028100 53033028700
NA 53033011900 NA NA 53033028200 53033028801
NA NA NA NA 53033028300 53033028802
NA NA NA NA 53033028402 53033029101
NA NA NA NA 53033028403 53033026100
NA NA NA NA 53033028500 53033026200
NA NA NA NA 53033028700 53033026300
NA NA NA NA 53033028801 53033026400
NA NA NA NA 53033028802 53033027100
NA NA NA NA 53033029100 53033027200
NA NA NA NA NA 53033027300
NA NA NA NA NA 53033028100
NA NA NA NA NA 53033028200
NA NA NA NA NA 53033028300
NA NA NA NA NA 53033028802

To create the community approximations, the tract boundaries of each community are merged and each of the census table estimates are aggregated. In addition to combining the estimates, this method also recalculates the standard error for each census table. It should be noted that this method is only valid for census tables representing count data.3

FALSE Simple feature collection with 403 features and 180 fields
FALSE geometry type:  MULTIPOLYGON
FALSE dimension:      XY
FALSE bbox:           xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID):    NA
FALSE proj4string:    NA
FALSE First 10 features:

Community Maps


  1. More information on the acs-class can be found in the acs package documentation and the package author’s user guide.

  2. More information on the simple features can be found here, while the implementation of this data structure in R is documented here and here.

  3. This limitation is made explicit by the acs package creator, Ezra Haber Glenn, here.