A.2 Census Data

The US Census provides invaluable information about American communities. The datasets provided by the Census encompas many disparate topics, comprehensively cover the entire US population, and are freely available to download. However, there are some barriers to using these data sets. When dealing with small geographies like a neighborhood, the statistical uncertainty may be high because the information comes from a small sample of the population. This complicates common data processing steps like combining or calculating the proportion of given indicator. Additionally, because census tract boundaries are subject to change every ten years, datasets must be normalized to account for this change before any comparative analysis can begin. Lastly, the census datasets represent information that is specific to a given geography (e.g. people in Tukwila, or households in King County) and are, therefore, spatial in nature. Spatial data, from the Census or other sources, present their own challenges that researchers must address in their choice of methods.

This analysis addresses these challenges by leveraging the capabilities of R, an opensource statistical programming language. While other software exists for working with Census data, there are several R-based tools that can be combined together efficiently and effectively. The method for downloading, organizing, and processing the data are summarized in the following steps:

Define census geographies of interest
Identify relevant tables from American Community Survey
Download tables using the acs R package
Normalize the pre-2010 dataset using Brown University’s Longitudinal Tract Database
Approximate the COO site communities by combining census tracts

ACS Geographies

There are many ways to collect US Census data, but this method uses the R package called acs to extract data with the official US Census API. This method is efficient, reproducible, and allows users to download census tables for a group of dissimilar geographies. To learn more about this, see the acs package documentation.

This analysis uses the following three types of census geographies:

Counties (King)
County subdivisions (Seattle CCD)
Census tracts (all tracts within King County)

ACS Tables

The following tables from the American Community Survey (ACS) are used to created indicators in this assessment

Table Name	Topic	Universe
B03002	HISPANIC OR LATINO ORIGIN BY RACE	Total population
B15002	SEX BY EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER	Population 25 years and over
B19001	HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2015 INFLATION-ADJUSTED DOLLARS)	Households
B25033	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY UNITS IN STRUCTURE	Total population in occupied housing units

Prior to normalization, the tables are stored in two separate dataframes: one for the 2005-2009 data, and another for the 2011-2015 data:

Census Tables, 2005-2009

FALSE Simple feature collection with 375 features and 92 fields
FALSE geometry type:  MULTIPOLYGON
FALSE dimension:      XY
FALSE bbox:           xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID):    NA
FALSE proj4string:    NA
FALSE First 10 features:

Census Tables, 2011-2015

FALSE Simple feature collection with 400 features and 92 fields
FALSE geometry type:  MULTIPOLYGON
FALSE dimension:      XY
FALSE bbox:           xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID):    NA
FALSE proj4string:    NA
FALSE First 10 features:

Data Structure: `acs` objects distributed in `sf` objects

In this method, each row contains a different census geography and each column contains a single column of a single census table. For instance, column B03002_003 contains the third column of the ‘Hispanic or Latino, By Race’ table, which contains the estimate of people who identify as “Not Hispanic or Latino: White alone”:

FALSE Simple feature collection with 1 feature and 4 fields
FALSE geometry type:  MULTIPOLYGON
FALSE dimension:      XY
FALSE bbox:           xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID):    NA
FALSE proj4string:    NA

Each “cell” of the dataframe contains a single acs-class object¹, which itself contains a set of metadata including the estimate value, standard error, geographic identifier, and other useful information:

FALSE ACS DATA: 
FALSE  2005 -- 2009 ;
FALSE   Estimates w/90% confidence intervals;
FALSE   for different intervals, see confint()
FALSE                                         B03002_003  
FALSE Census Tract 1, King County, Washington 3596 +/- 358

FALSE Formal class 'acs' [package "acs"] with 9 slots
FALSE   ..@ endyear       : int 2009
FALSE   ..@ span          : int 5
FALSE   ..@ geography     :'data.frame':    1 obs. of  5 variables:
FALSE   .. ..$ NAME             : chr "Census Tract 1, King County, Washington"
FALSE   .. ..$ state            : int 53
FALSE   .. ..$ county           : chr "33"
FALSE   .. ..$ countysubdivision: chr NA
FALSE   .. ..$ tract            : chr "000100"
FALSE   ..@ acs.colnames  : chr "B03002_003"
FALSE   ..@ modified      : logi TRUE
FALSE   ..@ acs.units     : Factor w/ 5 levels "count","dollars",..: NA
FALSE   ..@ currency.year : int 2009
FALSE   ..@ estimate      : num [1, 1] 3596
FALSE   .. ..- attr(*, "dimnames")=List of 2
FALSE   .. .. ..$ : chr "Census Tract 1, King County, Washington"
FALSE   .. .. ..$ : chr "B03002_003"
FALSE   ..@ standard.error: num [1, 1] 218
FALSE   .. ..- attr(*, "dimnames")=List of 2
FALSE   .. .. ..$ : chr "Census Tract 1, King County, Washington"
FALSE   .. .. ..$ : chr "B03002_003"

Storing acs objects in a simple feature dataframe² is unconventional but it follows a general principle of computing: don’t repeat yourself (DRY). The dataframe structure keeps related acs objects and geometries together, yielding benefits when the time comes to operate on the data.

For example, if census tracts need to be normalized before temporal comparison (as is the case in this project), that process can occur in a single, comprehensive step rather than individually for each census table. This efficiency gain is particularly important if census tables are added or removed, which may occur fequently in the exploratory phase of an analysis.

Normalized pre-2010 Data

Ultimately the ACS data will be combined into a single simple feature object, but before that can happen the pre-2010 must be normalized. The LTDB 2000-2010 Crosswalk file is a tabular tool that clarifies which tracts change from decade to decade, what type of change occurred (e.g., consolidation, split, many-to-many, none), and what weighting metric should be used to inpute the pre-2010 values. This information makes it possible to conduct meaningful temporal analysis on tracts whose boundaries changed between the two decades. More information regarding the normalization method can be found at the Longitudinal Tract Database website.

Once the pre-2010 data has been normalized, the data for the two observations periods can be combined into a single dataframe:

FALSE Simple feature collection with 399 features and 176 fields
FALSE geometry type:  MULTIPOLYGON
FALSE dimension:      XY
FALSE bbox:           xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID):    NA
FALSE proj4string:    NA
FALSE First 10 features:

COO Communities

The primary geographic unit of this assessment is the census tract. As is the case with many communities, the census geographies do not coincide exactly with the formal geographic boundary of the study’s three sites, and should be considered as spatial approximations of these communities.

Listed below are the geographic identifiers of the census tracts that approximate each site.

TABLE A.1: Census Tract GEOIDs
Rainier Valley (2009)	Rainier Valley (2015)	White Center (2009)	White Center (2015)	SeaTac/Tukwila (2009)	SeaTac/Tukwila (2015)
53033010000	53033010001	53033026900	53033026600	53033026100	53033026200
53033010300	53033010300	53033026500	53033026700	53033026200	53033027300
53033010400	53033010401	53033026600	53033026500	53033026300	53033028000
53033011000	53033011001	53033026700	53033026801	53033026400	53033028100
53033011101	53033011002	53033026801	53033026802	53033027100	53033028300
53033011102	53033011101	53033026802	53033027000	53033027200	53033028402
53033011700	53033011102	53033027000	NA	53033027300	53033028403
53033011800	53033011700	NA	NA	53033028000	53033028500
53033011900	53033011800	NA	NA	53033028100	53033028700
NA	53033011900	NA	NA	53033028200	53033028801
NA	NA	NA	NA	53033028300	53033028802
NA	NA	NA	NA	53033028402	53033029101
NA	NA	NA	NA	53033028403	53033026100
NA	NA	NA	NA	53033028500	53033026200
NA	NA	NA	NA	53033028700	53033026300
NA	NA	NA	NA	53033028801	53033026400
NA	NA	NA	NA	53033028802	53033027100
NA	NA	NA	NA	53033029100	53033027200
NA	NA	NA	NA	NA	53033027300
NA	NA	NA	NA	NA	53033028100
NA	NA	NA	NA	NA	53033028200
NA	NA	NA	NA	NA	53033028300
NA	NA	NA	NA	NA	53033028802

To create the community approximations, the tract boundaries of each community are merged and each of the census table estimates are aggregated. In addition to combining the estimates, this method also recalculates the standard error for each census table. It should be noted that this method is only valid for census tables representing count data.³

FALSE Simple feature collection with 403 features and 180 fields
FALSE geometry type:  MULTIPOLYGON
FALSE dimension:      XY
FALSE bbox:           xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID):    NA
FALSE proj4string:    NA
FALSE First 10 features:

Community Maps

More information on the acs-class can be found in the acs package documentation and the package author’s user guide.↩
More information on the simple features can be found here, while the implementation of this data structure in R is documented here and here.↩
This limitation is made explicit by the acs package creator, Ezra Haber Glenn, here.↩