A.2 Census Data
The US Census provides invaluable information about American communities. The datasets provided by the Census encompas many disparate topics, comprehensively cover the entire US population, and are freely available to download. However, there are some barriers to using these data sets. When dealing with small geographies like a neighborhood, the statistical uncertainty may be high because the information comes from a small sample of the population. This complicates common data processing steps like combining or calculating the proportion of given indicator. Additionally, because census tract boundaries are subject to change every ten years, datasets must be normalized to account for this change before any comparative analysis can begin. Lastly, the census datasets represent information that is specific to a given geography (e.g. people in Tukwila, or households in King County) and are, therefore, spatial in nature. Spatial data, from the Census or other sources, present their own challenges that researchers must address in their choice of methods.
This analysis addresses these challenges by leveraging the capabilities of R, an opensource statistical programming language. While other software exists for working with Census data, there are several R-based tools that can be combined together efficiently and effectively. The method for downloading, organizing, and processing the data are summarized in the following steps:
- Define census geographies of interest
- Identify relevant tables from American Community Survey
- Download tables using the
acs
R package - Normalize the pre-2010 dataset using Brown University’s Longitudinal Tract Database
- Approximate the COO site communities by combining census tracts
ACS Geographies
There are many ways to collect US Census data, but this method uses the R
package called acs
to extract data with the official US Census API. This method is efficient, reproducible, and allows users to download census tables for a group of dissimilar geographies. To learn more about this, see the acs
package documentation.
This analysis uses the following three types of census geographies:
- Counties (King)
- County subdivisions (Seattle CCD)
- Census tracts (all tracts within King County)
ACS Tables
The following tables from the American Community Survey (ACS) are used to created indicators in this assessment
Table Name | Topic | Universe |
---|---|---|
B03002 | HISPANIC OR LATINO ORIGIN BY RACE | Total population |
B15002 | SEX BY EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER | Population 25 years and over |
B19001 | HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2015 INFLATION-ADJUSTED DOLLARS) | Households |
B25033 | TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY UNITS IN STRUCTURE | Total population in occupied housing units |
Prior to normalization, the tables are stored in two separate dataframes: one for the 2005-2009 data, and another for the 2011-2015 data:
Census Tables, 2005-2009
FALSE Simple feature collection with 375 features and 92 fields
FALSE geometry type: MULTIPOLYGON
FALSE dimension: XY
FALSE bbox: xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID): NA
FALSE proj4string: NA
FALSE First 10 features:
Census Tables, 2011-2015
FALSE Simple feature collection with 400 features and 92 fields
FALSE geometry type: MULTIPOLYGON
FALSE dimension: XY
FALSE bbox: xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID): NA
FALSE proj4string: NA
FALSE First 10 features:
Data Structure: acs
objects distributed in sf
objects
In this method, each row contains a different census geography and each column contains a single column of a single census table. For instance, column B03002_003
contains the third column of the ‘Hispanic or Latino, By Race’ table, which contains the estimate of people who identify as “Not Hispanic or Latino: White alone”:
FALSE Simple feature collection with 1 feature and 4 fields
FALSE geometry type: MULTIPOLYGON
FALSE dimension: XY
FALSE bbox: xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID): NA
FALSE proj4string: NA
Each “cell” of the dataframe contains a single acs-class
object1, which itself contains a set of metadata including the estimate value, standard error, geographic identifier, and other useful information:
FALSE ACS DATA:
FALSE 2005 -- 2009 ;
FALSE Estimates w/90% confidence intervals;
FALSE for different intervals, see confint()
FALSE B03002_003
FALSE Census Tract 1, King County, Washington 3596 +/- 358
FALSE Formal class 'acs' [package "acs"] with 9 slots
FALSE ..@ endyear : int 2009
FALSE ..@ span : int 5
FALSE ..@ geography :'data.frame': 1 obs. of 5 variables:
FALSE .. ..$ NAME : chr "Census Tract 1, King County, Washington"
FALSE .. ..$ state : int 53
FALSE .. ..$ county : chr "33"
FALSE .. ..$ countysubdivision: chr NA
FALSE .. ..$ tract : chr "000100"
FALSE ..@ acs.colnames : chr "B03002_003"
FALSE ..@ modified : logi TRUE
FALSE ..@ acs.units : Factor w/ 5 levels "count","dollars",..: NA
FALSE ..@ currency.year : int 2009
FALSE ..@ estimate : num [1, 1] 3596
FALSE .. ..- attr(*, "dimnames")=List of 2
FALSE .. .. ..$ : chr "Census Tract 1, King County, Washington"
FALSE .. .. ..$ : chr "B03002_003"
FALSE ..@ standard.error: num [1, 1] 218
FALSE .. ..- attr(*, "dimnames")=List of 2
FALSE .. .. ..$ : chr "Census Tract 1, King County, Washington"
FALSE .. .. ..$ : chr "B03002_003"
Storing acs
objects in a simple feature dataframe2 is unconventional but it follows a general principle of computing: don’t repeat yourself (DRY). The dataframe structure keeps related acs
objects and geometries together, yielding benefits when the time comes to operate on the data.
For example, if census tracts need to be normalized before temporal comparison (as is the case in this project), that process can occur in a single, comprehensive step rather than individually for each census table. This efficiency gain is particularly important if census tables are added or removed, which may occur fequently in the exploratory phase of an analysis.
Normalized pre-2010 Data
Ultimately the ACS data will be combined into a single simple feature object, but before that can happen the pre-2010 must be normalized. The LTDB 2000-2010 Crosswalk file is a tabular tool that clarifies which tracts change from decade to decade, what type of change occurred (e.g., consolidation, split, many-to-many, none), and what weighting metric should be used to inpute the pre-2010 values. This information makes it possible to conduct meaningful temporal analysis on tracts whose boundaries changed between the two decades. More information regarding the normalization method can be found at the Longitudinal Tract Database website.
Once the pre-2010 data has been normalized, the data for the two observations periods can be combined into a single dataframe:
FALSE Simple feature collection with 399 features and 176 fields
FALSE geometry type: MULTIPOLYGON
FALSE dimension: XY
FALSE bbox: xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID): NA
FALSE proj4string: NA
FALSE First 10 features:
COO Communities
The primary geographic unit of this assessment is the census tract. As is the case with many communities, the census geographies do not coincide exactly with the formal geographic boundary of the study’s three sites, and should be considered as spatial approximations of these communities.
Listed below are the geographic identifiers of the census tracts that approximate each site.
Rainier Valley (2009) |
Rainier Valley (2015) |
White Center (2009) |
White Center (2015) |
SeaTac/Tukwila (2009) |
SeaTac/Tukwila (2015) |
---|---|---|---|---|---|
53033010000 | 53033010001 | 53033026900 | 53033026600 | 53033026100 | 53033026200 |
53033010300 | 53033010300 | 53033026500 | 53033026700 | 53033026200 | 53033027300 |
53033010400 | 53033010401 | 53033026600 | 53033026500 | 53033026300 | 53033028000 |
53033011000 | 53033011001 | 53033026700 | 53033026801 | 53033026400 | 53033028100 |
53033011101 | 53033011002 | 53033026801 | 53033026802 | 53033027100 | 53033028300 |
53033011102 | 53033011101 | 53033026802 | 53033027000 | 53033027200 | 53033028402 |
53033011700 | 53033011102 | 53033027000 | NA | 53033027300 | 53033028403 |
53033011800 | 53033011700 | NA | NA | 53033028000 | 53033028500 |
53033011900 | 53033011800 | NA | NA | 53033028100 | 53033028700 |
NA | 53033011900 | NA | NA | 53033028200 | 53033028801 |
NA | NA | NA | NA | 53033028300 | 53033028802 |
NA | NA | NA | NA | 53033028402 | 53033029101 |
NA | NA | NA | NA | 53033028403 | 53033026100 |
NA | NA | NA | NA | 53033028500 | 53033026200 |
NA | NA | NA | NA | 53033028700 | 53033026300 |
NA | NA | NA | NA | 53033028801 | 53033026400 |
NA | NA | NA | NA | 53033028802 | 53033027100 |
NA | NA | NA | NA | 53033029100 | 53033027200 |
NA | NA | NA | NA | NA | 53033027300 |
NA | NA | NA | NA | NA | 53033028100 |
NA | NA | NA | NA | NA | 53033028200 |
NA | NA | NA | NA | NA | 53033028300 |
NA | NA | NA | NA | NA | 53033028802 |
To create the community approximations, the tract boundaries of each community are merged and each of the census table estimates are aggregated. In addition to combining the estimates, this method also recalculates the standard error for each census table. It should be noted that this method is only valid for census tables representing count data.3
FALSE Simple feature collection with 403 features and 180 fields
FALSE geometry type: MULTIPOLYGON
FALSE dimension: XY
FALSE bbox: xmin: -122.5279 ymin: 47.08446 xmax: -121.0657 ymax: 47.78033
FALSE epsg (SRID): NA
FALSE proj4string: NA
FALSE First 10 features:
Community Maps
More information on the
acs-class
can be found in theacs
package documentation and the package author’s user guide.↩More information on the simple features can be found here, while the implementation of this data structure in R is documented here and here.↩
This limitation is made explicit by the
acs
package creator, Ezra Haber Glenn, here.↩