Example 3: Downloading State Data¶
For this example, we will be running a simple linear regression model, so we need an additional import:
import pandas as pd
import censusdata
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.precision', 2)
import statsmodels.formula.api as sm
We begin by downloading data on some basic socioeconomic characteristics for all U.S. states:
statedata = censusdata.download('acs5', 2015, censusdata.censusgeo([('state', '*')]),
['B01001_001E', 'B19013_001E', 'B19083_001E',
'C17002_001E', 'C17002_002E', 'C17002_003E', 'C17002_004E',
'B03002_001E', 'B03002_003E', 'B03002_004E', 'B03002_012E',])
We then link data on the percent of voters in each state voting Democratic in the 2016 U.S. presidential election:
voting2016 = {
censusdata.censusgeo((('state', '01'),)): 34.6,
censusdata.censusgeo((('state', '02'),)): 37.7,
censusdata.censusgeo((('state', '04'),)): 45.4,
censusdata.censusgeo((('state', '05'),)): 33.8,
censusdata.censusgeo((('state', '06'),)): 61.6,
censusdata.censusgeo((('state', '08'),)): 47.2,
censusdata.censusgeo((('state', '09'),)): 54.5,
censusdata.censusgeo((('state', '10'),)): 53.4,
censusdata.censusgeo((('state', '11'),)): 92.8,
censusdata.censusgeo((('state', '12'),)): 47.8,
censusdata.censusgeo((('state', '13'),)): 45.6,
censusdata.censusgeo((('state', '15'),)): 62.3,
censusdata.censusgeo((('state', '16'),)): 27.6,
censusdata.censusgeo((('state', '17'),)): 55.4,
censusdata.censusgeo((('state', '18'),)): 37.9,
censusdata.censusgeo((('state', '19'),)): 42.2,
censusdata.censusgeo((('state', '20'),)): 36.2,
censusdata.censusgeo((('state', '21'),)): 32.7,
censusdata.censusgeo((('state', '22'),)): 38.4,
censusdata.censusgeo((('state', '23'),)): 47.9,
censusdata.censusgeo((('state', '24'),)): 60.5,
censusdata.censusgeo((('state', '25'),)): 60.8,
censusdata.censusgeo((('state', '26'),)): 47.3,
censusdata.censusgeo((('state', '27'),)): 46.9,
censusdata.censusgeo((('state', '28'),)): 39.7,
censusdata.censusgeo((('state', '29'),)): 38,
censusdata.censusgeo((('state', '30'),)): 36,
censusdata.censusgeo((('state', '31'),)): 34,
censusdata.censusgeo((('state', '32'),)): 47.9,
censusdata.censusgeo((('state', '33'),)): 47.6,
censusdata.censusgeo((('state', '34'),)): 55,
censusdata.censusgeo((('state', '35'),)): 48.3,
censusdata.censusgeo((('state', '36'),)): 58.8,
censusdata.censusgeo((('state', '37'),)): 46.7,
censusdata.censusgeo((('state', '38'),)): 27.8,
censusdata.censusgeo((('state', '39'),)): 43.5,
censusdata.censusgeo((('state', '40'),)): 28.9,
censusdata.censusgeo((('state', '41'),)): 51.7,
censusdata.censusgeo((('state', '42'),)): 47.6,
censusdata.censusgeo((('state', '44'),)): 55.4,
censusdata.censusgeo((('state', '45'),)): 40.8,
censusdata.censusgeo((('state', '46'),)): 31.7,
censusdata.censusgeo((('state', '47'),)): 34.9,
censusdata.censusgeo((('state', '48'),)): 43.4,
censusdata.censusgeo((('state', '49'),)): 27.8,
censusdata.censusgeo((('state', '50'),)): 61.1,
censusdata.censusgeo((('state', '51'),)): 49.9,
censusdata.censusgeo((('state', '53'),)): 54.4,
censusdata.censusgeo((('state', '54'),)): 26.5,
censusdata.censusgeo((('state', '55'),)): 46.9,
censusdata.censusgeo((('state', '56'),)): 22.5,
}
voting2016 = pd.DataFrame.from_dict(voting2016, orient='index')
statedata['percent_democratic_pres_2016'] = voting2016
We then rename columns, compute some additional variables, and rescale some variables to make regression coefficients more easily interpretable:
statedata = statedata.rename(columns={'B01001_001E': 'population_size'})
statedata.population_size = statedata.population_size / 100000
statedata = statedata.rename(columns={'B19013_001E': 'median_HH_income'})
statedata['median_HH_income'] = statedata['median_HH_income'] / 1000
statedata = statedata.rename(columns={'B19083_001E': 'gini_index'})
statedata.gini_index = statedata.gini_index * 100
statedata['percent_below_125_poverty'] = (statedata['C17002_002E'] + statedata['C17002_003E'] + statedata['C17002_004E']) / statedata['C17002_001E'] * 100
statedata['percent_nonhisp_white'] = statedata['B03002_003E'] / statedata['B03002_001E'] * 100
statedata['percent_nonhisp_black'] = statedata['B03002_004E'] / statedata['B03002_001E'] * 100
statedata['percent_hispanic'] = statedata['B03002_012E'] / statedata['B03002_001E'] * 100
We run a quick check on the data and then delete variables we no longer need:
assert (statedata['population_size'] == statedata['B03002_001E'] / 100000).all()
for column in ['C17002_001E', 'C17002_002E', 'C17002_003E', 'C17002_004E',
'B03002_001E', 'B03002_003E', 'B03002_004E', 'B03002_012E',]:
del statedata[column]
We are only interested in the 50 states + DC, so we drop Puerto Rico:
statedata = statedata.drop([censusdata.censusgeo([('state', '72')])])
Finally, we reorder the variables and run simple descriptives:
statedata = statedata.reindex(columns=['percent_democratic_pres_2016', 'population_size', 'median_HH_income', 'percent_below_125_poverty', 'gini_index', 'percent_nonhisp_white', 'percent_nonhisp_black', 'percent_hispanic'])
statedata.describe()
percent_democratic_pres_2016 | population_size | median_HH_income | percent_below_125_poverty | gini_index | percent_nonhisp_white | percent_nonhisp_black | percent_hispanic | |
---|---|---|---|---|---|---|---|---|
count | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 |
mean | 45.05 | 62.06 | 54.64 | 19.44 | 46.22 | 69.53 | 10.91 | 11.20 |
std | 12.41 | 70.53 | 9.16 | 3.94 | 2.14 | 16.12 | 10.77 | 10.06 |
min | 22.50 | 5.80 | 39.66 | 11.84 | 41.81 | 22.89 | 0.44 | 1.37 |
25% | 36.10 | 17.34 | 47.55 | 16.25 | 44.81 | 58.43 | 3.17 | 4.72 |
50% | 46.70 | 43.97 | 53.00 | 20.08 | 46.26 | 73.60 | 7.12 | 8.84 |
75% | 52.55 | 68.46 | 60.68 | 22.45 | 47.59 | 81.23 | 14.92 | 12.88 |
max | 92.80 | 384.21 | 74.55 | 28.96 | 53.17 | 93.88 | 47.98 | 47.36 |
Then we examine bivariate correlations prior to running a linear regression model:
statedata.corr()
percent_democratic_pres_2016 | population_size | median_HH_income | percent_below_125_poverty | gini_index | percent_nonhisp_white | percent_nonhisp_black | percent_hispanic | |
---|---|---|---|---|---|---|---|---|
percent_democratic_pres_2016 | 1.00 | 0.24 | 0.57 | -0.21 | 0.47 | -0.53 | 0.34 | 0.26 |
population_size | 0.24 | 1.00 | 0.03 | 0.18 | 0.43 | -0.40 | 0.11 | 0.53 |
median_HH_income | 0.57 | 0.03 | 1.00 | -0.81 | -0.09 | -0.27 | -0.06 | 0.11 |
percent_below_125_poverty | -0.21 | 0.18 | -0.81 | 1.00 | 0.48 | -0.23 | 0.39 | 0.19 |
gini_index | 0.47 | 0.43 | -0.09 | 0.48 | 1.00 | -0.45 | 0.61 | 0.28 |
percent_nonhisp_white | -0.53 | -0.40 | -0.27 | -0.23 | -0.45 | 1.00 | -0.46 | -0.63 |
percent_nonhisp_black | 0.34 | 0.11 | -0.06 | 0.39 | 0.61 | -0.46 | 1.00 | -0.13 |
percent_hispanic | 0.26 | 0.53 | 0.11 | 0.19 | 0.28 | -0.63 | -0.13 | 1.00 |
result = sm.ols(formula=("percent_democratic_pres_2016 ~ population_size + median_HH_income"
"+ percent_nonhisp_black + percent_hispanic"), data=statedata).fit()
result.summary()
Dep. Variable: | percent_democratic_pres_2016 | R-squared: | 0.532 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.492 |
Method: | Least Squares | F-statistic: | 13.08 |
Date: | Sat, 07 Aug 2021 | Prob (F-statistic): | 3.42e-07 |
Time: | 12:37:24 | Log-Likelihood: | -180.94 |
No. Observations: | 51 | AIC: | 371.9 |
Df Residuals: | 46 | BIC: | 381.5 |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -5.7076 | 7.801 | -0.732 | 0.468 | -21.409 | 9.994 |
population_size | 0.0121 | 0.021 | 0.563 | 0.576 | -0.031 | 0.055 |
median_HH_income | 0.7715 | 0.138 | 5.603 | 0.000 | 0.494 | 1.049 |
percent_nonhisp_black | 0.4551 | 0.120 | 3.790 | 0.000 | 0.213 | 0.697 |
percent_hispanic | 0.2578 | 0.151 | 1.704 | 0.095 | -0.047 | 0.562 |
Omnibus: | 2.104 | Durbin-Watson: | 2.514 |
---|---|---|---|
Prob(Omnibus): | 0.349 | Jarque-Bera (JB): | 1.237 |
Skew: | 0.208 | Prob(JB): | 0.539 |
Kurtosis: | 3.640 | Cond. No. | 647. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In this simple model, the percentage voting Democratic is not significantly associated with population size or % Hispanic, at the p<.05 level. It is significantly associated with median household income and the % non-Hispanic black. Every $1,000 increase in median household income is associated with an increase of just under 1 percentage point in the Democratic vote. Every one percentage point increase in the % non-Hispanic black is associated with about a half a percentage point increase in the Democratic vote. Of course,
- The outcome variable is not continuous, due to its bounded range, and this model does not account for this (it is essentially a linear probability model);
- The choice of covariates is simplistic and just designed to demonstrate fitting a model;
- We might consider robust standard errors for this model.