Estimators
Estimators are algorithms that compute statistical estimates (fair market value) from data (of sales transactions).
Our research design compares the predictive accuracy of linear regressions (ordinary least squares, OLS) and tree ensemble methods (extremely randomized trees, gradient boosting, histogram-based gradient boosting).
Linear regressions (OLS)
Linear models offer a simple parametric approach to model land values.
We use the default specification of statsmodels’ OLS (ordinary least squares) regressor.
Strengths
Interpretability: coefficients can be interpreted as marginal effects, if assumption are satisfied. This allows statements such as: “a 10% increase in wetlands is associated with an X% reduction in property values”
Extrapolation: an OLS model that is correctly specified and fitted can make predictions outside the range of observed sales prices or predictors.
Challenges
Analysts need to know the functional form: OLS regressions require the definition of a regression equation, which should represent the real-life data-generating process. If the data-generating process is unknown, or varies across space, it is easy to misspecify OLS models, leading to biased results.
Lack of flexibility: Unless specified in the regression formula, OLS do not explicitly consider the potentially large number of non-linearities and high-dimensional interactions between variables. Therefore, OLS models often offer lower predictive accuracy than more flexible modeling strategies (e.g., Extremely Randomized Trees).
The exact regression specification depends on the predictor set, but generally takes the following formula:
Where:
\(price_{ijt}\) is the sales price of property \(i\) in region \(j\) at time \(t\)
\(\alpha\) is the intercept
\(X_i\) is a predictor set
\(\mu_j\) are dummies for Regions
\(\tau_t\) are year-quarter dummies
\(\varepsilon_{ijt}\) is a normally distributed error
Extremely Randomized Trees (ERT)
The Extremely Randomized Trees (ERT) algorithm (Geurts et al 2006) is a close cousin of the Random Forest, a popular machine-learning algorithm.
Similar to a Random Forest, ERTs average predictions of randomized decision trees, and decision trees are built on bootstrapped samples of training data (in our specification). While a Random Forest searches for the “best” split across features and thresholds, ERTs draw random thresholds for each feature and pick whichever happens to be the most discriminative.
We used ERTs to generate our first published estimates of PLACES-FMV for CONUS (Nolte (2020) PNAS (article, data).
We use scikit-learn’s ExtraTreesRegressor with the following modifications:
n_estimators=500to build 500 trees (instead of 100). Larger forests tend to increase accuracy.bootstrap=Trueto compute out-of-bag (OOB) predictions, i.e. fair-market values for parcels that sold, based only on decision trees that have not seen the parcel in question.min_samples_leaf=3to average results and to avoid the publication of actual sales data (to comply with the data licensing agreements).
We vary parameters such as min_samples_leaf and n_estimators to study trade-offs between resource usage, privacy, and predictive performance.
Gradient Boosting Regressor (GBR)
A tree ensemble algorithm, but this one “boosts” (serial improvements) and doesn’t “bag” (parallel predictions, as in random forests).
We use scikit-learn’s GradientBoostingRegressor.
Histogram-based Gradient Boosting (HGB)
A tree ensemble algorithm that promises superior efficiency (faster predictions).
Uses scikit-learn’s HistGradientBoostingRegressor.
Stacked models
Stacked models use predictions of other models as inputs. We let first-level models make predictions during cross-validation, then use predictions from several different models as independent variables in second-level models.
Estimator specifications
label |
constructor |
|
ert |
Extremely randomized trees |
ExtraTreesRegressor(n_estimators=500, bootstrap=True, min_samples_leaf=3) |
lm |
Linear model |
LinearRegression() |
gbr |
Gradient boosting |
GradientBoostingRegressor(max_depth=5, min_samples_leaf=3) |
gbr-250-8 |
Gradient boosting (250 estimators, depth 8) |
GradientBoostingRegressor(n_estimators=250, max_depth=8, min_samples_leaf=3) |
hgb |
Histogram-based gradient boosting |
HistGradientBoostingRegressor(min_samples_leaf=3) |
ert-msl10 |
Extremely randomized trees (#min: 10) |
ExtraTreesRegressor(n_estimators=500, bootstrap=True, min_samples_leaf=10) |
ert-msl20 |
Extremely randomized trees (#min: 20) |
ExtraTreesRegressor(n_estimators=500, bootstrap=True, min_samples_leaf=20) |
ert-msl50 |
Extremely randomized trees (#min: 50) |
ExtraTreesRegressor(n_estimators=500, bootstrap=True, min_samples_leaf=50) |
ert-ne100 |
Extremely randomized trees (#est: 100) |
ExtraTreesRegressor(n_estimators=100, bootstrap=True, min_samples_leaf=3) |
ert-ne200 |
Extremely randomized trees (#est: 200) |
ExtraTreesRegressor(n_estimators=200, bootstrap=True, min_samples_leaf=3) |