Sample Selection Bias

Sample Selection Bias

Hall

Sample Selection Bias

Author: Xinya Hao

Definition

Sample selection bias is a type of bias caused by choosing non-random data for statistical analysis. The bias exists due to a flaw in the sample selection process, where a subset of the data is systematically excluded due to a particular attribute. The exclusion of the subset can influence the statistical significance of the test, and it can bias the estimates of parameters of the statistical model. Sample selection bias is also called Survivorship Bias.

Types of Sample Selection Bias

Advertising or Pre-Screening Bias

This occurs when the way participants are pre-screened in a study introduces bias. For example, the language researchers use to advertise for participants can itself introduce bias into the study simply by discouraging or encouraging certain groups of people from volunteering to participate.

Self-Selection Bias

Self-selection bias—also known as volunteer response bias—occurs when the study organizers allow participants to self-select or volunteer to participate. The study organizers relinquish control over who participates to those who decide to volunteer. This may lead people with specific characteristics or opinions to volunteer for a study and thus skew the results.

For example, when studying returns to education for women, the sample of low-income women could not be observed and was excluded from the questionnaire. When the income is below a certain level, these women may choose not to work independently, such as becoming housewives to contribute to the family in another way, so their salary level cannot be observed. Therefore, the questionnaire survey may underestimate women with low-income levels, thus overestimating women’s educational returns.

Exclusion and Undercoverage Bias

Exclusion bias occurs when specific members of a population are excluded from participating in a study. Undercoverage bias occurs when study organizers create a study that does not adequately represent some members of the population. For example, the participants of a study are selected from certain areas only while other areas are not represented in the sample.

Heckman Correction

The idea of Heckman Correction is that we can use a selection equation to estimate the probability of unobservability of the dependent variable in a regression equation. The correction can be down by a two stage process.

  • Stage One. Estimate the Selection Equation.
  • Calculate the Inverse Mills Ratio (IMR) and make imr a control variable in the second stage estimation.
  • Stage Two. Estimate the Regression Equation.

Important Nots:

  1. The dependent variable is a 0/1 variable indicating whether the in the regression equation is observable. So the Heckman Correction can only be done if the samples with unobserved are included in the data.
  2. The Selection Equation should include all control variables in the Regression Equation plus at least one exogenous variable () that can predict the probability of unobservability of but not correlated with ’s magnitude. The exogenous of should be clearly stated.
  3. The Selection Equation should be estimated using Probit model, rather than logit model. Heckman Correction rely on the assumption that the error term in the selection equation follows a normal distribution. Thus, the FE model cannot be applied to the first-stage estimation and only the RD model is viable.
  4. We use all samples to estimate the Selection Equation, including the observations with a missing . We only use the samples with non-missing to estimate the Regression Equation. This feature can help us intuitively distinguish the Heckman Correction method from the IV method.
  5. It is okay if the in the second stage is not statistically significant, which means the sample selection bias is minor and the regression can be viewed as a robust test.
  6. The “Two-stage Approach” is inefficient as the estimation error will be inherited in the second-stage estimation. This problem can be solved by using the Maximum Likelihood Estimation (MLE) approach. However, MLE requires significant computing power. The “Two-stage Approach” is also very popular among empirical studies.

Heckman Correction Example in Stata

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
webuse womenwk.dta, clear

*** OLS estimation
//summary statistics
sum age educ married children wage

// OLS estimation
reg wage educ age

*** heckman two-step, step-by-step (manually)
// estimate selection equation.
// married and children are exogenous variables here.
gen work = (wage != .)
probit work married children educ age
est store First

// calculate imr
predict y_hat, xb
gen pdf = normalden(y_hat)
gen cdf = normal(y_hat)
gen imr = pdf/cdf

// estimate regression equation with imr
reg wage educ age imr if work == 1
est store Second
vif

*** heckman two-step, all-in-one
heckman wage educ age, select(married children educ age) twostep
est store Heck2s

*** heckman MLE
// heckman mould run MLE by default
heckman wage educ age, select(married children educ age)
est store HeckMLE

References

Stata Manual for the Heckman selection model heckman