A DID-based Machine Learning Method of Estimating CATE

114 阅读1分钟

Research Question

Given delivery capacity constraint, 哪些用户应该被优先上门?

  • From a revenue-management perspective, one criterion would be how responsive is a customer's purchasing behavior to the delivery service.
  • Along this line of thinking, we should estimate the individual level treatment effect (or conditional average treatment effect, CATE) of last-mile delivery: δ(x)=E[Y(1)Y(0)X=x].\delta(x) = \mathbb{E}\left [ Y(1) - Y(0) | X=x\right ]. Since the treatment variable in the data could be continuous, we are also interested in the following object: δ(x,d)=E[Y(d)Y(0)X=x,D=d]\delta(x,d) = \mathbb{E}\left [ Y(d) - Y(0) | X=x, D=d\right ], where dd is the treatment intensity (like the percentage of the orders being delivered to home), which is continuous.

A DID-based Framework

The standard DiD model in panel regression form: Yit=αi+τt+δDit+ϵit.Y_{it} = \alpha_i + \tau_t + \delta D_{it} + \epsilon_{it}. However, this linear model may not be sufficiently flexible for estimating highly nonlinear CATEs Instead, we assume Yit=αi+τt(X~i)+δ(Xi,Dit)+ϵit,Y_{it} = \alpha_i + \tau_t(\tilde X_i) + \delta(X_i,D_{it}) + \epsilon_{it}, where

  • we allow the treatment effect to depend on a set of individual level features XiX_i
  • we allow the features and the treatment to interact nonlinearly
  • we allow the time trend to be individual specific (X~i\tilde X_i can be a subset of XtX_t) This configuration is more robust as the parallel trend assumptions often fail in practice

More on the Model

We assume there are two periods, where t{0,1}t \in \{ 0, 1\}.

  • For the treated group, we have ΔYi1(d)=Yi1(d)Yi0(0)=Δτ(X~i)+δ(Xi,d)+Δϵi,\Delta Y^1_i(d) = Y_{i1}(d)-Y_{i0}(0) = \Delta\tau(\tilde X_i) + \delta(X_i,d) + \Delta \epsilon_{i}, where Δτ(X~i)=τ1(X~i)τ0(X~i)\Delta\tau(\tilde X_i) = \tau_1(\tilde X_i)-\tau_0(\tilde X_i) and Δϵi=ϵi1ϵi0.\Delta \epsilon_{i} = \epsilon_{i1}-\epsilon_{i0}.
  • For the control group, we have ΔYi0=Yi1(0)Yi0(0)=Δτ(X~i)+δ(Xi,0)+Δϵi=Δτ(X~i)+Δϵi,\Delta Y^0_i = Y_{i1}(0)-Y_{i0}(0) = \Delta\tau(\tilde X_i) + \delta(X_i,0) + \Delta \epsilon_{i} = \Delta\tau(\tilde X_i) + \Delta \epsilon_{i}, where it is natural to assume δ(Xi,0)=0\delta(X_i,0) = 0.

Thus, we have E[Y1(d)Y0(0)X=x,D=d]=Δτ(x~)+δ(x,d)\mathbb{E} [Y_1(d)-Y_0(0)|X=x,D=d] = \Delta \tau(\tilde x) + \delta(x,d) and E[Y1(0)Y0(0)X=x,D=0]=Δτ(x~)\mathbb{E} [Y_1(0)-Y_0(0)|X=x,D=0] = \Delta \tau(\tilde x)

Taking the difference in differences, we can eliminate the individual specific time trends and get the individual level treatment effects: E[Y1(d)Y0(0)X=x,D=d]E[Y1(0)Y0(0)X=x,D=0]=δ(x,d)\mathbb{E} [Y_1(d)-Y_0(0)|X=x,D=d]-\mathbb{E} [Y_1(0)-Y_0(0)|X=x,D=0] = \delta(x,d)

Then, we can estimate the right-hand side using any machine learning methods.

Estimation algorithm

  1. take the difference to eliminate the individual level fixed effect ΔYi1=Yi1(d)Yi0(0),\Delta Y^1_i = Y_{i1}(d)-Y_{i0}(0), and ΔYi0=Yi1(0)Yi0(0)\Delta Y^0_i = Y_{i1}(0)-Y_{i0}(0)
  2. matching the treated units with the control units based on feature X~i\tilde X_i. For example, one can use KNN matching
  3. calculate the difference in differences using the matched sample ΔYi1ΔYi0^\Delta Y^1_i - \widehat{\Delta Y^0_i}
  4. regress ΔYi1ΔYi0^\Delta Y^1_i - \widehat{\Delta Y^0_i} on (x,d)(x,d) to estimate the conditional average treatment effect δ(x,d)\delta(x,d). This step can be performed with any machine learning methods that are more powerful in prediction than linear regressions.