# Time-Series Imputation with Wasserstein Interpolation for Optimal Look-Ahead-Bias and Variance Trade

Updated: May 4, 2021

Authors: Samarth Chandra, Rupesh

Time series is widely used all around the world in real-time applications. However, unexpected events for example broken sensors or lost signal values may cause missing values in time. Missing time series data is a common practical problem persisting with the use of time series calculations. Many analysis methods require missing values to be replaced with some meaningful values. In Statistics, the process of replacing and arranging missing values is called as Imputation. Unfortunately, this practice may lead to a look-ahead bias in future which can lead to inaccurate results. By connecting layers through time we propose a Bayesian posterior consensus distribution with optimal control over the variance and the loo-ahead bias trade off In the imputation.

Missing data is a common problem when dealing with data collection from large datasets. It commonly prevails in all the areas that are in touch with Data Science. Understanding the effect of imputation of time-series data is hardly studied about. This is exactly why we wrote this paper.

The novel Bayesian approach interpolates between global data and local observed data to optimize the look-ahead bias trade off. This is of particular importance with respect to Finance, this method can be applied in portfolio optimization techniques as a leading example.

Every imputation method involves extracting as much information as possible from the globally observed data to infer the missing data but problem with in time-series settings. Portfolio selection procedures are designed to exploit subtle signals that can lead to favorable future investment outcomes.

**Problem with imputing missing data using all information**

An out-of-sample portfolio selection procedure on the imputed data will inherently suffer from a look-ahead-bias. This can be avoided by only using past data for imputation and hence avoid “looking into the future”. Therefore more data is added to imputation procedure. But this approach ignores valuable information that could be used to improve the error in the imputation procedure.

**Bayesian Approach**

The Bayesian imputation framework can be used in order to precisely define look-ahead-bias as deviation from a specific baseline posterior distribution. The observed entries are assumed as realizations. The model is endowed with a prior distribution for the underlying parameters and it generates a corresponding posterior both for the parameters and the missing observations conditional on the observed data.

The full data imputation (**with look-ahead-bias and small variance**) and conservative imputation (**without look-ahead-bias and with large variance**) are both special cases of the Bayesian approach. Each of them corresponds to a different posterior.

The posterior computed by fitting the Bayesian model only on the training set (which we call baseline) is assumed to be look-ahead-bias-free, and the look-ahead-bias incurred by any other posterior is defined as the bias relative to the baseline posterior.

**Imputation Methods**

**Imputation Methods**

**Single imputation:**

Single imputation refers to the use of a specific number (i.e., a best guess) in place of each missing value, which includes for example, k-nearest neighbor, mean/median imputation, smoothing, interpolation and spline.

**Multiple Imputation :**

Multiple imputation is a statistical procedure for handling missing data in a study with the aim of reducing the bias, and complications, that missing data can cause. Multiple imputation involves creation of multiple datasets where the missing data are imputed with more realistic values as compared to the non-missing data, allowing for the uncertainty around what the real value might be by imputing data randomly from a distribution

The flexibility and predictive performance of multiple imputation method have been successfully demonstrated in a variety of data sets, such as Industry and Occupation Codes , GDP in African nation , and concentrations of pollutants in the Arctic.

**The Bias-Variance Multiple Imputation Framework**

Let us assume a universe of n assets over time period **T**. **Zt ∈ Rn** is the random vector representing the joint return of assets at a particular time **t.** The random vector **Mt ∈ {0, 1} n** indicates the pattern of the missing values at time **t ,∀ i ∈ [n]** :

** (Mt)i = { 1 if the i-th component of Zt is missing,**

** ** **0 if the i-th component of Zt is observed. }**

FIGURE 1

FIGURE 2

**The Bayesian approach:**

Let us assume the vector **θ ∈ R n **is Bayesian parameter with a prior distribution **π0** and **(Yt)t∈[T]** is Bayesian parameters with appropriate priors. **D** are the missing entries on dataset, our strategy is to compute the distribution of **(Yt)t∈[T]** conditional on **D**, and then generating multiple imputations of **(Yt)t∈[T]**. For computing the posterior of **(Yt)t∈[T] |D**, we first need to calculate a posterior distribution of the unknown mean vector θ conditional on **D**, and then the distribution of **(Yt)t∈[T] |D** is obtained by marginalizing out θ from the distribution of **((Yt)t∈[T] |(θ, D)**.

To generate multiply-imputed versions of the data set, following assumptions are needed:

**Assumption 1:**

We observe at least one non-missing value for each row of D in the training section of **D**.

**Assumption 2:**

The missing pattern satisfies the ignorability assumption, namely, the probability of **(Mt)i = 1 ∀i ∈ [n] ∀t ∈ [T] **does not depend on **(Yt)t∈[T] or θ**, conditional on **(Xt)t∈[T].**

**Assumption 3:**

The noise **εt ∀t∈[T]** is independently and identically distributed as **Nn(0, Ω)**, where **Ω ∈ Sn++** is a known covariance matrix. The prior **π0** is non-informative flat prior or Gaussian with mean vector µ0 ∈ Rn and covariance matrix **Σ0 ∈ Sn++**. Applying projection operator on **Ω **we get** ΩYt **.The priors on** Yt** conditional on** θ** are independent across t, and the conditional distribution is Gaussian **Ndim(Yt)(θYt,ΩYt ) ∀t∈[T]**, where **θYt = PMt(θ).**

**BVMI framework**

**BVMI(Bias-Variance Multiple Imputation)** framework is designed to mitigate the look-ahead-bias by implementing these modules:

**(G)**The posterior Generator

**(C) **The Consensus mechanism

**(S) **The multiple imputation Sampler

**Posterior Generator **

Suppose **T1 = Ttrain **and **T2 = T; Tk**, **k = 1**, 2 is denoted as **Dk = {(Xt, Mt), t ∈ [Tk]}**, so that **D1** coincides with the training data set while** D2 **coincides with the whole data set D.The following theorem shows that **πk** are Gaussians under Assumptions 1 to 3 to generate multiply-imputed versions of the data set.

Theorem : Under Assumptions 1 to 3, the posterior distribution of **θ|Dk** is governed by **πk ∼ Nn(µk, Σk), k = 1, 2, **

where if π0 is non-informative, then

**Σk = (Σt∈[Tk] (P⊥Mt)−1(Ω-1Xt) )−1 **

** µk = Σk (Σt∈[Tk] (P⊥Mt)−1(Ω−1Xt)(P⊥Mt)−1(Xt)) **

and if π0 is Nn(µ0, Σ0), then

**Σk = Σ0-1 + Σt∈[Tk] ] (P⊥Mt)−1(Ω−1Xt) )−1 ,**

** µk = Σk (Σ0-1µ0+ Σt∈[Tk] (P⊥Mt)−1(Ω−1Xt)(P⊥Mt)−1(Xt) ), (3)**

where **(P⊥Mt)−1** is the inverse map of **(P⊥Mt)** obtained by filling entries with zeros

**Multiple Imputation Sampler **

Under Assumption 3, the distribution of **Yt** conditional on θ and **(Xt, Mt)t∈[T]** is independent across **t**, and

**Yt|θ,(Xt, Mt)t∈[T] ∼ Ndim(Yt)(θ~t, Ω˜t )**

** where θ~t = θYt + ΩYtXtΩ−1Xt(Xt − θXt) and Ω˜t = ΩYt − ΩYtXtΩ−1XtΩXtYt**

The above stated proposition indicates that the distribution of Yt|D coincide with the distribution of the random vector **ξt ∈ Rdim(Yt)** dictated by

**ξt = Atθ + bt + ηt, θ∼π*, ηt∼Ndim(Yt)(0,St), θ╨ηt**

At,** bt** and **St** that are designed to match the mean vector and the covariance matrix of the Gaussian distribution.The posterior covariance of **θ** is inverse-proportional to the time-dimension.

**Bias-Variance Targeted Consensus Mechanism **

__Algorithm : __

**Conditional Expectation Bayesian Imputation:**

Input: aggregated posterior π* , covariance matrix Ω, data set D = {(Xt, Mt) : t ∈ [T]},

```
Sample θ from π*
for t = 1, . . . , T do
Compute At, bt for model (5) using (Xt, Mt, Ω) Impute Yt by Atθ + bt
end for
Output: An imputed data set
```

The consensus mechanism module receives two posteriors π1, π2 from the posterior generator module (G), then it synthesizes a unique posterior π* and transmits π* to the sampler module.

**Wasserstein distance:**

For p ≥ 1 the p-Wasserstein distance for two probability measures π1, π2 ∈ M is

**Wp(π1, π2) = _x0012_infπ∈Π(π1,π2) p2dπ(x, y) 1/p . **where Π(π1, π2) is the set of all probability measures on R2n that have marginals π1 and π2.

**Wasserstein Consensus mechanism:**

The Wasserstein consensus mechanism Cλ : M2 → M parametrized by λ outputs a unifying posterior π* as

**Cλ({πk}) = arg minπ∈M λ1W22(π,π1) + λ2W22(π,π2)**

**Minimal variance posterior:**

For a fixed input {πk}, the aggregated posterior π* obtained by π* = Cλ* ({πk}) for an Cλ* ∈ C has minimal variance if 1T >Varπ* [θ] ≤ 1T >VarCλ({πk})[θ] ∀Cλ ∈ C, where 1 is the vector of ones

**(µ, δ)-bias tolerable posterior:**

For mean µ ∈ Rn and a tolerance δ ∈ R+. A posterior π* is said to be **(µ, δ)-**bias tolerable if **||Eπ*[θ] − µ||2 ≤ δ, δ**is the tolerance parameter which indicates how much bias is accepted in π*(aggregated posterior).

**Optimal consensus mechanism**

For a given input {πk}, the consensus mechanism Cλ* ∈ C is optimal for {πk} if λ* solves

min **1T >VarCλ({πk})[θ], **

**Such that λ ∈ ∆, **

** ||ECλ({πk})[θ] − Eπ1 [θ]||2 ≤ δ**

**Consensus mechanism in C(Cλ*):**

Let us suppose πk ∼ N (µk, Σk) with Σk 0 for k = 1, 2 and λ* be the optimal solution

min Tr(Σ1)λ21 + Tr(Σ2)λ22 + 2Tr(Σ1Φ)λ1λ2

Such that λ ∈ ∆

λ2||µ1 − µ2||2 ≤ δ,

where Φ = (Σ1/22 _x0010_ Σ1/22 Σ1Σ1/22 )−1/2 Σ1/22

**Coding Implementation**

**Coding Implementation**

```
import miceforest as mf
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
```

# Load data and introduce missing values

```
iris = pd.concat(load_iris(as_frame=True,return_X_y=True),axis=1)
iris['target'] = iris['target'].astype('category')
iris_amp = mf.ampute_data(iris,perc=0.25,random_state=1991)
Imputing a Single Dataset
If you only want to create a single imputed dataset, you can use KernelDataSet:
# Create kernel.
kds = mf.KernelDataSet(
iris_amp,
save_all_iterations=True,
random_state=1991
)
```

# Run the MICE algorithm for 3 iterations

`kds.mice(3)`

# Return the completed kernel data

`completed_data = kds.complete_data()`

There are also an array of plotting functions available, these are discussed below in the section Diagnostic Plotting. The plotting behavior between single imputed datasets and multi-imputed datasets is slightly different.

**Simple Example of Multiple Imputation**

We can also create a class which contains multiple KernelDataSets, along with easy ways to compare them:

# Create kernel.

```
kernel = mf.MultipleImputedKernel(
iris_amp,
datasets=4,
save_all_iterations=True,
random_state=1991
)
```

# Run the MICE algorithm for 3 iterations on each of the datasets

`kernel.mice(3)`

Printing the MultipleImputedKernel object will tell you some high level information:

```
print(kernel)
## Class: MultipleImputedKernel
## Models Saved: Last Iteration
## Datasets: 4
## Iterations: 3
## Imputed Variables: 5
## save_all_iterations: True
```

**Controlling Tree Growth**

A very nice thing about random forests is that they are trivially parallelizable. We can save a lot of time by setting the n_jobs parameter in both the fit and predict methods for the random forests:

# Run the MICE algorithm for 2 more iterations on the kernel,

`kernel.mice(2,n_jobs=2)`

Any other arguments may be passed to either class **(RandomForestClassifier,RandomForestRegressor)**. In our example, we may not have saved much (if any) time. This is because there is overhead with using multiple cores, and our data is very small.

**Creating a Custom Imputation Schema**

It is possible to customize our imputation procedure by variable. By passing a named list to variable_schema, you can specify the predictors for each variable to impute. You can also select which variables should be imputed using mean matching, as well as the mean matching candidates, by passing a dict tomean_match_candidates:

```
var_sch = {
'sepal width (cm)': ['target','petal width (cm)'],
'petal width (cm)': ['target','sepal length (cm)']
}
var_mmc = {
'sepal width (cm)': 5,
'petal width (cm)': 0
}
cust_kernel = mf.MultipleImputedKernel(
iris_amp,
datasets=3,
variable_schema=var_sch,
mean_match_candidates=var_mmc
)
cust_kernel.mice(2)
```

**Imputing New Data with Existing Models**

Multiple Imputation can take a long time. If you wish to impute a dataset using the MICE algorithm, but don’t have time to train new models, it is possible to impute new datasets using a MultipleImputedKernel object. The impute_new_data() function uses the random forests collected by MultipleImputedKernel to perform multiple imputation without updating the random forest at each iteration:

# Our 'new data' is just the first 15 rows of iris_amp

```
new_data = iris_amp.iloc[range(15)]
new_data_imputed = kernel.impute_new_data(new_data=new_data)
print(new_data_imputed)
## Class: MultipleImputedDataSet
## Datasets: 4
## Iterations: 5
## Imputed Variables: 5
## save_all_iterations: False
```

All of the imputation parameters (variable_schema, mean_match_candidates, etc) will be carried over from the original MultipleImputedKernel object. When mean matching, the candidate values are pulled from the original kernel dataset. To impute new data, the save_models parameter in MultipleImputedKernel must be > 0. If save_models == 1, the model from the latest iteration is saved for each variable. If save_models > 1, the model from each iteration is saved. This allows for new data to be imputed in a more similar fashion to the original mice procedure.

**Diagnostic Plotting**

As of now, miceforest has four diagnostic plots available.

**Distribution of Imputed-Values**

We probably want to know how the imputed values are distributed. We can plot the original distribution beside the imputed distributions in each dataset by using the plot_imputed_distributions method of an MultipleImputedKernel object:

`kernel.plot_imputed_distributions(wspace=0.3,hspace=0.3)`

The red line is the original data, and each black line are the imputed values of each dataset.

**Convergence of Correlation**

We are probably interested in knowing how our values between datasets converged over the iterations. The plot_correlations method shows you a boxplot of the correlations between imputed values in every combination of datasets, at each iteration. This allows you to see how correlated the imputations are between datasets, as well as the convergence over iterations:

`kernel.plot_correlations()`

**Variable Importance**

We also may be interested in which variables were used to impute each variable. We can plot this information by using the plot_feature_importance method.

`kernel.plot_feature_importance(annot=True,cmap="YlGnBu",vmin=0, vmax=1)`

The numbers shown are returned from the sklearn random forest _feature_importance attribute. Each square represents the importance of the column variable in imputing the row variable.

**Mean Convergence**

If our data is not missing completely at random, we may see that it takes a few iterations for our models to get the distribution of imputations right. We can plot the average value of our imputations to see if this is occurring:

`kernel.plot_mean_convergence(wspace=0.3, hspace=0.4)`

Our data was missing completely at random, so we don’t see any convergence occurring here.

**Using the Imputed Data**

To return the imputed data simply use the complete_data method:

`dataset_1 = kernel.complete_data(0)`

This will return a single specified dataset. Multiple datasets are typically created so that some measure of confidence around each prediction can be created. Since we know what the original data looked like, we can cheat and see how well the imputations compare to the original data:

```
acclist = []
for iteration in range(kernel.iteration_count()+1):
target_na_count = kernel.na_counts['target']
compdat = kernel.complete_data(dataset=0,iteration=iteration)
# Record the accuract of the imputations of target.
acclist.append(
round(1-sum(compdat['target'] != iris['target'])/target_na_count,2)
)
# acclist shows the accuracy of the imputations
# over the iterations.
print(acclist)
## [0.32, 0.76, 0.78, 0.81, 0.86, 0.86]
```

In this instance, we went from a **~32% accuracy** (which is expected with random sampling) to an **accuracy of ~86%**. We managed to replace the missing target values with a pretty high degree of accuracy!

**The MICE Algorithm**

Multiple Imputation by Chained Equations ‘fills in’ (imputes) missing data in a dataset through an iterative series of predictive models. In each iteration, each specified variable in the dataset is imputed using the other variables in the dataset. These iterations should be run until it appears that convergence has been met.

This process is continued until all specified variables have been imputed. Additional iterations can be run if it appears that the average imputed values have not converged, although no more than 5 iterations are usually necessary.

Common Use Cases

**Data Leakage:**

MICE is particularly useful if missing values are associated with the target variable in a way that introduces leakage. For instance, let’s say you wanted to model customer retention at the time of sign up. A certain variable is collected at sign up or 1 month after sign up. The absence of that variable is a data leak, since it tells you that the customer did not retain for 1 month.

**Funnel Analysis:**

Information is often collected at different stages of a ‘funnel’. MICE can be used to make educated guesses about the characteristics of entities at different points in a funnel.

**Confidence Intervals:**

MICE can be used to impute missing values, however it is important to keep in mind that these imputed values are a prediction. Creating multiple datasets with different imputed values allows you to do two types of inference:

Imputed Value Distribution: A profile can be built for each imputed value, allowing you to make statements about the likely distribution of that value.

Model Prediction Distribution: With multiple datasets, you can build multiple models and create a distribution of predictions for each sample. Those samples with imputed values which were not able to be imputed with much confidence would have a larger variance in their predictions.

**Predictive Mean Matching**

Miceforest can make use of a procedure called predictive mean matching (PMM) to select which values are imputed. PMM involves selecting a datapoint from the original, nonmissing data which has a predicted value close to the predicted value of the missing sample. The closest N (mean_match_candidates parameter) values are chosen as candidates, from which a value is chosen at random. This can be specified on a column-by-column basis. Going into more detail from our example above, we see how this works in practice:

This method is very useful if you have a variable which needs imputing which has any of the following characteristics:

Multimodal

Integer

Skewed

**Effects of Mean Matching**

As an example, let’s construct a dataset with some of the above characteristics:

```
randst = np.random.RandomState(1991)
# random uniform variable
nrws = 1000
uniform_vec = randst.uniform(size=nrws)
def make_bimodal(mean1,mean2,size):
bimodal_1 = randst.normal(size=nrws, loc=mean1)
bimodal_2 = randst.normal(size=nrws, loc=mean2)
bimdvec = []
for i in range(size):
bimdvec.append(randst.choice([bimodal_1[i], bimodal_2[i]]))
return np.array(bimdvec)
# Make 2 Bimodal Variables
close_bimodal_vec = make_bimodal(2,-2,nrws)
far_bimodal_vec = make_bimodal(3,-3,nrws)
# Highly skewed variable correlated with Uniform_Variable
skewed_vec = np.exp(uniform_vec*randst.uniform(size=nrws)*3) + randst.uniform(size=nrws)*3
# Integer variable correlated with Close_Bimodal_Variable and Uniform_Variable
integer_vec = np.round(uniform_vec + close_bimodal_vec/3 + randst.uniform(size=nrws)*2)
# Make a DataFrame
dat = pd.DataFrame(
{
'uniform_var':uniform_vec,
'close_bimodal_var':close_bimodal_vec,
'far_bimodal_var':far_bimodal_vec,
'skewed_var':skewed_vec,
'integer_var':integer_vec
}
)
# Ampute the data.
ampdat = mf.ampute_data(dat,perc=0.25,random_state=randst)
# Plot the original data
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.PairGrid(dat)
g.map(plt.scatter,s=5)
```

We can see how our variables are distributed and correlated in the graph above. Now let’s run our imputation process twice, once using mean matching, and once using the model prediction.

kernelmeanmatch <- mf.MultipleImputedKernel(ampdat,mean_match_candidates=5)

kernelmodeloutput <- mf.MultipleImputedKernel(ampdat,mean_match_candidates=0)

kernelmeanmatch.mice(5)

kernelmodeloutput.mice(5)

Let’s look at the effect on the different variables.

**With Mean Matching**

`kernelmeanmatch.plot_imputed_distributions(wspace=0.2,hspace=0.4)`

**Without Mean Matching**

`kernelmodeloutput.plot_imputed_distributions(wspace=0.2,hspace=0.4)`

You can see the effects that mean matching has, depending on the distribution of the data. Simply returning the value from the model prediction, while it may provide a better ‘fit’, will not provide imputations with a similair distribution to the original. This may be beneficial, depending on your goal.

**References:**

https://paperswithcode.com/paper/time-series-imputation-with-wasserstein

https://www.quanticate.com/blog/bid/106726/is-multiple-imputation-in-clinical-trials-worth-the-effort

https://github.com/AnotherSamWilson/miceforest