# Fold - Core Walkthrough

**Welcome ðŸ‘‹**

In this notebook we'll demonstrate `fold`

's powerful interface for creating, training, and cross-validating (or backtesting, if you prefer) simple and *composite* models/pipelines.

We will use the dataset from an Energy residual load forcasting challenge hosted on Kaggle.

**By the end you will know how to:**
- Create a simple and ensemble model (composite model)
- Train multiple models / pipelines over time
- Analyze the model's simulated past performance

Let's start by installing:
- `fold`

- `fold-wrappers`

: optional, this will be required later for third party models. Wraps eg. `XGBoost`

or `StatsForecast`

models to be used with `fold`

.
- `krisi`

, optional. Dream Faster's Time-Series evaluation library to quickly get results.

## Installing libraries

```
%%capture
pip install --quiet fold-core fold-wrappers fold-models krisi matplotlib seaborn xgboost plotly prophet statsforecast statsmodels ray kaleido
```

## Data Loading and Exploration

Let's load in the data and do minimal exploration of the structure of the data.

`fold`

has a useful utility function that loads example data from our `datasets`

GitHub repo.

- We are forecasting
`residual_load`

â€¡. - We will shorten the dataset to
`4000`

rows so we have a speedier demonstration.

â€¡ *The difference between the load in the network and the P that the industrial complex is producing.*

```
from fold.utils.dataset import get_preprocessed_dataset
from statsmodels.graphics.tsaplots import plot_acf
from krisi import score, compare
from krisi.report import plot_y_predictions
import plotly.io as pio
pio.renderers.default = "png"
X, y = get_preprocessed_dataset(
"energy/industrial_pv_load",
target_col="residual_load",
resample="H",
deduplication_strategy="first",
shorten=4000,
)
no_of_observation_per_day = 24
no_of_observation_per_week = no_of_observation_per_day * 7
y.plot(figsize = (20,5), grid=True);
```

The data format may be very familiar - it looks like the standard scikit-learn data.

`X`

represents exogenous variables in `fold`

, where a single row corresponds to a single target value. That means we currently only support univariate time-series (with exogenous variables), but soon we're extending that.

It's important that the data should be sorted and its integrity (no missing values, no duplicate indicies) should be checked before passing data to `fold`

.

P | Gb(i) | Gd(i) | H_sun | T2m | WS10m | load | residual_load | |
---|---|---|---|---|---|---|---|---|

datetime | ||||||||

2018-01-01 00:00:00 | 0.0 | 0.0 | 0.0 | 0.0 | 8.44 | 5.54 | 120.0 | 120.0 |

2018-01-01 01:00:00 | 0.0 | 0.0 | 0.0 | 0.0 | 7.56 | 5.43 | 115.5 | 115.5 |

2018-01-01 02:00:00 | 0.0 | 0.0 | 0.0 | 0.0 | 7.04 | 5.33 | 120.5 | 120.5 |

2018-01-01 03:00:00 | 0.0 | 0.0 | 0.0 | 0.0 | 6.48 | 5.67 | 123.5 | 123.5 |

2018-01-01 04:00:00 | 0.0 | 0.0 | 0.0 | 0.0 | 5.95 | 5.79 | 136.5 | 136.5 |

(We'll ignore the exogenous variables until a bit later)

```
datetime
2018-01-01 00:00:00 115.5
2018-01-01 01:00:00 120.5
2018-01-01 02:00:00 123.5
2018-01-01 03:00:00 136.5
2018-01-01 04:00:00 138.0
Freq: H, Name: residual_load, dtype: float64
```

You can see that `y`

(our target) contains the next value of `X`

's "residual_load" column.

## Time Series Cross Validation with a univariate forecaster

### 1. Model Building

`fold`

has three core type of building blocks which you can build arbitrary sophisticated pipelines from:
- **Transformations** (classes that change, augment the data. eg: `AddHolidayFeatures`

adds a column feature of holidays/weekends to your exogenous variables)
- **Models** (eg.: Sklearn, Baseline Models, third-party adapters from `fold-wrappers`

, like Statsmodels)
- **Composites** (eg.: `Ensemble`

- takes the mean of the output of arbitrary number of 'parallel' models or pipelines)

Let's use Facebook's popular `Prophet`

library, and create in instance.

If `fold-wrappers`

is installed, `fold`

can take this instance without any additional wrapper class.

### 2. Creating a Splitter

A splitter allows us to do Time Series Cross-Validation with various strategies.

`fold`

supports three types of `Splitters`

:

```
from fold.splitters import ExpandingWindowSplitter
splitter = ExpandingWindowSplitter(
initial_train_window=no_of_observation_per_week * 6,
step=no_of_observation_per_week
)
```

Here, `initial_train_window`

defines the first window size, `step`

is the size of the window between folds.

We're gonna be using the first 6 weeks as our initial window, and re-train (or update, in another training mode) it every week after. We'll have 18 models, each predicting the next week's target variable.

You can also use percentages to define both, for example, `0.1`

would be equivalent to `10%`

of the availabel data.

### 3. Training a (univariate) Model

We could use ray to parallelize the training of multiple folds, halving the time it takes for every CPU core we have available (or deploying it to a cluster, if needed).

We pass in `None`

as `X`

, to indicate that we want to train a univariate model, without any exogenous variables.

```
from fold import train_evaluate, Backend
import ray
ray.init(ignore_reinit_error=True)
scorecard, predictions, trained_pipeline = train_evaluate(prophet, None, y, splitter, backend=Backend.ray, krisi_args={"model_name":"prophet"})
```

### 4. Evaluating the results

prophet Mean Absolute Error 9.507563e+01 Mean Absolute Percentage Error 7.955506e+13 Symmetric Mean Absolute Percentage Error 5.092002e-01 Mean Squared Error 1.477914e+04 Root Mean Squared Error 1.215695e+02 R-squared 4.268419e-01 Mean of the Residuals 6.361641e+00 Standard Deviation of the Residuals 1.214232e+02

Finally, let's save the scorecard into a list, so we can compare the results later.

## Using an Ensemble (Composite) model

Here we will build an `Ensemble`

model that leverages the output of multiple models.

### 1. Model Building with `fold-wrappers`

We are going to define three different pipelines, each leveraging a different model and different features.

We can leverage the most popular modelling libraries, like StatsForecast, Sktime, XGBoost, etc. (the list can be found here).

Let's train a MSTL model that's implemented in StatsForecast, that can capture multiple seasonalities, with the `WrapStatsForecast`

class from `fold-wrappers`

. This is not strictly necessary, though, as the automatic wrapping also works for StatsForecast instaces as well.

```
from statsforecast.models import MSTL
from fold_wrappers import WrapStatsForecast, WrapStatsModels
mstl = WrapStatsForecast.from_model(MSTL([24, 168]))
```

### 2. Ensembling with `fold`

Finally, let's `ensemble`

the two pipelines.

### 3. Training all pipelines seperately and within an `ensemble`

We'll use the same `ExpandingWindowSplitter`

we have defined above, to make performance comparable.

```
from fold import train_evaluate
for name, pipeline in [
("mstl", mstl),
("univariate_ensemble",univariate_ensemble)
]:
scorecard, predictions, pipeline_trained = train_evaluate(pipeline, None, y, splitter, krisi_args={"model_name":name})
results.append((scorecard, predictions))
```

rmse | mse | |
---|---|---|

prophet | 121.569502 | 14779.143772 |

mstl | 118.703683 | 14090.564243 |

univariate_ensemble | 105.962271 | 11228.002822 |

We see that our Ensemble model has beaten all individual models' performance - which is very usual in the time series context.

## Using a single-step ahead forecaster (a baseline)

So far we've used models that were costly to update (or re-train) every day, therefore we were limited to training once for every week, then predicting the next week's target.

What if we could use a lightweight, "online" model, that can be updated on every timestamp?

And.. what if we just repeat the last value?

That'd be the `Naive`

model you can load from `fold_models`

.

```
from fold import train_evaluate
from fold_models import Naive
scorecard, predictions, trained_pipeline = train_evaluate(Naive(), None, y, splitter, krisi_args={"model_name":"naive"})
results.append((scorecard, predictions))
scorecard.print("minimal")
```

```
0%| | 0/18 [00:00<?, ?it/s]
0%| | 0/18 [00:00<?, ?it/s]
```

naive Mean Absolute Error 3.911054e+01 Mean Absolute Percentage Error 2.122351e+14 Symmetric Mean Absolute Percentage Error 2.038015e-01 Mean Squared Error 4.012944e+03 Root Mean Squared Error 6.334780e+01 R-squared 8.443718e-01 Mean of the Residuals -4.387701e-02 Standard Deviation of the Residuals 6.335837e+01

**We call this Adaptive Backtesting.**

It looks like having access to last value really makes a difference: the baseline model beats all long-term forecasting models by a large margin.

**It's extremely important to define our forecasting task well**:
1. We need to think about what time horizon can and should forecast
2. And how frequently can we update our models.

Long-horizon (in this case, a week ahead) forecasts can be very unreliable, on the other hand, frequent, short-term forecasts are where Machine Learning shines (as we'll see in the next section).

## Using exogenous variables with Tabular Models

So far we have been training univariate models, and ignored all the additional, exogenous variables that come with our data.

Let's try whether using this data boost our model's performance!

### Building Models separately

We'll be using scikit-learn's `HistGradientBoostingRegressor`

, their competing implementation of Gradient Boosted Trees. You don't need to wrap `scikit-learn`

models or transformations when using it in `fold`

, just pass it in directly to any pipeline.

```
from sklearn.ensemble import HistGradientBoostingRegressor
tree_model = HistGradientBoostingRegressor(max_depth=10)
```

Let's add both holiday and date-time features to our previous ensemble pipeline.

The data was gathered in the Region of Hessen, Germany -- so we pass in `DE`

(we can pass in multiple regions). This transformation adds another column for holidays to our `exogenous`

(`X`

) features.

We're also adding the current hour, and day of week as integers to our exogenous features. This is one of the ways for our tabular model to capture seasonality.

```
from fold.transformations import AddHolidayFeatures, AddDateTimeFeatures
datetime = AddDateTimeFeatures(['hour', 'day_of_week', 'day_of_year'])
holidays = AddHolidayFeatures(['DE'])
```

Let's add a couple of lagged, exogenous values for our model. `AddLagsX`

receives a tuple of column name and integer or list of lags, for each of which it will create a column in `X`

.

We can easily create transformations of existing features on a rolling window basis with `AddWindowFeatures`

as well, in this case, the last day's average value for all of our exogenous features.

We can "tie in" two separate pipelines with `Concat`

, which concatenates all columns from all sources.

```
from fold.transformations import AddWindowFeatures, AddLagsX
from fold.composites import Concat
tree = [
Concat([
AddLagsX(("all",range(1,3))),
AddWindowFeatures([("all", 24, "mean")]),
]),
datetime,
holidays,
tree_model
]
```

Let's see how this performs!

We can also use fold's `train`

, `backtest`

to decouple these functionalities.

```
from fold import train, backtest
trained_pipeline = train(tree, X, y, splitter)
predictions = backtest(trained_pipeline, X, y, splitter)
scorecard = score(y[predictions.index], predictions.squeeze(), model_name="tabular_tree")
results.append((scorecard, predictions))
scorecard.print("minimal")
```

```
0%| | 0/18 [00:00<?, ?it/s]
0%| | 0/18 [00:00<?, ?it/s]
```

tabular_tree Mean Absolute Error 2.965292e+01 Mean Absolute Percentage Error 2.123209e+14 Symmetric Mean Absolute Percentage Error 1.591514e-01 Mean Squared Error 2.637592e+03 Root Mean Squared Error 5.135749e+01 R-squared 8.977101e-01 Mean of the Residuals -1.747140e+00 Standard Deviation of the Residuals 5.133635e+01

### Creating an Ensemble of Tabular models

First let's creat two more models: * an Sklearn LinearRegressor * and an XGBoostRegressor instance

We are also going to use the HistGradientBoostingRegressor pipeline that we defined prior.

```
from sklearn.linear_model import LinearRegression
lregression = [
AddLagsX(('all',range(1,3))),
datetime,
LinearRegression()
]
```

```
from xgboost import XGBRegressor
from fold_wrappers.xgboost import WrapXGB
xgboost = [
AddLagsX(('all',range(1,3))),
datetime,
WrapXGB.from_model(XGBRegressor())
]
```

```
scorecard, predictions, pipeline_trained = train_evaluate(tabular_ensemble, X, y, splitter, krisi_args={"model_name":"tabular_ensemble"})
results.append((scorecard, predictions))
```

```
0%| | 0/18 [00:00<?, ?it/s]
0%| | 0/18 [00:00<?, ?it/s]
```

## Comparing & Vizualising the results

rmse | mae | mape | smape | mse | r_two | residuals_mean | residuals_std | |
---|---|---|---|---|---|---|---|---|

prophet | 121.569502 | 95.075627 | 7.955506e+13 | 0.509200 | 14779.143772 | 0.426842 | 6.361641 | 121.423231 |

mstl | 118.703683 | 68.464305 | 1.846641e+14 | 0.279174 | 14090.564243 | 0.453546 | 13.258315 | 117.980649 |

univariate_ensemble | 105.962271 | 73.283533 | 1.321096e+14 | 0.356740 | 11228.002822 | 0.564561 | 9.809978 | 105.524826 |

naive | 63.347800 | 39.110541 | 2.122351e+14 | 0.203802 | 4012.943794 | 0.844372 | -0.043877 | 63.358374 |

tabular_tree | 51.357493 | 29.652915 | 2.123209e+14 | 0.159151 | 2637.592134 | 0.897710 | -1.747140 | 51.336346 |

tabular_ensemble | 48.228591 | 28.165175 | 2.113763e+14 | 0.153255 | 2325.997006 | 0.909794 | -2.690336 | 48.161544 |

In this simplistic, unfair comparison, it looks like the tabular models (and the Naive baseline) that have access to the previous value (and the exogenous variables) outperform the univariate models that are only re-trained every week.

We can't really draw general conclusions from this work, though.

Unlike NLP and Computer vision, Time Series data is very heterogeneous, and a Machine Learning approach that works well for one series may be an inferior choice for your specific usecase.

But now we have an easy way to compare the different pipelines, with unprecedented speed, by using a unified interface, with fold.

```
all_predictions =[predictions.squeeze().rename(scorecard.metadata.model_name) for scorecard, predictions in results]
plot_y_predictions(y[predictions.index], all_predictions, y_name="residual_load", mode='seperate')
```

Want to know more? Visit fold's Examples page, and access all the necessary snippets you need for you to build a Time Series ML pipeline!