时间序列指数平滑预测法

科技2023-11-28 94

时间序列指数平滑预测法

重点(Top highlight)

In time series forecasting, the presence of dirty and messy data can hurt the final predictions. This is true, especially in this domain, because the temporal dependency plays a crucial role when dealing with temporal sequences.

在时间序列预测中，脏数据和杂乱数据的存在会损害最终预测。这是正确的，尤其是在此领域，因为在处理时间序列时，时间依赖性起着至关重要的作用。

Noise or outliers must be handled with care following ad-hoc solutions. In this situation, the tsmoothie package can help us save a lot of time in preparing time series for our analysis. Tsmoothie is a python library for time series smoothing and outlier detection that can handle multiple series in a vectorized way. It’s useful because it can provide the preprocess steps we needed, like denoising or outlier removal, preserving the temporal pattern present in our raw data.

按照临时解决方案，必须小心处理噪声或异常值。在这种情况下， tsmoothie软件包可以帮助我们节省大量时间来准备用于分析的时间序列。 Tsmoothie 是用于时间序列平滑和离群值检测的python库，可以以矢量化方式处理多个序列。这很有用，因为它可以提供我们所需的预处理步骤，例如去噪或离群值去除，保留原始数据中存在的时间模式。

In this post, we use these trinks to improve a forecasting task. More precisely, we try to forecast the daily power production of solar panels. In the end, we will expect to benefit from our denoising procedure and produce better predictions than the case without preprocessing.

在本文中，我们将使用这些工具来改善预测任务。更准确地说，我们尝试预测太阳能电池板的每日发电量。最后，与没有预处理的情况相比，我们将期望受益于我们的去噪程序并产生更好的预测。

数据 (THE DATA)

A real dataset for our purpose is available on Kaggle. This data stores the daily power production of solar panels installed on the roof of a private house. The data are recorded since 2011 and present 3 different sources in a time series format:

我们可以在Kaggle上找到真实的数据集。该数据存储安装在私人房屋屋顶上的太阳能电池板的每日发电量。该数据自2011年以来开始记录，并按时间序列格式显示3种不同的来源：

The daily gas consumption of the house.

房屋的每日天然气消耗。 The daily power consumption of the house. Where a negative value indicates that solar power exceeds the local power consumption.

房屋的每日耗电量。负值表示太阳能超过当地的能耗。 The daily value of the power meter on the DC to AC converter. This is the current cumulative solar power. We don’t need the cumulative but instead, we want the absolute daily value, for this reason, we operate a simple differentiation. This is our target to forecast.

直流到交流转换器上电表的日值。这是当前的累积太阳能。我们不需要累积值，但是我们需要绝对的每日价值，因此，我们进行简单的区分。这是我们的预测目标。

As we can notice from the plot of the raw series, a lot of noise is present. This is normal for data registered by sensors. The situation is worse if our data sources suffer the external meteorological conditions or if the sensors are not of good quality and locate in not optimal positions.

从原始序列的图上我们可以注意到，存在很多噪声。对于传感器记录的数据，这是正常的。如果我们的数据源遭受外部气象条件的影响，或者如果传感器的质量不佳并且定位在非最佳位置，则情况会更糟。

Fortunately, we have the knowledge and the instruments to achieve good results for our forecasting task.

幸运的是，我们拥有知识和工具可以为我们的预测任务取得良好的结果。

时间序列平滑 (TIME SERIES SMOOTHING)

The first step in our workflow consists of time series preprocessing. Our strategy is very intuitive and effective. We take the target time series (power production) and smooth it with a fantastic instrument: the Kalman Filter, a must to know for every data scientist.

我们工作流程的第一步包括时间序列预处理。我们的策略非常直观有效。我们采用目标时间序列(发电)并使用出色的工具对其进行平滑处理：卡尔曼滤波器，每个数据科学家都必须了解的知识。

Generally speaking, the great advantage of using the Kalman Filter, in time series tasks, is the possibility to use a state-space form to represent an unobserved component model. The scope of representing time series models in the state-space form is the availability of a set of general algorithms (including Kalman Filter), for the computation of the Gaussian likelihood, which can be numerically maximized to obtain the maximum likelihood estimation of the model’s parameters. It’s no accident that famous software use this representation to fit models like ARIMA. In our particular case, we use the Kalman Filter and a state-space representation to build an unobserved component model.

一般而言，在时间序列任务中使用卡尔曼滤波器的最大优势是可以使用状态空间形式表示未观察到的组件模型。以状态空间形式表示时间序列模型的范围是可以使用一组通用算法(包括卡尔曼滤波器)来计算高斯似然，可以对其进行数值最大化以获得模型的最大似然估计。参数。著名软件使用此表示来拟合ARIMA之类的模型并非偶然。在我们的特殊情况下，我们使用卡尔曼滤波器和状态空间表示来构建不可观察的组件模型。

All that explained so far can sound tricky but I want to reassure you… Tsmoothie can easily build unobserved component models to operate a custom Kalman smoothing in a very simple and efficient way. At this stage, we can free our imagination and detect which components from level, trend, seasonality, long seasonality contributes to create the time series we are observing. A level and a long seasonality of 365 days sound good for us. We simply add a ‘confidence’ for each component assumption and we have finished.

到目前为止，所有这些解释听起来都很棘手，但我想向您保证... Tsmoothie 可以轻松构建不可观察的组件模型，从而以非常简单和有效的方式操作自定义Kalman平滑。在此阶段，我们可以释放我们的想象力，并从水平，趋势，季节性，长季节性中找出哪些因素有助于创建我们正在观察的时间序列。 365天的水平和较长的季节性对我们来说很不错。我们只需为每个组件假设添加一个“置信度”，我们就完成了。

Kalman smoothing on ALL the series (only for visual purpose). We use only the smoothed target 在所有系列上进行卡尔曼平滑(仅用于视觉目的)。我们仅使用平滑的目标

The resulting smoothed time series holds the same temporal pattern present in the raw data but with a consistent and rational noise reduction.

所得的平滑时间序列保持原始数据中存在的相同时间模式，但具有一致且合理的降噪效果。

PRO TIP: If our series presents NaNs it’s not a problem, this procedure works extremely well and it’s a very powerful instrument to fill missing gaps in our data… this is the beauty of Kalman smoothing.

专家提示：如果我们的系列提出NaN没问题，那么此过程将非常有效，并且它是一种非常强大的工具，可以填补我们数据中的缺失空白……这就是卡尔曼平滑的美。

时间序列预测 (TIME SERIES FORECASTING)

The second step involves the building of a neural network structure to forecast the next days’ power production. Firstly we fit a model on the raw data and secondly, we try fitting on the smoothed series. The smoothing data is used only as a target variable, all the input series remain in the original format. The usage of a smoothed label is aimed to help the model to better catch the real patterns and discard the noise.

第二步涉及建立神经网络结构，以预测未来几天的发电量。首先，我们在原始数据上拟合模型，其次，我们尝试在平滑序列上拟合。平滑数据仅用作目标变量，所有输入序列均保持原始格式。平滑标签的用途旨在帮助模型更好地捕捉实际图案并消除噪音。

We choose an LSTM autoencoder to forecast the next 5 daily values of power production. The training procedure is carried out using keras-hypetune. This framework provides hyperparameter optimization of the neural network structures in a very intuitive way. We operate a grid search of some parameter combinations. This is done for all the two training involved.

我们选择LSTM自动编码器来预测接下来的5个每日发电量。训练过程使用keras-hypetune进行。该框架以非常直观的方式提供了神经网络结构的超参数优化。我们对某些参数组合进行网格搜索。这是针对所有涉及的两个培训完成的。

The baseline is simply the repetition of the present value 基线只是现值的重复

As we can imagine the prediction errors are in function of the time horizons. The predictions of the next day are more accurate than the predictions for the next five days. The important point is that the smoothing procedure provides a great benefit in terms of prediction accuracy for all the time forecasting horizons.

可以想象，预测误差与时间范围有关。第二天的预测比接下来五天的预测更准确。重要的是，对于所有时间预测范围，平滑过程在预测准确性方面都具有很大的好处。

Forecasts on unseen test data 看不见的测试数据的预测

概要(SUMMARY)

In this post, we took advantage of the time series smoothing in a forecasting scenario. We applied the Kalman Filter to smooth our raw data and reduce the presence of noise. This choice proved to be advantageous in terms of forecasting accuracy. I want to point out also the power of the Kalman Filter in this application and its ability to be a good instrument when building unobserved component models.

在本文中，我们利用了预测场景中的时间序列平滑。我们应用卡尔曼滤波器来平滑原始数据并减少噪声的存在。事实证明，这种选择在预测准确性方面是有利的。我还要指出的是，卡尔曼滤波器在此应用程序中的功能及其在构建未观察到的组件模型时成为优秀仪器的能力。

CHECK MY GITHUB REPO

检查我的GITHUB回购

Keep in touch: Linkedin

保持联系： Linkedin

翻译自: https://towardsdatascience.com/time-series-smoothing-for-better-forecasting-7fbf10428b2

时间序列指数平滑预测法

相关资源：keras tensorflow lstm 多变量序列的预测 + 数据文件

Processed: 0.012, SQL: 8