生成对抗网络的判别器

    科技2022-08-01  139

    生成对抗网络的判别器

    Generative models have been used in a myriad of applications across different domains — starting from quantitative finance to model and minimize tail risk to hydro-climate research to study the joint effects of extreme weather. The power of generative models lies in its ability to encapsulate the entire dependence structure and the underlying probability distribution of the datasets. This is very useful because, using generative models we can not only create future representations of datasets, but also sample new realistic data points that preserves the cross correlation across the original training dataset.

    摹 enerative车型已经在跨越不同领域无数的应用中使用-开始从量变到财务模型,并尽量减少尾部风险水文气候研究,研究极端天气的共同作用。 生成模型的能力在于它能够封装整个依存结构和数据集的潜在概率分布。 这非常有用,因为使用生成模型,我们不仅可以创建数据集的将来表示形式,还可以采样新的现实数据点,从而保留原始训练数据集之间的互相关性。

    Generative Adversarial Networks (GANs) are a new class of generative models that was first introduced by Goodfellow et al. (2014). Since then, GANs have found widespread adoption within the machine learning community to solve unsupervised machine learning problems including image/text generation and translation. In this article, let us explore how we can use GANs to model discrete-time stochastic processes, and on-demand create plausible samples to construct synthetic time series of weather data.

    生成对抗网络(GAN)是一类新的生成模型,最早由Goodfellow等人引入。 (2014年) 。 从那时起,GAN已在机器学习社区中得到广泛采用,以解决无监督的机器学习问题,包括图像/文本生成和翻译。 在本文中,让我们探索如何使用GAN来建模离散时间随机过程,并按需创建合理的样本来构建天气数据的合成时间序列。

    GAN的简要回顾: (Brief Review of GAN’s:)

    Schematic Architecture of Generative Adversarial Networks (Image by Author) 生成对抗网络的示意性架构 (作者提供)

    GANs architecture comprise of two sub-modes called Generator (G) and Discriminator (D), that compete with each other with the goal of achieving Nash equilibrium through the training process. The Generator learns to map the latent space (e.g. Noise ~ N(0,1)) to the data space over which the given data samples are distributed, and the Discriminator evaluates the mapping done by the Generator. The principle role of the Generator is to generate synthetic data that mimics the training dataset to an extent where the Discriminator cannot distinguish the synthetic data from the real data.

    GAN架构包含两个子模式,分别称为Generator(G)和Discriminator(D) ,它们相互竞争,目的是通过训练过程达到Nash平衡 。 生成器学习将潜在空间(例如,噪声〜N (0,1) )映射到在其上分发给定数据样本的数据空间,鉴别器评估生成器完成的映射。 生成器的主要作用是生成模仿训练数据集的合成数据,以使鉴别器无法将合成数据与真实数据区分开。

    The input to the Generator is a random noise vector x’ (usually a uniform or normal distribution). The noise vector is mapped to a new data space via Generator to obtain a fake sample, G(x’), which is a multi-dimensional vector. The Discriminator is a binary classifier that takes in both the synthetic G(x’) and the training (x) dataset and learn to classify them as fake and real. Optimal state of the GAN model is reached when the Discriminator cannot determine whether the data comes from the real dataset or the Generator.

    生成器的输入是随机噪声矢量x' (通常是均匀或正态分布)。 噪声向量通过Generator映射到新的数据空间,以获得伪样本G(x') ,它是多维向量。 鉴别器是一个二进制分类器,它吸收了合成G(x')和训练( x )数据集,并学习将它们分类为假的和真实的。 当判别器无法确定数据来自真实数据集还是生成器时,便会达到GAN模型的最佳状态。

    Several variations of GANs have been developed over time to address specific needs of the problem in hand, but all these variants employ Adversarial training that happens in two phases:

    随着时间的推移,已经开发了GAN的几种变体来解决当前问题的特定需求,但是所有这些变体均采用对抗性训练,该训练分为两个阶段:

    Phase I: Train the Discriminator and freeze the Generator. This is done by passing samples of both the real and fake data to the model and evaluate if the Discriminator can predict them correctly. During Phase I of the training process, the network will only do the forward pass and no back-propagation of error will be applied.

    第一阶段:训练鉴别器并冻结发生器。 这是通过将真实数据和虚假数据的样本传递到模型并评估鉴别器是否可以正确预测它们来完成的。 在训练过程的第一阶段,网络将仅进行前向传递,并且不会应用错误的反向传播。

    Phase II: In this phase we train the Generator, and freeze the Discriminator. This is done by optimizing the Generator to create realistic samples that would fool the the trained Discriminator from Phase I.

    第二阶段:在这一阶段中,我们训练生成器,并冻结鉴别器。 这是通过优化生成器来创建逼真的样本来完成的,该样本将使受过训练的鉴别器从第一阶段中消失。

    Once trained, the GANs model is optimized to produce synthetic time series that’s realistic enough to mimic the real time series data. After training, the Generative model can be used to create new plausible samples, on demand.

    经过训练后,GANs模型将经过优化,以生成足够逼真的模拟时间序列数据的合成时间序列。 训练后,可根据需要使用生成模型来创建新的合理样本。

    什么是随机天气生成器? (What is a stochastic weather generator?)

    Image by Author 图片作者

    A stochastic weather generator is a statistical model that aims to produce synthetic time series of weather data of unlimited length for a location or a region based on analysing the statistical characteristics of weather in that region. Most stochastic weather generators use a combination of Markov procedure and frequency distribution of different weather variables.

    随机天气生成器是一种统计模型,旨在通过分析某个区域或某个区域的天气统计特征,生成该区域或区域无限长度的天气数据的合成时间序列。 大多数随机天气生成器结合使用马尔可夫过程和不同天气变量的频率分布。

    In climate modeling, multivariate distributions have been traditionally modeled using Copulas. A copula function is a multivariate distribution with standard uniform marginals. Any multivariate distribution can be written as a composition of its marginal distributions and its copula. For example, in a bivariate case, given two continuous random variables “x” and “y”, with marginal distributions “F” and “G”, and a joint distribution “Q”, there exists a copula “C” such that Q(x,y) = C[F(x),G(y)]

    在气候建模中,传统上使用Copulas对多元分布进行建模。 copula函数是具有标准均匀边际的多元分布。 任何多元分布都可以写成其边际分布和系数的组合。 例如,在双变量情况下,给定两个连续的随机变量“ x”和“ y” ,边际分布为“ F”和“ G” ,联合分布为“ Q” ,则存在系数“ C”使得Q (x,y)= C [F(x),G(y)]

    GANs offer a completely different framework to model multivariate distributions. Using a combination of Generative and Discriminative models, along with adversarial training, we would be able to transform noise into realistic weather data.

    GAN提供了一个完全不同的框架来建模多元分布。 结合生成模型和判别模型,再加上对抗性训练,我们将能够将噪声转换为真实的天气数据。

    数据: (Data:)

    For this study, we will use the total monthly precipitation from five weather stations located in the Bay Area, CA. For training the GAN model, we will use the monthly precipitation data from year 1893 to 2012 obtained from the NOAA’s USHCN database. The precipitation values are in hundredths of inches. We will focus on GANs ability to capture the spatial correlation across five weather stations by modeling the total monthly precipitation.

    在本研究中,我们将使用位于加利福尼亚州湾区的五个气象站的每月总降水量。 为了训练GAN模型,我们将使用从NOAA的USHCN数据库获得的1893年至2012年的月降水量数据。 降水值以百分之一英寸为单位。 通过模拟月总降水量,我们将重点关注GAN捕获五个气象站之间空间相关性的能力。

    Weather Stations in Bay Area (Image by Author) 湾区的气象站 (作者提供)

    模型: (Model:)

    For both the Generator and Discriminator components of the GAN, we will employ a three layer Sequential model. The latent vector would comprise of 20 values sampled from a normal distribution. We will use the popular ReLU activation function in the hidden layers, and train the model using mean absolute percentage error (MAPE) loss and the Adam version of stochastic gradient descent. To prevent model from overfitting, we will adopt Dropout regularization technique. Schematic of the the GANs architecture used in this study is shown below

    对于GAN的生成器和鉴别器组件,我们将采用三层顺序模型。 潜在向量将包含从正态分布采样的20个值。 我们将在隐藏层中使用流行的ReLU激活函数,并使用平均绝对百分比误差(MAPE)损失和随机梯度下降的Adam版本训练模型。 为了防止模型过度拟合,我们将采用Dropout正则化技术。 这项研究中使用的GANs架构示意图如下所示

    GAN Architecture (Image by Author) GAN建筑 (作者提供)

    Our model output will have 10,000 synthetic data samples with 5 output features corresponding to total monthly precipitation across Livermore, Berkeley, Napa, Petaluma and Santa Cruz locations. The model was implemented using Keras library in Python, and code snippets are provided below.

    我们的模型输出将包含10,000个合成数据样本,并具有5​​个输出特征,这些特征分别对应于Livermore,Berkeley,Napa,Petaluma和Santa Cruz位置的每月总降水量。 该模型是使用Python中的Keras库实现的,下面提供了代码段。

    # Function to generate "true" samples from original data def generate_real_samples(dataset, n_samples): ix = np.random.randint(0,dataset.shape[0]-n_samples) #choose random index to start the sequence of n_samples X = dataset.iloc[ix:ix+n_samples].values y = np.ones((n_samples,1)) #class lables is set to 1 for real samples return X, y # Function to generate "Fake" samples from a defined latent space def generate_latent_points(latent_dim, n_samples): #generate points in latent space x_input = np.random.randn(n_samples,latent_dim) #use "randn" for normal distribution and "rand" for uniform distribution x_input = x_input.reshape(n_samples,latent_dim) return x_input def generate_fake_samples(generator, latent_dim, n_samples): x_input = generate_latent_points(latent_dim, n_samples) X = generator.predict(x_input) y = np.zeros((n_samples,1)) #class label is set to 0 for fake samples return X, y # Define Generator def define_generator(latent_dim, nr_features): model = Sequential(name='Generator') model.add(Dense(64, activation='relu', kernel_initializer='he_uniform',input_dim=latent_dim, name='G-Input')) model.add(Dropout(0.25)) model.add(Dense(32, activation='relu', name='G-Layer1')) model.add(Dropout(0.25)) model.add(Dense(nr_features, activation='tanh', name='G-Output')) plot_model(model, show_shapes=True, to_file='G_Model.png') return model # Define Discriminator def define_discriminator(nr_features, opt): model = Sequential(name='Discriminator') model.add(Dense(64, activation = 'relu', kernel_initializer='he_uniform',input_dim=nr_features, name='D-Input')) model.add(Dropout(0.25)) model.add(Dense(32, activation='relu', name='D-Layer1')) model.add(Dropout(0.25)) model.add(Dense(1, activation = 'tanh', name='D-Output')) model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics=['accuracy']) plot_model(model, show_shapes=True, to_file='D_Model.png') return model # Define GAN def define_GAN(generator, discriminator, opt): discriminator.trainable = False # make weights in the discriminator not trainable model = Sequential(name='GAN') model.add(generator) model.add(discriminator) model.compile(loss='mean_absolute_percentage_error', optimizer='adam') plot_model(model, show_shapes=True, show_layer_names=True,expand_nested=True,dpi=200,to_file='GAN_Model.png') return model

    结果: (Results:)

    As mentioned in the introductory part of the article, one of the greatest strengths of GANs is its ability to generate data on demand. Using the trained GANs model, 10,000 new synthetic data points were generated for each of the five weather stations. The plots and table below compares the descriptive statistics of the actual data used for training the model and the synthetic data generated by the GANs model.

    如本文的引言部分所述,GAN的最大优势之一就是它能够按需生成数据。 使用训练有素的GAN模型,为五个气象站中的每个气象站生成了10,000个新的综合数据点。 下面的图和表比较了用于训练模型的实际数据和GANs模型生成的综合数据的描述统计量。

    Box-Plot of Precipitation Data Across Different Weather Stations (Actual) 跨不同气象站的降水数据箱线图(实际) Box-Plot of Precipitation Data Across Different Weather Stations (Synthetic) 跨不同气象站(合成)的降水数据箱线图 Descriptive Statistics and Correlation Comparison Between Actual (Training) and Synthetic Data 实际(训练)与合成数据之间的描述性统计和相关比较

    A quick review of the model results show that the GAN model was able to learn the distribution of data at each of the weather stations relatively well, while simultaneously preserving the cross correlation across the stations. For Santa Cruz station, the modeled estimates are slightly higher than the observed values. But in general, the sample minium, lower quartile (25%), median (50%), upper quartile (75%), and sample maximum of the synthetic data are matching well with the corresponding statistics from the actual data. The performance of the GAN model can be further improved by optimizing the size of latent vector, number of layers, cost functions, training algorithm and activation functions.

    对模型结果的快速回顾表明,GAN模型能够较好地了解每个气象站的数据分布,同时保留了各个气象站之间的互相关性。 对于圣克鲁斯站,建模的估计值略高于观测值。 但是通常,合成数据的样本最小值,下四分位数(25%),中位数(50%),上四分位数(75%)和样本最大值与实际数据中的相应统计数据很好地匹配。 通过优化潜矢量的大小,层数,成本函数,训练算法和激活函数,可以进一步改善GAN模型的性能。

    基于GAN的数据增强: (GANs Based Data Augmentation:)

    For most of real world applications, data scarcity is one of the biggest bottlenecks to address. This is particularly true in the insurance and risk analytics domain where the problem of data scarcity is further compounded by the quality of data. GANs offer a powerful pre-trained modeling framework that can artificially synthesize new labeled data that can be used to develop better models and risk analytics. This study just scratches the surface of potential application of GANs for risk modeling. Stay tuned for more.

    对于大多数实际应用而言,数据稀缺是要解决的最大瓶颈之一。 在保险和风险分析领域尤其如此,因为数据质量问题进一步加剧了数据稀缺性的问题。 GAN提供了功能强大的预训练建模框架,可以人工合成新的标记数据,这些数据可用于开发更好的模型和风险分析。 这项研究只是摸索了GAN在风险建模中潜在应用的表面。 敬请期待更多。

    Thanks for reading this article! All feedbacks are appreciated. For any questions feel free to contact me.

    感谢您阅读本文! 感谢所有反馈。 如有任何疑问,请随时与我联系。

    If you liked this article, here are some other articles you may enjoy:

    如果您喜欢这篇文章,可以参考以下其他一些文章:

    翻译自: https://towardsdatascience.com/stochastic-weather-generator-using-generative-adversarial-networks-a9856b0f83ef

    生成对抗网络的判别器

    Processed: 0.011, SQL: 8