深入了解其行为以及如何实施

科技2025-03-28 40

The ability of graph data structures to represent complex interactions has led to new ways to analyze and classify entities defined by their co-interactions. While these analyses are powerful at finding different structures within communities, they lack the ability to encode aspects of the graph for input into conventional machine learning algorithms. With the proposal of DeepWalk, [1] co-interactions within graphs can be captured and encoded by simple neural networks into embeddings consumable by the aforementioned ML algorithms. Yet while there are articles presenting simple introductions to the DeepWalk algorithm, few that I could find provided code and discussed implementation details about these systems. Details including model parametrization, deployment considerations, and handling unseen data.

图形数据结构表示复杂交互的能力已导致分析和分类由其交互作用定义的实体的新方法。尽管这些分析功能强大，可以在社区内找到不同的结构，但它们缺乏对图形的各个方面进行编码以输入常规机器学习算法的能力。借助DeepWalk的建议，可以通过简单的神经网络捕获[1]图内的互交互并将其编码为上述ML算法可消耗的嵌入。然而，尽管有文章介绍了DeepWalk算法的简单介绍，但我找不到提供的代码并讨论了有关这些系统的实现细节。详细信息包括模型参数化，部署注意事项和处理看不见的数据。

In this short article, I will provide a high-level overview of graph networks, Word2Vec / Skip-Gram, and the DeepWalk process. To help with this, I’ll present a multi-class classification example to walk through the algorithm. After that, I will consider different parameter configurations and show their impact on the performance of the algorithm. Lastly, I’ll outline some considerations for deployment and handling unseen data within the system.

在这篇简短的文章中，我将提供图形网络，Word2Vec / Skip-Gram和DeepWalk流程的高级概述。为了解决这个问题，我将提供一个多类分类示例来介绍该算法。之后，我将考虑不同的参数配置，并显示它们对算法性能的影响。最后，我将概述在系统中部署和处理看不见的数据的一些注意事项。

图网络 (Graph Networks)

Graphs are data structures that efficiently represent the interactions between two or more objects in an ecosystem. The structure is defined by two objects, a node or vertex that defines the entities within the system. For this article I’ll use an example of a network of purchases on an e-commerce website, where the nodes in the graph are the products being sold.

图是有效表示生态系统中两个或多个对象之间相互作用的数据结构。该结构由两个对象定义，一个节点或顶点定义了系统内的实体。在本文中，我将使用一个电子商务网站上的购买网络示例，其中图中的节点是所出售的产品。

Image by author 图片作者

The other aspect of graphs is what links them: an edge, which defines the interaction that connects two nodes. An edge can be directed, showing a to-from relationship — Imagine person A sent an email to person B. An edge may also have a weight which defines their interaction. In our case, we could define an edge weight that represents the proportion of consumers who bought both of the products on our e-commerce website.

图的另一方面是链接它们的方面：边，它定义了连接两个节点的交互。可以引导一条边，显示出一个往返的关系-想象人A向人B发送了一封电子邮件。边也可以具有权重，该权重定义了他们的交互。在我们的案例中，我们可以定义一个边缘权重，该权重代表在我们的电子商务网站上购买了这两种产品的消费者的比例。

DeepWalk算法： (The DeepWalk Algorithm:)

DeepWalk is a type of graph neural network [1]— a type of neural network that operates directly on the target graph structure. It uses a randomized path traversing technique to provide insights into localized structures within networks. It does so by utilizing these random paths as sequences, that are then used to train a Skip-Gram Language Model. For simplicity in this article we will use the Gensim package Word2Vec to train our Skip-Gram model.

DeepWalk是一种图形神经网络[1] -一种直接在目标图形结构上运行的神经网络。它使用随机路径遍历技术来洞悉网络中的局部结构。它通过将这些随机路径用作序列来实现，然后将其用于训练Skip-Gram语言模型。为了简单起见，在本文中，我们将使用Gensim包Word2Vec来训练我们的Skip-Gram模型。

Word2Vec (Word2Vec)

This simple implementation of the DeepWalk algorithm relies heavily on the Word2Vec language model [2]. Presented by Google in 2013, Word2Vec allowed for words to be embedded into n-dimensional space, with similar words being locally situated near each-other. This means that words that are often used together / used in similar situations would have smaller cosine distances.

DeepWalk算法的这种简单实现在很大程度上依赖于Word2Vec语言模型[2]。由Google在2013年提出的Word2Vec允许将单词嵌入到n维空间中，而相似的单词在本地彼此相邻。这意味着经常一起使用/在类似情况下使用的单词将具有较小的余弦距离。

Image by author 作者提供的图片

Word2Vec does this by using the Skip-Gram algorithm to compare the target words with its context. At a high level, Skip-Gram operates using a sliding window technique — where it tries to predict the surrounding words given the target word in the middle. For our use-case of trying to encode similar nodes within the graph to be close to each-other in n-dimensional space, this means that we are effectively trying to guess the neighbors around the target node within our network.

Word2Vec通过使用Skip-Gram算法将目标单词与其上下文进行比较来实现此目的。在较高的层次上，Skip-Gram使用滑动窗口技术进行操作-尝试在给定中间目标词的情况下预测周围的词。对于我们的尝试将图形中的相似节点编码为在n维空间中彼此接近的用例，这意味着我们正在有效地尝试猜测网络中目标节点周围的邻居。

A Skip-Gram Example from S. Doshi [3], modifying an image taken from the original authors [2] S. Doshi的Skip-Gram示例[3]，修改了原始作者的图像[2]

深度漫步 (DeepWalk)

DeepWalk utilizes random path-making through graphs to reveal latent patterns in the network, these patterns are then learned and encoded by neural networks to yield our final embeddings. These random paths are generated in an extremely simple manner: Starting from the target root, randomly select a neighbor of that node, and add it to the path, next you randomly choose a neighbor of that node and continue through the walk until the desired number of steps has been taken. Using the e-commerce example, this repeated sampling of network paths yields a list of product-ids. These ID’s are then treated as if they were tokens in a sentence, and the state-space is learned from them using a Word2Vec model. More succinctly, the DeepWalk process follows the following steps:

DeepWalk利用遍历图的随机路径揭示网络中的潜在模式，然后通过神经网络学习并编码这些模式以产生最终的嵌入。这些随机路径以非常简单的方式生成：从目标根开始，随机选择该节点的邻居，并将其添加到路径，接下来，您随机选择该节点的邻居，并继续遍历直到所需的数目已采取步骤。使用电子商务示例，此网络路径的重复采样会生成产品ID列表。然后将这些ID视为它们是句子中的标记，并使用Word2Vec模型从它们中了解状态空间。更简洁地说，DeepWalk流程遵循以下步骤：

DeepWalk流程分为以下几个步骤： (The DeepWalk process operates in a few steps:)

For each node, perform N “random steps” starting from that node

对于每个节点，从该节点开始执行N个“随机步骤” Treat each walk as a sequence of node-id strings

将每次走动视为节点ID字符串序列 Given a list of these sequences, train a word2vec model using the Skip-Gram algorithm on these string sequences

给定这些序列的列表，在这些字符串序列上使用Skip-Gram算法训练word2vec模型 Image by author 图片作者

这在代码中是什么样的？ (What does this look like in code?)

We start with the core networkx data structure defining a series of products given by their ID. Vertices between two products are products that were co-purchased together within the e-commerce ecosystem. First, we’re going to define a function get_random_walk(Graph, Node_Id):

我们从定义其ID给出的一系列产品的核心networkx数据结构开始。两种产品之间的折点是在电子商务生态系统内共同购买的产品。首先，我们将定义一个函数get_random_walk (Graph，Node_Id)：

# Instantiate a undirected Networkx graphG = nx.Graph()G.add_edges_from(list_of_product_copurchase_edges)def get_random_walk(graph:nx.Graph, node:int, n_steps:int = 4)->List[str]: """ Given a graph and a node, return a random walk starting from the node """ local_path = [str(node),] target_node = node for _ in range(n_steps): neighbors = list(nx.all_neighbors(graph, target_node)) target_node = random.choice(neighbors)local_path.append(str(target_node)) return local_pathwalk_paths = []for node in G.nodes():for _ in range(10): walk_paths.append(get_random_walk(G, node))walk_paths[0]>>> [‘10001’, ‘10205’, ‘11845’, ‘10205’, ‘10059’]

What these random walks provide to us is a series of strings that act as a path from the start node — randomly walking from one node to the next down the list. What we do next is we treat this list of strings as a sentence, then utilize these series of strings to train a Word2Vec model

这些随机游走提供给我们的是一系列字符串，它们充当从起始节点开始的路径-从一个节点随机游走到列表中的下一个节点。接下来要做的是将这个字符串列表当作一个句子，然后利用这些字符串系列来训练Word2Vec模型

# Instantiate word2vec modelembedder = Word2Vec( window=4, sg=1, hs=0, negative=10, alpha=0.03, min_alpha=0.0001, seed=42)# Build Vocabularyembedder.build_vocab(walk_paths, progress_per=2)# Trainembedder.train( walk_paths, total_examples=embedder.corpus_count, epochs=20, report_delay=1)

调整和参数 (Tuning and Parameters)

Now that we have the basic structure of our DeepWalk, there are lots of aspects that are parametrizable outside of the general model params from our Word2Vec model. These may include:

现在我们有了DeepWalk的基本结构，在Word2Vec模型的常规模型参数之外，还有许多可参数化的方面。这些可能包括：

Number of random walks performed for the W2V training data

W2V训练数据执行的随机游走次数 Depth of each walk taken from the node

从节点上取走的每次深度

I’ll utilize a general classification approach on a sample dataset to show how these parameters can affect the performance of your model. In the graph described above — utilizing a series of products, with a graph defining co-purchased products — we seek to classify the products into their 10 respective categories.

我将对样本数据集使用通用的分类方法，以显示这些参数如何影响模型的性能。在上述图表中-利用一系列产品，并定义共同购买的产品的图表-我们试图将产品分类为10个各自的类别。

Image by author 图片作者

Above shows the classification performance (in accuracy) of the classifier trained on the node vectors from our Word2Vec model using increasing numbers of random walks on the y-axis, and increasing random walk depth on the x-axis. What we see is that accuracy increases as both parameters increase, but show a diminishing rate of returns as they both increase upwards. One thing to note is as the walk count increased, the training time increased linearly, so training times can explode as the number of walks increase. For example, the difference in training time from the top left corner took only 15 seconds, while the bottom right corner took well over an hour.

上图显示了在我们的Word2Vec模型中的节点向量上训练的分类器的分类性能(以准确性为单位)，其中使用的是y轴上的随机游走次数不断增加，而x轴上的随机游走深度却不断增加。我们看到的是，精度随着两个参数的增加而增加，但是随着它们两个都增加而显示出递减率。需要注意的一件事是，随着步行次数的增加，训练时间会线性增加，因此训练时间会随着步行次数的增加而爆炸。例如，与左上角的训练时间差异仅用15秒，而右下角则用了一个多小时。

部署系统： (Deploying the System:)

热启动再训练 (Warm-Start Retraining)

So now that we know how it behaves, how do we make it practical? In most graph systems the core issue is updating and maintaining systems without having to retrain the whole model at once. Thankfully, due to DeepWalks’ relationship with NLP, we can rely on similar update procedures. When utilizing gensim, the update algorithm is even more simple and follows the process:

因此，现在我们知道它的行为方式了，我们如何使其实用？在大多数图形系统中，核心问题是更新和维护系统而不必立即重新训练整个模型。值得庆幸的是，由于DeepWalks与NLP的关系，我们可以依靠类似的更新过程。使用gensim时，更新算法更加简单，并且遵循以下过程：

Add the target node to the graph

将目标节点添加到图中 Perform random walks from that node

从该节点执行随机游走 Using those sequences, to update the Word2Vec Model

使用这些序列来更新Word2Vec模型 # Add new nodes to graph from their edges with current nodesG.add_edges_from([new_edges])# From a list of the new nodes' product_ids (unknown_nodes) get rw'ssequences = [get_random_walks(G, node) for node in unknown_nodes]model.build_vocab(new_nodes, update=True)model.train(sequences,total_examples=model.corpus_count, epochs=model.epochs)

无需再培训 (No Retraining Necessary)

Alternatively, there is an even easier way to handle new nodes in the system. You can utilize the known embeddings of the model to extrapolate the embedding of the unknown node. Following a similar pattern to the previous implementation:

另外，还有一种更简单的方法来处理系统中的新节点。您可以利用模型的已知嵌入来推断未知节点的嵌入。遵循与先前实现类似的模式：

Add the target node to the graph

将目标节点添加到图中 Perform random walks from that node

从该节点执行随机游走 Aggregate the embeddings from the random walks, then use that aggregation to stand-in as the unknown nodes embedding

聚合随机游走的嵌入，然后使用该聚合作为未知节点的嵌入 # Add new nodes to the graph from their edges with current nodesG.add_edges_from([new_edges])sequences = [get_random_walk(G, node) for node in unknown_nodes]for walk in sequences: nodes_in_corpus = [ node for node in walk if node in word2vec ] node_embedding = [ # here we take the average of known embeddingsnp.mean(embedder[nodes_in_corpus]) ]

Here, our embedding is an average of the known nodes in the random walks starting from the unknown node. The benefit to this method is its speed, we don’t need to retrain anything, and are performing several O(1) calls to a dictionary-like data structure within Gensim. The drawbacks are its inexactness, without retraining the model, the interactions between the new node and its neighbors are approximated, and are only as good as your aggregation function. For more insight into these methods, you can check out papers that discuss the efficacy of such aggregations, like TF-IDF averaging, etc. [4]

这里，我们的嵌入是从未知节点开始的随机游走中已知节点的平均值。这种方法的好处是它的速度，我们不需要重新培训任何东西，并且正在Gensim中对类似于字典的数据结构执行多个O ( 1 )调用。缺点是它不精确，没有重新训练模型，新节点及其邻居之间的交互是近似的，仅与聚合函数一样好。要对这些方法有更多的了解，可以查看讨论此类聚合功效的论文，例如TF-IDF平均等。[4]

In the past 10 minutes, we’ve walked through [hah.] the core components of DeepWalk, how to implement it, and some considerations for implementing it into your own work. There are many possibilities to consider for evaluation of graph networks, and given its simplicity and scalability, DeepWalk should certainly be considered amongst other algorithms available. Below you can find some references to the algorithms outlined above, as well as some further reading regarding word embeddings.

在过去的10分钟内，我们经历了[ 哈。 ] DeepWalk的核心组件，如何实现它，以及在您自己的工作中实现它的一些注意事项。评估图网络有很多可能性，考虑到其简单性和可扩展性，DeepWalk当然应与其他可用算法一起考虑。在下面，您可以找到上述算法的一些参考，以及有关单词嵌入的进一步阅读。

资料来源： (Sources:)

[1] Perozzi et Al. DeepWalk: Online Learning of Social Representations https://arxiv.org/pdf/1403.6652.pdf

[1] Perozzi等人。 DeepWalk：在线学习社交表征 https://arxiv.org/pdf/1403.6652.pdf

[2] Mikolov et Al. Efficient Estimation of Word Representations in Vector Space https://arxiv.org/pdf/1301.3781.pdf

[2] Mikolov等人。向量空间中单词表示的有效估计 https://arxiv.org/pdf/1301.3781.pdf

[3] S. Doshi. Skip-Gram: NLP context words prediction algorithm: https://towardsdatascience.com/skip-gram-nlp-context-words-prediction-algorithm-5bbf34f84e0c?gi=8ff2aeada829

[3] S. Doshi。 Skip-Gram：NLP上下文词预测算法： https ://towardsdatascience.com/skip-gram-nlp-context-words-prediction-algorithm-5bbf34f84e0c?gi =8ff2aeada829

[4] Levy et Al. Improving Distributional Similarity with Lessons Learned from Word Embeddings https://www.aclweb.org/anthology/Q15-1016/

[4] Levy等人。从Word嵌入中汲取的教训提高分布相似性 https://www.aclweb.org/anthology/Q15-1016/

翻译自: https://towardsdatascience.com/deepwalk-its-behavior-and-how-to-implement-it-b5aac0290a15

Processed: 0.012, SQL: 8