数位板时不时失控
In the world of Data Science and Machine Learning, network analysis can be easily treated as a standalone domain. The depth of the field is so vast that nowadays lots of companies and industries use it for countless things. From social media apps that exploit connections between users to find out more about our likes and dislikes, to fraud prevention companies such as us at Ravelin, using network analysis to connect customers according to the payment methods or devices they used while ordering online.
在数据科学和机器学习的世界中,可以轻松地将网络分析视为一个独立的领域。 领域的深度是如此之大,以至于当今许多公司和行业将其用于无数事物。 从利用用户之间的联系以了解我们的好恶的社交媒体应用程序到Ravelin这样的欺诈预防公司,使用网络分析根据客户在网上订购时使用的付款方式或设备来联系客户。
Now, let’s stop there for a second to understand more about how all this gibberish works. In a nutshell, link analysis is a technique used to assess and evaluate connections between data. This is much easier and faster when the data is shown in a graph network, so sometimes link analysis is called network analysis or network visualisation.
现在,让我们停一秒钟,以进一步了解所有这些乱码如何工作。 简而言之,链接分析是一种用于评估和评估数据之间的连接的技术。 当数据显示在图形网络中时,这更容易,更快捷,因此有时链接分析称为网络分析或网络可视化。
There are no limits on what can be represented on a network. Take the following example:
网络上可以表示的内容没有限制。 请看以下示例:
In a similar way, instead of persons, companies and cities, we could have connections between users and movies or between football teams and players. Whatever the case, these entities are usually known as nodes. Nodes can have attributes or properties which store information about the node in key/value pairs. These nodes are connected by edges, i.e. the lines between nodes which represent the relationships. Potentially, the edges could also have properties, such as start date, length of time, distances or costs. However, for the sake of simplicity, in this story we’ll assume that all edges are alike.
以类似的方式,我们可以在用户和电影之间或在足球队和球员之间建立联系,而不是个人,公司和城市。 无论如何,这些实体通常称为节点。 节点可以具有在键/值对中存储有关节点信息的属性或属性。 这些节点通过边连接,即代表关系的节点之间的线。 边缘也可能具有属性,例如开始日期,时间长度,距离或成本。 但是,为了简单起见,在这个故事中,我们将假设所有边缘都是相同的。
At Ravelin we use a variety of edges for connecting customers and getting a clear picture of the underlying links between them. However, when connections start to happen and you’re dealing with millions of orders per day, things can get out of control quite easily. That’s why initially we anticipated using an open source tool or buying something off the shelf but quickly realised that latency, flexibility and other issues meant it would be more effective for us to build it ourselves. However, do not despair, tools such as neo4j or even the Python library networkx, which we use at Ravelin mostly for exploratory data analysis and we’ll mention later on in this story, can potentially be more than enough for lots of business cases.
在Ravelin,我们使用各种优势来联系客户并清楚了解客户之间的潜在联系。 但是,当连接开始发生并且您每天要处理数百万个订单时,事情很容易失控。 这就是为什么我们最初期望使用开源工具或购买现成的产品,但是很快意识到延迟,灵活性和其他问题意味着我们自己构建它会更有效。 但是,不要失望,诸如neo4j甚至Python库networkx之类的工具 (对于我们在Ravelin中主要用于探索性数据分析,我们将在本故事的后面部分提到),对于许多商业案例而言,可能已经足够了。
So far so good, we have already quickly reviewed the main elements of a network and some examples of what these elements could be. Now let’s see a real world example of one of our networks at Ravelin, from a customer that we’ll call David Riley:
到目前为止,到目前为止,我们已经快速地回顾了网络的主要元素以及这些元素可能是什么的一些示例。 现在,让我们来看看Ravelin的一个网络的真实示例,该客户来自我们称为David Riley的客户:
As you can see, David’s network looks suspicious. The different node colours in the network represent different entities, e.g. pink is a device, green represents credit cards, blue nodes are phone numbers and the orange ones are chargebacks, i.e. a fraudulent transaction. And connections through all of them allow us to find links between customers. In fact, our network propagates out through connected nodes and we are able to find connections based on their distance (or “hops”) to confirmed fraudsters — up to a maximum of 20 hops. And not only that, by looking at the number of customers within a network or the speed at which the network grows we can extract out of them further signals of fraud.
如您所见,David的网络看起来可疑。 网络中不同的节点颜色表示不同的实体,例如,粉红色是设备,绿色表示信用卡,蓝色节点是电话号码,橙色节点是退款,即欺诈交易。 通过它们之间的联系,我们可以找到客户之间的联系。 实际上,我们的网络通过连接的节点传播出去,并且我们能够根据它们与已确认的欺诈者的距离(或“跳跃数”)找到连接,最多可达20个跳跃数。 不仅如此,通过查看网络中的客户数量或网络的增长速度,我们还可以从中提取出更多的欺诈信号。
Ok, cool, as you can see, these networks can be super powerful and insightful for finding underlying connections that otherwise would have been impossible to notice. However, as useful as they can be, a network containing multiple types of nodes can also be quite problematic when it starts to increase in size uncontrollably. Why? Basically because just as a network can unveil real relationships between users, it can also grow from unwanted connections. Imagine an eCommerce company where customers are allowed to buy without creating an account. This is called guest checkout and big retailers, such as eBay, allow their customers to place orders that way. The thing with guest checkout is that users do not have to provide their real data when buying, so we start to see customers connected by fake phone numbers such as 123456789 or emails like abcd@abcd.com. Up to a point where all of a sudden, we start to see networks like this one:
好的,很酷,正如您所看到的那样,这些网络可以非常强大并且洞察力强,可以找到根本无法发现的基础连接。 但是,包含多种类型节点的网络虽然很有用,但当其大小开始不受控制地增加时,也可能会出现问题。 为什么? 基本上,因为就像网络可以揭示用户之间的真实关系一样,它也可以从不需要的连接中增长。 想象一下一家电子商务公司,该公司允许客户在不创建帐户的情况下进行购买。 这被称为客人结帐,大型零售商(例如eBay)允许其客户以这种方式下订单。 来宾结帐的问题在于,用户在购买时不必提供真实数据,因此我们开始看到客户通过虚假电话号码(例如123456789)或电子邮件(例如abcd@abcd.com)进行连接 。 突然之间,我们开始看到这样的网络:
These untrustworthy connections are no good for us at Ravelin. And worst of all, with millions of nodes and edges, it becomes very difficult to actually clean these networks and remove the nodes or the edges that shouldn’t be there.
这些不可靠的联系对我们Ravelin不利。 最糟糕的是,拥有数百万个节点和边缘,实际上很难清理这些网络并删除不应该存在的节点或边缘。
So, what options do we have? In the following we’ll explore the following techniques for identifying the most influential nodes in massive networks:
那么,我们有哪些选择? 在下面的内容中,我们将探索以下技术来识别大型网络中最具影响力的节点:
Eigenvector centrality 特征向量中心性 Network projection 网络投影Eigenvector centrality measures how connected an individual is. An individual with a high eigenvector score is likely to be at the center of a cluster of individuals that are also highly connected. Therefore, the node with the highest eigenvector centrality is likely to have a strong level of influence within the group. Take the following example of a fake phone number with a very high eigenvector centrality within a network:
特征向量中心性衡量个体之间的联系方式。 特征向量得分高的个人可能处于联系紧密的个人集群的中心。 因此,特征向量中心度最高的节点可能在组内具有很强的影响力。 以以下示例为例,该伪造电话号码在网络中具有很高的特征向量中心性:
Fun fact: it’s also the tool working behind Google’s Search Engine. If you’re interested in learning more about, following this link.
有趣的事实:它也是Google搜索引擎背后的工具。 如果您有兴趣了解更多信息,请点击此链接。
Luckily for us, networkx has a very convenient implementation of this so we don’t have to worry about all the things happening behind the scenes, just follow the next steps:
对我们来说幸运的是,networkx对此实现非常方便,因此我们不必担心幕后发生的所有事情,只需执行以下步骤:
1. Create an empty graph
1.创建一个空图
import networkx as nxfull_net = nx.Graph()2. Populate it with all the nodes and edges
2.用所有节点和边填充它
full_net.add_nodes_from(list_of_nodes)full_net.add_edges_from(edges_from_node_to_node)3. Find the eigenvector centrality (it should take just a few minutes)
3.找到特征向量的中心点(只需几分钟)
eigen_original = nx.eigenvector_centrality(full_net)4. Get a new dataframe with the nodes and their value
4.使用节点及其值获取一个新的数据框
The following function is a helper to transform the output from networkx into a dataframe:
以下函数是将networkx的输出转换为数据帧的助手:
def get_centrality_df(centrality): cent_dict_cust = {‘node’:[],’centrality’:[]} for node, c in zip(centrality.keys(), centrality.values()): cent_dict_cust[‘node’].append(node) cent_dict_cust[‘centrality’].append(c) cent_df = pd.DataFrame(cent_dict_cust) return cent_dffull_centrality_df = get_centrality_df(eigen_original)5. Now just print the dataframe sorted by centrality 🎉.
5.现在只需打印按中心度🎉排序的数据框。
full_centrality_df.sort_values(by=’centrality’, ascending=False)Another way to find the importance of a node within a network is by looking at its degree, i.e. the number of edges connected to the node. Networkx provides a very easy way of getting the degree for any single node just by using the .degree function for any given network. However, in a network containing several types of nodes a technique called network projection allows us to find underlying relationships between nodes that have connections to the same entities within the network. E.g. a set of customers connected to the same credit card or using the same device for ordering.
查找网络中节点重要性的另一种方法是查看其程度,即连接到该节点的边的数量。 Networkx提供了一种非常简单的方法,只需对任何给定的网络使用.degree函数即可获取任何单个节点的学位。 但是,在包含几种类型的节点的网络中,一种称为网络投影的技术使我们能够找到与网络中相同实体具有连接的节点之间的潜在关系。 例如,一组连接到相同信用卡或使用相同设备进行订购的客户。
To understand how network projection works, first we need to understand the concept of what a bipartite network is, which refers to a particular class of networks whose nodes are divided into two sets X and Y and connections can only happen between nodes in different sets. In real life, lots of networks resemble this structure, and Ravelin is no exception. Our networks follow this logic since for example, if we have customers and credit cards, a customer couldn’t be directly connected to another customer.
要了解网络投影的工作原理,首先我们需要了解双向网络的概念,双向网络是指一类特定的网络,其节点分为两个集合X和Y,并且连接只能在不同集合的节点之间发生。 在现实生活中,许多网络都类似于这种结构,Ravelin也不例外。 我们的网络遵循此逻辑,因为例如,如果我们有客户和信用卡,则无法将一个客户直接连接到另一个客户。
Now that we understand what a bipartite network is, network projection consists in connecting nodes of the same type (that are not supposed to be connected in our bipartite network) according to shared edges, i.e. connections. See the following example:
既然我们了解了双向网络,那么网络投影就是根据共享边(即连接)连接相同类型的节点(在我们的双向网络中不应该连接的节点)。 请参见以下示例:
Now, it could also be possible to have more than two types of nodes. In fact, at Ravelin we have way more than that. In such cases, we have two options:
现在,也可能有两种以上类型的节点。 实际上,在Ravelin,我们所能做的还不止这些。 在这种情况下,我们有两种选择:
1. We can create a bipartite network with one type of nodes to the left, e.g. customers, and all the other nodes to the right. Take the following example:
1.我们可以创建一个双向网络,在左侧有一种类型的节点(例如客户),在右侧有所有其他节点。 请看以下示例:
2. As we can create bipartite networks, we could also generalize to k-partitions of vertices such as tri, quad or pentapartite networks, and then perform the projection.
2.由于我们可以创建二分网络,因此我们也可以将其推广到顶点的k分区(例如三,四或五分网络),然后执行投影。
In practice, it is easier to use the first option and when it comes to projecting the networks the final outcome is the same, so we’ll do that now using networkx and following our previous example with the customers to the left side and all the other nodes to the right.
在实践中,使用第一个选项更容易,并且在投影网络时,最终结果是相同的,因此,我们现在将使用networkx并按照前面的示例,将客户放在左侧,然后将所有右边的其他节点。
1. Create a new empty graph
1.创建一个新的空图
import networkx as nxcust_vs_all_network = nx.Graph()2. Find the list of nodes to be fed into the network.
2.查找要馈入网络的节点列表。
Remember that we’ll be creating a bipartite graph, so we’ll need 2 independent list of nodes:
请记住,我们将创建一个二部图,因此我们需要2个独立的节点列表:
one just for the customers 一个只为客户 and another for the rest of the nodes 其余的节点3. Feed the network with the nodes and edges
3.用节点和边向网络馈电
cust_vs_all_network.add_nodes_from(customers, bipartite=0)cust_vs_all_network.add_nodes_from(all_other_nodes, bipartite=1)cust_vs_all_network.add_edges_from(edges_from_customers_to_others)4. Project the networks
4.投影网络
customers_proj = bipartite.projected_graph(cust_vs_all_network, customers)cust_all_others_proj = bipartite.projected_graph(cust_vs_all_network, all_other_nodes)5. Obtain a ranking of nodes given its degree
5.根据节点的等级获得节点的等级
Remember: the degree of a node in a network is just the number of edges it has.
请记住:网络中节点的程度只是其具有的边缘数。
print(‘Number of connections for top 10:’)sorted([x for x in list(customers_proj.degree())],key=lambda x: x[1], reverse=True)[:10]And that’s all! We have quickly covered two great options for finding very influential nodes within a large network. Doing this can be very useful for:
就这样! 我们Swift介绍了两种在大型网络中查找非常有影响力的节点的绝佳选择。 这样做对以下情况非常有用:
Breaking up a network into parts when the size of it has become a problem 当网络规模成为问题时,将其分成几部分 Finding relevant nodes within a network, which might not be an error, but perhaps relevant for your business for a variety of reasons 查找网络中的相关节点,这可能不是错误,但由于多种原因可能与您的业务相关 Finding concrete metrics about the relevance and centrality of nodes within a network, which could also be interesting features to feed a machine learning model trying to catch some other kind of related pattern 查找有关网络中节点的相关性和中心性的具体指标,这可能也是有趣的功能,可为尝试捕获某种其他相关模式的机器学习模型提供信息Finally, if you’re interested about learning more about how we use network analysis in Ravelin for fraud prevention, I invite you to read this entry from our blog, which goes into detail into the kind of nodes we use and the types of fraud it allows us to find.
最后,如果您有兴趣了解更多有关我们如何在Ravelin中使用网络分析来预防欺诈的信息,我邀请您从我们的博客中阅读此条目 ,其中详细介绍了我们使用的节点类型及其欺诈类型。让我们找到。
翻译自: https://syslog.ravelin.com/network-analysis-when-things-get-out-of-control-f2bca0b93cff
数位板时不时失控
相关资源:jdk-8u281-windows-x64.exe