云原生 设计模式
With the growing number of new technologies coming into the market every year it becomes very difficult for engineers and their leadership to choose the right combination of elements to get the “right solution” in place. In this article, I provide architectural patterns for a cloud-centric analytics platform, their pros and cons and when each should be used.
每年,随着越来越多的新技术进入市场,工程师及其领导层很难选择正确的要素组合来实现“正确的解决方案”。 在本文中,我提供了以云为中心的分析平台的架构模式,它们的优缺点以及何时使用它们。
Let me start with the definition of the “right solution”. We’ll use a widely accepted set of evaluation criteria called “Architectural Concerns”. There are 7 of them that creates focus areas when building or evaluation a solution. These concerns are coming into the digital world from a much wider and older engineering field.
让我从“正确解决方案”的定义开始。 我们将使用一套广泛接受的评估标准,称为“建筑问题”。 其中有7个在构建或评估解决方案时创建了重点领域。 这些问题正从更广阔,更古老的工程领域进入数字世界。
Concern 1: Availability — what’s the desired uptime depending on the criticality and the use case.
关注点1:可用性 -根据关键程度和用例,所需的正常运行时间是多少?
Concern 2: Performance — how quick it responds to user activities or events under different workloads.
关注点2:性能 -它在不同工作负载下对用户活动或事件的响应速度。
Concern 3: Reliability — how reliable system needs to be for every type of break, i.e. if the disk is broken, node down, data-centre is down, etc.
关注点3:可靠性 -每种中断类型,即磁盘是否损坏,节点故障,数据中心故障等,系统的可靠性如何。
Concern 4: Recoverability — how quickly and how in principle the system would recover from breaks. Some of the recoveries are automated, like in HDFS or S3, others like node failure need to be considered in advance.
关注点4:可恢复性 -系统从中断中恢复的速度和原理。 有些恢复是自动化的,例如在HDFS或S3中,其他恢复则需要提前考虑。
Concern 5: Cost — how much money are we willing to spend to bring the solution up (Infra, Development) and maintain it later (Operations).
关注点5:成本 -我们愿意花多少钱来启动该解决方案(基础架构,开发)并在以后进行维护(运营)。
Concern 6: Scalability — how scalable the solution needs to be, i.e. peak hour traffic, changing trends, growth over the next few years.
关注点6:可扩展性 -解决方案需要具有多大的可扩展性,即高峰时段的流量,变化的趋势,未来几年的增长。
Concern 7: Manageability — how to ensure compliance, privacy and security requirements
关注点7:可管理性 -如何确保合规性,隐私和安全性要求
We’ve defined the evaluation criteria now let’s look at several common scenarios: Entry, Enterprise Non-real time and Enterprise Real-time.
现在,我们已经定义了评估标准,让我们看几个常见的场景:进入,企业非实时和企业实时。
Entry Level — built for a small organization (or organizational unit) with less than 1 Bn. records and annual growth of up to 20% 入门级—为少于10亿的小型组织(或组织单位)构建。 记录,年增长率高达20% Enterprise Non-Real Time — used for consolidation of several regionally distributed Entry Level data sets and/or 1–100 Bn. records with annual growth of up to 20% 企业非实时-用于合并几个区域分布的入门级数据集和/或1–100亿。 年增长率高达20%的记录 Enterprise Real-Time — used for real-time analytics and/or consolidation of several regionally distributed Entry Level data sets with more than 100 Bn. records with high growth factors. 企业实时-用于实时分析和/或合并多个区域分布的入门级数据集,这些数据集超过1000亿。 具有高增长因子的记录。Now let’s define architectural patterns for each of the solutions and provide analysis based on architectural concerns.
现在让我们为每个解决方案定义架构模式,并根据架构问题提供分析。
For small use cases (< 1 Bn. records) most of the transformations and dimensional storage could be kept within the tool itself. Modern BI solutions (Qlik, Tableau) come with in-memory storage capability directly linked to the self-discovery and dashboarding UI. However, they create a heavy query load on source transactional databases to dynamically refresh the dimensional models and that’s why it’s highly recommended to create a CDC copy of the original relational tables and not link directly into the transactional DBs. That’s also advisable from a security perspective based on the decoupling principle.
对于小型用例(小于10亿条记录),大多数转换和维存储可以保留在工具本身内。 现代BI解决方案(Qlik,Tableau)具有内存存储功能,该功能直接链接到自发现和仪表板UI。 但是,它们在源事务数据库上创建了沉重的查询负载,以动态刷新维度模型,因此,强烈建议创建原始关系表的CDC副本,而不是直接链接到事务数据库。 从安全性的角度出发,基于去耦原理也是明智的。
If we look at the conceptual architecture above we can note the following core capabilities that we need to deliver:
如果我们看一下上面的概念体系结构,可以注意到我们需要提供以下核心功能:
Stage the data for Analytics (full or partial relational data); 暂存数据以进行分析(全部或部分关系数据); Store transitional data (Consolidation, curation, enrichment); 存储过渡数据(合并,管理,充实); Process extraction, transformation and loading of data from/to every storage layer; 从/到每个存储层的过程提取,转换和加载数据; Self-discovery, Dashboards, Data Wrangling UI; 自我发现,仪表板,数据整理用户界面; Efficiently serving the data into the Self-discovery, Dashboards, Data Wrangling UI 高效地将数据提供给自我发现,仪表板和数据整理用户界面I’ve specifically outlined core capabilities above to take them through all solution types and provide analysis of issues based on architectural concerns listed above.
我已经在上面专门概述了核心功能,以便将它们应用于所有解决方案类型,并根据上面列出的体系结构关注点提供问题分析。
First, let’s fit the selected technology stack into the conceptual model to get a better feeling of the solution. For this scenario, my source systems are SAP ERP and I’m using AWS as a cloud provider, Tableau as a BI tool of choice.
首先,让我们将选定的技术堆栈放入概念模型中,以更好地了解解决方案。 对于这种情况,我的源系统是SAP ERP,并且我使用AWS作为云提供商,使用Tableau作为BI首选工具。
Technology Stack: SAP ERP, AWD DMS, AWS RDS, Tableau
技术堆栈:SAP ERP,AWD DMS,AWS RDS,Tableau
Assuming SAP ERP is using Oracle DB for its on-prem transactional storage I’m selecting AWS DMS to CDC into the AWS RDS Oracle (as a read-only copy). AWS RDS is used to serve Tableau as a staging area capability. All the other capabilities (ETL, Consolidation, Serving) will be used as part of the Tableau solution. That means we are deploying Tableau and then use the UI to hook it to AWS RDS Oracle and build transformations within Tableau.
假设SAP ERP使用Oracle DB作为其本地事务存储,我选择将AWS DMS CDC放入AWS RDS Oracle(作为只读副本)。 AWS RDS用于将Tableau用作登台区域功能。 所有其他功能(ETL,合并,服务)将用作Tableau解决方案的一部分。 这意味着我们将部署Tableau,然后使用UI将其挂接到AWS RDS Oracle并在Tableau中构建转换。
Please note that although this design seems to be straightforward for cases with > 100 mil. records you might need to create additional indexes and/or summarised data structures to increase the query speed on the Tableau side.
请注意,尽管这种设计对于大于1亿的情况似乎很简单。 记录,您可能需要创建其他索引和/或汇总的数据结构以提高Tableau端的查询速度。
Cloud Analytics — Entry Level Technology Architecture 云分析—入门级技术架构Let’s evaluate this approach based on the selected set of criteria.
让我们根据所选的标准对这种方法进行评估。
Concern 1: Availability: AWS RDS supports HA through Multi-AZ deployments, AWS DMS needs to be configured HA at the virtualization level, Tableau HA and load balancing configuration on AWS is well documented, see [1] and [2]. HA of the on-prem to AWS channel usually provided by Direct Connect provider or done by setting a separate backup VPN channel through the Internet.
关注点1:可用性: AWS RDS通过多可用区部署支持HA,需要在虚拟化级别对AWS DMS进行HA配置,对Tableau HA和AWS上的负载平衡配置进行了详细记录,请参阅[1]和[2]。 本地到AWS通道的HA通常由Direct Connect提供程序提供,或者通过通过Internet设置单独的备用VPN通道来完成。
Concern 2: Performance: Performance would depend on selected compute instances. RDS was not designed to scale up or down automatically so it needs assessment based on expected workload. Tableau installation supports automated scaling up (Load Balancing) so it is not a concern.
关注点2:性能:性能将取决于所选的计算实例。 RDS并非自动缩放,因此需要根据预期的工作量进行评估。 Tableau安装支持自动放大(负载平衡),因此无需担心。
Concern 3: Reliability: Reliability of AWS infrastructure provided by the Multi-AZ design. On-prem nodes need to be considered separately depending on the virtualization technology in use.
关注点3:可靠性:多可用区设计提供的AWS基础设施的可靠性。 根据使用的虚拟化技术,需要单独考虑本地节点。
Concern 4: Recoverability: All components are automatically recoverable by design. Configuration needs to be backed up separately.
关注点4:可恢复性:所有组件均可根据设计自动恢复。 配置需要单独备份。
Concern 5: Cost: AWS DMS is free to use when replicating to/from RDS. RDS itself is priced based on the node type and the storage. AWS budgets could be utilized to limit the costs for auto-scaling functions. Personnel to support the DMS and AWS RDS is essentially the same who supports the existing Oracle DBs. So the only missing skills are AWS IaaS and Tableau.
关注点5:成本:复制到RDS或从RDS复制时,可以免费使用AWS DMS。 RDS本身是根据节点类型和存储定价的。 AWS预算可用于限制自动扩展功能的成本。 支持DMS和AWS RDS的人员本质上与支持现有Oracle DB的人员相同。 因此,唯一缺少的技能是AWS IaaS和Tableau。
Concern 6: Scalability: Easy to scale up RDS with minimum downtime (<1h). Other components to be scaled up automatically constrained only by budgets.
关注点6:可伸缩性:易于扩展RDS,而停机时间最短(<1h)。 其他要自动扩展的组件仅受预算约束。
Concern 7: Manageability: Easy. Use AWS Organizations to set company-wide policies, use AWD AD Connector to enforce authentication rules and SSO for Tableau. Use MS AD to centrally define authorization rules. CDC ensures almost instant CRUD follow-up on the RDS instance.
关注点7:可管理性:容易。 使用AWS Organizations设置公司范围的策略,使用AWD AD Connector强制执行Tableau的身份验证规则和SSO。 使用MS AD集中定义授权规则。 CDC确保在RDS实例上几乎立即进行CRUD跟踪。
In general, this would be the quickest to deploy and least costly solution from infrastructure and people.
通常,这是从基础架构和人员那里部署最快,成本最低的解决方案。
For non-real-time use cases (distributed data sets with up to 100 Bn. records and moderate growth rate) I would use a different approach replacing RDS with cheaper intermediary storage on S3 and involving proper data integration toolset. Let’s look at how conceptual architecture has to change in response to increased data volumes.
对于非实时用例(具有多达1000亿条记录和中等增长率的分布式数据集),我将使用另一种方法,用S3上的便宜中间存储代替RDS,并使用适当的数据集成工具集。 让我们看一下概念架构必须如何响应不断增加的数据量而发生变化。
With larger data sets we have to introduce two new technical capabilities into our design. First is the analytic database (e.g. Redshift, Snowflake) that specifically designed to serve dimensional models and enable data wrangling and self-discovery over large data sets. Data in dimensional models are usually denormalized and not compliant with 3NF or 5NF compared to transactional DBs. That denormalization helps with fast data retrieval reducing table joins.
对于更大的数据集,我们必须在设计中引入两项新的技术功能。 首先是分析数据库(例如Redshift,Snowflake),该数据库专门设计用于服务于维度模型,并能够对大型数据集进行数据整理和自我发现。 与事务数据库相比,维模型中的数据通常会被规范化,并且不符合3NF或5NF的要求。 这种非规范化有助于快速进行数据检索,从而减少表联接。
Dimensional models are usually built on aggregate data rather than contain every detailed transaction from the source. Think of Sales analysis over the following hierarchies: Products, Sales channels, Customer Types, Locations, Time — These are Dimensions and sales transactions are Facts. In Dimensional model you analyze Facts over the combination of Dimensions. It is usual to have facts as daily (or hourly) summaries over lowest dimensional hierarchy levels. Dimensional concepts have been explained in great detail in a well-known book by Bill Inmon — the father of data warehousing [3].
维度模型通常建立在汇总数据上,而不包含来源中的每个详细交易。 考虑以下层次的销售分析:产品,销售渠道,客户类型,位置,时间-这些是维度,销售交易是事实。 在维度模型中,您可以根据维度组合分析事实。 通常将事实作为最低维度层次结构级别上的每日(或每小时)摘要。 在数据仓库之父比尔·英蒙(Bill Inmon)的一本著名书籍中,对维概念进行了详尽的解释[3]。
The second capability is a proper data integration toolset that orchestrates all the data transformations. Centralizing ETL jobs in a specialized tool has many advantages over spreading across different systems in the long run. Technically you can build transformation even using shell scripts running Python, however, if you look at architectural concerns each of them will be negatively impacted so it’s not the way and I’m ruling that option out.
第二项功能是适当的数据集成工具集,可以协调所有数据转换。 从长远来看,将ETL作业集中在一种专用工具中具有许多优势,而这些优点比在不同系统中分散要好。 从技术上讲,即使使用运行Python的shell脚本,您也可以构建转换,但是,如果您查看架构方面的问题,那么每个问题都会受到负面影响,所以这不是办法,因此我将其排除在外。
Cloud Analytics — Enterprise Level Non-Real Time Solution Conceptual Architecture 云分析—企业级非实时解决方案概念架构Attunity (Qlik) have developed a very niche product that gained popularity over recent years. It’s a set of tools to essentially move (Attunity Replicate) [4] and transform (Attunity Compose) the data [5]. These tools are standing out of others by intuitive interface, simplicity and wide range of ready connectors.
Attunity(Qlik)开发了一种非常利基的产品,近年来获得了普及。 它是一组用于移动(Attunity Replicate)[4]和转换(Attunity Compose)数据[5]的工具。 这些工具通过直观的界面,简单性和广泛的现成连接器而脱颖而出。
Technology Stack: SAP ERP, Qlik Attunity Replicate and Compose, AWS S3, AWS Redshift (or Snowflake), Tableau
技术堆栈:SAP ERP,Qlik Attunity Replicate and Compose,AWS S3,AWS Redshift(或Snowflake),Tableau
For an analytic database, I’ve picked native AWS Redshift. There are no major blockers to pick Snowflake that has growing popularity. So far I haven’t seen proper performance and cost comparison across a range of use-cases so will keep it as an option here.
对于分析数据库,我选择了本机AWS Redshift。 没有主要的阻碍者可以选择越来越受欢迎的Snowflake。 到目前为止,我还没有看到在各种用例之间进行适当的性能和成本比较,因此将其保留为一种选择。
Cloud Analytics — Enterprise Level Non-Real Time Solution Technology Architecture 云分析—企业级非实时解决方案技术架构Replicate needs to be located close to data sources. It collects database changelogs, packs them into small compressed files and sends to S3 over encrypted TLS channel. Alternatively, it can publish to Kafka topic but that would be if we need to introduce real-time analysis capabilities.
复制必须位于靠近数据源的位置。 它收集数据库变更日志,将它们打包成小的压缩文件,然后通过加密的TLS通道发送到S3。 或者,它可以发布到Kafka主题,但是如果我们需要引入实时分析功能,那就可以。
Compose is essentially a visual pipeline designer. It can integrate with a range of source and target database systems. For our specific use-case, I’m using EMR cluster for Compose to get data from S3.
Compose本质上是视觉管道设计师。 它可以与一系列源数据库系统和目标数据库系统集成。 对于我们的特定用例,我使用EMR群集进行Compose来从S3获取数据。
Main tasks for Compose are to:
Compose的主要任务是:
transform the data from relational into a dimensional model for fast analytic queries (e.g. compare 5-year monthly sales data across regions and customer types — should be done in seconds not minutes), 将数据从关系数据转换为维模型以进行快速分析查询(例如,比较跨区域和客户类型的5年月度销售数据-应该在几秒钟而不是几分钟之内完成), take only what’s needed and leave unnecessary details behind, 仅保留所需内容,并保留不必要的细节, consolidate and enrich data sets 合并和丰富数据集 create specialized data presentation (materialized views) for different roles of users (basic design principle of “need to know”) 为用户的不同角色创建专门的数据表示(实例化视图)(“需要了解”的基本设计原则)How it works: Data Designer using Compose interface defines transformations, then Compose creates target tables in Redshift and starts loading the data. Once data has landed to Redshift you can start enjoying Tableau.
工作原理:使用Compose接口的Data Designer定义转换,然后Compose在Redshift中创建目标表并开始加载数据。 一旦数据到达Redshift,您就可以开始使用Tableau。
Antipattern here would be taking source data as-is and put them into Redshift for further processing using Tableau. That is frequent mistake when building the data lakes. That leads to substantially increased Redshift costs and reduce response time in Tableau. Other than very small use-cases I would strongly recommend preparing your data before feeding into Redshift (or any other analytical database).
这里的反模式将按原样获取源数据,并将其放入Redshift中,以使用Tableau进行进一步处理。 在构建数据湖时,这是常见的错误。 这导致Redshift成本大幅增加,并减少了Tableau中的响应时间。 除了非常小的用例,我强烈建议您在将数据输入Redshift(或任何其他分析数据库)之前准备好数据。
Tableau setup is straightforward and similar to the previous one.
Tableau的设置非常简单,与上一个类似。
Concern 1: Availability: AWS S3 is highly redundant, designed to provide 99.999999999% durability and 99.99% availability of objects. EMR is HA by design. Attunity Replicate needs to be configured HA at the virtualization level on the source system side. Attunity Compose supports HA configuration with primary and secondary node installations [4]. Tableau and other components are similar to the Entry-Level use case.
关注点1:可用性: AWS S3是高度冗余的,旨在提供99.999999999%的耐久性和99.99%的对象可用性。 EMR在设计上是HA。 需要在源系统端的虚拟化级别上配置Attunity Replicate HA。 Attunity Compose支持在主节点和辅助节点安装中进行HA配置[4]。 Tableau和其他组件类似于入门级用例。
Concern 2: Performance: When we are talking about the performance of the analytics system it depends on two key factors: 1) right data structure and 2) right infrastructure. We’ve talked about what’s the right data structure above. Coming to infrastructure — AWS Redshift is designed as a columnar database with elastic resize [7] feature so that’s the solution that you should be looking for enterprise-level solutions.
关注点2:性能:当我们谈论分析系统的性能时,它取决于两个关键因素:1)正确的数据结构和2)正确的基础架构。 我们在上面讨论了什么是正确的数据结构。 进入基础架构-AWS Redshift被设计为具有弹性调整大小[7]功能的列式数据库,因此这是您应在企业级解决方案中寻找的解决方案。
Tableau performance notes covered above — no change here.
上面介绍的Tableau性能说明-此处不变。
Concern 3: Reliability: All AWS services used here are managed services so their reliability provided out of the box by the Multi-AZ design. On-prem nodes need to be considered separately depending on the virtualization technology in use.
关注点3:可靠性:此处使用的所有AWS服务都是托管服务,因此Multi-AZ设计提供了开箱即用的可靠性。 根据使用的虚拟化技术,需要单独考虑本地节点。
Concern 4: Recoverability: Convenient recoverability is achieved by 1) building the AWS environment by code (CloudFormation) and keeping it in the backup region; 2) enabling S3 cross-region replication to overcome outage of the whole region; 3) regularly taking images of Replicate and Compose hosts and keeping them in backup region.
关注点4:可恢复性:通过以下两种方式实现便捷的可恢复性: 1)通过代码(CloudFormation)构建AWS环境并将其保留在备份区域中; 2)使S3跨区域复制能够克服整个区域的中断; 3)定期拍摄复制和组成主机的映像,并将其保留在备份区域中。
Concern 5: Cost: With unmatched performance for analytics queries Redshift (as well as Snowflake or any other dedicated analytics database) is more expensive than simple RDS. I would expect Redshift to cost less you can commit to a certain level of usage. Usually with enterprise-level solutions that could be easily done. Qlik doesn’t publish prices but you can get the feeling of the price from AWS Marketplace [8]. Typical Replicate price is $1.866/hr.
关注点5:成本:分析查询具有无与伦比的性能,Redshift(以及Snowflake或任何其他专用分析数据库)比简单的RDS昂贵。 我希望Redshift可以降低成本,您可以承诺一定程度的使用。 通常,使用企业级解决方案可以轻松完成。 Qlik不会发布价格,但是您可以从AWS Marketplace [8]上了解价格。 典型复制价格为$ 1.866 /小时。
Concern 6: Scalability: We need to consider user NFRs for Tableau and Redshift sizing. Data velocity and volume will be the input for EMR and Compose node sizing and also connections to on-prem infrastructure. S3 doesn’t need any capacity configuration.
关注点6:可伸缩性:对于Tableau和Redshift大小调整,我们需要考虑用户NFR。 数据速度和数据量将是EMR和Compose节点大小确定的输入,也是与本地基础结构的连接。 S3不需要任何容量配置。
Concern 7: Manageability: As we have centralized all our transformations in Compose it now became very simple to manage them. New data structures in Redshift can be made ready in hours.
关注点7:可管理性:由于我们在Compose中集中了所有转换,因此现在变得非常易于管理。 Redshift中的新数据结构可以在数小时内准备就绪。
Enterprise non-real-time pattern should be considered for a wide range of solutions. Provided we have the agility to scale up or down quickly in the cloud, I would use this even for medium-sized analytics. Using dedicated tools for data integration pays back quickly when you start operationalising the solution and look to enable CI/CD for analytics.
广泛的解决方案应考虑企业非实时模式。 只要我们具有在云中快速扩展或缩小的敏捷性,即使是中型分析,我也可以使用它。 使用专用工具进行数据集成可以在您开始实施该解决方案并希望启用CI / CD进行分析时Swift得到回报。
This option should be considered when either you have a large number of source systems or building real-time analytics. This pattern is very much in the centre for every real-time analytics repeating what I’ve described in one of my previous posts — Real-Time Security Data Lake [9].
当您具有大量的源系统或构建实时分析时,应考虑使用此选项。 对于每种实时分析来说,这种模式都非常重要,它重复了我在之前的一篇文章-实时安全数据湖[9]中描述的内容。
Use cases could be log analytics for security/website/mobile app data, IoT event data, machinery logs and many others.
用例可以是针对安全性/网站/移动应用程序数据,IoT事件数据,机器日志等的日志分析。
A conceptual architecture for downstream remains mainly the same, however upstream has been changed to add the streaming capability for real-time data pipelines.
下游的概念体系结构基本上保持不变,但是上游已更改,以增加实时数据管道的流功能。
Real-Time Cloud Analytics — Conceptual Architecture 实时云分析-概念架构Real-time data pipeline introduces stringent requirements for response time at any scale. Even if vendors are referring to near real-time capability (e.g. in Compose) I would be hesitant to put it into this architecture. Kinesis Family (Firehose and Analytics), as well as Kafka, has been designed ground up for streaming use-cases. When we use the right tools in relevant scenarios then we can suddenly realise more hidden capabilities that we can start using with little effort.
实时数据管道对任何规模的响应时间都提出了严格的要求。 即使供应商指的是近实时功能(例如在Compose中),我也很犹豫将其纳入此体系结构。 Kinesis系列(Firehose和Analytics)以及Kafka都是针对流使用案例而专门设计的。 当我们在相关场景中使用正确的工具时,我们可以突然实现更多隐藏的功能,而我们可以轻松地开始使用它们。
Technology Stack: AWS Kinesis Data Firehose, AWS Kinesis Data Analytics (Managed Apache Flink installation), AWS Redshift, Tableau, AWS Athena and Amazon S3 for ad-hock queries.
技术堆栈:AWS Kinesis Data Firehose,AWS Kinesis Data Analytics(托管的Apache Flink安装),AWS Redshift,Tableau,AWS Athena和Amazon S3用于ad-hock查询。
Real-Time Cloud Analytics — Technology Architecture 实时云分析-技术架构Streaming data pipelines require a dedicated tool and Apache Flink is the one with growing popularity. AWS Kinesis Data Analytics is a managed Flink installation with tight integration into other AWS services. That’d be my preference for the streaming DI tool. Data gets curated, transformed and aggregated as needed in Kinesis Data Analytics then pushed into Redshift as part of streaming data ingestion.
流数据管道需要专用的工具,Apache Flink是一种越来越流行的工具。 AWS Kinesis Data Analytics是托管Flink安装,与其他AWS服务紧密集成。 那就是我偏爱流媒体DI工具的原因。 在Kinesis Data Analytics中根据需要对数据进行整理,转换和聚合,然后将其作为流数据摄取的一部分推入Redshift。
Kinesis Data Firehose could be further integrated into AWS CloudFront CDN network to provide closer data collection end-points and protect from a range of attacks on the internet.
Kinesis Data Firehose可以进一步集成到AWS CloudFront CDN网络中,以提供更近的数据收集端点并保护免受Internet上的一系列攻击。
I’ve borrowed the image below from AWS blog [10] to illustrate how accelerated and more secure front-door could be configured with AWS CloudFront CDN.
我已从AWS博客[10]借用了以下图片,以说明如何使用AWS CloudFront CDN配置加速和更安全的前门。
Global Data Ingestion with Amazon CloudFront and Lambda@Edge 使用Amazon CloudFront和Lambda @ Edge进行全球数据提取Concern 1: Availability: AWS Kinesis and CloudFront are HA by design.
关注点1:可用性: AWS Kinesis和CloudFront是设计使然的HA。
Concern 2: Performance: All components are easy to scale up or down to any size with little or no reconfiguration.
关注点2:性能:所有组件都很容易按比例缩小或缩小到任何大小,而几乎不需要重新配置。
Concern 3: Reliability: By design
关注点3:可靠性:通过设计
Concern 4: Recoverability: Similar to non-real-time.
关注点4:可恢复性:类似于非实时。
Concern 5: Cost: Operational costs would be slightly higher than the non-real-time version as maintaining or adding new streaming pipelines with Flink requires higher skills. Flink pipelines are coded in Java. Infrastructure costs would be scalable based on budget constraints.
关注点5:成本:运营成本会比非实时版本略高,因为使用Flink维护或添加新的流媒体管道需要更高的技能。 Flink管道使用Java编码。 基础架构成本将根据预算限制进行扩展。
Concern 6: Scalability: This is a massively scalable solution. The largest known use of Flink based streaming pipeline is at Keystone (Netflix Real-time Stream Processing Platform) for 3 trillion events per day [11]
关注点6:可扩展性:这是一个可大规模扩展的解决方案。 基于Flink的流传输管道的最大已知用途是在Keystone(Netflix实时流处理平台)上,每天处理3万亿个事件[11]
Concern 7: Manageability: All streaming components are native to AWS so I would expect small operational overhead to keep this running. New pipeline development would require Java skills with no GUI. Debugging of streaming applications are more difficult than static.
关注点7:可管理性:所有流式组件都是AWS固有的,因此我希望有少量操作开销来保持运行。 新的管道开发将需要没有GUI的Java技能。 流应用程序的调试比静态调试困难。
Streaming use-cases are growing fast with the growing number of connected devices. The approach described in this section enables both real-time analytics as well as the ability to implement automated event-based response system which reacts to an event as it happens.
流连接用例随着连接设备数量的增长而Swift增长。 本节中描述的方法既可以进行实时分析,又可以实现基于事件的自动响应系统,该系统对事件进行响应。
Entry Level solution is easy to start with but limited in scalability. There will be a linear increase in infrastructure and operational costs with growing data.
入门级解决方案易于入门,但可扩展性有限。 随着数据的增长,基础设施和运营成本将呈线性增长。
Enterprise Non-Real Time approach could be used for a wide range of cases provided we have the agility to scale up or down quickly in the cloud. Using dedicated tools for data integration pays back quickly when you start operationalising the solution and look to enable CI/CD for analytics.
如果我们能够灵活地在云中快速扩展或缩小,则企业非实时方法可用于各种情况。 使用专用工具进行数据集成可以在您开始实施该解决方案并希望启用CI / CD进行分析时Swift得到回报。
Real-Time Analytics is costly to maintain as it requires skilled Java developers to build and support data pipelines however this approach enables a new generation of event-based intelligent response systems.
实时分析的维护成本很高,因为它需要熟练的Java开发人员来构建和支持数据管道,但是这种方法支持新一代基于事件的智能响应系统。
If you have any questions or comments feel free to reach out via https://www.linkedin.com/in/fgurbanov/
如果您有任何疑问或意见,请随时通过https://www.linkedin.com/in/fgurbanov/与我们联系。
[1] Scaling Tableau Server on AWS
[1] 在AWS上扩展Tableau Server
[2] Tableau Server on AWS for healthcare
[2] 适用于医疗保健的AWS上的Tableau Server
[3] Book: Building the Data Warehouse, Inmon, W. H.
[3] 书籍:WH,Inmon,构建数据仓库
[4] Qlik Attunity Compose v6.6 User Guide (pdf)
[4] Qlik Attunity Compose v6.6用户指南(pdf)
[5] Qlik Attunity Replicate v6.4 User Guide (pdf)
[5] Qlik Attunity Replicate v6.4用户指南(pdf)
[6] SAP Datasheet (SAP ERP data structures and more)
[6] SAP数据表 (SAP ERP数据结构等)
[7] AWS Redshift Elastic resize
[7] AWS Redshift Elastic调整大小
[8] Qlik (Attunity) Replicate hourly pricing on AWS Marketplace
[8] Qlik(Attunity)在AWS Marketplace上重复按小时定价
[9] Real-Time Security Data Lake
[9] 实时安全数据湖
[10] Global Data Ingestion with Amazon CloudFront and Lambda@Edge
[10] 使用Amazon CloudFront和Lambda @ Edge进行全球数据提取
[11] Keystone Routing Pipeline at Netflix (presented at Flink Forward San Francisco 2018). Video presentation here
[11] Netflix上的Keystone路由管道(在Flink Forward San Francisco 2018年提出) 。 此处的视频演示
翻译自: https://medium.com/data-analytics-reference-architectures/design-patterns-for-cloud-analytics-b6d29ec4859e
云原生 设计模式
相关资源:微信小程序源码-合集6.rar