数据仓库的基本架构是什么

    科技2022-08-03  110

    数据仓库的基本架构是什么

    A Data Warehouse is a component where your data is centralized, organized, and structured according to your organization's needs. It is used for data analysis and BI processes.

    数据仓库是一个组件,可以根据组织的需要对数据进行集中,组织和结构化。 它用于数据分析和BI流程。

    Data warehouses are not a new concept. In fact, the concept was developed in the late 1980s. But, it evolved over time.

    数据仓库不是一个新概念。 实际上,该概念是在1980年代后期提出的 。 但是,它随着时间的流逝而发展。

    The aim of this post is to explain the main concepts related to Data Warehouses and their use cases. Also, we’ll talk about Data Lakes and how these two components work together.

    这篇文章的目的是解释与数据仓库及其用例有关的主要概念 。 此外,我们还将讨论数据湖以及这两个组件如何协同工作。

    TL;DR — This post comprises basic information about data lakes and data warehouses. So, if you are familiar with these topics and their basic architecture, this post may not be for you. If that is not your case, please go ahead an enjoy the reading.

    TL; DR —此帖子包含有关数据湖和数据仓库的基本信息。 因此,如果您熟悉这些主题及其基本体系结构,则此职位可能不适合您。 如果不是您的情况,请继续阅读。

    为什么需要数据仓库? (Why do you need a Data Warehouse?)

    In the beginning, there was chaos. At least this is my point of view when I arrived at an organization that was doing data analysis using old spreadsheets and a bunch of CSV files. No one didn’t know where the files would come from. They were just…there.

    一开始,情况很混乱。 至少当我到达一个使用旧电子表格和一堆CSV文件进行数据分析的组织时,这就是我的观点。 没有人不知道文件从哪里来 。 他们只是……那里。

    Inconsistent metrics, unreproducible processes, and a bunch of manual — copy/paste — work was common at that time.

    当时,度量标准不一致,流程不可重复以及一堆手动操作(复制/粘贴)很普遍。

    No one even knew what was the real value of the metrics they were tracking. For example, for a metric like Monthly Active Users (MAU) the answer would always depend on who you asked.

    没人知道他们正在追踪的指标的真正价值是什么。 例如,对于像月度活跃用户(MAU)这样的指标,答案将始终取决于您询问的人。

    If you are still with me and this rings a bell, you may know it is important to have a single source of truth. Mainly, because you don’t want to have a lot of business users making decisions based on inconsistent metrics.

    如果您仍然与我同在,并且这敲响了钟声,那么您可能知道拥有唯一的真理来源很重要。 主要是因为您不想让许多业务用户基于不一致的指标来做出决策。

    Also, you don’t want your data engineers/analyst doing a bunch of manual work that can be automated. Certainly, they can do more interesting stuff than copy/paste spreadsheets.

    另外,您也不希望数据工程师/分析师进行大量可以自动化的手动工作。 当然,与复制/粘贴电子表格相比,他们可以做更多有趣的事情。

    If this is a problem your organization is facing in a daily manner, you may need a Data Warehouse.

    如果您的组织每天都遇到此问题,则可能需要数据仓库。

    So, let me now define what is a Data Warehouse…

    现在,让我定义什么是数据仓库...

    A Data Warehouse is a component where your data is centralized, organized, and structured according to your organization’s needs. It is used for data analysis and BI processes.

    数据仓库是一个组件,可以根据组织的需要对数据进行集中,组织和结构化。 它用于数据分析和BI流程。

    Put it simply, you may need a Data Warehouse if:

    简单地说,如果满足以下条件,则可能需要数据仓库:

    There are several people working with the data and they need it to be consistent, i.e., they need to have a single source of truth. So, they can make more informed decisions

    有几个人在处理数据,他们需要数据是一致的, 即 ,他们需要唯一的事实来源。 因此,他们可以做出更明智的决定

    You have several sources where the data is coming from and integrating them in a manual way is not easy

    您有多个数据来源,以手动方式集成它们并不容易 You want to automate manual processes requiring you to repeat yourself

    您想自动化手动过程,需要您重复自己 You want to do data analysis based on clean, organized, and structured data

    您想基于干净,组织和结构化的数据进行数据分析 You have the resources for putting in place processes for maintaining a Data Warehouse

    您拥有用于实施维护数据仓库的流程的资源

    数据仓库基本概念 (Data Warehouse Basic Concepts)

    Now you know why do you need a Data Warehouse, let’s explore some of the Data Warehouse basic concepts.

    现在您知道了为什么需要数据仓库,让我们探索一些数据仓库的基本概念。

    So, if you want to integrate multiple data sources and structure the data in a way that you can perform data analysis, you have to centralize it. This where ETL (Extract, Transform, and Load) processes come in.

    因此,如果要集成多个数据源并以可以执行数据分析的方式来结构化数据,则必须将其集中化。 这是ETL (提取,转换和加载)过程进入的地方。

    Basically, ETL processes extract the data from the sources, transform it in a usable way, and load it to the Data Warehouse. So, you can do some cool analytics and BI processes.

    基本上,ETL流程从源中提取数据,以可用的方式对其进行转换,然后将其加载到数据仓库中。 因此,您可以执行一些很酷的分析和BI流程。

    An illustration of a classic ETL process — Illustration made by the author 举例说明了经典的ETL流程— Vector by作者所作

    But, ETL processes are considered to be the legacy way. Some problems exhibited by ETL processes are:

    但是, ETL过程被认为是 传统方法 。 ETL流程显示出一些问题:

    There is no registry of the original form of the data since transformation happens on the way to the Data Warehouse. This can make transformation processes difficult to reproduce. Even more, if your data is no immutable, i.e., data sources are constantly changing

    由于转换是在到数据仓库的过程中进行的,因此没有原始数据形式的注册表。 这会使转换过程难以复制 。 甚至,如果您的数据不是一成不变的, 即数据源在不断变化

    It can be difficult introducing new changes into the transformation logic since it may require you to reprocess past data that has been already transformed. Depending on your requirements, this can be difficult to achieve if you don’t have the data in its original form

    将新更改引入转换逻辑可能会很困难, 因为这可能需要您重新处理已转换的过去数据 。 根据您的要求,如果您没有原始格式的数据,则可能难以实现

    It can be more complex to maintain data architecture for supporting ETL processes since you may have to put in place additional resources for performing them

    维护用于支持ETL流程的数据体系结构可能会更加复杂,因为您可能必须放置额外的资源来执行它们。

    There is another approach similar to ETL processes: ELT processes. ELT (Extract, Load, and Transform) processes are considered to be the modern approach. Basically, they perform the same processes but in a different order. Some of the key advantages of this approach are:

    还有另一种与ETL流程类似的方法: ELT流程。 ELT (提取,加载和转换) 流程被认为是现代方法。 基本上,它们执行相同的过程,但顺序不同。 这种方法的一些主要优点是:

    Data can be extracted in its original form, which ends up in simple logic in extraction processes

    数据可以原始形式提取,最终以简单的逻辑提取过程

    Data in its original form can be stored in a staging area. In this way, you can generate immutable data. By doing so, you can make transformation processes easily reproducible

    原始格式的数据可以存储在暂存区中。 这样,您可以生成不可变的数据。 这样,您可以使转换过程易于重现

    Transformation processes can be performed by using the power of modern Data Warehouses, so you don’t have to incur in additional resources for performing such processes

    可以使用现代数据仓库的功能来执行转换过程,因此您无需为执行此类过程而招致额外资源

    ELT-based architectures can be simpler to maintain depending on your set up

    基于ELT的体系结构可以更容易维护,具体取决于您的设置

    Staging area

    暂存区

    According to Maxime Beauchemin, ideally, the staging area of a Data Warehouse should immutable, i.e., it should be an area where all your data is in its original form. So, it can serve as the loading dock of your data warehouse.

    根据Maxime Beauchemin的观点 ,理想情况下,数据仓库的暂存区域应该是不可变的, 即它应该是所有数据都保持其原始形式的区域 。 因此,它可以用作数据仓库的加载平台 。

    The staging area allows you to take the data in its original form and perform transformation processes on top of it without actually changing the data. So, basically, you are taking data in its original form as an input to generate new data as an output.

    暂存区域允许您以原始形式获取数据并在其之上执行转换过程,而无需实际更改数据。 因此,基本上,您是以原始形式的数据作为输入,以生成新数据作为输出。

    This concept is important since if you need to change some logic in transformation processes it should be easier to reprocess the data if you have it in its original form. Keep in mind this an ideal state, so achieving it can be sometimes difficult.

    这个概念很重要,因为如果您需要在转换过程中更改某些逻辑,则如果原始格式的数据应该更容易重新处理。 请记住,这是一种理想状态,因此有时可能很难实现。

    An immutable staging area should allow you to recompute the state of the warehouse from scratch in case you need to. This can be achieved by implementing functional transformation processes and pure tasks — see this post for more info. Also, check this post for an example of an implementation of the concept of functional data engineering.

    一个不可变的暂存区应该允许您从头开始重新计算仓库的状态,以备不时之需 。 这可以通过实现功能转换过程和纯粹的任务来实现-有关更多信息,请参见此文章 。 另外,请查看此帖子以获取功能数据工程概念的实现示例。

    数据湖 (Data lakes)

    A Data Lake can be defined as a repository of multiple sources where data is stored in its original format.

    数据湖可以定义为多个源的存储库,数据以其原始格式存储。

    It’s similar to a staging area of a Data Warehouse — see this post for more info. But, they solve some problems not addressed for Data Warehouses. For example, dealing with semi-structured and unstructured data — JSON files, XML files, and so on.

    它类似于数据仓库的暂存区域-有关更多信息,请参阅此文章 。 但是,它们解决了一些数据仓库无法解决的问题。 例如,处理半结构化和非结构化数据-JSON文件,XML文件等。

    数据仓库架构 (Data Warehouse Architecture)

    At this point, you may wonder about how Data Warehouses and Data Lakes work together.

    此时,您可能想知道数据仓库和数据湖如何协同工作。

    So, to put it simply you can build a Data Warehouse on top of a Data Lake by putting in place ELT processes and following some architectural principles.

    因此,简单地说,您可以通过实施ELT流程并遵循一些架构原则,在Data Lake之上构建数据仓库。

    Check this post for more information about these principles.

    检查这篇文章以获取有关这些原理的更多信息。

    A basic architecture allowing for implementing the approach explained before may look like this:

    允许实施之前说明的方法的基本体系结构可能如下所示:

    An abstraction of a Data Pipeline Architecture — Illustration made by the author 数据管道体系结构的抽象— Vector by作者

    This stages can be described as follows:

    此阶段可以描述如下:

    Sources: Data coming from business operations. This data might come from production databases of your organization. Also, it might come from other sources of interest for your organization, e.g., like CRMs, APIs, etc…

    来源:来自业务运营的数据。 此数据可能来自您组织的生产数据库。 另外,它可能来自您组织的其他感兴趣来源, 例如 CRM,API等…

    Data Lake: The repository where all your data sources should be centralized. It can act as the staging area of your data warehouse. Ideally, it should contain immutable data, so can easily guarantee process reproducibility.

    Data Lake:应该集中所有数据源的存储库。 它可以充当数据仓库的暂存区域。 理想情况下,它应包含不变的数据,因此可以轻松保证过程的可重复性。

    Data Warehouse: A source where all your data is structured accordingly to your needs for data analysis. It is built on top of the Data Lake. Basically, you are taking data of the Data Lake as an input to generate new views of that data in the Data Warehouse by applying some transformation logic.

    数据仓库:根据所有数据分析需求来结构化所有数据的来源。 它建在Data Lake之上。 基本上,您将数据湖的数据作为输入,以通过应用一些转换逻辑在数据仓库中生成该数据的新视图。

    Visualizations: All the tools and processes allowing you to do cool analytics by plotting some charts, e.g., Metabase, Tableau, and so on. These tools should allow you to visualize data structured in Data Warehouse. All the heavy work related to complex calculations should be performed in the Data Warehouse.

    可视化:所有工具和流程都允许您通过绘制一些图表( 例如 Metabase,Tableau等)来进行出色的分析。 这些工具应允许您可视化数据仓库中结构化的数据。 与复杂计算有关的所有繁重工作应在数据仓库中执行。

    结论 (Conclusions)

    In this post, we addressed some basic concepts related to Data Warehouses and Data Lakes.

    在这篇文章中,我们讨论了一些与数据仓库和数据湖有关的基本概念。

    Also, we addressed how these two components can complement each other by assembling the right architecture.

    此外,我们介绍了如何通过组装正确的体系结构来使这两个组件相互补充。

    You should be aware there is more on this topic that you should check out.

    您应该了解有关此主题的更多信息,请查阅。

    For example, once you have the initial setup for a data warehouse there are several processes you should put in place to improve its operability and performance. See this post for more info

    例如,一旦为数据仓库进行了初始设置,就应该采用几个流程来改善其可操作性和性能。 有关更多信息,请参见此帖子

    I hope you find useful this information.

    希望您能从中找到有用的信息。

    Thanks for reading until the end.

    感谢您的阅读直到最后。

    See you in the next post!

    下篇再见!

    翻译自: https://towardsdatascience.com/what-is-a-data-warehouse-basic-architecture-ea2cd12c9bb0

    数据仓库的基本架构是什么

    相关资源:微信小程序源码-合集6.rar
    Processed: 0.015, SQL: 9