dwelo的数据工程

科技2025-03-22 56

A reflection on my three month journey.

对我三个月旅程的反思。

In large companies, data is typically stored in a distributed manner. It is highly important to build a reliable data pipeline to deal with batch, incremental loads of data.

在大型公司中，数据通常以分布式方式存储。建立可靠的数据管道以处理批量，增量数据负载非常重要。

大数据：目标与挑战 (Big Data: Goals & Challenges)

How do you develop pipelines for incrementally loading data?

您如何开发用于增量加载数据的管道？How do you debug transformation logic in a highly distributed environment?

您如何在高度分布式的环境中调试转换逻辑？ How do you optimize the running of pipelines and ensure reliability and availability?

您如何优化管道的运行并确保可靠性和可用性？ How do you deal with endless changes in the underlying technology stack?

您如何应对基础技术堆栈中的无穷变化？ How does the system handle the propagation of upstream changes?

系统如何处理上游变化的传播？

Dwelo uses bleeding edge data warehouse technology in architecting distributed systems , creating reliable pipelines and collaborating with data science teams to build the right solutions for them.

Dwelo使用最前沿的数据仓库技术来设计分布式系统，创建可靠的管道并与数据科学团队合作为其构建正确的解决方案。

Here’s a brief description of the technologies that are used :

这是所使用技术的简要说明：

1.气流+ Docker + Kubernetes(可扩展且轻松的数据管道) (1. Airflow + Docker + Kubernetes (Scalable and painless data pipeline))

Airflow:

空气流动：

Airflow is a platform to programmatically author, schedule and monitor workflows(a.k.a DAGs or Directed Acyclic Graphs). The python code base makes it easily extendable.

Airflow是一个以编程方式编写，安排和监视工作流(又称为DAG或有向无环图)的平台。 python代码库使其易于扩展。

Web UI : DAGs at Dwelo Web UI：Dwelo的DAG

The above Airflow UI allows any users to visualize the DAG in a graph view, using code as configuration. The author of a data pipeline must define the structure of dependencies among tasks in order to break down complex workflows into granular parts that are safer, more modular and reusable. This specification is often written in a file called the DAG definition file, which lays out the anatomy of an Airflow job.

上面的Airflow UI允许任何用户使用代码作为配置在图形视图中可视化DAG。数据管道的作者必须定义任务之间的依赖关系结构，以便将复杂的工作流程分解为更安全，更具模块化和可重用的细粒度部分。该规范通常写在称为DAG定义文件的文件中，该文件列出了Airflow作业的结构。

Advantages of Airflow:

气流的优势：

Handle task failures

处理任务失败 Report/Alert on failures

报告/警报失败Enforce SLAs

强制执行SLAEasily scale for growing load

轻松扩展负载

Docker:

码头工人：

Developing apps today requires so much more than writing code. Multiple languages, frameworks, architectures, and discontinuous interfaces between tools for each lifecycle stage creates enormous complexity. Docker simplifies and accelerates your workflow, while giving developers the freedom to innovate with their choice of tools, application stacks, and deployment environments for each project.

今天开发应用程序所需要的不仅仅是编写代码。在每个生命周期阶段，工具之间的多种语言，框架，体系结构以及不连续的接口都会带来极大的复杂性。 Docker简化并加速了您的工作流程，同时使开发人员可以自由选择每个项目的工具，应用程序堆栈和部署环境进行创新。

Advantages:

好处：

Reproducibility

重现性 Isolation

隔离Portability

可移植性Shareability

共享性

Kubernetes:

Kubernetes：

Kubernetes is a system for running and coordinating containerized applications across a cluster of machines. It is a platform designed to completely manage the life cycle of containerized applications and services using methods that provide predictability, scalability, and high availability.

Kubernetes是一个用于在机器集群中运行和协调容器化应用程序的系统。它是一个平台，旨在使用提供可预测性，可伸缩性和高可用性的方法来完全管理容器化应用程序和服务的生命周期。

For more information and latest updates on airflow, please refer the following link — https://github.com/jghoman/awesome-apache-airflow

有关气流的更多信息和最新更新，请参考以下链接— https://github.com/jghoman/awesome-apache-airflow

2. GCP(存储和效率) (2. GCP (Storage and Efficiency))

Google Cloud Platform (GCP), is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, like as google-search and youtube. Alongside a set of management tools, it also provides a series of modular cloud services including computing, data storage, data analytics and machine learning.

Google Cloud Platform(GCP)是一套云计算服务，可在Google内部用于其最终用户产品(例如google-search和youtube)的相同基础架构上运行。除了一套管理工具外，它还提供了一系列模块化云服务，包括计算，数据存储，数据分析和机器学习。

Google BigQuery was designed as a “cloud-native “ data warehouse. It was built to address the needs of data driven organizations in a cloud first world. We can ingest data directly into BigQuery by uploading in batch or by streaming data directly, enabling real-time insights.

Google BigQuery被设计为“云原生”数据仓库。它旨在满足云第一世界中数据驱动型组织的需求。我们可以通过批量上传或直接流式传输数据将数据直接摄取到BigQuery中，从而实现实时见解。

BigQuery exposes two graphical web UI’s that you can use to create and manage BigQuery resources and to run SQL queries: the BigQuery web UI in the Cloud Console and the classic BigQuery web UI.

BigQuery公开了两个图形化的Web UI，可用于创建和管理BigQuery资源以及运行SQL查询：Cloud Console中的BigQuery Web UI和经典的BigQuery Web UI。

The pricing model is simple. We pay for data storage, streaming inserts and querying data. Loading and exporting data are free of charge. Storage costs are based on the amount of data stored. For queries, you can choose to pay per query or a flat rate for dedicated resources. At Dwelo, we use the billing dashboard in GCP to track the query with maximum usage and spend and further optimize the query to reduce costs.

定价模型很简单。我们为数据存储，流插入和查询数据付费。加载和导出数据是免费的。存储成本基于存储的数据量。对于查询，您可以选择按查询付费或为专用资源按固定费率付费。在Dwelo，我们使用GCP中的计费仪表板来跟踪使用情况最大的查询，并花费并进一步优化查询以降低成本。

3. dbt模型(数据转换) (3. dbt Models (Data Transformations))

ETL pipelines are a fundamental component of any data system. They extract data from many disparate sources, transform (aka wrangling) the data (often making it fit the data model defined by your data warehouse) then load said data into your data warehouse.

ETL管道是任何数据系统的基本组成部分。他们从许多不同的源中提取数据，转换(又称处理)数据(通常使其适合您的数据仓库定义的数据模型)，然后将所述数据加载到数据仓库中。

How to handle data transformations? — dbt is a new way to transform data and build pipelines. It applies the principles of software engineering to analytics code , an approach that dramatically increases your leverage as a data analyst.

如何处理数据转换？ — dbt是转换数据和构建管道的新方法。它将软件工程的原理应用于分析代码，这种方法可以极大地提高您作为数据分析师的影响力。

The following are the four approaches to dbt:

以下是dbt的四种方法：

Monolithic

单片 Micro-services

微服务Layers

层数Layers & Verticals

图层和垂直

Dwelo uses Monolithic approach.

Dwelo使用整体方法。

Advantages:

好处：

Easier to debug , test and deploy

易于调试，测试和部署 Comprehensive Data Lineage and DBT documentation

全面的数据沿袭和DBT文档Macros are defined in one place to help standardise data transformations

宏定义在一个地方，以帮助标准化数据转换Easier to enforce standards with everything in a single place

只需一个地方即可轻松实施所有标准

The following is a code snippet from one of the underlying models to explain configuration , partitioning and clustering:

以下是其中一个基础模型的代码片段，用于解释配置，分区和集群：

Configuration:

组态：

Materializations are strategies for persisting dbt models in a warehouse. There are four types of materializations built into dbt, namely :

物化是将dbt模型持久存储在仓库中的策略。 dbt内置有四种类型的实现，分别是：

table

表 view

视图incremental

增加的ephemeral

短暂的

Using an incremental model limits the amount of data that needs to be transformed, vastly reducing the runtime of your transformations. This improves warehouse performance and reduces compute costs

使用增量模型会限制需要转换的数据量，从而大大减少转换的运行时间。这样可以提高仓库性能并降低计算成本

Partition:

划分：

BigQuery supports the use of a partition by clause to easily partition a table by a column or expression. This option can help decrease latency and cost when querying large tables. Earlier only date fields were allowed for partitioning the data, but in December 2019 Google released a new partition capability: Integer range partitioning. This feature allows you to store all the values of a same range in the same partition. x`By allowing the integer partitioning, BigQuery allows you to partition on any fields: Float, String, Date,… For achieving this you have to transform your partition field into an integer value when you storing and querying your data.

BigQuery支持使用partition by子句轻松按列或表达式对表进行分区。查询大表时，此选项可以帮助减少延迟和成本。之前只有日期字段被允许对数据进行分区，但是Google在2019年12月发布了一项新的分区功能：整数范围分区。此功能使您可以将同一范围内的所有值存储在同一分区中。 x`通过允许整数分区，BigQuery允许您在任何字段上进行分区：浮点数，字符串，日期等。要实现此目的，必须在存储和查询数据时将分区字段转换为整数值。

Clustering:

聚类：

BigQuery tables can be clustered to colocate related data. It helps in narrowing the volume of data to scanned by the database. The column order is extremely important in clustering.Earlier Clustering was supported only on partitioned tables. But as of June 2020, we can do clustering on any table.

可以对BigQuery表进行聚类以并置相关数据。它有助于缩小数据库要扫描的数据量。列顺序在群集中非常重要。早期的群集仅在分区表上受支持。但是从2020年6月开始，我们可以在任何表上进行集群。

Some of the Best ETL Practices that we follow:

我们遵循的一些最佳ETL做法：

Partition Data Tables — Partitioning your tables by date and querying the relevant partition; for example, WHERE _PARTITIONDATE=”2017–01–01" only scans the January 1, 2017 partition can help reduce the cost of processing queries as well as improve performance.

分区数据表-按日期对表进行分区并查询相关分区；例如，WHERE _PARTITIONDATE =“ 2017-01-01”仅扫描2017年1月1日的分区可以帮助降低处理查询的成本并提高性能。

Loading Data incrementally — Since we use on-demand pricing , we are charged for the number of bytes processed, regardless of the data housed in BigQuery or external data sources involved. The advantage of incremental loading is that it reduces the amount of data being transferred and a full load may take hours / days to complete depending on volume of data. Even if the full-load takes 2–3 minutes to load, it is quite expensive.

增量加载数据-由于我们使用按需定价，因此无论BigQuery中包含的数据还是所涉及的外部数据源，我们都会为处理的字节数收费。增量加载的优点在于，它减少了要传输的数据量，根据数据量的不同，完整加载可能需要数小时/天才能完成。即使满负荷加载需要2–3分钟，它还是很昂贵的。

Modularity — Breaking our model logic into base and staging models that then feed dim models addresses modularity to make things more manageable. Additionally, the ref function encourages you to write modular transformations, so that you can re-use models, and reduce repeated code.

模块化-将我们的模型逻辑分为基础模型和过渡模型，然后将它们提供给模糊模型，从而解决了模块化问题，使事情更易于管理。另外，ref函数鼓励您编写模块化转换，以便您可以重用模型并减少重复的代码。

Adding data checks early and often — When processing data, it is useful to write data into a staging table, check the data quality, and only then exchange the staging table with the final production table.

尽早且经常添加数据检查-处理数据时，将数据写入登台表，检查数据质量，然后将登台表与最终生产表交换，这很有用。

Data Engineering is a specialized skill that often does not come out as a necessity. But their need is realized only when enterprises are stuck for ROI(Return on Investment), limited by scale or do not have the same analytics velocity as the behemoths world-wide. Data Engineers are the heroes working within the shadows to ensure that the enterprises have the right data, at the right time, and to the right people.

数据工程是一种专门技能，通常并不需要。但是，只有当企业因规模有限而无法获得ROI(投资回报率)或与世界范围内的庞然大物没有相同的分析速度时，才可以实现他们的需求。数据工程师是在阴影中工作的英雄，以确保企业在正确的时间和正确的人员拥有正确的数据。

It’s a long journey, and we are all still learning.

这是一段漫长的旅程，我们都还在学习。

https://docs.getdbt.com/docs/introduction

https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4

https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-ii-47c4e7cbda71

https://medium.com/photobox-technology-product-and-design/practical-tips-to-get-the-best-out-of-data-building-tool-dbt-part-1-8cfa21ef97c5

https://medium.com/google-cloud/partition-on-any-field-with-bigquery-840f8aa1aaab

https://www.datalife8020.com/post/data-engineers-the-underappreciated-siblings

翻译自: https://medium.com/dwelo-r-d/data-engineering-at-dwelo-1a68a212cf17

相关资源：微信小程序源码-合集6.rar

Processed: 0.012, SQL: 8