A reflection on my three month journey.
对我三个月旅程的反思。
In large companies, data is typically stored in a distributed manner. It is highly important to build a reliable data pipeline to deal with batch, incremental loads of data.
在大型公司中,数据通常以分布式方式存储。 建立可靠的数据管道以处理批量,增量数据负载非常重要。
Dwelo uses bleeding edge data warehouse technology in architecting distributed systems , creating reliable pipelines and collaborating with data science teams to build the right solutions for them.
Dwelo使用最前沿的数据仓库技术来设计分布式系统,创建可靠的管道并与数据科学团队合作为其构建正确的解决方案。
Here’s a brief description of the technologies that are used :
这是所使用技术的简要说明:
Airflow:
空气流动:
Airflow is a platform to programmatically author, schedule and monitor workflows(a.k.a DAGs or Directed Acyclic Graphs). The python code base makes it easily extendable.
Airflow是一个以编程方式编写,安排和监视工作流(又称为DAG或有向无环图)的平台。 python代码库使其易于扩展。
Web UI : DAGs at Dwelo Web UI:Dwelo的DAGThe above Airflow UI allows any users to visualize the DAG in a graph view, using code as configuration. The author of a data pipeline must define the structure of dependencies among tasks in order to break down complex workflows into granular parts that are safer, more modular and reusable. This specification is often written in a file called the DAG definition file, which lays out the anatomy of an Airflow job.
上面的Airflow UI允许任何用户使用代码作为配置在图形视图中可视化DAG。 数据管道的作者必须定义任务之间的依赖关系结构,以便将复杂的工作流程分解为更安全,更具模块化和可重用的细粒度部分。 该规范通常写在称为DAG定义文件的文件中,该文件列出了Airflow作业的结构。
Advantages of Airflow:
气流的优势:
Handle task failures 处理任务失败 Report/Alert on failures报告/警报失败Enforce SLAs强制执行SLAEasily scale for growing load 轻松扩展负载Docker:
码头工人:
Developing apps today requires so much more than writing code. Multiple languages, frameworks, architectures, and discontinuous interfaces between tools for each lifecycle stage creates enormous complexity. Docker simplifies and accelerates your workflow, while giving developers the freedom to innovate with their choice of tools, application stacks, and deployment environments for each project.
今天开发应用程序所需要的不仅仅是编写代码。 在每个生命周期阶段,工具之间的多种语言,框架,体系结构以及不连续的接口都会带来极大的复杂性。 Docker简化并加速了您的工作流程,同时使开发人员可以自由选择每个项目的工具,应用程序堆栈和部署环境进行创新。
Advantages:
好处:
Reproducibility 重现性 Isolation隔离Portability可移植性Shareability共享性Kubernetes:
Kubernetes:
Kubernetes is a system for running and coordinating containerized applications across a cluster of machines. It is a platform designed to completely manage the life cycle of containerized applications and services using methods that provide predictability, scalability, and high availability.
Kubernetes是一个用于在机器集群中运行和协调容器化应用程序的系统。 它是一个平台,旨在使用提供可预测性,可伸缩性和高可用性的方法来完全管理容器化应用程序和服务的生命周期。
For more information and latest updates on airflow, please refer the following link — https://github.com/jghoman/awesome-apache-airflow
有关气流的更多信息和最新更新,请参考以下链接— https://github.com/jghoman/awesome-apache-airflow
Google Cloud Platform (GCP), is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, like as google-search and youtube. Alongside a set of management tools, it also provides a series of modular cloud services including computing, data storage, data analytics and machine learning.
Google Cloud Platform(GCP)是一套云计算服务,可在Google内部用于其最终用户产品(例如google-search和youtube)的相同基础架构上运行。 除了一套管理工具外,它还提供了一系列模块化云服务,包括计算,数据存储,数据分析和机器学习。
Google BigQuery was designed as a “cloud-native “ data warehouse. It was built to address the needs of data driven organizations in a cloud first world. We can ingest data directly into BigQuery by uploading in batch or by streaming data directly, enabling real-time insights.
Google BigQuery被设计为“云原生”数据仓库。 它旨在满足云第一世界中数据驱动型组织的需求。 我们可以通过批量上传或直接流式传输数据将数据直接摄取到BigQuery中,从而实现实时见解。
BigQuery exposes two graphical web UI’s that you can use to create and manage BigQuery resources and to run SQL queries: the BigQuery web UI in the Cloud Console and the classic BigQuery web UI.
BigQuery公开了两个图形化的Web UI,可用于创建和管理BigQuery资源以及运行SQL查询:Cloud Console中的BigQuery Web UI和经典的BigQuery Web UI。
The pricing model is simple. We pay for data storage, streaming inserts and querying data. Loading and exporting data are free of charge. Storage costs are based on the amount of data stored. For queries, you can choose to pay per query or a flat rate for dedicated resources. At Dwelo, we use the billing dashboard in GCP to track the query with maximum usage and spend and further optimize the query to reduce costs.
定价模型很简单。 我们为数据存储,流插入和查询数据付费。 加载和导出数据是免费的。 存储成本基于存储的数据量。 对于查询,您可以选择按查询付费或为专用资源按固定费率付费。 在Dwelo,我们使用GCP中的计费仪表板来跟踪使用情况最大的查询,并花费并进一步优化查询以降低成本。
ETL pipelines are a fundamental component of any data system. They extract data from many disparate sources, transform (aka wrangling) the data (often making it fit the data model defined by your data warehouse) then load said data into your data warehouse.
ETL管道是任何数据系统的基本组成部分。 他们从许多不同的源中提取数据,转换(又称处理)数据(通常使其适合您的数据仓库定义的数据模型),然后将所述数据加载到数据仓库中。
How to handle data transformations? — dbt is a new way to transform data and build pipelines. It applies the principles of software engineering to analytics code , an approach that dramatically increases your leverage as a data analyst.
如何处理数据转换? — dbt是转换数据和构建管道的新方法。 它将软件工程的原理应用于分析代码,这种方法可以极大地提高您作为数据分析师的影响力。
The following are the four approaches to dbt:
以下是dbt的四种方法:
Monolithic 单片 Micro-services微服务Layers层数Layers & Verticals图层和垂直Dwelo uses Monolithic approach.
Dwelo使用整体方法。
Advantages:
好处:
Easier to debug , test and deploy 易于调试,测试和部署 Comprehensive Data Lineage and DBT documentation全面的数据沿袭和DBT文档Macros are defined in one place to help standardise data transformations宏定义在一个地方,以帮助标准化数据转换Easier to enforce standards with everything in a single place只需一个地方即可轻松实施所有标准The following is a code snippet from one of the underlying models to explain configuration , partitioning and clustering:
以下是其中一个基础模型的代码片段,用于解释配置,分区和集群:
Configuration:
组态:
Materializations are strategies for persisting dbt models in a warehouse. There are four types of materializations built into dbt, namely :
物化是将dbt模型持久存储在仓库中的策略。 dbt内置有四种类型的实现,分别是:
table 表 view视图incremental增加的ephemeral短暂的Using an incremental model limits the amount of data that needs to be transformed, vastly reducing the runtime of your transformations. This improves warehouse performance and reduces compute costs
使用增量模型会限制需要转换的数据量,从而大大减少转换的运行时间。 这样可以提高仓库性能并降低计算成本
Partition:
划分:
BigQuery supports the use of a partition by clause to easily partition a table by a column or expression. This option can help decrease latency and cost when querying large tables. Earlier only date fields were allowed for partitioning the data, but in December 2019 Google released a new partition capability: Integer range partitioning. This feature allows you to store all the values of a same range in the same partition. x`By allowing the integer partitioning, BigQuery allows you to partition on any fields: Float, String, Date,… For achieving this you have to transform your partition field into an integer value when you storing and querying your data.
BigQuery支持使用partition by子句轻松按列或表达式对表进行分区。 查询大表时,此选项可以帮助减少延迟和成本。 之前只有日期字段被允许对数据进行分区,但是Google在2019年12月发布了一项新的分区功能:整数范围分区。 此功能使您可以将同一范围内的所有值存储在同一分区中。 x`通过允许整数分区,BigQuery允许您在任何字段上进行分区:浮点数,字符串,日期等。要实现此目的,必须在存储和查询数据时将分区字段转换为整数值。
Clustering:
聚类:
BigQuery tables can be clustered to colocate related data. It helps in narrowing the volume of data to scanned by the database. The column order is extremely important in clustering.Earlier Clustering was supported only on partitioned tables. But as of June 2020, we can do clustering on any table.
可以对BigQuery表进行聚类以并置相关数据。 它有助于缩小数据库要扫描的数据量。 列顺序在群集中非常重要。早期的群集仅在分区表上受支持。 但是从2020年6月开始,我们可以在任何表上进行集群。
Some of the Best ETL Practices that we follow:
我们遵循的一些最佳ETL做法:
Partition Data Tables — Partitioning your tables by date and querying the relevant partition; for example, WHERE _PARTITIONDATE=”2017–01–01" only scans the January 1, 2017 partition can help reduce the cost of processing queries as well as improve performance.
分区数据表-按日期对表进行分区并查询相关分区; 例如,WHERE _PARTITIONDATE =“ 2017-01-01”仅扫描2017年1月1日的分区可以帮助降低处理查询的成本并提高性能。
Loading Data incrementally — Since we use on-demand pricing , we are charged for the number of bytes processed, regardless of the data housed in BigQuery or external data sources involved. The advantage of incremental loading is that it reduces the amount of data being transferred and a full load may take hours / days to complete depending on volume of data. Even if the full-load takes 2–3 minutes to load, it is quite expensive.
增量加载数据-由于我们使用按需定价,因此无论BigQuery中包含的数据还是所涉及的外部数据源,我们都会为处理的字节数收费。 增量加载的优点在于,它减少了要传输的数据量,根据数据量的不同,完整加载可能需要数小时/天才能完成。 即使满负荷加载需要2–3分钟,它还是很昂贵的。
Modularity — Breaking our model logic into base and staging models that then feed dim models addresses modularity to make things more manageable. Additionally, the ref function encourages you to write modular transformations, so that you can re-use models, and reduce repeated code.
模块化-将我们的模型逻辑分为基础模型和过渡模型,然后将它们提供给模糊模型,从而解决了模块化问题,使事情更易于管理。 另外,ref函数鼓励您编写模块化转换,以便您可以重用模型并减少重复的代码。
Adding data checks early and often — When processing data, it is useful to write data into a staging table, check the data quality, and only then exchange the staging table with the final production table.
尽早且经常添加数据检查-处理数据时,将数据写入登台表,检查数据质量,然后将登台表与最终生产表交换,这很有用。
Data Engineering is a specialized skill that often does not come out as a necessity. But their need is realized only when enterprises are stuck for ROI(Return on Investment), limited by scale or do not have the same analytics velocity as the behemoths world-wide. Data Engineers are the heroes working within the shadows to ensure that the enterprises have the right data, at the right time, and to the right people.
数据工程是一种专门技能,通常并不需要。 但是,只有当企业因规模有限而无法获得ROI(投资回报率)或与世界范围内的庞然大物没有相同的分析速度时,才可以实现他们的需求。 数据工程师是在阴影中工作的英雄,以确保企业在正确的时间和正确的人员拥有正确的数据。
It’s a long journey, and we are all still learning.
这是一段漫长的旅程,我们都还在学习。
https://docs.getdbt.com/docs/introduction
https://docs.getdbt.com/docs/introduction
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4
https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-ii-47c4e7cbda71
https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-ii-47c4e7cbda71
https://medium.com/photobox-technology-product-and-design/practical-tips-to-get-the-best-out-of-data-building-tool-dbt-part-1-8cfa21ef97c5
https://medium.com/photobox-technology-product-and-design/practical-tips-to-get-the-best-out-of-data-building-tool-dbt-part-1-8cfa21ef97c5
https://medium.com/google-cloud/partition-on-any-field-with-bigquery-840f8aa1aaab
https://medium.com/google-cloud/partition-on-any-field-with-bigquery-840f8aa1aaab
https://www.datalife8020.com/post/data-engineers-the-underappreciated-siblings
https://www.datalife8020.com/post/data-engineers-the-underappreciated-siblings
翻译自: https://medium.com/dwelo-r-d/data-engineering-at-dwelo-1a68a212cf17
相关资源:微信小程序源码-合集6.rar