数据库模式分解方法

    科技2022-08-03  116

    数据库模式分解方法

    For every data scientist, data availability is a crucial pre-requisite for their work. Most commonly, data are acquired from internal and external (paid) sources via APIs. On one hand, data scientists usually don’t care a lot about how the data are provided to them, they are satisfied when they can query them from a database or retrieve from an excel sheet. On the other hand, quick data availability is very important for building any analytics tools.

    对于每位数据科学家而言,数据可用性是其工作的关键先决条件。 最常见的是,数据是通过API从内部和外部(收费)来源获取的。 一方面,数据科学家通常并不关心数据如何提供给他们,当他们可以从数据库中查询数据或从excel工作表中检索数据时,他们会感到满意。 另一方面,快速的数据可用性对于构建任何分析工具都非常重要。

    Usually, the data must go through processing done by integration engineers, before they are ready to be consumed, and that can take a lot of time. Additionally, data mapping is a very tedious task. It is believed that 90% of developers do not enjoy the decomposition of data and their mapping. Raw data from external sources are usually acquired as XML or JSON files for given parameters. Nowadays, one of the most common concepts of storing such data in the cloud is inside a data lake. Once the data is stored there, an engineer develops a pipeline to propagate them into tables in a normalized form in a relational database, or more likely, into a data warehouse.

    通常,在准备使用数据之前,数据必须经过集成工程师的处理,这会花费很多时间。 此外,数据映射是一项非常繁琐的任务。 据认为,有90%的开发人员不喜欢数据分解和映射。 对于给定的参数,通常从XML或JSON文件获取来自外部源的原始数据。 如今,在云中存储此类数据的最常见概念之一是在数据湖内部。 一旦将数据存储到那里,工程师就会开发出一条管道,以规范化的形式将它们传播到关系数据库中的表中,或者更有可能传播到数据仓库中。

    The concept of the data flow has been heavily automated with tools in the cloud. Namely in AWS, one can use S3, Glue, Athena, Lake Formation, and Redshift for data management and storage. This time we will challenge heuristics in AWS to automatically detect the format and properties of the input data and will see how the grouping into tables works.

    数据流的概念已通过云中的工具高度自动化。 即在AWS中,可以使用S3,Glue,Athena,Lake Formation和Redshift进行数据管理和存储。 这次,我们将挑战AWS中的启发式算法,以自动检测输入数据的格式和属性,并查看分组到表的工作方式。

    汽车登记册示例 (Car register example)

    In our hypothetical example, let’s suppose we are retrieving historical data about cars identified by their VIN number. The first API provides us a history of the mileage measured by the car service stations and the other API returns the history of the registration plates. Below there is an example API call for a car with VIN identifier 1HGBH41JXMN203578.

    在我们的假设示例中,让我们假设正在检索有关由其VIN编号标识的汽车的历史数据。 第一个API向我们提供了汽车维修站测量的里程历史,另一个API返回了车牌的历史。 下面有一个示例示例API调用,用于具有VIN标识符1HGBH41JXMN203578的汽车。

    The first API returns the VIN of the vehicle together with the attribute mileageHistory, which is an array of elements, containing the mileage in km, measured on a particular day. The registration API returns again the VIN together with the array storing the dates when the registration number has changed. Altogether, in these two JSON responses we have three entities:

    第一个API返回车辆的VIN以及属性mileageHistory ,该属性是一个元素数组,包含在特定日期测量的公里数(公里)。 注册API再次返回VIN以及存储注册号已更改的日期的数组。 总之,在这两个JSON响应中,我们具有三个实体:

    Vehicle identification (through VIN),

    车辆识别(通过VIN), Mileage history (for each vehicle),

    里程历史记录(每辆车) Plate history (for each vehicle).

    车牌历史(每辆车)。

    Let’s have a look at one of the possible ways of mapping and decomposition this data into database tables. Allowing each entity to have its own table and each records its own identifier, the schema could look like this.

    让我们看一下将这些数据映射和分解为数据库表的可能方法之一。 允许每个实体拥有自己的表并记录各自的标识符,该模式可能如下所示。

    The table vehicle is storing the vinand vehicleIDis used as a vehicle identifier for saving the historical data, every element of the array as one row. This decomposition is definitely at least in the 3rd normal form and it describes the most common way on how the data could be represented. In general, delivering a system that will automatically populate a DB scheme like this is not trivial — one has to first do a business analysis of the data and analyze their relationship, decide on how to parse the data, how to structure the database, and finally map the data into the appropriate fields.

    表格车辆存储vin并且vehicleID用作车辆标识符以保存历史数据,该数组的每个元素都为一行。 这种分解肯定至少是第三范式,它描述了如何表示数据的最常用方法。 总的来说,提供一种可以自动填充数据库方案的系统并非易事-必须首先对数据进行业务分析并分析它们之间的关系,决定如何解析数据,如何构建数据库以及最后将数据映射到适当的字段中。

    具有挑战性的AWS Glue (Challenging AWS Glue)

    According to the definition in AWS docs, a crawler can automate the process of data decomposition and from raw data. Here is its description:

    根据AWS文档中的定义,搜寻器可以自动执行数据分解过程以及原始数据。 这是它的描述:

    Classifies data to determine the format, schema, and associated properties of the raw data — You can configure the results of classification by creating a custom classifier.

    对数据进行分类,以确定 原始数据 的格式,架构和相关属性 -您可以通过创建自定义分类器来配置分类结果。

    Groups data into tables or partitions, while grouped based on crawler heuristics.

    根据爬虫启发式将数据分组 到表或分区中 。

    We will use two responses of our fictive API as an example, to feed the heuristics and grouping in AWS. The following files were uploaded to S3 and the crawler was applied to them.

    我们将以虚拟API的两个响应为例,来提供AWS中的启发式方法和分组。 以下文件已上载到S3,并且对它们应用了搜寻器。

    The crawler was configured to group data into tables and recognize the data types while reading data from S3 bucket. It was not forced to create a single schema for the files, since there are multiple entities presented. Surprisingly, the crawler ended up creating four database tables, each for one file.

    搜寻器配置为将数据分组到表中,并在从S3存储桶读取数据时识别数据类型。 由于存在多个实体,因此没有强制为文件创建单个架构。 令人惊讶的是,搜寻器最终创建了四个数据库表,每个数据库表用于一个文件。

    Looking at the structure of the tables, two tables had schema with vinand mileagehistoryas an array,

    看一下表格的结构,两个表格的架构以vin和mileagehistory为数组,

    and two tables had schema with vinand platehistory.

    和两个表具有vin和platehistory模式。

    Not really an expected behavior to have as many tables as output files. A data scientist trying to retrieve data from all these four files would have to first retrieve the list of all tables (which will equal the number of files) and later for each decide depending whether it has mileage or plate column property. And that is a very complicated query, not mentioning the need to unnest the arrays into database rows before processing.

    具有与输出文件一样多的表并不是真正的预期行为。 试图从所有这四个文件中检索数据的数据科学家必须首先检索所有表的列表(该表等于文件的数量),然后再由每个人决定是否具有里程或车牌栏目的属性。 这是一个非常复杂的查询,更不用说在处理之前将数组嵌套到数据库行中了。

    Let’s change the configuration of the crawler and allow a single schema for all files. The crawler could at least recognize that mileagehistoryand platehistoryfor one VIN number belong together.

    让我们更改搜寻器的配置,并为所有文件提供一个架构。 爬虫至少可以识别出一个VIN号码的mileagehistory和platehistory属于一起。

    Right now, the crawler created only a single scheme table for all records. The table definition looks promising:

    目前,搜寻器仅为所有记录创建了一个方案表。 该表定义看起来很有希望:

    However, a surprise comes when querying the data from the data lake. After selecting all records from the resulting table, one gets:

    但是,从数据湖查询数据时会感到意外。 从结果表中选择所有记录后,将得到:

    There are four independent records, while again each file is one line. The data scientist trying to obtain these data points would have to unnest mileagehistoryor platehistoryfor rows that are not empty and combine the data together. Solving the task “find all cars that have at least 10 000 km mileage and their second registration was in 2014” would take a lot of time, considering the amount of data.

    有四个独立的记录,而每个文件又是一行。 试图获取这些数据点的数据科学家必须mileagehistory或platehistory ,以获取不为空的行并将数据组合在一起。 考虑到数据量,解决任务“查找所有行驶里程至少为10000 km且第二次登记在2014年的汽车”将花费大量时间。

    The crawler failed in both cases with the data grouping; however, it correctly identified the data types. Data format identification is easy, but the grouping can be tricky especially when dealing with array JSON fields

    两种情况下,爬网程序均因数据分组而失败; 但是,它可以正确识别数据类型。 数据格式识别很容易,但是分组可能很棘手,尤其是在处理数组JSON字段时

    提出的数据分组算法 (Proposed algorithm to data grouping)

    Using the AWS Glue for grouping and categorizing data on our example was not very successful. Therefore, we have come up with a new approach to data normalizing. The main idea utilizes the following hypothesis: During the design of an external API, the business logic has been incorporated into its response(s). To some extent, different endpoints and structuring the response already contains the relations inside. Assuming the API is well-designed, we can capture a lot of logic already from the shape of the response. The proposed algorithm analyzes each response as they come to the system and assumes the existence of two global variables:

    在我们的示例中,使用AWS Glue对数据进行分组和分类不是很成功。 因此,我们提出了一种新的数据标准化方法。 主要思想采用以下假设:在外部API的设计过程中,业务逻辑已集成到其响应中。 在某种程度上,不同的端点和结构化的响应已经包含了内部的关系。 假设API的设计合理,我们已经可以从响应的形状中捕获很多逻辑。 所提出的算法分析每个响应进入系统时的响应,并假设存在两个全局变量:

    1. Pattern table, containing a list of column identifiers and their data types for each so far discovered table — during the process, new tables are originating, their columns and data types are stored to this list in order to check, if data will be appended to an existing table, or a new table will be created

    1. 模式表,其中包含每个迄今发现的表的列标识符及其数据类型的列表-在此过程中,将生成新表,将其列和数据类型存储到此列表中以检查是否将数据附加到现有表上,否则将创建新表

    2. Table of assumed relations — discovered relations, such as 1:n will be placed into this structure — e.g. table A and it’s column 1 which is related to table B and it’s column 2

    2. 假定关系表 -已发现的关系(例如1:n)将放置在此结构中-例如表A及其第1列与表B有关及其第2列

    In the beginning, both of these structures are empty. The algorithm works in a recursive way, decomposing the JSON response from the API. The procedure distinguishes whether the structure is an array […] or an object {…}.

    最初,这两个结构都是空的。 该算法以递归方式工作,分解了来自API的JSON响应。 该过程区分结构是数组[…]还是对象{…}。

    Applying this algorithm to our sample data, this is how the pattern table would look like.

    将这种算法应用于我们的样本数据,这就是模式表的外观。

    The table of relations will look like this.

    关系表将如下所示。

    Using the table of relations and pattern table, this is how the table scheme looks like. It is a bit different from the 3rd normal form mentioned in the introduction, but the main entities and relations remain the same.

    使用关系表和模式表,这就是表方案的外观。 它与导言中提到的第三范式有些不同,但是主要实体和关系保持不变。

    The most powerful aspect of the procedure is its reproducibility in SQL commands. Every step that involves data table creation or data row insertion can be fully automated. Therefore, already in the staging layer of a database, one can have a fairly decomposed data without human interaction. However, indexes and optimization features must be added manually later. From a data scientist’s point of view, at least for model training, it is very valuable to have data in a normalized form, despite the longer loading.

    该过程最强大的方面是它在SQL命令中的可重复性。 涉及数据表创建或数据行插入的每个步骤都可以完全自动化。 因此,已经在数据库的登台层中,无需人工干预就可以拥有相当可分解的数据。 但是,索引和优化功能必须在以后手动添加。 从数据科学家的角度来看,至少对于模型训练而言,尽管加载时间较长,但以标准化形式保存数据非常有价值。

    The previous exercise was a model example, showing the perfect and theoretical world when the API structure doesn’t change over time. However, in real business, we usually demand our API vendor to make changes and gradually add new data points. Every time the vendor adds a new data point, traditionally, the data engineer must analyze the response and map this new field to a column in a table.

    上一个练习是一个模型示例,当API结构不会随时间变化时,展示了一个完美的理论世界。 但是,在实际业务中,我们通常要求API供应商进行更改并逐渐添加新的数据点。 通常,每次供应商添加新数据点时,数据工程师都必须分析响应并将此新字段映射到表中的列。

    分解的经验方法 (An empirical approach to the decomposition)

    The communication with the vendors and their deadlines are sometimes hard to be met, therefore the API production is an iterative process. Let’s have a look at a typical conversation with the vendor.

    与供应商的沟通及其期限有时很难满足,因此API的生产是一个反复的过程。 让我们看一下与供应商的典型对话。

    Unless the API vendor would be providing their product “as it is”, there are always some ongoing changes. From time to time from the consumer point of view, sometimes from the vendor. From this point of view, it is needed to implement an additional layer in the algorithm that could be robust to react to these kinds of changes.

    除非API供应商“按原样”提供其产品,否则总会有一些持续的变化。 从消费者的角度来看,有时是从供应商的角度来看。 从这个角度来看,需要在算法中实现一个额外的层,以应对这些变化。

    Therefore, a second step in the data consumption automation involves the automatic acquisition of a new data point, so data scientists could immediately use it for their analyses. The automation of the decision, to which entity the new fields belong and whether they don’t form a new entity themselves, is a very difficult task.

    因此,数据消耗自动化的第二步涉及自动获取新数据点,因此数据科学家可以立即将其用于分析。 自动化决策,新字段属于哪个实体以及它们本身是否不构成新实体,是一项非常困难的任务。

    Example #1:

    范例1:

    By having a look at the two responses, it is clear that these data shall not be added to the same table. Why is it so? The first response describes state attributes that usually don’t change, such as the manufacturing year and color of the vehicle. The other response captures the current status of the car, which can change over time. Therefore, these two responses would be represented most likely as two independent tables, e.g. vehiclebasedata and vehiclestatus.

    通过查看这两个响应,很明显,这些数据不应添加到同一表中。 为什么会这样呢? 第一个响应描述了通常不变的状态属性,例如车辆的制造年份和颜色。 另一个响应捕获汽车的当前状态,该状态会随着时间而变化。 因此,这两个响应很可能表示为两个独立的表,例如vehiclebasedata和vehiclestatus 。

    Example #2:

    范例2:

    Here, the parameter carBodycould be simply kept with the VIN in the basetablefrom our example, therefore there would be only two entities preserved in both cases of the response — the mileage history and the vehicle base data.

    在这里,根据我们的示例,参数carBody可以简单地与VIN一起保留在basetable ,因此,在两种响应情况下都将仅保留两个实体-里程历史和车辆基础数据。

    Usually, when adding new data that doesn’t belong to any existing entity, the API producers create a new endpoint with a different name (similar to example #1). When the data could logically belong to the existing structures, they are appended to the existing response (similarly to example #2). Therefore, the reasoning will be based again both on the response format as well as the number of new identifiers.

    通常,当添加不属于任何现有实体的新数据时,API生产者会创建一个具有不同名称的新终结点(类似于示例1)。 当数据在逻辑上可以属于现有结构时,会将它们附加到现有响应中(类似于示例2)。 因此,推理将再次基于响应格式以及新标识符的数量。

    When comparing the JSON response A to JSON response B, there can be a certain number of identifiers with the same name, and then there are some unique identifiers in A, that is not present in B and vice versa. The decision, whether they will be part of the same entity is based on the calculation of the adjustment ratio. Denote the variables as follows: UIA = number of identifiers in response A, but not in B (including arrays)UIB = number of identifiers in response B, but not in A (including arrays)SAB = shared identifiers within A, B.

    在将JSON响应A与JSON响应B进行比较时,可能存在一定数量的具有相同名称的标识符,然后A中存在一些唯一的标识符,而B中没有这些标识符,反之亦然。 它们是否将属于同一实体的决定是基于调整比率的计算。 如下表示变量: UIA =响应A中标识符的数量,但不在B(包括数组)中UIB =响应B中标识符的数量,但不在A(包括数组)中SAB = A,B中的共享标识符。

    In example #1, the adjustment ratio equals 2/3, whereas the example #2 has the adjustment ratio 1/5. When there is no difference within the response structure, the coefficient will equal zero. The decision strategy whether a new entity will be established is based on the comparison with a threshold value. If the adjustment ratio is smaller than the threshold, the entity will be updated, while when the adjustment ratio is higher, a new entity will be produced.

    在示例#1中,调节比等于2/3,而示例#2具有1/5的调节比。 当响应结构内没有差异时,系数将等于零。 是否将建立新实体的决策策略是基于与阈值的比较。 如果调整比例小于阈值,则将更新实体,而当调整比例较高时,将产生一个新实体。

    调整率示例 (Example of the adjustment ratio)

    Suppose that we are receiving API responses in the following format. The response gets transferred into a single database table.

    假设我们正在接收以下格式的API响应。 响应被传送到单个数据库表中。

    After reconstruction and prompting a few records, the table can look like this.

    重建并提示一些记录后,该表将如下所示。

    Suppose we have configured the adjustment ratio to be 0.30 (empirically). Imagine, that the data scientist requests a new field being propagated along with the data, for example, the number of doors. The API designer appends the new identifier to the response. The new response could look like this.

    假设我们已将调整率配置为0.30(根据经验)。 想象一下,数据科学家要求将一个新字段随数据一起传播,例如,门的数量。 API设计器将新标识符附加到响应中。 新的响应可能如下所示。

    The adjustment ratio in this case is 1/(1+2*3) = 1/7 = 0.1428. Since 0.1428<0.30, we change the table definition, populate the previous rows with null and change the Pattern table (the table of assumed relations remains the same). The modified table will look like this.

    在这种情况下,调整率是1 /(1 + 2 * 3)= 1/7 = 0.1428。 由于0.1428 <0.30,因此我们更改了表定义,使用空值填充了前面的行,并更改了Pattern表(假定关系表保持不变)。 修改后的表将如下所示。

    Once we have at least one object in the pattern table, the adjustment ratio is not anymore calculated with respect to the particular API responses, but to the identifiers in the pattern table itself. Suppose we request one more field, for example, engine displacement. After the API is redesigned and prompting for VIN = JM1BG2246R0816241, we receive the following API response:

    一旦我们在模式表中至少有一个对象,就不再针对特定的API响应计算调整率,而是针对模式表本身中的标识符进行计算。 假设我们要求另外一个字段,例如发动机排量。 重新设计API并提示输入VIN = JM1BG2246R0816241 ,我们收到以下API响应:

    Obviously, the API vendor forgot to provide color in this case (this happens quite often if the API developer is not using API REST templates) and this field is not available. This usually throws a violation error, since the field is not available. Not in our case. We calculate the adjustment ratio with respect to the pattern table as 2/(2*3+1+1) =2/8 = 0.25 < 0.30. The adjustment ratio is smaller than the empirically determined threshold, therefore we append a row with null values with identifiers not present in the original response and create a new column for engine displacement while keeping all the previous responses as NULL.

    显然,在这种情况下,API供应商忘记提供颜色(如果API开发人员未使用API​​ REST模板,这种情况经常发生),并且此字段不可用。 由于该字段不可用,通常会引发违规错误。 在我们这里不是。 我们将相对于模式表的调整比率计算为2 /(2 * 3 + 1 + 1)= 2/8 = 0.25 <0.30。 调整率小于根据经验确定的阈值,因此我们在行中附加一个空值,其标识符不包含在原始响应中,并为发动机排量创建新列,同时将所有先前响应都保持为NULL。

    The adjustment ratio denotes the pace with which we allow to append new columns to existing tables or simply the ratio that describes the maximum proportion of missing data. In reality, we need to have two coefficients of adjustment, one for adding and one for tolerating missing data, where

    调整比率表示我们允许在现有表中追加新列的速度,或者表示描述丢失数据的最大比例的比率。 实际上,我们需要两个调整系数,一个用于相加,一个用于容忍丢失的数据,其中

    Determination of the threshold cannot be fully automatic; however, its value shall be set larger than 0 and smaller than 0.30. For relatively young API vendors and for low data-amount responses, it is recommended to set this coefficient higher, for standardized responses, one can use lower values of the threshold.

    阈值的确定不能完全自动; 但是,其值应设置为大于0且小于0.30。 对于相对年轻的API供应商和较低的数据量响应,建议将该系数设置为较高,对于标准化响应,可以使用较低的阈值。

    Usually, the new data providers often change their API outputs in the first months of their service, while established businesses that run for several years and provide data services don’t change their responses much. One could add other aspects to the calculation of the adjustment, such as the company age of the data provider or the penalty for missing data.

    通常,新的数据提供者通常会在服务的前几个月更改其API输出,而已经运行了几年并提供数据服务的已建立业务却并没有太大改变其响应。 可以将其他方面添加到调整的计算中,例如数据提供者的公司年龄或丢失数据的罚款。

    结论 (Conclusion)

    This algorithm has been verified on real REST APIs with fair results, gaining significantly better grouping performance than AWS Glue, which has shown its inefficiency especially for responses containing arrays with a variable number of elements. In 80% of testing cases, the procedure assured the decomposition into 3rd normal form or higher. In the case, when the response is small, it is still hard for the algorithm to detect, whether the newly added data form a new entity, or stay in the same one.

    该算法已经在真实的REST API上得到了验证,并取得了不错的结果,其分组性能比AWS Glue明显好,后者表现出了低效率,尤其是对于包含具有可变数量元素的数组的响应而言。 在80%的测试用例中,该程序确保分解为第三标准形式或更高形式。 在这种情况下,当响应较小时,算法仍然很难检测到新添加的数据是形成新实体还是保留在同一实体中。

    The proposed approach is still very experimental to what has been done in the data integration automation up till now. It is good to mention, that in the near future, the mapping exercise or integration process won’t be fully automatic, but at least could contribute to the overhead reduction and help to pre-process data in a polished way. Data stability is always an issue and in many cases, the understanding of API responses is even difficult for humans.

    到目前为止,所提出的方法对于数据集成自动化所做的工作仍处于试验阶段。 值得一提的是,在不久的将来,映射练习或集成过程将不会完全自动化,但至少可以有助于减少开销并有助于以精炼的方式预处理数据。 数据稳定性始终是一个问题,在很多情况下,人类对API响应的理解甚至很困难。

    Additionally, it is necessary to mention, that this algorithm will not always work perfectly and will not solve every problem with mapping an API output. The performance of the queries is not guaranteed, indexing and optimization have to be always done manually. This approach is meant to deliver the tables inside a staging layer of a data warehouse architecture, where they can be consumed by analysts or serve for reporting experiments.

    此外,有必要提及的是,该算法将无法始终完美运行,并且无法解决映射API输出的所有问题。 无法保证查询的性能,必须始终手动进行索引和优化。 这种方法旨在将表交付到数据仓库体系结构的暂存层中,供分析人员使用或用于报告实验。

    翻译自: https://medium.com/@martindlask/an-empirical-approach-to-automatic-data-decomposition-6c98f37ed762

    数据库模式分解方法

    Processed: 0.010, SQL: 8