bigquery
Google BigQuery is undoubtedly one of the most popular cloud data warehousing platforms available today. Since its launch in 2011, the product has evolved to new heights while still maintaining its simplicity and ease of use.
Google BigQuery无疑是当今最流行的云数据仓库平台之一。 自2011年推出以来,该产品已发展到新的高度,同时仍保持其简单性和易用性。
One of the key steps in setting up a data warehouse is to establish the data pipelines for ingesting data from external sources. Many times, data being ingested requires generation of a unique identifier for each record in a table, also known as a Surrogate Key.
建立数据仓库的关键步骤之一是建立用于从外部源提取数据的数据管道。 很多时候,要提取的数据需要为表中的每个记录生成唯一的标识符,也称为代理键。
In contrast to a Natural Key which is a column or a set of columns that uniquely identify a record in a table and have a business meaning, a Surrogate Key is a system-generated value (could be GUID, Sequential Integer etc.) with no business meaning.
与自然键(一个或一组唯一地标识表中的记录并具有业务意义的自然键)相反,代理键是系统生成的值(可以是GUID,顺序整数等),没有商业意义。
For example:
例如:
Auto-generating unique Event IDs while ingesting streaming events. 在摄取流式事件时自动生成唯一的事件ID。 Auto-generating unique IDs for data without an existing primary key. 在没有现有主键的情况下自动为数据生成唯一ID。Google BigQuery does not offer field attributes like IDENTITY (as in MS SQL) and AUTO_INCREMENT (as in MySQL) that can be associated with a field at the time of table definition. Neither does it offer object like Sequence (as in T-SQL) to auto-generate surrogate keys.
Google BigQuery不提供IDENTITY(如MS SQL)和AUTO_INCREMENT(如MySQL)之类的字段属性,这些属性可以在表定义时与该字段相关联。 它也不提供像Sequence之类的对象(如T-SQL中的那样)来自动生成代理键。
Following are some ways to auto-generate surrogate keys in Google BigQuery depending on their format and use case:
以下是根据其格式和用例在Google BigQuery中自动生成代理键的一些方法:
BigQuery provides an analytic function ROW_NUMBER() that can be used over a window of rows to generate an incremental integer for each row.
BigQuery提供了一个分析函数ROW_NUMBER(),可在行窗口上使用该函数为每行生成一个增量整数。
SELECT ROW_NUMBER() OVER() AS ID, *FROM `bigquery-public-data.usa_names.usa_1910_current` Auto-generated sequential Integer values for ID 自动生成ID的顺序整数The above technique is useful when there is a need to populate surrogate keys for data already in a BigQuery table.
当需要为BigQuery表中已有的数据填充代理键时,上述技术很有用。
If the records being ingested contain a field or a combination of fields that guarantee uniqueness (Primary Key / Composite Primary Key), this technique can be applied to generate a Hash of those fields to populate the Surrogate key.
如果要提取的记录包含一个字段或保证唯一性的字段组合(主键/复合主键),则可以使用此技术来生成这些字段的哈希以填充代理键。
In our example, a combination of state, gender, year and name fields are guaranteed to be unique across all records. So, in order to generate a surrogate key, we concatenate these fields and compute a SHA256 digest (returns 32 bytes).
在我们的示例中,保证州,性别,年和姓名字段的组合在所有记录中都是唯一的。 因此,为了生成代理密钥,我们将这些字段连接起来并计算SHA256摘要(返回32个字节)。
SELECT SHA256(CONCAT(state, gender, year, name)) as ID, *FROM `bigquery-public-data.usa_names.usa_1910_current` Auto-generated hash values for ID 自动生成ID的哈希值The above technique can also be used for streaming ingestion use cases as it is only dependent on the values received in the record being ingested.
上面的技术也可以用于流式传输使用案例,因为它仅取决于在要摄取的记录中接收的值。
Apart from SHA256 function that we have used in the example above, Google BigQuery also offers other hash functions like MD5 (returns 16 bytes hash), SHA1 (returns 20 bytes hash) and SHA512 (returns 64 bytes hash) that can also be used. Each of these hash functions return values in BYTES datatype.
除了上面示例中使用的SHA256函数之外,Google BigQuery还提供其他可以使用的哈希函数,例如MD5(返回16个字节的哈希),SHA1(返回20个字节的哈希)和SHA512(返回64个字节的哈希)。 这些哈希函数中的每一个都以BYTES数据类型返回值。
One of the recent additions to Google BigQuery is the GENERATE_UUID function. This function generates a universally unique identifier that consists of 32 hexadecimal digits separated by hyphens in the form 8–4–4–4–12. The return value is of String datatype.
Google BigQuery最近新增的功能之一是GENERATE_UUID函数。 此函数会生成一个通用的唯一标识符,该标识符由32个十六进制数字组成,这些数字由连字符分隔,形式为8–4–4–4–12。 返回值为String数据类型。
SELECT GENERATE_UUID() as ID, *FROM `bigquery-public-data.usa_names.usa_1910_current Auto-generated UUID values for ID 自动生成ID的UUID值Considering that Surrogate keys do not have any contextual or business meaning, they allow for unlimited values and stand the test of time as they do not get impacted by changing business environment.
考虑到代理键没有任何上下文或业务含义,因此它们不受限制,并且可以经受时间的考验,因为它们不受更改的业务环境的影响。
Google BigQuery provides multiples ways to auto-generate surrogate keys depending on the format and use case being addressed.
Google BigQuery提供了多种方式来自动生成代理密钥,具体取决于所解决的格式和用例。
翻译自: https://medium.com/@smathur/surrogate-keys-in-google-bigquery-64677f48e653
bigquery
相关资源:四史答题软件安装包exe