使用Microsoft Hyperspace索引Spark数据

科技2022-07-12 118

Expedia Group Technology —数据 (EXPEDIA GROUP TECHNOLOGY — DATA)

Data warehouses built on top of Spark and columnar stores often don’t perform ad hoc queries, ranges, or even joins well due to their OLAP-oriented nature. Hyperspace by Microsoft, an indexing subsystem built on top of Apache Spark, allows you to create indexes to support ad hoc queries just like a traditional database.

建立在Spark和列式存储之上的数据仓库由于其面向OLAP的特性，通常不会执行临时查询，范围甚至联接得很好。 Microsoft的Hyperspace (基于Apache Spark的索引子系统)使您可以创建索引以像传统数据库一样支持即席查询。

什么是超空间？ (What is Hyperspace?)

Hyperspace is a simple set of APIs in the Spark programming model that lets you easily create and manage indexes on your existing DataFrame. It injects its faster execution into Spark's original execution plan to fully utilize the performance boost provided by indexes.

Hyperspace是Spark编程模型中的一组简单的API ，可让您轻松地在现有DataFrame.上创建和管理索引DataFrame. 它将更快的执行速度注入Spark的原始执行计划中，以充分利用索引提供的性能提升。

Its relationship with Spark can be illustrated like this:

它与Spark的关系可以这样说明：

让我们运行一个演示 (Let’s run a demo)

As of this writing, Hyperspace supports Apache Spark 2.4, with Scala versions of 2.11 and 2.12. In this demo, I will use Spark 2.4.6, with Scala version of 2.11.12 (Java 1.8.0_222).

撰写本文时，Hyperspace支持Apache Spark 2.4，以及Scala 2.11和2.12版本。在此演示中，我将使用Spark 2.4.6和Scala版本为2.11.12 (Java 1.8.0_222 )。

1.在本地设置所有内容 (1. Set up everything locally)

Because Hyperspace is not really production ready yet and only supports HDFS-based index creation now. Therefore, when we run it in a local environment, HDFS running at localhost:9000 is needed so that Hyperspace can function properly.

因为Hyperspace尚未真正投入生产，现在仅支持基于HDFS的索引创建。因此，当我们在本地环境中运行它时，需要在localhost:9000运行的HDFS，以便Hyperspace可以正常运行。

1.1 Download and run a local HDFS

1.1下载并运行本地HDFS

Go to the official Hadoop download page and download Hadoop 2.9.2. Extract the tarball in your preferred location. Edit file $HADOOP_HOME/etc/hadoop/hdfs-site.xml and make it look like this:

转到Hadoop官方下载页面，然后下载Hadoop 2.9.2 。将压缩包提取到您的首选位置。编辑文件$HADOOP_HOME/etc/hadoop/hdfs-site.xml ，使其如下所示：

Then in the$HADOOP_HOME/bin directory, run:

然后在$HADOOP_HOME/bin目录中，运行：

./hdfs namenode -format

to format your HDFS for the first time before use.

在使用前第一次格式化HDFS。

Go to $HADOOP_HOME/sbin directory and execute start-dfs.sh. When prompted with SSH questions, make sure that you have all the permissions and SSH keys set up in order to allow Hadoop to SSH into your local machine to act like a pseudo-distributed HDFS. As a result, there should be three Java processes running to act as NameNode, DataNode and SecondaryNameNode respectively.

转到$HADOOP_HOME/sbin目录并执行start-dfs.sh 。当出现有关SSH问题的提示时，请确保已设置所有权限和SSH密钥，以允许Hadoop SSH到本地计算机中，以充当伪分布式HDFS。结果，应该有三个Java进程正在运行，分别充当NameNode ， DataNode和SecondaryNameNode 。

You’ve now set up your local HDFS!

现在，您已经设置了本地HDFS！

1.2 Create a CSV sample file

1.2创建CSV示例文件

Create a CSV file like this:

创建一个CSV文件，如下所示：

and put it into your local HDFS by running:

并通过运行以下命令将其放入本地HDFS：

$HADOOP_HOME/bin/hdfs dfs -put /path/to/csv/file.csv /hyperspace_test

where /hyperspace_test is the destination directory in HDFS.

/hyperspace_test是HDFS中的目标目录。

2.运行并测试超空间 (2. Run and test hyperspace)

2.1 Start your Spark shell

2.1启动Spark Shell

To include Hyperspace as a dependency, run:

要将Hyperspace作为依赖项包括在内，请运行：

$SPARK_HOME/bin/spark-shell \ --packages=com.microsoft.hyperspace:hyperspace-core_2.11:0.1.0

Please choose different versions and packages if you are running with a different Scala version (2.11 or 2.12).

如果您使用其他Scala版本( 2.11或2.12 )运行，请选择其他版本和软件包。

2.2 Load data and create an index

2.2加载数据并创建索引

To load data, run:

要加载数据，请运行：

Import Hyperspace into the Spark shell:

将Hyperspace导入Spark Shell：

Next, we want to create an index on column id, which includes a data column called name, so that the name column can be retrieved quickly using the id:

接下来，我们要在列id上创建索引，其中包括一个名为name的数据列，以便可以使用id快速检索name列：

This will create an index on the column id, and show index information like this:

这将在列id上创建索引，并显示如下索引信息：

Next, let’s take advantage of the index that we just created and see how it can change its execution plan and boost performance. Write the query as you normally would:

接下来，让我们利用刚刚创建的索引，看看它如何改变执行计划并提高性能。像往常一样编写查询：

val query = df.filter(df("id") === 1).select("name")

Use Hyperspace to explain how this query will be interpreted:

使用Hyperspace解释如何解释此查询：

hs.explain(query, verbose = true)

hs.explain(query, verbose = true )

It will generate output like:

它将生成如下输出：

You can clearly see that with Hyperspace, FileScan will read the index Parquet file instead of the original CSV file from HDFS. Although this small example isn't complex enough to show the big advantage from it, it is obvious that building an index from the original CSV file and saving it into a sorted and managed Parquet file based on the id column will bypass the shuffle phase, and therefore increase the performance dramatically.

您可以清楚地看到，使用Hyperspace， FileScan将读取索引Parquet文件，而不是HDFS中的原始CSV文件。尽管这个小例子还不够复杂，无法显示出很大的优势，但是很明显，从原始CSV文件构建索引并将其保存到基于id列的已排序和托管的Parquet文件中，将绕过混洗阶段，从而大大提高了性能。

Finally, let’s enable Hyperspace and execute the query:

最后，让我们启用Hyperspace并执行查询：

Let’s run query.explain() to see what's executed under the hood:

让我们运行query.explain()看看query.explain()执行了什么：

The physical plan is rewired by Hyperspace. During the physical plan execution, index created is being scanned instead. Compared to the original CSV, the index parquet file is pre sorted by column id by Hyperspace. Therefore, lookup query like WHERE id = 123 is faster during execution by hitting the index directly.

物理计划由Hyperspace重新布线。在执行物理计划期间，将扫描创建的索引。与原始CSV相比，索引拼花文件由Hyperspace按列id预先排序。因此，通过直接命中索引，在执行过程中，诸如WHERE id = 123类的查询查询会更快。

Besides this core API functionality, Hyperspace also includes index management APIs like:

除了此核心API功能之外，Hyperspace还包括索引管理API，例如：

何时使用超空间 (When to use Hyperspace)

If you often have queries that:

如果您经常有以下查询：

Look up a specific value (WHERE col = 'abcd')

查找特定值( WHERE col = 'abcd' )

Narrow the data into a very small range (WHERE num > 4 AND num < 7)

将数据缩小到很小的范围( WHERE num > 4 AND num < 7 )

Or join between two tables on a common column (JOIN table2 ON table1.value = table2.value)

或在公共列上的两个表之间进行JOIN table2 ON table1.value = table2.value ( JOIN table2 ON table1.value = table2.value )

then you can definitely create indexes on top of those needed columns to speed up your queries.

那么您绝对可以在那些需要的列上创建索引以加快查询速度。

摘要 (Summary)

Hyperspace builds indexes for your specified columns to bypass the distributed shuffle sort phase at runtime, and therefore boost your query performance. It is still under development, so please use it with caution in your production deployment.

Hyperspace为指定的列构建索引，以在运行时绕过分布式shuffle排序阶段，从而提高查询性能。它仍在开发中，因此在生产部署中请谨慎使用。

http://lifeatexpediagroup.com/

翻译自: https://medium.com/expedia-group-tech/indexing-spark-data-with-microsofts-hyperspace-ec4de4b93ba3

Processed: 0.011, SQL: 8