tableau数据可视化
In this article I will describe the steps to set up a notebook that exports a Databricks dashboard as an HTML file and uploads it to an S3 bucket configured for static website hosting. In Tableau, we will create a dashboard that will embed the URL where the file is located.
在本文中,我将描述设置笔记本的步骤,该笔记本将Databricks仪表板导出为HTML文件并将其上载到为静态网站托管配置的S3存储桶。 在Tableau中,我们将创建一个仪表板,该仪表板将嵌入文件所在的URL。
Notebooks and data visualization tools are important components of an enterprise data framework. Notebooks are mainly used by data scientists for exploratory data analysis, statistical modeling and machine learning. Specialized data visualization tools such as Tableau focus on providing users with a platform to quickly build interactive reports and dashboards with almost no technical background.
笔记本和数据可视化工具是企业数据框架的重要组件。 笔记本电脑主要由数据科学家用于探索性数据分析,统计建模和机器学习。 Tableau等专门的数据可视化工具专注于为用户提供一个平台,可以在几乎没有技术背景的情况下快速构建交互式报表和仪表板。
In general, when there are new questions raised by business users which require data exploration and fast feedback, notebooks are very helpful because of their flexibility and speed to try out different paths and provide insights quickly. Even though notebooks could be exported in a friendly format and shared, many users prefer to use their enterprise standard visualization tool as an entry point to all reports and dashboards.
通常,当业务用户提出需要数据探索和快速反馈的新问题时,笔记本电脑非常有用,因为它们具有灵活性和速度,可以尝试不同的路径并快速提供见解。 即使可以以友好的格式导出笔记本并共享笔记本,许多用户还是喜欢使用其企业标准的可视化工具作为所有报表和仪表板的入口。
There are also cases where specialized visualization tools do not have the capability to build advanced customized graphs. In my particular situation, I needed to build an interactive network graph with nodes and edges that were constantly being updated. After some research I found that I could use a Javascript library called D3.js which have powerful visualization capabilities. In addition, Databricks allows to embed D3.js visualizations in its notebooks, so one can integrate it with the rest of the data pipeline.
在某些情况下,专用的可视化工具不具备构建高级自定义图形的功能。 在我的特定情况下,我需要使用不断更新的节点和边来构建交互式网络图。 经过一些研究,我发现我可以使用一个名为D3.js的Javascript库,它具有强大的可视化功能。 此外, Databricks允许将D3.js可视化效果嵌入其笔记本中,因此可以将其与其余数据管道集成在一起。
There are two steps in the process: first to build the Databricks dashboard that will contain the different graphs, and then to export this so that it can be accessed from Tableau. Even though the first step of generating the network graph with D3.js is really fun, in this article I will focus on the second step.
该过程分为两个步骤:首先构建将包含不同图形的Databricks仪表板,然后将其导出以便可以从Tableau中对其进行访问。 尽管使用D3.js生成网络图的第一步确实很有趣,但是在本文中,我将重点关注第二步。
First we need to run the notebook that have the visualizations for the dashboard we want to use. We will use the run_id of the executed notebook to export the dashboard.
首先,我们需要运行具有要使用的仪表板可视化效果的笔记本。 我们将使用执行的笔记本的run_id导出仪表板。
When this notebook runs, it will store the run id in a global temporary table. This is done by including the following snippet:
该笔记本运行时,会将运行ID存储在全局临时表中。 通过添加以下代码段即可完成此操作:
%scala val runId = dbutils.notebook.getContext.currentRunId.toString Seq(runId).toDF("run_id").createOrReplaceTempView("run_id")The run_id is then extracted from the previously created view, along with the name of the notebook. A new global temporary view will be created with the name: run_id_notebook-name
然后从先前创建的视图中提取run_id以及笔记本的名称。 将创建一个新的全局临时视图,其名称为: run_id_notebook-name
# Get run_id from temporary view runId = spark.table("run_id").head()["run_id"] runId = re.findall(r'\d+', runId)[0] runId = int(runId) data = [[run_id]] # Get notebook name notebook_path = spark.table("notebook_path").head()["notebook_path"] path_split = notebook_path.split("/") nb_name = path_split[len(path_split)-1] # Create global temporary view with run_id_notebook-name df = spark.createDataFrame(data, schema=schema) df.createOrReplaceGlobalTempView("run_id_{}".format(nb_name))In a separate notebook (let’s call it network_graph_export), we will run the notebook and get the run_id after it is executed.
在单独的笔记本中(我们将其称为network_graph_export ),我们将运行笔记本并在执行后获取run_id 。
# Run notebook notebook_name = 'network_graph' dbutils.notebook.run(notebook_name, 180) # Get run_id from notebook global_temp_db = spark.conf.get("spark.sql.globalTempDatabase") run_id_table = 'run_id_{}'.format(notebook_name) run_id = table(global_temp_db + "." + run_id_table).first()[0]We define a method that will use the previously obtained run_id and the Databricks REST API to export the Dashboard in JSON format.
我们定义将使用先前获得run_id和方法Databricks REST API导出JSON格式的仪表板。
The ACCOUNT in the DOMAIN variable should be replaced by your own Databricks account name. The API requires a token for authentication. This personal token can be generated in the Databricks UI or via the REST API.
DOMAIN变量中的ACCOUNT应该替换为您自己的Databricks帐户名。 该API需要令牌进行身份验证。 该个人令牌可以在Databricks UI中或通过REST API生成。
As you can see, the token is stored in what is called a Databricks secret. This utility can store any sort of credentials outside notebooks so that they can be retrieved when needed.
如您所见,令牌存储在所谓的Databricks secret中。 该实用程序可以在笔记本之外存储任何种类的凭据,以便可以在需要时进行检索。
# Databricks access credentials DOMAIN = 'ACCOUNT.cloud.databricks.com' TOKEN = dbutils.secrets.get(scope="databricks", key="token") BASE_URL = 'https://%s/api/2.0/jobs/runs/export?run_id=' % (DOMAIN) # Exports notebook with given run id as a JSON object def export_notebook(run_id): views_to_export = '&views_to_export=DASHBOARDS' response = requests.get( BASE_URL + str(run_id) + views_to_export, headers={'Authorization': 'Bearer %s' % TOKEN} ) return response.json()To be able to upload the files to the S3 bucket that is configured to host static webpages, we first retrieve the access and secret keys using Databricks secrets utility.
为了能够将文件上传到配置为托管静态网页的S3存储桶,我们首先使用Databricks secrets实用程序检索访问密钥和密钥。
The upload_to_s3 method takes the file name and actual content as parameters and creates a new file in the DBFS file store. Then, this file is uploaded to the previously defined S3 bucket.
upload_to_s3方法将文件名和实际内容作为参数,并在DBFS文件存储中创建一个新文件。 然后,此文件将上传到先前定义的S3存储桶。
# AWS ACCESS_KEY = dbutils.secrets.get(scope="aws-s3", key="access_key") SECRET_KEY = dbutils.secrets.get(scope="aws-s3", key="secret_key") ENCODED_SECRET_KEY = dbutils.secrets.get(scope="aws-s3", key="encoded_secret_key") AWS_BUCKET_NAME = "bucket-static-webpages" def upload_to_s3(file_name, file_content): # Check if file_name is a key in dashboards dictionary if file_name not in dashboards: print("{} is not a key in the dictionary".format(file_name)) return # Create file in DBFS Filestore try: dbutils.fs.rm("/FileStore/graph_file_static/{}.html".format(dashboards[file_name])) dbutils.fs.put("/FileStore/graph_file_static/{}.html".format(dashboards[file_name]), file_content) except: dbutils.fs.put("/FileStore/graph_file_static/{}.html".format(dashboards[file_name]), file_content) # Upload file from Filestore to S3 s3 = boto3.client('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY) with open("/dbfs/FileStore/graph_file_static/{}.html".format(dashboards[file_name]), "rb") as f: s3.upload_fileobj(f, AWS_BUCKET_NAME, "{}.html".format(dashboards[file_name]), ExtraArgs={'ACL': 'public-read', 'ContentType':'text/html'}) print("File {} uploaded to S3".format(file_name))Running the export and uploadThe JSON response that we get from the export_notebook method includes all views (dashboards) related to the notebook that we executed. There, we can choose to upload to S3 as many dashboards as we need (stored in the dashboards dictionary) but in this example I’m only choosing to upload one.
运行导出和上传我们从export_notebook方法获得的JSON响应包括与我们执行的笔记本相关的所有视图(仪表板)。 在这里,我们可以选择将所需数量的仪表板上传到S3(存储在仪表板字典中),但在本示例中,我仅选择上传一个。
# Maps dashboards to HTML files dashboards = { 'Network Graph Dashboard' : 'network_graph' } # Get JSON response from HTTP export request response = export_notebook(run_id) # For each dashboard, get content and upload to S3 for view in response.get("views"): upload_to_s3(view.get("name"), view.get("content"))Finally, now that the dashboard is uploaded to S3 as an HTML static file, we will use the corresponding URL to visualize it in a Tableau dashboard. To do this, we just have to create a new dashboard and drag the Web Page object to the canvas. This will open a dialog box where you need to type the URL of the HTML file located in the S3 web hosting.
最后,既然仪表板已作为HTML静态文件上传到S3,我们将使用相应的URL在Tableau仪表板中对其进行可视化。 为此,我们只需要创建一个新的仪表板并将Web Page对象拖到画布上即可。 这将打开一个对话框,您需要在其中键入位于S3虚拟主机中HTML文件的URL。
Now, the Tableau dashboard will point to the URL where the exported Databricks notebook is located. If this needs to be updated frequently, you can set up a job that recreates the file from the Databricks notebook and replace the previous file in the S3 bucket with the new one.
现在,Tableau仪表板将指向导出的Databricks笔记本所在的URL。 如果需要经常更新此文件,则可以设置一项作业,以从Databricks笔记本重新创建文件,并用新文件替换S3存储桶中的先前文件。
翻译自: https://medium.com/analytics-vidhya/visualize-databricks-dashboards-in-tableau-eae01b5c7219
tableau数据可视化
相关资源:Tableau:BI数据可视化_网盘链接下载84.84M