tableau public使用形状文件进行数据可视化

科技2022-07-12 160

问题背景 (Problem Context)

Pedestrian activity in any country is a symbol of vibrancy and vitality. As often stated, “Walking is important to a city”, the pedestrian activity is an indication of economic prosperity, safety, and convenience of people residing in a city. An automated pedestrian counting system developed by the City of Melbourne enabled the development of an information repository that can be utilized to analyze the activity of residents across different parts of the city. The pedestrian counting sensors installed in various locations across the city transmit data to a central server which can then be utilized towards better decision making and planning for the future. This project aims to deep dive into the data gathered using these sensors to address two key questions:

任何国家的行人活动都是充满活力和活力的象征。如常说的那样，“步行对城市很重要”，步行活动表明居住在城市中的人们的经济繁荣，安全和便利。墨尔本市开发的自动行人计数系统使信息存储库的开发成为可能，该信息存储库可用于分析城市不同地区居民的活动。安装在城市各处的行人计数传感器将数据传输到中央服务器，然后可以将其用于更好的决策和未来规划。该项目旨在深入研究使用这些传感器收集的数据，以解决两个关键问题：

How does pedestrian traffic volume vary in different parts of Melbourne?

墨尔本不同地区的行人流量如何变化？ How does the day of the week and time affect overall pedestrian traffic volume?

星期几和时间如何影响整体行人流量？

分析方法 (Analytical Approach)

A five-step iterative process starting with data gathering, attribute synthesis, data wrangling, hypothesis generation, and data visualization was used to derive key findings and insights about the questions at hand. The detailed process of the same is outlined in Figure 1 below.

从数据收集，属性综合，数据争用，假设生成和数据可视化开始的五步迭代过程用于得出关于当前问题的关键发现和见解。其详细过程在下面的图1中概述。

Figure 1. Illustrates the analytical process followed to answer analyze the problem context. Note that the data used for this problem content was collated and made available beforehand. Image Credits — Developed by the Author using Power Point 图1.说明了回答问题后所要遵循的分析过程。请注意，用于此问题内容的数据已预先整理并可用。图像学分—由作者使用Power Point开发

数据整理与属性综合 (Data collation & Attribute Synthesis)

Two separate datasets used for this analysis are outlined below:

下面概述了用于此分析的两个单独的数据集：

Pedestrian Counting System — Sensor Locations: This contains the spatial coordinates of pedestrian sensor devices located around the City of Melbourne, extracted from data.melbourne.vic.gov.au

行人计数系统-传感器位置：该数据包含从data.melbourne.vic.gov.au中提取的位于墨尔本市周围的行人传感器设备的空间坐标 Pedestrian Counting System — 2019: This contains the hourly pedestrian counts of each sensor during 2019, extracted from data.melbourne.vic.gov.au

行人计数系统-2019年：其中包含从data.melbourne.vic.gov.au中提取的每个传感器在2019年的每小时行人计数

The data dictionary, i.e. information about different data attributes are included in the Jupyter Notebook and can also be found here. The Pedestrian Counting System — Sensor Locations dataset is at a Sensor ID level (primary key) whereas the Pedestrian Counting System — 2019 dataset is at a Sensor ID, date-time level (primary keys).

数据字典，即有关不同数据属性的信息包含在Jupyter Notebook中，也可以在此处找到。行人计数系统-传感器位置数据集处于Sensor ID级别(主键)，而Pedestrian Counting System-2019数据集处于传感器ID，日期时间级别(主键)。

数据整理 (Data Wrangling)

Data wrangling is defined as the process of structuring, cleaning, and transforming the data into a format that can facilitate better decision making in quicker turn around times. The data was obtained in a tabular format hence restructuring wasn’t required. However, a Jupyter notebook was developed to look for data anomalies, missing values, outliers, and other data issues. The findings from the wrangling exercise are outlined below:

数据整理被定义为将数据进行结构化，清理并将其转换为一种格式的过程，该格式可以在更快的周转时间内促进更好的决策。数据以表格格式获取，因此不需要重组。但是，开发了Jupyter笔记本来查找数据异常，缺失值，离群值和其他数据问题。激烈的辩论得出以下结论：

价值缺失和离群值 (Missing Value & Outliers)

The Pedestrian_Counting_System_-_Sensor_Locations.csv file had information for 66 different sensor ids of which the direction information was missing for four sensors. On further deep dive, it was seen that these sensors were either removed or are inactive. The information about these sensors can be seen in Figure 2 below. The “note” column has missing values for the majority of the sensors.

Pedestrian_Counting_System _-_ Sensor_Locations.csv文件包含有关66个不同传感器ID的信息，其中缺少四个传感器的方向信息。在进一步的深潜中，可以看到这些传感器已卸下或不活动。有关这些传感器的信息，请参见下面的图2。大多数传感器的“注释”列缺少值。

Figure 2. Snapshot of sensors with missing information for columns direction_1 and direction_2. Image Credit s— Developed by the Author using Jupyter Notebook 图2.缺少方向_1和方向_2列信息的传感器的快照。图像信用-由作者使用Jupyter Notebook开发

None of the columns from the Pedestrian_Counting_System_2019.csv dataset had missing values. A quick description/summary statistic of the numerical columns exhibited that the figures are in range. The column “Hourly_Counts” indicated significant skewness of the pedestrian counts. The 75th percentile and maximum value of the pedestrian count recorded are significantly different and can be attributed to events, holidays, seasonal trends, etc. The summary statics of Sensors with at-least 7% outliers are shown in Figure 3 below.

Pedestrian_Counting_System_2019.csv数据集中的所有列均未缺少值。数字栏的快速描述/摘要统计显示这些数字在范围内。 “ Hourly_Counts”列表示行人计数明显偏斜。记录的行人数量的第75个百分位数和最大值存在显着差异，并且可以归因于事件，假日，季节性趋势等。异常值至少为7％的Sensor的摘要静态值显示在下面的图3中。

Box plot — Image Credit s— Developed by the Author using Jupyter Notebook 箱形图—图像s —由作者使用Jupyter Notebook开发 Figure 3. Summary statistics and boxplot of Sensors with 7% outliers. Note for the sake of simplicity 7% is considered as a threshold. Since pedestrian count can differ based on time, seasonality, events, and other factors we won’t be treating the outliers for this analysis. Image Credit s— Developed by the Author using Jupyter Notebook 图3.具有7％异常值的传感器的摘要统计信息和箱线图。注意，为简单起见，将7％作为阈值。由于行人数量可能会根据时间，季节，事件和其他因素而有所不同，因此我们不会在此分析中处理异常值。图片信用-由作者使用Jupyter Notebook开发

传感器ID-缺少匹配 (Sensor ID — Miss match)

On further deep dive, 8 sensors IDs were obtained that are part of the Pedestrian_Counting_System_-_Sensor_Locations.csv dataset however they are absent in the Pedestrian_Counting_System_2019.csv dataset. Findings below:

在进一步的深入研究中，获得了8个传感器ID，它们属于Pedestrian_Counting_System _-_ Sensor_Locations.csv数据集，但在Pedestrian_Counting_System_2019.csv数据集中却不存在。以下发现：

Sensor IDs 16, 38, 32, and 13 which were originally identified as removed or inactive doesn’t have pedestrian count information in the Pedestrian_Counting_System_2019.csv dataset

最初被标识为已删除或不活动的传感器ID 16、38、32和13在Pedestrian_Counting_System_2019.csv数据集中没有行人计数信息 Sensor IDs 63, 64, 65 and 66 were installed in the year 2020 hence doesn’t have pedestrian count information in the year 2019 hence are missing in the Pedestrian_Counting_System_2019.csv dataset

传感器ID 63、64、65和66已在2020年安装，因此在2019年没有行人计数信息，因此在Pedestrian_Counting_System_2019.csv数据集中丢失 Sensor IDs 15 (State Library) and 33 (Flinders St-Spring St (West)) are marked as R (Removed) and I (Inactive) but has a pedestrian volume from 2019. These sensors are retained with the hypothesis that the sensors were removed recently

传感器ID 15(状态库)和33(Flinders St-Spring St(西))分别标记为R(已移除)和I(未激活)，但自2019年起的行人流量为零。这些传感器保留了以下假设：最近删除

假设公式 (Hypothesis Formulation)

A top-down approach is used to break the problem into multiple factors (hypothesis). Based on data availability these factors are then analyzed using visualizations in Tableau to identify key findings or trends. The treemap for hypothesis synthesis can be found here.

自上而下的方法用于将问题分解为多个因素(假设)。根据数据的可用性，然后使用Tableau中的可视化工具分析这些因素，以识别关键发现或趋势。假设综合的树状图可以在这里找到。

Tableau中的数据探索 (Data Exploration in Tableau)

Data were explored in Tableau using visualizations to check for data anomaly. Figure 4 shows that the pedestrian traffic volume collected by the sensors is skewed towards the left and the box plot also shows the presence of outliers. As discussed earlier, outliers are not treated as multiple factors like seasonality, events, time, and day of the week impacts the pedestrian volume recorded. Figure 5 shows that the pedestrian volume is not recorded for a few of the sensors, e.g. Sensors 32, 63–66, etc. as illustrated in the bar graph below. Since most of the analysis is based on pedestrian volume by regions and date-time, these sensors weren’t dropped.

使用可视化技术在Tableau中探索数据以检查数据异常。图4显示了传感器收集到的行人流量向左偏斜，并且方框图还显示了异常值的存在。如前所述，离群值没有被视为多个因素，例如季节性，事件，时间和星期几影响行人流量。图5显示了一些传感器(例如传感器32、63-66等)未记录行人流量，如下图所示。由于大多数分析都是基于按地区和日期时间划分的行人流量，因此这些传感器没有丢失。

Figure 4. Histogram & box plot of the pedestrian volume recorded. Most of the data is skewed towards the left indicating the presence of outliers. Anything beyond Q1–1.5xIQR and Q3+1.5xIQR is treated as outliers. Q1 and Q3 are quartiles and IQR signifies Quartile Range. Image Credits — Developed by the Author using Tableau 图4.记录的行人体积的直方图和箱形图。大多数数据偏向左侧，表明存在异常值。 Q1-1.5xIQR和Q3 + 1.5xIQR以外的任何值均视为异常值。 Q1和Q3是四分位数，IQR表示四分位数范围。图片信用-由作者使用Tableau开发 Figure 5. #Sensors (# signifies counts) by suburbs and Pedestrian volume by Sensor IDs. Image Credits — Developed by the Author using Tableau 图5.按郊区划分的#Sensors(数字表示计数)，按Sensor IDs的步行者数量。图片信用-由作者使用Tableau开发

数据可视化 (Data Visualization)

人口统计学对行人流量的影响 (Impact of Demographics on Pedestrian Traffic Volume)

The processed data is uploaded to Tableau along with two shapefiles to understand the impact of demographics, suburbs, and areas of interest on the pedestrian count in 2019. The visualizations are represented using Figures 6, 7, and 8 below.

处理后的数据与两个shapefile一起上传到Tableau，以了解人口统计，郊区和感兴趣的区域对2019年行人数量的影响。下面的图6、7和8表示了可视化效果。

Figure 6. Illustrates the pedestrian count recorded by Sensors across Melbourne in 2019. The density represents total pedestrian volume from 2019. Image Credits — Developed by the Author using Tableau 图6.说明了Sensor在2019年记录的整个墨尔本的行人数量。密度表示自2019年以来的总行人数量。图片鸣谢—作者使用Tableau开发

Findings:

发现：

Pedestrian traffic volume for only 58 sensors mostly placed in and around CBD is present in the dataset

数据集中仅存在58个主要位于CBD周围的传感器的行人流量

2. Land use can broadly be classified into recreational, transport, agriculture, residential, and commercial. Areas that fall under commercial and recreational are expected to have higher footfall when compared to the rest. Pedestrian density recorded in 2019 is much higher in and around CBD when compared to other areas as visible in Figure 6 above. The area between La Trobe Street and Eureka Tower that hosts several MNCs and recreational spots observed higher pedestrians in 2019 than surrounding areas

2.土地利用大致可分为娱乐，交通，农业，住宅和商业。与其他地区相比，属于商业和休闲场所的地区的人流量有望更高。与上图6所示的其他区域相比，2019年记录的CBD内及周边的行人密度要高得多。拉筹伯街和尤里卡塔之间的区域拥有多个跨国公司和娱乐场所，2019年的步行者数量高于周边地区

3. Sensors closer to metro stations like Southern Cross and Melbourne Central have higher pedestrian count compared to other sensor spots

3.与其他传感器地点相比，靠近南十字路口和墨尔本中心等地铁站的传感器的行人数量更高

4. The Activity Centre Zone (ACZ) is the preferred tool to guide and facilitate land-use planning in activity centres. Figure 7 below shows the presence of Activity Centres in an around the area under analysis. Inner Metro and Inner South East regions have a higher number of activity centres and observed higher pedestrian volume in 2019 as captured by the sensors in and around these areas

4.活动中心区(ACZ)是指导和促进活动中心土地使用规划的首选工具。下图7显示了分析区域周围活动中心的存在。内城区和东南部内城区的活动中心数量更多，并且根据这些地区及其周围地区的传感器捕获的数据，2019年的行人流量也更高

Figure 7. Activity centres in and around the zones where sensors are placed. Activity centres are plotted using a shapefile (Australian Government, n.d.). Image Credits — Developed by the Author using Tableau 图7.活动中心位于放置传感器的区域内和周围。使用shapefile(澳大利亚政府，nd)绘制活动中心。图片信用-由作者使用Tableau开发 Figure 8. Illustrates the average pedestrian count recorded across suburbs with sensors. Since different suburbs have a different number of sensors installed, the average value is considered to ensure apple to apple comparison. The findings are in line with what we discussed before, suburbs Melbourne (Melbourne City) and South Wharf have higher average pedestrian volumes when compared to all suburbs. Docklands experienced slightly higher footfall when compared to Parkville, North Melbourne, Carlton, and East Melbourne (Department of Environment, n.d.). Image Credits — Developed by the Author using Tableau 图8.说明了使用传感器记录的整个郊区的平均行人数量。由于不同的郊区安装了不同数量的传感器，因此考虑平均值以确保进行苹果与苹果的比较。调查结果与我们之前讨论的结果一致，与所有郊区相比，墨尔本郊区(墨尔本市)和南码头的平均行人流量更高。与帕克维尔，北墨尔本，卡尔顿和东墨尔本(环境部)相比，港区的人流略高。图片信用-由作者使用Tableau开发

星期几和时间对行人流量的影响 (Impact of Day of the Week and Time on Pedestrian Traffic Volume)

Figure 9 and 10 below analyzes the variation of pedestrian volume across the day of the week and time of the day. Time of day is categorized into different buckets based on the usual volume expected.

下面的图9和图10分析了一周中的全天和一天中的时间的行人流量变化。一天中的时间会根据通常的预期量分为不同的时段。

Figure 9. Bar graph and Treemap illustrating the variation of pedestrian volume across days and times of the day respectively in 2019. Image Credits — Developed by the Author using Tableau 图9.条形图和树形图分别说明了2019年行人通行量在一天中的不同时间的变化。 Figure 10. Illustrates the average pedestrian volume across different times of the day captured by each sensor in 2019. The definition for “Time of the Day” is provided below. Also, note that the graph compares the average pedestrian traffic volume by normalizing it using the number of weeks and number of sensors under consideration. Image Credits — Developed by the Author using Tableau 图10.显示了每个传感器在2019年捕获的一天中不同时间的平均行人流量。下面提供了“一天中的时间”的定义。此外，请注意，该图表通过使用考虑的周数和传感器数量对其进行归一化来比较平均行人流量。图片信用-由作者使用Tableau开发

Time of the Day: 0–6: Post 12AM till 6AM, 6–8: Early Morning, 8–12: Peak Morning, 12–16: Afternoon, 16–18: Early evening, 18–20: Peak Evening, 20–24: Night.

一天中的时间：0-6：从12 AM到6AM，6-8：清晨，8-12：高峰，12-16：下午，16-18：傍晚，18-20：高峰晚，20- 24：晚上

Findings:

发现：

The pedestrian traffic volume is higher on weeks days when compared to the weekend (refer to Figure 10)

与周末相比，工作日的行人流量更高(请参见图10) Afternoon, Peak Morning, and evenings (both Early and Peak Evenings) have higher pedestrian volume when compared to other time buckets (refer to Figure 10)

与其他时段相比，下午，高峰时段和傍晚(早晚和高峰时段)的行人流量更高(请参见图10) Peak Mornings and Evenings are expected to have higher footfall however since the data is collated across all days of the week, however since weekends experience footfall during the afternoon owing to recreation and other activities, the total volume of pedestrians comes up higher in the afternoon bucket

预计高峰时段的高峰时段和晚上时段的人流较高，但是，由于该数据在一周中的每一天都进行汇总，但是由于休闲和其他活动，周末周末在下午时会有人流，因此下午时段的行人总人数增多 Interestingly the pedestrian volume on Fridays and Saturdays are much higher during the Night when compared to other days of the week (refer to Figure 11)

有趣的是，与一周中的其他几天相比，晚上的周五和周六的行人流量要大得多(请参见图11)。 Increase in pedestrian volume can be attributed to recreational activities on weekends specifically Friday and Saturday nights

行人数量的增加可以归因于周末，特别是周五和周六晚上的娱乐活动 On weekdays, the trends of pedestrian volume remained flat when compared across days (for different time buckets separately), e.g. the pedestrian traffic volume during Peak Morning hours remained the same from Monday to Friday

在工作日中，跨天的行人流量趋势保持不变(分别针对不同的时间段)，例如，在高峰时段，周一至周五的行人流量保持不变 Pedestrian volume in the bucket “Post 12 AM till 6 AM” is significantly higher on Saturday’s and Sundays indicating the popularity of nightlife in Melbourne on weekends (refer to Figure 11)

“星期六12：00至凌晨6：00”时段的行人数量在周六和周日显着增加，表明周末夜生活在墨尔本流行(请参见图11)

About the Author: Advanced analytics professional and management consultant helping companies find solutions for diverse problems through a mix of business, technology, and math on organizational data. A Data Science enthusiast, here to share, learn and contribute; You can connect with me on Linked and Twitter;

作者简介：高级分析专家和管理顾问，通过组织数据的业务，技术和数学相结合，帮助公司找到各种问题的解决方案。数据科学爱好者，在这里分享，学习和贡献；您可以在 Linked 和 Twitter上与我联系；

翻译自: https://towardsdatascience.com/tableau-public-for-data-visualization-using-shape-files-1782c9930f9