mta运营分析

    科技2025-02-16  14

    mta运营分析

    In the Project 1 of Metis Data Science Bootcamp (Singapore Batch 5), we are tasked on exploratory data analysis (EDA) of MTA turnstile data to advise a fictitious non-profit organization, WomenTechWomenYes (WTWY) on the optimal placement of street teams (at entrances to NYC subway stations) for social engagements. WTWY wants to invite interested individuals to its annual gala to raise awareness and increase participation for women in tech, and the street teams’ agenda is to collect as many emails as possible and give out free tickets to the gala. In my analysis, I have made the following assumptions:

    在Metis数据科学训练营的项目1(新加坡批次5)中,我们负责MTA旋转栅数据的探索性数据分析(EDA),以为虚构的非营利组织WomenTechWomenYes(WTWY)提供有关街道团队最佳布局的建议(在纽约地铁站的入口处)进行社交活动。 WTWY希望邀请有兴趣的人参加其年度盛会,以提高认识并提高女性对科技的参与度,街头团队的议程是收集尽可能多的电子邮件,并免费发放盛会的门票。 在分析中,我做出了以下假设:

    假设条件 (Assumptions)

    WTWY is constrained by time and manpower resources, hence insights from my analysis should identify top stations by traffic, as well as the peak periods in those stations.

    WTWY受时间和人力资源的限制,因此,根据我的分析得出的见解应按流量确定最热门的车站,以及这些车站的高峰时段。 Individuals who are interested in tech have a higher probability to be encountered in city center with a denser cluster of tech corporate offices.

    对科技感兴趣的个人更有可能在市中心拥有更密集的科技公司办公室群。 The WTWY gala is imminent, and a week of MTA turnstile data is analyzed as an sample for the weeks leading to the gala.

    WTWY盛会迫在眉睫,将分析一周的MTA转闸数据,作为通往盛会的几周的样本。

    数据探索和清理 (Data Exploration and Cleaning)

    The MTA turnstile data is scraped for the week 22 August 2020 to 28 August 2020. Taking the first few rows of the data, we observed the following data frame:

    刮取了2020年8月22日到2020年8月28日这一周的MTA旋转栅数据。采用数据的前几行,我们观察到以下数据框:

    The descriptions of the column features are given here. Further exploration reveals that the ‘ENTRIES’ and ‘EXIT’ columns are cumulative serial numbers that increase with time. Also, the rows of the ‘TIME’ column progress at approximately 4 hour blocks and each turnstile has a unique combinations of ‘UNIT’ and ‘SCP’ number as shown. Some rows are also found to be duplicates, and are removed.

    此处提供了列功能的描述。 进一步的研究表明,“ ENTRIES”和“ EXIT”列是随时间增加的累积序列号。 同样,“时间”(TIME)列的行以大约4个小时的时间间隔进行,每个旋转门都有“ UNIT”和“ SCP”编号的唯一组合,如图所示。 还发现某些行是重复的,已被删除。

    In order to find number of people that enter and exit each turnstile, I first sort the columns in the data frame according to the following order: [‘UNIT’, ‘SCP’, ‘DATE’, ‘TIME’], then using the diff() method for Series, I take the difference between consecutive rows for ‘ENTRIES’ and ‘EXITS’ respectively to form new columns. The new columns (named as ‘ENTRY_DELTA’ and ‘EXIT_DELTA’) will then represent the actual number of entries and exits through each turnstile during a particular 4 hour period. Nonetheless, some of the values of entries/exit turn out to be either negative or astronomically high numbers.

    为了找到进入和退出每个旋转栅的人数,我首先按照以下顺序对数据框中的列进行排序:['UNIT','SCP','DATE','TIME'],然后使用对于Series的diff()方法,我分别采用“ ENTRIES”和“ EXITS”的连续行之间的差来形成新列。 然后,新列(分别命名为“ ENTRY_DELTA”和“ EXIT_DELTA”)将代表特定4小时内通过每个旋转闸门的实际出入次数。 尽管如此,某些条目的数值还是负数或天文数字很高。

    Further analysis showed that this is due to the transition of rows between 2 turnstiles in the data frame, or the reset of the serial number counter in particular turnstiles. To make the corrections, these anomalous values are then replaced with the mean of the preceding and succeeding values of entries/exit. This intervention is reasonable because the entries/exit can be approximated as interpolation between consecutive time periods.

    进一步的分析表明,这是由于数据帧中2个旋转门之间的行转换,或者是特定旋转门中序列号计数器的重置造成的。 为了进行校正,然后将这些异常值替换为之前/之后的条目/退出值的平均值。 这种干预是合理的,因为进入/退出可以近似为连续时间段之间的插值。

    df['ENTRY_DELTA'][df['ENTRY_DELTA']>10000]=np.nan df['ENTRY_DELTA'][df['ENTRY_DELTA']<0]=np.nan # Setting the anomaly values due to reset of counters to the uniform NaN values delta_list = list(df['ENTRY_DELTA']) ind = 0 for i in delta_list: if np.isnan(i) == 1: delta_list[ind] = np.nanmean([delta_list[ind-1],delta_list[ind+1]]) ind += 1 df['ENTRY_DELTA_1'] = delta_list # for each NaN values, replace it with the mean of values before and after the NaN value

    Thereafter, the total traffic for each turnstile in each time period is calculated by summing entries and exit values. I named this column as ‘ENTRY_EXIT’.

    此后,通过将条目和退出值相加来计算每个时间段中每个旋转门的总流量。 我将此列命名为“ ENTRY_EXIT”。

    结果 (Results)

    Now, we are ready for the data visualization to derive insights for potential placements of street teams for WTWY! Using Panda’s groupby, Matplotlib and Seaborn, I then proceed to plot a histogram, bar chart, and a few heatmaps.

    现在,我们准备好进行数据可视化,以获取有关WTWY街头小队潜在安置的见解! 然后,我使用Panda的groupby,Matplotlib和Seaborn来绘制直方图,条形图和一些热图。

    group_station = df.groupby('STATION')['ENTRY_EXIT'].sum().sort_values(ascending=False) fig1 = plt.figure(figsize=[8,6]) ax1 = sns.distplot(group_station,bins=50,kde=False) plt.xlim([0,410000]) plt.ylabel('No. of Stations',fontsize=15, weight='bold') plt.xlabel('Total Traffic', fontsize=15, weight='bold') ax1.annotate('Top 10 Stations', xy=(0.73, 0.08), xytext=(0.73, 0.12), xycoords='axes fraction', fontsize=12, ha='center', va='bottom', arrowprops=dict(arrowstyle='-[, widthB=9.0, lengthB=1', lw=1.0),color='blue') plt.title('Distribution of Traffic Across Stations',fontsize=15,weight='bold') sns.despine()

    Plotting the histogram shows that the distribution of traffic across all MTA stations in New York City is heavily right-skewed, and that the top 10 stations by traffic are outliers in the distribution. Hence, this gives a clearer indication that WTWY could focus their engagement efforts in the top 10 stations.

    绘制直方图可以看出,纽约市所有MTA站点之间的流量分布严重偏右,按流量排名前10的站点在分布中是异常值。 因此,这更清楚地表明WTWY可以将其参与努力集中在前10个站点上。

    plt.figure(figsize=[8,5]) ax = sns.barplot(data=group_station.head(10).reset_index(),x='ENTRY_EXIT',y='STATION',palette='rainbow') plt.xlabel('Total Traffic',weight='bold',fontsize=15) plt.ylabel('Top 10 Stations',weight='bold',fontsize=15) plt.xticks(range(0,400001,50000),[str(int(i/1000))+'k' for i in range(0,400001,50000)]) plt.title('Busiest MTA Stations from 22/8 to 28/8', weight='bold',fontsize='15') for p in ax.patches: ax.annotate(str(int(p.get_width()/1000))+'k', (p.get_width(), p.get_y()+0.5)) sns.despine()

    Zooming in, plotting the bar chart of the top 10 stations reveal that 34 St-Penn and 34 St-Herald Square stations have notably more traffic than the rest of the stations, and should be taken as priority.

    放大并绘制出前10个车站的条形图,可以发现34个St-Penn和34个St-Herald Square车站的交通量明显超过其余车站,应优先考虑。

    df_10 = df[df['STATION'].isin(list(group_station.head(10).index))] df_10['WEEKDAY'] = df_10['DATE'].dt.day_name() group_station_day = df_10.groupby(['STATION','WEEKDAY'])['ENTRY_EXIT'].sum() matrix_station_day = group_station_day.unstack() matrix_station_day.reset_index() matrix_station_day = matrix_station_day.reindex(columns=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]) matrix_station_day = matrix_station_day.reindex(index=list(group_station.head(10).index)) array = np.array(matrix_station_day.applymap(lambda x:str(round(x/1000,1))+'k')) fig2 = plt.figure(figsize=[10,10]) cmap = sns.cubehelix_palette(light=1, as_cmap=True) ax2 = sns.heatmap(matrix_station_day,cmap='Blues',linecolor='white',linewidths=1,annot = array,fmt='') plt.xlabel('Day of the Week',fontsize=15) plt.ylabel('Top 10 Stations',fontsize=15) plt.title('Station Traffic in the Week',weight='bold',fontsize=15)

    Adding the dimension of day, plotting the heat map of the top 10 stations across the week reveals the trend that people are traveling with the subway more frequently in the weekdays, as compared to the weekends. The trend applies to all the 10 stations.

    加上一天的维度,绘制一周中前十个站点的热图,可以发现与周末相比,工作日人们乘坐地铁的频率更高。 趋势适用于所有10个站。

    def timeperiod(time): if time >= datetime.time(0,0,0) and time < datetime.time(4,0,0): return "12am-4am" elif time >= datetime.time(4,0,0) and time < datetime.time(8,0,0): return "4am-8am" elif time >= datetime.time(8,0,0) and time < datetime.time(12,0,0): return "8am-12pm" elif time >= datetime.time(12,0,0) and time < datetime.time(16,0,0): return "12pm-4pm" elif time >= datetime.time(16,0,0) and time < datetime.time(20,0,0): return "4pm-8pm" else: return "8pm-12am" df_10['TIME_PERIOD'] = df_10['TIME'].apply(timeperiod) matrix_list= [] for station in list(group_station.head(10).index): df_station = df_10[df_10['STATION']==station] group_day_time = df_station.groupby(['WEEKDAY','TIME_PERIOD'])['ENTRY_EXIT'].sum() matrix_day_time = group_day_time.unstack() matrix_day_time.reset_index() matrix_day_time = matrix_day_time.reindex(index=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]) matrix_day_time = matrix_day_time.reindex(columns=["12am-4am","4am-8am","8am-12pm","12pm-4pm","4pm-8pm","8pm-12am"]) matrix_list.append(matrix_day_time) fig, axn = plt.subplots(2,5, sharex=True, sharey=True, figsize=(15,6)) cmap = sns.cubehelix_palette(light=1, as_cmap=True) cbar_ax = fig.add_axes([.91, .3, .03, .4]) for i, ax in enumerate(axn.flat): station = matrix_list[i] sns.heatmap(station, ax=ax, cmap=cmap, cbar=i == 0, cbar_ax=None if i else cbar_ax, linecolor='white',linewidths=0.5) ax.set_title(list(group_station.head(10).index)[i]) ax.set_xlabel('') ax.set_ylabel('')

    Further breaking the heat map down into day versus time for each station reveals another interesting fact — stations are generally busier in the late afternoon and in the evenings, even during weekdays. This comes as no surprise, as during the Covid pandemic, many companies in NYC have adopted work-from-home arrangements, thus the morning rush hour crowd was avoided. Furthermore, employing street teams on weekday mornings could be counter-productive, as the rest of the essential workers would be busy reporting to work and are less likely to be successfully approached.

    将每个站点的热图进一步分解为白天与时间的对比,揭示了另一个有趣的事实-站点通常在下午晚些时候和晚上,甚至在工作日都比较忙。 这不足为奇,因为在Covid大流行期间,纽约的许多公司都采用了在家上班的安排,因此避免了早上高峰时间的人群。 此外,在工作日的早晨雇用街头团队可能会适得其反,因为其余的基本工人将忙于上班报告,不太可能获得成功。

    With another heat map, when we analyze the net entry and exits of commuters in each station (red regions represent net entry, and blue regions represent net exit), we can identify the stations that are located in the denser residential/hotel areas — Flushing-Main and 42 St-Port Auth stations. This is evidenced by their net exits during the evenings, which implies that people are returning back home. As these stations are not close to corporate offices, individuals interested in tech are less likely to be found in these pool of commuters.

    使用另一个热图,当我们分析每个站点的通勤者净进出站时(红色区域代表净入口,蓝色区域代表净出口),我们可以识别出位于较密集的住宅/酒店区域的站点-法拉盛-Main和42个St-Port Auth站。 晚上的净出口证明了这一点,这意味着人们正在返回家中。 由于这些车站不靠近公司办公室,因此在这些通勤人群中不太可能发现对科技感兴趣的人。

    结论 (Conclusions)

    From our analysis, we can conclude that WTWY should focus their street engagement efforts in the top 10 stations, ideally during weekdays in the late afternoon to evening periods. Moreover, if there are further manpower constraints, Flushing-Main and 42 St-Port Auth stations could be avoided as they are potentially residential and touristy areas where tech corporate offices are not located.

    根据我们的分析,我们可以得出结论,WWTY应该将街道参与工作集中在前10个站点上,最好是在工作日的下午晚些时候至晚上。 此外,如果存在进一步的人力限制,则可以避免使用Flushing-Main和42个St-Port Auth工作站,因为它们可能是科技公司办公室所在的住宅区和旅游区。

    Finally, I hope my exploratory data analysis on MTA turnstile data has generated interesting insights, and I look forward to showcasing other upcoming data projects from the Metis Data Science Bootcamp. Stay tuned!

    最后,我希望我对MTA旋转栅数据的探索性数据分析能够产生有趣的见解,并期待展示Metis数据科学训练营即将推出的其他数据项目。 敬请关注!

    翻译自: https://towardsdatascience.com/mta-turnstile-traffic-analysis-to-optimize-street-engagements-a7391adc4d45

    mta运营分析

    相关资源:四史答题软件安装包exe
    Processed: 0.009, SQL: 8