时间序列数据 横截面数据
I knew I wanted to do two things in the process of writing my bachelor’s thesis: improve my programming skills and work with time-series data prediction.
我知道在写学士学位论文的过程中我想做两件事:提高我的编程技能和使用时序数据预测。
What I didn’t know, however, is what I wanted to study. However, it had to be something I truly liked and not necessarily connected to my major.
但是,我所不知道的是我想学习的东西。 但是,它一定是我真正喜欢的东西,不一定与我的专业相关。
Since I like maps and… things that move, I decided to somehow use data from a website I had recently come across and fallen in love with (probably after playing many hours of SimCity as a teenager) — the Uber Movement website.
由于我喜欢地图和……移动的事物,因此我决定以某种方式使用我最近遇到并爱上(可能是在十几岁的《模拟城市》玩了好几个小时后)创建的Uber Movement网站的数据 。
It allows you to visualize anonymized data for average travel times from a certain point (or zone) to any other point in that same city.
它使您可以可视化从某个特定点(或区域)到同一城市中任何其他点的平均旅行时间的匿名数据。
Sample travel times for my home town, São Paulo. 我的家乡圣保罗的旅行时间示例。“Great!”, I thought. “Let me just download the travel times and plot some numbers so I can move on to exploratory data analysis (EDA).”
“好极了!”,我想。 “让我下载旅行时间并绘制一些数字,以便我继续进行探索性数据分析(EDA)。”
As it turns out, the data I wanted was not so easy to extract.
事实证明,我想要的数据并不是那么容易提取。
In order to have a more statistically precise outcome, I needed all the data I could get. The more rows of data I had, the greater the predictive power of my models (potentially). The smaller the time increment, the better. Therefore, I needed to get my hands on daily travel times.
为了获得更精确的统计结果,我需要我可以获得的所有数据。 我拥有的数据行越多,模型的预测能力(可能)就越大。 时间增量越小越好。 因此,我需要掌握每天的旅行时间。
The Uber Movement website allows you to download data from any zone to every other zone in the city. However, there’s a catch. Whatever date range you are interested in getting travel times for does not consist of daily data.
优步运动网站允许您将数据从任何区域下载到城市中的每个其他区域。 但是,有一个陷阱。 无论您希望获取旅行时间的任何日期范围都不包含每日数据。
That is, if you select to download values from January 2020 to March 2020, you won’t receive 90 values, which is roughly the amount of days in that range. Rather, it spits out a csv file with one single value for the three-month average travel time for each pair of zones.
也就是说,如果您选择下载2020年1月到2020年3月之间的值,则不会收到90个值,这大约是该范围内的天数。 相反,它会吐出一个csv文件 一个单一的价值 每对区域三个月的平均旅行时间。
Different formats of travel times data you can download for every given pair of zones. 您可以为每个给定的区域对下载不同格式的旅行时间数据。This meant that I had to compromise on the amount of data points throughout time, to get lots of values for a single point in time.
这意味着我不得不在整个时间范围内折衷数据点的数量,以在单个时间点上获得大量值。
TL;DR: Uber Movement does not provide time series data, but monochronic/cross-sectional data.
TL; DR:Uber Movement不提供时间序列数据,但提供单时/横断面数据。
There were, of course, other limiting factors, such as the fact that I’d only be getting what I call “radial” data where I wouldn’t be getting travel times between every possible zone pair, but only from the center-most point to the other points. Statistically speaking, I wasn’t sure if this would provide me with a reasonably accurate measure for the average travel times for a given city.
当然,还有其他限制因素,例如,我只会获取所谓的“径向”数据,而我不会获取每个可能的区域对之间的旅行时间,而只是从最中心指向其他点。 从统计学上讲,我不确定这是否可以为我提供给定城市平均旅行时间的合理准确的度量。
Here’s a (very ugly) depiction of what the two geospatial types of data you can download look like:
这是您可以下载的两种地理空间数据的外观(非常难看):
“Radial” data vs. every pair of zones. “径向”数据与每对区域的关系。There was still hope, though.
但是仍然有希望。
The maximum possible range of dates that can be selected was 3 months. The minimum, on the other hand, is one day, which is exactly what I needed.
可以选择的最大日期范围是3个月。 另一方面,最小的时间是一天,这正是我所需要的。
Uber Movement calendar where you can choose the desired date range. 您可以在Uber Movement日历中选择所需的日期范围。So I thought: “why not click on each day and download its respective data set?”.
所以我想:“为什么不点击每一天并下载其各自的数据集?”。
Well… I tried.
好吧...我试过了。
I knew automating this process would be much easier, but I didn’t have the coding skills, nor did I know what packages I needed to use for this task.
我知道自动化此过程会容易得多,但是我没有编码技能,也不知道需要使用什么软件包来完成此任务。
It took me around 6 minutes and 30 seconds to download data for a single month. Among the available cities, the most amount of days of data is of 3.25 years (January 2016 to March 2020) or 39 months. It would take me a total of 253 minutes or over 4 hours of continuous clicking to download all data sets for a single city. In total, I wanted to extract data from 31 cities, so multiplying 4 hours by 31 cities equals roughly 5 days of manual downloading.
我花了大约6分30秒才能下载一个月的数据。 在所有可用城市中,最多的数据天是3.25年(2016年1月至2020年3月)或39个月。 要下载单个城市的所有数据集,总共需要253分钟或4个小时以上的连续单击时间。 总的来说,我想从31个城市中提取数据,因此将4小时乘以31个城市大约等于5天的手动下载时间。
And that’s with no eating or sleeping.
那就是没有进食或睡觉。
It was time to address the elephant in the room. Since I didn’t want to spend almost an entire week of my life clicking through download buttons, I had to automate this.
现在是时候向房间里的大象讲话了。 由于我不想花整个一星期的时间来点击下载按钮,因此我不得不将其自动化。
I essentially used two Python packages to automate the downloading of the daily data sets: Selenium, for automating actions (like clicking and waiting for page responses) and datetime to use a precise data type for reading date and time, instead of just using strings.
我实质上使用了两个Python软件包来自动化日常数据集的下载:Selenium,用于自动化操作(例如单击和等待页面响应),以及datetime,使用精确的数据类型来读取日期和时间,而不仅仅是使用字符串。
A big chunk of the code is consisted of simple assignments of XPaths and CSS selectors to variables, and clicking actions performed on those variables. This was in and of itself a learning exercise for me, since I’d never used the Google Chrome developer tools.
代码的很大一部分由XPath和CSS选择器对变量的简单分配以及对这些变量执行的单击操作组成。 对我而言,这本身就是学习活动,因为我从未使用过Google Chrome开发人员工具。
Cool, I just learned to automate clicks. Then, I had the idea of clicking through each day on the calendar. Easy-peasy.
太酷了,我刚刚学会了自动点击。 然后,我想到了点击日历上的每一天。 十分简单。
However, I was now faced with a big hurdle.
但是,我现在面临着很大的障碍。
In order for me to iterate through the dates, I had to go back from the page where the download button was located to the window with the calendar (so I could click on the next date). To achieve that, I had to refresh the page. And when you refresh the page, the DOM (document Object Model) also updates.
为了让我可以遍历日期,我必须从下载按钮所在的页面返回到带有日历的窗口(以便单击下一个日期)。 为此,我必须刷新页面。 当您刷新页面时,DOM(文档对象模型)也会更新。
Meaning, whatever code you had referencing to a specific web element won’t work on the refreshed page, because the web elements will have become stale.
意思是,无论您引用的是特定Web元素的代码如何,在刷新的页面上均不起作用,因为Web元素将变得过时。
Selenium, for example, would give me the following error message:
例如,Selenium会给我以下错误消息:
selenium.common.exceptions.StaleElementReferenceException: Message: <element_name>'stale element reference: element is not attached to the page documentThis was a nightmare and a challenge I couldn’t overcome, no matter how many StackOverflow posts I read. Desperation ensues.
无论我阅读了多少StackOverflow帖子,这都是一场噩梦,也是我无法克服的挑战。 随之而来的是绝望。
So I contact a past coworker who I knew had experience with CSS and Java to see if he could help me and although my hopes were low, he came up with a simple, yet elegant idea: “why instead of clicking on the dates and having to refresh the page don’t you update the URL itself to contain the date you want?”
因此,我联系了一位我曾经在CSS和Java方面有丰富经验的同事,以查看他是否可以帮助我,尽管我的希望不高,但他提出了一个简单而优雅的想法:“为什么不单击日期,而是刷新页面,您不更新URL本身以包含所需的日期吗?”
This was a game changer for my project. (Valeu Eduardo!).
这是我项目的改变游戏规则。 (瓦莱德·爱德华多!)。
If you look closely at the Uber Movement URL for a given city, you’ll notice that it contains every information you need to land on the exact page you want.
如果您仔细查看给定城市的Uber运动URL,您会注意到它包含您需要在所需页面上登陆的所有信息。
I just had to create an initial URL with info such as the city name, the desired zone type and the zone origin code and then figure out an algorithm to iterate through the dates and update the URL accordingly.
我只需要创建一个包含城市名称,所需区域类型和区域原始代码等信息的初始URL,然后找出一种算法来遍历日期并相应地更新URL。
The URL-creating function is as follows:
URL创建功能如下:
# Create URLs for the desired date rangedef getURL():"""" Function that creates one URL per date between the specified date range """ date = datetime(2016,2,2)while date <= datetime(2020,3,31):yield ('https://movement.uber.com/explore/' + city + '/travel-times/query?si' + origin_code + '&ti=&ag=' + zone_type + '&dt[tpb]=ALL_DAY&dt[wd;]=1,2,3,4,5,6,7&dt[dr][sd]=' + date.strftime('%Y-%m-%d') + '&dt[dr][ed]=' + date.strftime('%Y-%m-%d') + '&cd=&sa;=&sdn=' + coordinates + '&lang=en-US') date += timedelta(days=1)Lastly, we just have to create an iterating mechanism that executes the next URL in the getURL function.
最后,我们只需要创建一个迭代机制即可执行getURL函数中的下一个URL。
# Perform iteration through URLs downloading the datasets for each URLiterated_URLs = []i = 0print('Number of generated URLs: ' + str(len(list(getURL()))))for url in getURL(): i += 1 driver.execute_script("window.open('"+url+"', '_self')") iterated_URLs.append(url)Here’s the bot in action:
这是行动机器人:
The data set download bot in action. 数据集下载机器人正在运行。That’s it!
而已!
The next steps would be to concatenate each csv file into one and perform cleaning and formatting to start EDA; which can be done easily with Pandas.
下一步将是将每个csv文件连接成一个文件,并执行清理和格式化操作以启动EDA。 使用熊猫可以轻松完成。
With this, anyone can make the most out of the amazing data that Uber has provided to us for free. Whereas time series analysis was not possible due to the nature of the structure in which the data is provided on their website, now we can expand the scope of research that is done on Uber travel times and geospatial data.
这样,任何人都可以充分利用Uber免费提供给我们的惊人数据。 尽管由于在其网站上提供数据的结构的性质而无法进行时间序列分析,但现在我们可以扩大对Uber旅行时间和地理空间数据进行的研究范围。
Of course, these two pieces of code are not the entire story, so you can check out this and other projects of mine on GitHub.
当然,这两段代码并不是全部内容,因此您可以在GitHub上查看此代码以及其他项目。
翻译自: https://towardsdatascience.com/how-i-built-time-series-data-out-of-cross-sectional-uber-travel-times-data-e0de5013ace2
时间序列数据 横截面数据
相关资源:论文研究-利用时间序列与截面结合数据估计并检验个体差异.pdf