找出谁是阿姆斯特丹最大的Airbnb房主

科技2025-02-28 44

Hi data science enthusiasts! It is my pleasure to write you for the first time, I am happy to share my passion for data science and machine learning with the community! Today I will be analyzing the Airbnb market in Amsterdam.

嗨，数据科学爱好者！第一次写您是我的荣幸，我很高兴与社区分享我对数据科学和机器学习的热情！今天，我将分析阿姆斯特丹的Airbnb市场。

Amsterdam is a very popular destination among tourists as it receives well over 15 million tourists a year. Amsterdam is the first city to implement taxation specific to Airbnb renting (10% tourist tax, which is a lot compared to other jurisdictions), and as such provides an interesting use case to find answers to some of my intriguing questions.

阿姆斯特丹是一个深受游客欢迎的目的地，它每年接待超过1500万游客。阿姆斯特丹是第一个对特定于Airbnb租金征收税收的城市(10％的旅游税，与其他辖区相比很多)，因此提供了一个有趣的用例，可以找到我一些有趣的问题的答案。

The questions I will answer using Airbnb data are:

我将使用Airbnb数据回答的问题是：

How many hosts are running a business with multiple listings and where are they located in Amsterdam?

有多少个主机正在运行具有多个列表的业务，它们在阿姆斯特丹位于何处？

What are the top 10 Airbnb hosts based on the number of listings they have and how much do they constitute as a percent of the Amsterdam Airbnb market?

根据他们拥有的房源数量，排名前10位的Airbnb房东是多少？它们在阿姆斯特丹Airbnb市场中所占的比例是多少？

Can I develop an accurate regression model that would predict the price of an Airbnb listing in Amsterdam using available attributes? Which attributes are good predictors for price?

我是否可以开发一个准确的回归模型，以使用可用属性来预测阿姆斯特丹的Airbnb房源价格？哪些属性可以很好地预测价格？

You can find my Python code on Github and you can follow along.

您可以在Github上找到我的Python代码，然后继续。

数据 (Data)

As Airbnb does not publish its own data on listings, I will be using the Amsterdam Airbnb dataset from Inside Airbnb, which is an independent third party that publishes datasets on Airbnb listings from major cities across the world.

由于Airbnb不会在列表中发布自己的数据，因此我将使用Inside Airbnb的Amsterdam Airbnb数据集，Inside Airbnb是一个独立的第三方，可发布来自世界各主要城市的Airbnb列表中的数据集。

The datasets are very handy to do such analysis, they are updated frequently and lend themselves greatly for doing regression analysis for predicting listing night prices, for doing time-based analysis on listing level or visualizing these interesting data points.

数据集非常便于进行此类分析，它们会经常更新，并且非常适合进行回归分析以预测上市夜价，进行基于时间的上市分析或可视化这些有趣的数据点。

I am using the Amsterdam Airbnb dataset (as of 18.08.2020) which can be found here

我正在使用Amsterdam Airbnb数据集(截至2020年8月18日)，可在此处找到

第一部分：在阿姆斯特丹有多少家主机店经营着多个房源？(Part I: How many hosts are running a business with multiple listings in Amsterdam?)

The straightforward answer is 2161 hosts. Percentage-wise, that’s roughly 13.17 %, which means that actually slightly more than one in ten Airbnb hosts in Amsterdam have multiple listings and are probably running a home rental business.

直接的答案是2161位主机。从百分比的角度来看，大约为13.17％，这意味着实际上，阿姆斯特丹有十分之多的Airbnb房东中有超过一个拥有多个房源，并且可能正在经营房屋租赁业务。

Hosts with multiple listings and the proportion 具有多个列表和比例的主机

第二部分：根据Airbnb拥有的房源数量，排名前十位的房东是多少？它们在阿姆斯特丹Airbnb市场中所占的比例是多少？(Part II: What are the top 10 Airbnb hosts based on the number of listings they have and how much do they constitute as a percent of the Amsterdam Airbnb market?)

The top Airbnb hosts based on the number of listings they have are as follows:

根据房源的数量，顶级的Airbnb房东如下：

Top 10 hosts based on number of listings 根据列表数量排名前10位的主机

The ten of them own 351 homes together and make up around 1.85% of the Airbnb homes in Amsterdam, which is a remarkable finding.

他们中的十个人共拥有351套房屋，约占阿姆斯特丹Airbnb房屋的1.85％，这是一个了不起的发现。

第二部分：我是否可以开发一个准确的回归模型，以使用可用属性来预测Airbnb在阿姆斯特丹上市的价格？哪些属性可以很好地预测价格？ (Part II: Can I develop an accurate regression model that would predict the price of an Airbnb listing in Amsterdam using available attributes? Which attributes are good predictors for price?)

I ran three different models, with the only difference in feature selection:

我运行了三种不同的模型，唯一的区别在于功能选择：

Model 1: Feature selection using correlation threshold of 10%

模型1：使用10％的相关阈值进行特征选择

Model 2: All features

模型2：所有功能

Model 3: Subset of features depending on Information Gain feature selection

模型3：取决于信息增益功能选择的功能子集

Measuring model performance

测量模型性能

I will be using r-square (R2) as my performance score, which measures the strength of the relationship between the model and the dependent variable on a scale of 0 to 1 and indicates the percentage of variance in the target variable that can be explained by the model.

我将使用r平方(R2)作为我的绩效得分，该得分以0到1的比例来衡量模型与因变量之间关系的强度，并指示目标变量的方差百分比，这可以解释由模型。

使用10％的相关阈值进行特征选择 (Feature Selection using correlation threshold of 10%)

R-square performance score model 1 R平方表现分数模型1

Only 17% of the variance in the price can be explained by this model.

该模型只能解释价格差异的17％。

The predictions are as follows:

预测如下：

所有功能 (All features)

R-square performance score model 2 R平方表现分数模型2

We have a very bad performance score of -8.798 when running the model on all features, which is to be expected since the model has over 200 features.

在所有功能上运行模型时，我们的性能得分非常差-8.798 ，这是可以预期的，因为模型具有200多个功能。

The predictions are as follows:

预测如下：

Predictions of the model 模型的预测

特征子集(k = 4)取决于信息增益(Subset of features (k = 4) depending on Information Gain)

R-square performance score model 3 R平方表现得分模型3

We have a bad performance r- square score of 0.096 when running the model on just 4 features and using Information Gain as feature selection criteria. The model has a better prediction accuracy compared to model 2, but performs just at half the performance of model 1.

当仅对4个特征运行模型并将信息增益用作特征选择标准时，我们的r-square评分为0.096 ，这是一个很差的表现。与模型2相比，该模型的预测精度更高，但性能仅为模型1的一半。

It seems Model 1 outperforms the other models by a significant amount. From Model 1 we will look at the coefficients to determine feature importance when it comes to predicting the price. See below.

看来模型1的表现要比其他模型好很多。在模型1中，我们将着眼于系数来确定特征在预测价格时的重要性。见下文。

Coefficients for model 1 模型1的系数

The features that are the most important in determining the price of stay make sense: property_type_Room in aparthotel and room_type_Hotel room, which means that if a listing is a room in an aparthotel or an hotel room can add a significant amount to the price.

确定住宿价格最重要的功能很有意义： aparthotel中的property_type_Room和room_type_Hotel room ，这意味着如果列表是aparthotel中的房间或酒店房间，则可以为价格增加很多。

Whether or not a listing is located in the neighbourhood of ‘Centrum-West’ is also a strong predictor for price.

列表是否位于“ Centrum-West”附近也是价格的重要预测指标。

Furthermore, the feature accomodates, which stands for the number of guests that the listing can accomodate, is a valuable indicator for the price.

此外，所容纳的功能代表该列表可容纳的客人数量，是价格的重要指标。

Lastly, if the listing is a private room in an appartment, also is a good indicator for the price.

最后，如果清单是公寓中的私人房间，则也可以很好地表明价格。

结论 (Conclusion)

In this post, we took a look at Airbnb data from Amsterdam and were able to determine some interesting insights:

在这篇文章中，我们研究了来自阿姆斯特丹的Airbnb数据，并能够确定一些有趣的见解：

There are 2161 hosts who rent out multiple listings on Airbnb in Amsterdam. They make up around 13.17 % of the market, which means that actually slightly more than one in ten Airbnb hosts in Amsterdam have multiple listings and are probably running a home rental business.

有2161位房东在阿姆斯特丹的Airbnb上出租多个房源。它们约占市场的13.17％，这意味着实际上，阿姆斯特丹有十分之多的Airbnb房东中有多于一个的房源，并且可能正在经营房屋租赁业务。

The top 10 of hosts by listings own 351 homes together and make up around 1.85% of the Airbnb homes in Amsterdam

排名前十的房东共有351栋房屋，约占阿姆斯特丹Airbnb住宅的1.85％

Overall, our price prediction models proved not that successful in predicting the price of a stay in Amsterdam. Our best model is model 1 which uses a correlation threshold. The features that are of the biggest importance to the price of a stay in Amsterdam are if the listing is a Room in an aparthotel or Hotel room. Furthermore, if the listing is located in the neighbourhood of ‘Centrum-West’ is also a strong predictor for price. The number of guests that the listing can accomodate, is also avaluable indicator for the price, as well as if the listing is a private room in an appartment, is also a good indicator for the price.

总体而言，我们的价格预测模型无法成功预测阿姆斯特丹的住宿价格。我们最好的模型是使用相关阈值的模型1。对于在阿姆斯特丹住宿的价格来说，最重要的功能是如果列表中列出的是公寓式酒店的房间或酒店房间。此外，如果列表位于“ Centrum-West”附近，那么它也是价格的有力预测指标。房源可以容纳的客人数量也是价格的重要指标，并且房源是否是公寓中的私人房间，也是价格的良好指标。

The low result for the models could be due to the independent variables not being good predictors, or imputation and variable selection/transformation choices made by me.

模型的结果偏低可能是由于自变量不是好的预测变量，或者是我做出的归因和变量选择/转换选择。

Perhaps the model’s performance could be improved by employing other feature selection algorithms, such as Wrapper methods (RFE, Backward Elimination…).

也许可以通过采用其他特征选择算法(例如Wrapper方法(RFE，向后消除…))来提高模型的性能。

Another helpful thing might be to impute missing values using KNN imputation instead of mode and mean imputation, which might prove beneficial in this context to provide with more accurate predictions.

另一个有用的事情可能是使用KNN插补而不是模式和均值插补来插补缺失值，这在此情况下可能有益于提供更准确的预测。

It makes sense of course always sense to just have more data, to use Airbnb data from different cities and union them and run different types of machine learning models on the data and see if it performs better :)

当然，拥有更多数据，使用来自不同城市的Airbnb数据并将它们结合起来并在数据上运行不同类型的机器学习模型，当然看得通总是有意义的：)

I hope you enjoyed this read and I am looking forward to post some more interesting stories in the future!

希望您喜欢这篇阅读文章，也希望以后能发表更多有趣的故事！

Stay tuned for more!

敬请期待更多！

#datascience #machinelearning #tableau #python #udacity #airbnb #amsterdam

＃数据科学＃机器学习#Tableau #Python #udacity #airbnb＃阿姆斯特丹

翻译自: https://medium.com/@jovangligorevi/finding-out-who-the-biggest-airbnb-home-owners-are-in-amsterdam-47bfdf8294f7

Processed: 0.010, SQL: 8