短租数据集分析--利用pyecharts绘制房源分布地图及单因子方差分析

    科技2024-06-22  71

    文章目录

    前言一、绘制房源分布地图1.导入基本模块2.数据清洗3.绘制房源分布地图 二、单因素方差分析1.Entire home/apt 下地区对房租价格的影响2.Private room 下地区对房租价格的影响3. Shared room 下地区对房租价格的影响


    前言

    共享,通过让渡闲置资源的使用权,在有限增加边际成本的前提下,提高了资源利用效率。随着信息的透明化,越来越多的共享发生在陌生人之间。短租,共享空间的一种模式,不论是否体验过入住陌生人的家中,你都可以从短租的数据里挖掘有趣的信息。本文主要根据爱彼迎平台公布的数据绘制房源分布地图以及进行在三种租房模式下地区因子水平对租房价格的影响。 数据集链接:天池大赛数据集

    一、绘制房源分布地图

    1.导入基本模块

    import pandas as pd import numpy as np import matplotlib.pyplot as plt from pyecharts.charts import Geo from pyecharts import options as opts from pyecharts.globals import GeoType from pyecharts.charts import Map from scipy import stats from statsmodels.formula.api import ols from statsmodels.stats.anova import anova_lm import warnings warnings.filterwarnings('ignore')#忽略生成图片时的报错 plt.rcParams['font.sans-serif']='SimHei' plt.rcParams['axes.unicode_minus']=False#是图片显示中文 %matplotlib inline

    2.数据清洗

    data = pd.read_csv(r'D:\天池大赛\短租数据集\listings.csv') data.head() idnamehost_idhost_nameneighbourhood_groupneighbourhoodlatitudelongituderoom_typepriceminimum_nightsnumber_of_reviewslast_reviewreviews_per_monthcalculated_host_listings_countavailability_365044054Modern and Comfortable Living in CBD192875East ApartmentsNaN朝阳区 / Chaoyang39.89503116.45163Entire home/apt7921892019-03-040.8593411100213The Great Wall Box Deluxe Suite A团园长城小院东院套房527062JoeNaN密云县 / Miyun40.68434117.17231Private room1201122017-10-080.10402128496Heart of Beijing: House with View 2467520CindyNaN东城区39.93213116.42200Entire home/apt38932592019-02-052.701933161902cozy studio in center of Beijing707535RobertNaN东城区39.93357116.43577Entire home/apt3761262016-12-030.2852904162144nice studio near subway, sleep 4707535RobertNaN朝阳区 / Chaoyang39.93668116.43798Entire home/apt5371372018-08-010.405352

    观察数据,可以看到共有16列,分别代表着房源id,房源名称、房主id、房主名称、北京的行政区划分、经度、维度、房源类型、价格、最小租住天数、最后评论日期等。 下面我们进一步粗略浏览整体数据情况

    data.info()#使用info方法先整体查看数据集 <class 'pandas.core.frame.DataFrame'> RangeIndex: 28452 entries, 0 to 28451 Data columns (total 16 columns): id 28452 non-null int64 name 28451 non-null object host_id 28452 non-null int64 host_name 28452 non-null object neighbourhood_group 0 non-null float64 neighbourhood 28452 non-null object latitude 28452 non-null float64 longitude 28452 non-null float64 room_type 28452 non-null object price 28452 non-null int64 minimum_nights 28452 non-null int64 number_of_reviews 28452 non-null int64 last_review 17294 non-null object reviews_per_month 17294 non-null float64 calculated_host_listings_count 28452 non-null int64 availability_365 28452 non-null int64 dtypes: float64(4), int64(7), object(5) memory usage: 3.5+ MB

    通过总览数据我们可以看到,name列有一个空值,neighbourhood_group列全是空值,last_review、reviews_per_month 两列有一部分是空值。

    data.describe().T#使用转置方法使结果更可视化 countmeanstdmin25%50%75%maxid28452.02.628583e+076.403312e+0644054.000002.245616e+072.787765e+073.134482e+073.395441e+07host_id28452.01.442821e+087.057051e+07192875.000008.708958e+071.525464e+082.061464e+082.563498e+08neighbourhood_group0.0NaNNaNNaNNaNNaNNaNNaNlatitude28452.03.998323e+011.869841e-0139.455813.989733e+013.993090e+013.999047e+014.094966e+01longitude28452.01.164420e+022.047957e-01115.473391.163553e+021.164347e+021.164911e+021.174953e+02price28452.06.112033e+021.623535e+030.000002.350000e+023.890000e+025.770000e+026.898300e+04minimum_nights28452.02.729685e+001.792093e+011.000001.000000e+001.000000e+001.000000e+001.125000e+03number_of_reviews28452.07.103156e+001.681507e+010.000000.000000e+001.000000e+006.000000e+003.220000e+02reviews_per_month17294.01.319757e+001.581243e+000.010002.900000e-018.000000e-011.750000e+002.000000e+01calculated_host_listings_count28452.01.281829e+012.926132e+011.000002.000000e+005.000000e+001.100000e+012.220000e+02availability_36528452.02.203421e+021.384307e+020.000008.700000e+012.090000e+023.610000e+023.650000e+02

    删除neighbourhood_group一列

    data = data.drop('neighbourhood_group',axis = 1)

    删除name列的空值行

    data = data.dropna(axis = 0,subset = ['name'])

    规范neighbourhood列,使其只含有中文名

    def neighbourhood_str(data): neighbourhoods=[] list=data["neighbourhood"].str.findall("\w+").tolist() for i in list: neighbourhoods.append(i[0]) return neighbourhoods data["neighbourhood"]=neighbourhood_str(data) data.head() idnamehost_idhost_nameneighbourhoodlatitudelongituderoom_typepriceminimum_nightsnumber_of_reviewslast_reviewreviews_per_monthcalculated_host_listings_countavailability_365044054Modern and Comfortable Living in CBD192875East Apartments朝阳区39.89503116.45163Entire home/apt7921892019-03-040.8593411100213The Great Wall Box Deluxe Suite A团园长城小院东院套房527062Joe密云县40.68434117.17231Private room1201122017-10-080.10402128496Heart of Beijing: House with View 2467520Cindy东城区39.93213116.42200Entire home/apt38932592019-02-052.701933161902cozy studio in center of Beijing707535Robert东城区39.93357116.43577Entire home/apt3761262016-12-030.2852904162144nice studio near subway, sleep 4707535Robert朝阳区39.93668116.43798Entire home/apt5371372018-08-010.405352

    3.绘制房源分布地图

    我们现在比较感兴趣的是这这两万多个房源在北京16个行政区的分布情况。

    data.neighbourhood.value_counts() 朝阳区 10810 东城区 3346 海淀区 3197 丰台区 1758 西城区 1701 通州区 1290 昌平区 1034 密云县 935 顺义区 920 怀柔区 833 大兴区 823 延庆县 718 房山区 578 石景山区 213 门头沟区 152 平谷区 143 Name: neighbourhood, dtype: int64 data.neighbourhood.hist(bins = 30,figsize = (20,8))

    def test_geo(): city = '北京' g = Geo() g.add_schema(maptype=city,itemstyle_opts=opts.ItemStyleOpts(color="#D9D9D9", border_color="#111")) # 定义坐标对应的名称,添加到坐标库中 add_coordinate(name, lng, lat) list1 = data['id'].tolist() list2 = data.longitude.tolist() list3 = data.latitude.tolist() for x,y,z in zip(list1,list2,list3): g.add_coordinate(str(x),y,z) #将坐标点名称及坐标点值添加到图表中 b = [] for i in zip(data['id'].map(str),data['id'].value_counts()): b.append(i) g.add('', b, type_='scatter', symbol_size=3,color = '#68228B') # 设置样式成不显示图例 g.set_series_opts(label_opts=opts.LabelOpts(is_show=False)) #设置标题 g.set_global_opts( title_opts=opts.TitleOpts(title="{}-房源分布".format(city)) ) return g g = test_geo() g.render_notebook()

    效果如下图,房源分布地图绘制完毕!

    二、单因素方差分析

    接下来进行方差分析,本来想进行短租房屋类型因子下对于房屋价格的影响分析,但后来查资料了解到Entire home/apt 代表的是全职房,Private room 代表的是独立房间,shared room 代表的是合住房间,那么他们对于价格的影响必然有显著性差异,所以我们做三种类型下的地区对房价的影响,尤其像 了解共享房间这类新型合租模式地区会对其产生显著性影响吗?

    1.Entire home/apt 下地区对房租价格的影响

    e_data = data[data['room_type'] == 'Entire home/apt'] e_data.head() idnamehost_idhost_nameneighbourhoodlatitudelongituderoom_typepriceminimum_nightsnumber_of_reviewslast_reviewreviews_per_monthcalculated_host_listings_countavailability_365044054Modern and Comfortable Living in CBD192875East Apartments朝阳区39.89503116.45163Entire home/apt7921892019-03-040.8593412128496Heart of Beijing: House with View 2467520Cindy东城区39.93213116.42200Entire home/apt38932592019-02-052.701933161902cozy studio in center of Beijing707535Robert东城区39.93357116.43577Entire home/apt3761262016-12-030.2852904162144nice studio near subway, sleep 4707535Robert朝阳区39.93668116.43798Entire home/apt5371372018-08-010.4053525279078Nice Apartment in Beijing1455726Fiona东城区39.93958116.43485Entire home/apt4031292018-11-020.337353 e_data.price.describe()#浏览Entire home/apt下的价格 count 16955.000000 mean 746.479151 std 1705.645806 min 0.000000 25% 356.000000 50% 470.000000 75% 658.000000 max 68983.000000 Name: price, dtype: float64

    观察到价格最低有0元/晚,最高有68983元/晚显然不合理,需要删除

    e_data = e_data[e_data['price']>0]#删除价格为0的房源 import seaborn as sns sns.boxplot(e_data.price,whis=2,orient='h')#选取2倍四分位距仍然有很多异常值,需要删除

    def box_plot_outliers(data,data_ser, box_scale): iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25)) val_low = data_ser.quantile(0.25) - iqr val_up = data_ser.quantile(0.75) + iqr a = data[(data_ser> val_low) & (data_ser<val_up)] #删除异常值 b = a[['price','neighbourhood']]#由于方差分析只需要各地区因子水平,以及价格,所以删除其他列 return b e_data = box_plot_outliers(e_data,e_data.price,2) e_data priceneighbourhood0792朝阳区2389东城区3376东城区4537朝阳区5403东城区.........28444832延庆县28446228房山区28447396朝阳区28449329朝阳区28451295丰台区 sns.boxplot(e_data.price,whis = 2)

    e_data.price.describe() count 15299.000000 mean 481.556115 std 201.162288 min 54.000000 25% 342.000000 50% 443.000000 75% 584.000000 max 1248.000000 Name: price, dtype: float64 neighbourhood_to_list = {'朝阳区':1, '海淀区':2, '东城区':3, '丰台区':4, '西城区':5, '通州区':6, '昌平区':7, '大兴区':8, '顺义区':9, '石景山区':10, '房山区':11, '密云县':12, '门头沟区':13, '平谷区':14, '怀柔区':15, '延庆县':16} e_data['neighbourhood'] = e_data['neighbourhood'].map(neighbourhood_to_list) e_data priceneighbourhood0792123893337634537154033.........28444832162844622811284473961284493291284512954 model = ols('price ~ neighbourhood',e_data).fit() anovat = anova_lm(model) print(anovat) df sum_sq mean_sq F PR(>F) neighbourhood 1.0 2.967648e+05 296764.791065 7.336672 0.006764 Residual 15297.0 6.187562e+08 40449.511263 NaN NaN

    可以看到对于全租房来说,房价与地区有强烈的显著相关性。

    2.Private room 下地区对房租价格的影响

    p_data = data[data['room_type'] == 'Private room'] p_data.price.describe() count 9838.000000 mean 430.681236 std 1203.643527 min 0.000000 25% 181.000000 50% 248.000000 75% 389.000000 max 66667.000000 Name: price, dtype: float64 p_data = p_data[p_data['price']>0] p_data = box_plot_outliers(p_data,p_data.price,2) sns.boxplot(p_data.price,whis = 2)

    p_data['neighbourhood'] = p_data['neighbourhood'].map(neighbourhood_to_list) model = ols('price ~ neighbourhood',p_data).fit() anovat = anova_lm(model) print(anovat) df sum_sq mean_sq F PR(>F) neighbourhood 1.0 1.294708e+07 1.294708e+07 593.986554 4.194081e-127 Residual 9002.0 1.962160e+08 2.179693e+04 NaN NaN

    3. Shared room 下地区对房租价格的影响

    s_data = data[data['room_type'] == 'Shared room'] s_data.price.describe() count 1658.000000 mean 293.343185 std 2521.130124 min 27.000000 25% 87.000000 50% 107.000000 75% 148.000000 max 67909.000000 Name: price, dtype: float64 s_data = box_plot_outliers(s_data,s_data.price,2) sns.boxplot(s_data.price,whis = 2)

    p_data['neighbourhood'] = p_data['neighbourhood'].map(neighbourhood_to_list) model = ols('price ~ neighbourhood',s_data).fit() anovat = anova_lm(model) print(anovat) df sum_sq mean_sq F PR(>F) neighbourhood 13.0 7.287987e+04 5606.144166 2.838746 0.00048 Residual 1503.0 2.968225e+06 1974.866697 NaN NaN

    单因子方差分析完毕

    Processed: 0.013, SQL: 8