matplotlib 数据

    科技2022-08-02  131

    matplotlib 数据

    简介 (Introduction)

    Data Visualization is one of the fundamental skills in the Data Scientist toolkit. Given the right data, Possessing the ability to tell compelling stories of data can unlock a goldmine of opportunities for any organisation to drive value to whomever they serve — Let’s not forget making the employees more efficient also.

    数据可视化是数据科学家工具包的基本技能之一。 有了正确的数据,拥有说服力的数据故事的能力就可以为任何组织为服务的任何人带来价值的机会的金矿–我们不要忘记提高员工的效率。

    In the past, I’ve written some tips to do effective data visualization, however, in that post, I did not use a single dataset to explore all of the ideas that were shared. As a result, with this post, we are going to get our hands dirty and do some visualizations using the tips I shared and dive deep into the MovieLens dataset.

    过去,我写了一些技巧来进行有效的数据可视化 ,但是在那篇文章中,我没有使用单个数据集来探索所有共享的想法。 因此,在这篇文章中,我们将动手实践,并使用我分享的技巧进行一些可视化,并深入研究MovieLens数据集。

    For full access to the code used in this post, visit my Github repository.

    要完全访问本文中使用的代码,请访问我的Github存储库。

    数据 (The Data)

    As earlier mentioned, we are going to be using the MovieLens dataset. Specifically, we will be using the MovieLens 100K movie ratings dataset which consists of 1000 users on 1700 movies. The data was collected through the MovieLens web site during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up — users who had less than 20 ratings or did not have complete demographic information were removed from this data set.

    如前所述,我们将使用MovieLens数据集。 具体来说,我们将使用MovieLens 100K电影评分数据集,该数据集包含1700部电影中的1000个用户。 数据是从1997年9月19日至1998年4月22日的七个月内通过MovieLens网站收集的。这些数据已清理完毕-评分低于20或没有完整人口统计信息的用户已从该数据集。

    For us to effectively perform our visualizations, we were concerned with 3 specific datasets that were collected:

    为了使我们有效地执行可视化,我们关注了以下3个特定的数据集:

    u.data — Consist of the full dataset, 100000 ratings by 943 users on 1682 items.

    u.data —包含完整的数据集,由943位用户对1682个项目进行100000个评分。

    u.item — Information about the items (movies)

    u.item —有关项目(电影)的信息

    u.user — Demographic information about the users

    u.user有关用户的人口统计信息

    In this project, I use popular Data Science libraries such as Pandas for data manipulation, Matplotlib for data visualization and NumPy for working with arrays. Additionally, I leverage Python’s datetime module for general calendar related functions and IPython for interactive computing.

    在这个项目中,我使用流行的数据科学库,例如Pandas进行数据处理,Matplotlib进行数据可视化,NumPy进行数组处理。 此外,我将Python的datetime模块用于常规日历相关功能,并将IPython用于交互式计算。

    We begin by simply importing the frameworks and loading the data using Pandas read_csv — See Documentation.

    我们首先简单地导入框架并使用Pandas read_csv加载数据-请参阅文档 。

    import numpy as npimport pandas as pdimport matplotlib.pyplot as plt from datetime import datetimefrom IPython.display import IFrameimport warnings warnings.filterwarnings("ignore")# read datarating_df= pd.read_csv("../data/u.data", sep="\t", names=["user_id", "item_id", "rating", "timestamp"])item_df = pd.read_csv("../data/u.item", sep="|",encoding="latin-1", names=["movie_id", "movie_title", "release_date", "video_release_date", "imbd_url", "unknown", "action", "adventure", "animation", "childrens", "comedy", "crime", "documentary", "drama", "fantasy", "film_noir", "horror", "musical", "mystery", "romance", "sci-fi", "thriller", "war", "western"])user_df = pd.read_csv("../data/u.user", sep="|", encoding="latin-1", names=["user_id", "age", "gender", "occupation", "zip_code"])

    Taking the 3 DataFrames we were provided: u.data, u.item, and u.user, we have converted them into Pandas Dataframes and stored them in the variables as follows:

    以我们提供的3个数据框为例: u.data , u.item和u.user ,我们将它们转换为Pandas Dataframes并将它们存储在变量中,如下所示:

    rating_df — The full u data set holding all the ratings given by users

    rating_df完整的u数据集,包含用户给出的所有评分

    item_df — Information about the items (movies)

    item_df有关项目(电影)的信息

    user_df— Demographic information about the users

    user_df有关用户的人口统计信息

    交叉检查数据 (Cross-Checking the Data)

    A general rule of thumb I stand by is to always check that what I am told that I am given is exactly what has been provided. Pandas makes identifying these things easy with df.info() and df.head() (or df.tail()) functions which give us more information about the DataFrame and allows us to see a preview of the data.

    我坚持的一般经验法则是,始终检查告诉我的信息与所提供的信息完全相同。 Pandas使用df.info()和df.head() (或df.tail() )函数使识别这些事情变得容易,这为我们提供了有关DataFrame的更多信息,并允许我们查看数据的预览。

    To start, I begin by viewing the rating_df of which we are expecting there to be 100000 ratings by 943 users on 1682 items.

    首先,我首先查看rating_df ,我们期望943个用户在1682个项目上获得100000个评分。

    # peak at ratings_dfprint(rating_df.info())rating_df.head()<class 'pandas.core.frame.DataFrame'>RangeIndex: 100000 entries, 0 to 99999Data columns (total 4 columns): # Column Non-Null Count Dtype--- ------ -------------- ----- 0 user_id 100000 non-null int64 1 item_id 100000 non-null int64 2 rating 100000 non-null int64 3 timestamp 100000 non-null int64dtypes: int64(4)memory usage: 3.1 MBNone

    We can see we have 100000 ratings, but we want to ensure there are 943 users and 1682 items.

    我们可以看到我们有100000个评分,但我们要确保有943个用户和1682个项目。

    # checking unique usersprint(f"# of Unique Users: {rating_df['user_id'].nunique()}")# checking number of itemsprint(f"# of items: {rating_df['item_id'].nunique()}")# of Unique Users: 943# of items: 1682

    Good. We can confirm the rating_df has exactly what it says it will have. However, upon further inspection, I noticed that we have a timestamp variable but it is currently been shown as a int64 data type. From the README I identified that the timestamp column of this dataframe is in unix seconds since 1/1/1970 UTC. Hence, we use Datetime (a Python Built-in) to convert the Dtype of the timestamp column to datetime64.

    好。 我们可以确认rating_df与它说的完全一样。 但是,在进一步检查时,我注意到我们有一个timestamp变量,但是当前显示为int64数据类型。 从自述文件中,我确定此数据帧的时间戳列自1970年1月1日UTC开始以Unix秒为单位。 因此,我们使用Datetime (Python内置)将timestamp列的Dtype转换为datetime64 。

    # convert timestamp column to time stamp rating_df["timestamp"] = rating_df.timestamp.apply(lambda x: datetime.fromtimestamp(x / 1e3))# check if change has been applied print(rating_df.info())rating_df.head()<class 'pandas.core.frame.DataFrame'>RangeIndex: 100000 entries, 0 to 99999Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 100000 non-null int64 1 item_id 100000 non-null int64 2 rating 100000 non-null int64 3 timestamp 100000 non-null datetime64[ns]dtypes: datetime64[ns](1), int64(3)memory usage: 3.1 MBNone

    You now see in the terminal printout that the timestamp column is now of Data type datetime64[ns].

    现在,您在终端打印输出中看到, timestamp列现在为datetime64[ns] 。

    Now I am much more comfortable with rating_df, I can move on to exploring item_df which we expect to give us more information about the movie.

    现在,我对rating_df更加满意了,我可以继续探索item_df ,我们希望它可以为我们提供有关电影的更多信息。

    # peak at items_df print(item_df.info())item_df.head()<class 'pandas.core.frame.DataFrame'>RangeIndex: 1682 entries, 0 to 1681Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 movie_id 1682 non-null int64 1 movie_title 1682 non-null object 2 release_date 1681 non-null object 3 video_release_date 0 non-null float64 4 imbd_url 1679 non-null object 5 unknown 1682 non-null int64 6 action 1682 non-null int64 7 adventure 1682 non-null int64 8 animation 1682 non-null int64 9 childrens 1682 non-null int64 10 comedy 1682 non-null int64 11 crime 1682 non-null int64 12 documentary 1682 non-null int64 13 drama 1682 non-null int64 14 fantasy 1682 non-null int64 15 film_noir 1682 non-null int64 16 horror 1682 non-null int64 17 musical 1682 non-null int64 18 mystery 1682 non-null int64 19 romance 1682 non-null int64 20 sci-fi 1682 non-null int64 21 thriller 1682 non-null int64 22 war 1682 non-null int64 23 western 1682 non-null int64 dtypes: float64(1), int64(20), object(3)memory usage: 315.5+ KBNone

    We already know that we have 1682 unique items in our data from rating_df so seeing 1682 non-null items in the movie_id and movie_title columns gave me an instant chill. Nonetheless, video_release_date is completely empty meaning it does not provide us with any information about the movies meaning we can remove this column.

    我们已经知道我们从rating_df获得的数据中有1682个唯一项,因此在movie_id和movie_title列中看到1682个非空项目让我立刻感到不寒而栗。 但是, video_release_date完全为空,这意味着它没有为我们提供有关电影的任何信息,因此我们可以删除此列。

    I noticed that release_date and imbd_url are also missing some values, but not enough that we need to delete the column — if worst comes to worst, we can manually impute these values by visiting the IMBD website and using the movie title to find the imbd_url and release_date.

    我注意到release_date和imbd_url也缺少一些值,但不足以删除列-如果最坏的情况发生了,我们可以通过访问IMBD网站并使用电影标题来查找imbd_url和release_date 。

    Another one of my so-called “rituals” is to think of what sort of data type to expect when I am reading in a data. I expected release_date to be of datetime64 data type but upon inspection, it was of data type object so I followed the necessary processing steps to convert an object to a datetime.

    我所谓的另一种“习惯”是考虑当我读取数据时期望使用哪种数据类型。 我希望release_date属于datetime64数据类型,但经检查,它属于对象的数据类型,因此我按照了必要的处理步骤,将一个对象转换为datetime。

    # drop empty column item_df.drop("video_release_date", axis=1, inplace= True)# convert non-null values to datetime in release_dateitem_df["release_date"] = item_df[item_df.release_date.notna()]["release_date"].apply(lambda x: datetime.strptime(x, "%d-%b-%Y"))# check if change is appliedprint(item_df.info(), item_df.shape)item_df.head()<class 'pandas.core.frame.DataFrame'>RangeIndex: 1682 entries, 0 to 1681Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 movie_id 1682 non-null int64 1 movie_title 1682 non-null object 2 release_date 1681 non-null datetime64[ns] 3 imbd_url 1679 non-null object 4 unknown 1682 non-null int64 5 action 1682 non-null int64 6 adventure 1682 non-null int64 7 animation 1682 non-null int64 8 childrens 1682 non-null int64 9 comedy 1682 non-null int64 10 crime 1682 non-null int64 11 documentary 1682 non-null int64 12 drama 1682 non-null int64 13 fantasy 1682 non-null int64 14 film_noir 1682 non-null int64 15 horror 1682 non-null int64 16 musical 1682 non-null int64 17 mystery 1682 non-null int64 18 romance 1682 non-null int64 19 sci-fi 1682 non-null int64 20 thriller 1682 non-null int64 21 war 1682 non-null int64 22 western 1682 non-null int64 dtypes: datetime64[ns](1), int64(20), object(2)memory usage: 302.4+ KBNone (1682, 23)

    After our processing steps we can see that we no longer have the video_release_date column and release_date is now displayed as a datetime64 data type.

    在完成处理步骤后,我们可以看到不再有video_release_date列,并且release_date现在显示为datetime64数据类型。

    Since we were provided with some urls, I thought it may be cool to take advantage of this and view some of the urls in imbd_url using IFrame from the Ipython library.

    由于我们提供了一些网址,因此我认为利用此网址并使用Ipython库中的IFrame查看imbd_url某些网址可能很酷。

    Note: The urls in the imbd_url columns may of moved to a new address permanently or are down at the time of implementing. Also, I could not connect to the IMBD webpage when I manually entered the url for a movie (i.e. I manually entered copycat (1995) url in IFrame and it returned that it refused to connect - I have not found a work around for this yet, but will update the notebook once I have. In the meantime, I've simply used the IMBD homepage url to give an idea of how it would work - essentially, we have full access to the webpage from our notebook.

    注意 : imbd_url列中的URL可能会永久移动到新地址,或者在实施时已关闭。 另外,当我手动输入电影的网址(即我在IFrame中手动输入copycat(1995)网址)时,我无法连接到IMBD网页,并且它返回拒绝连接的信息-我尚未找到解决方法,但会在需要时更新笔记本。同时,我只是使用IMBD主页url来了解其工作原理-本质上,我们可以从笔记本中完全访问该网页。

    # viewing random imbd_urlsIFrame("https://www.imdb.com", width=800, height=400)

    Last but not least we have user_df. If you remember correctly this is the Demographic information about the users, hence I am expecting there to be 943 rows since (especially in user_id column) we have already confirmed that there are 943 unique users in the data.

    最后但并非最不重要的一点是,我们有user_df 。 如果您没有记错的话,这是有关用户的人口统计信息,因此,我期望会有943行,因为(特别是在user_id列中)我们已经确认数据中有943个唯一用户。

    # peak at user dataprint(user_df.info())user_df.head()<class 'pandas.core.frame.DataFrame'>RangeIndex: 943 entries, 0 to 942Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 943 non-null int64 1 age 943 non-null int64 2 gender 943 non-null object 3 occupation 943 non-null object 4 zip_code 943 non-null objectdtypes: int64(2), object(3)memory usage: 37.0+ KBNone

    Great, we can confirm that we have 943 users: In summary, we have 100K ratings from 943 users on 1682 movies. To make data visualization simple at various points in the notebook, I decided to combine the DataFrames we have together — I discuss how to do this more in the PyTrix Series on Combining Data.

    太好了,我们可以确认我们有943位用户:总而言之,我们在1682部电影中获得943位用户的100K评分。 为了使笔记本中各个位置的数据可视化变得简单,我决定将我们拥有的DataFrame组合在一起—我在PyTrix的 “ 组合数据 系列”中讨论了如何做得更多。

    # store full dataframe full_df = pd.merge(user_df, rating_df, how="left", on="user_id")full_df = pd.merge(full_df, item_df, how="left", right_on="movie_id", left_on="item_id")full_df.head()

    Fabulous! We have successfully confirmed that the expected data is exactly what we have. This is sufficient information for us to dive deeper into our data and get a better understanding.

    极好! 我们已成功确认预期数据正是我们所拥有的。 这是足够的信息,可让我们深入研究数据并获得更好的理解。

    提问和回答数据 (Asking Questions and Answering with Data)

    Following the protocols of Effective Data Visualization, my next step is to think of some questions that would give me more insight into the data at hand, then identify the best method to visualise the answer to our question — the best method may be defined as the most simple and clear way to express the answer to our question.

    按照有效数据可视化的协议,我的下一步是考虑一些问题,这些问题将使我对手头的数据有更多的了解,然后确定可视化我们问题答案的最佳方法-最佳方法可以定义为表达我们问题答案的最简单明了的方法。

    Note: In this section my thoughts usually jump around as I am querying the data. Consequently, I prefer to use Jupyter Notebooks when exploring data.

    注意 :在本节中,我通常会在查询数据时跳动。 因此,我更喜欢在浏览数据时使用Jupyter Notebook。

    What are the top 10 most rated movies?

    收视率最高的10部电影是什么?

    # return number of rows associated to each titletop_ten_movies = full_df.groupby("movie_title").size().sort_values(ascending=False)[:10]# plot the countsplt.figure(figsize=(12, 5))plt.barh(y= top_ten_movies.index, width= top_ten_movies.values)plt.title("10 Most Rated Movies in the Data", fontsize=16)plt.ylabel("Moive", fontsize=14)plt.xlabel("Count", fontsize=14)plt.show()

    In our dataset, Star Wars (1977) was the most rated film. This information is so valuable as we may decide to use the most rated movies in our dataset to recommend to new users to overcome the Cold start problem. We can go further into our data from this question and begin to think about what sort of genres are associated with the most rated films — In this case, we only looked at the genres associated with Star Wars.

    在我们的数据集中, 《星球大战》(1977年)是评价最高的电影。 该信息是如此宝贵,因为我们可能决定使用数据集中收视率最高的电影来推荐给新用户,以克服冷启动问题 。 我们可以从这个问题进一步研究数据,并开始考虑与最受好评的电影相关的类型–在这种情况下,我们仅查看与《星球大战》相关的类型。

    genres= ["unknown", "action", "adventure", "animation", "childrens", "comedy", "crime", "documentary", "drama", "fantasy", "film_noir", "horror", "musical", "mystery", "romance", "sci-fi", "thriller", "war", "western"]full_df[full_df.movie_title == "Star Wars (1977)"][genres].iloc[0].sort_values(ascending=False)action 1sci-fi 1romance 1adventure 1war 1western 0documentary 0animation 0childrens 0comedy 0crime 0fantasy 0drama 0film_noir 0horror 0musical 0mystery 0thriller 0unknown 0Name: 204, dtype: int64

    I am not a major Star Wars fan, although I have watched many of them, but I made mention of that simply to confirm that associating genres such as action, sci-fi, adventure, war and romance sounds about right for this movie.

    尽管我看过很多电影,但我不是《星球大战》的主要粉丝,但我之所以提及这一点,只是为了确认与动作,科幻,冒险,战争和浪漫等类型相关联的声音是否适合这部电影。

    This question completely ignored the least rated movies, but if we were building a recommender system we cannot ignore less rated movies as there may be many reasons as to why a movie has not got many ratings. Let’s take a look at some of the movies that are the least rated in the dataset.

    这个问题完全忽略了收视率最低的电影,但是如果我们要建立一个推荐系统,我们就不能忽略收视率较低的电影,因为可能有很多原因导致电影收视率不高。 让我们看一下数据集中收视率最低的电影。

    # the least rated movies least_10_movies = full_df.groupby("movie_title").size().sort_values(ascending=False)[-10:]least_10_moviesmovie_titleColdblooded (1995) 1MURDER and murder (1996) 1Big Bang Theory, The (1994) 1Mad Dog Time (1996) 1Mamma Roma (1962) 1Man from Down Under, The (1943) 1Marlene Dietrich: Shadow and Light (1996) 1Mat' i syn (1997) 1Mille bolle blu (1993) 1Á köldum klaka (Cold Fever) (1994) 1dtype: int64

    Big Bang Theory was a surprise occurrence on this list for me, other than that I am unfamiliar with the other movies — I am not much of a movie person anyways, so that doesn’t mean much.

    “大爆炸理论”对我来说是一个意外的惊喜,除了我不熟悉其他电影外-我并不是一个电影人,所以意义不大。

    What are the Max/Min number of Movies rated by One user?

    一个用户评分的电影的最大/最小数量是多少?

    From the README provided with the dataset we were told that the minimum number of movies rated by a single user was 20, but we don’t know the maximum number of movies rated by a single user.

    根据数据集随附的自述文件,我们被告知单个用户评分的电影的最小数量为20,但是我们不知道单个用户评分的电影的最大数量。

    movies_rated = rating_df.groupby("user_id").size().sort_values(ascending=False)print(f"Max movies rated by one user: {max(movies_rated)}\nMin movies rated by one user: {min(movies_rated)}")Max movies rated by one user: 737Min movies rated by one user: 20rating_df.user_id.value_counts().plot.box(figsize=(12, 5))plt.title("Number of Movies rated by a Single user", fontsize=16)plt.show()

    The maximum number of movies rated by a single user in the dataset is 737 — whoever that is, is a very loyal movie watcher and rater — and the median number of movies rated by someone is 70. There are plenty of outliers that have rated more than 320 movies which is what I am approximating to be the extreme value from the plot above.

    数据集中由单个用户评分的电影的最大数量为737(无论谁是非常忠实的电影观看者和评分者),而由某人评分的电影的中位数为70。许多异常值的评分都更高超过320部电影,根据以上情节,我将其近似为极值。

    How many movies were released per year?

    每年发行多少部电影?

    # create the year column from Movie title full_df["year"] = full_df["movie_title"].str.extract("\((\d{4})\)", expand=True)# return number of rows by the year year_counts = full_df[["movie_title", "year"]].groupby("year").size()fig, ax = plt.subplots(figsize=(12, 5)) ax.plot(year_counts.index, year_counts.values)ax.xaxis.set_major_locator(plt.MaxNLocator(9)) # changes the number of xticks we seeplt.title("Number of movies per Annum", fontsize=16)plt.xlabel("Year", fontsize= 14)plt.ylabel("# of Movies Released", fontsize=14)plt.show()

    It’s pretty hard to miss the massive spike and dip between 1988–1998. It’s worth doing some research and asking questions to a domain expert to determine what could of happened during this period.

    很难错过1988年至1998年之间的大幅度波动。 进行一些研究并向领域专家询问问题以确定在此期间可能发生的事情是值得的。

    How many Men/Women rated movies?

    有多少部男女分级电影?

    # count the number of male and female ratersgender_counts = user_df.gender.value_counts()# plot the counts plt.figure(figsize=(12, 5))plt.bar(x= gender_counts.index[0], height=gender_counts.values[0], color="blue")plt.bar(x= gender_counts.index[1], height=gender_counts.values[1], color="orange")plt.title("Number of Male and Female Participants", fontsize=16)plt.xlabel("Gender", fontsize=14)plt.ylabel("Counts", fontsize=14)plt.show()

    There are clearly a lot more males in this sample than females and this may have a major influence on the genres of movies watched.

    显然,这个样本中的男性比女性多得多,这可能会对所观看电影的类型产生重大影响。

    What are the most popular Movie Genres among Males and Females?

    男性和女性中最受欢迎的电影类型是什么?

    full_df[genres+["gender"]].groupby("gender").sum().T.plot(kind="barh", figsize=(12,5), color=["orange", "blue"])plt.xlabel("Counts",fontsize=14)plt.ylabel("Genre", fontsize=14)plt.title("Popular Genres Among Genders", fontsize=16)plt.show()

    To my surprise, it turns out male and females really appreciate similar genres. Both genders most popular genre was Drama followed by comedy. Of course, we take into consideration there are more males in this dataset than females and we must also take this into account when we think of building our recommendation system.

    令我惊讶的是,事实证明,男性和女性真的很喜欢类似的类型。 男女最受欢迎的类型是戏剧,其次是喜剧。 当然,我们考虑到此数据集中的男性多于女性,并且在考虑构建推荐系统时也必须考虑到这一点。

    Something to know would be whether there is a change in interest when we put a constraint on the ages of raters.

    当我们限制评估者的年龄时,就会知道利益是否会发生变化。

    What are the most popular Movie Genres among Children by gender?

    按性别划分,儿童中最受欢迎的电影类型是什么?

    Note: Using UK standards, an adult can be defined as someone that is >= 18 years old, hence a Child would be < 18.

    注意:使用英国标准,可以将成年人定义为年龄大于等于18岁的人,因此儿童年龄应小于18岁。

    full_df[full_df["age"] < 18][genres + ["gender"]].groupby("gender").sum().T.plot(kind="barh", figsize=(12, 5), color=["orange", "blue"])plt.xlabel("Counts",fontsize=14)plt.ylabel("Genre", fontsize=14)plt.title("Popular Genres Among Children by Gender", fontsize=16)plt.show()

    Drama is still quite popular for under 18 males, but more males under 18 preferred comedy and action films. On the other hand, females under 18 pretty much didn’t change, it’s still drama and comedy.

    18岁以下的男性仍然很喜欢戏剧,但18岁以下的男性更喜欢喜剧和动作片。 另一方面,18岁以下的女性几乎没有变化,仍然是戏剧和喜剧。

    These figures were interesting, but I was wondering whether the popularity of drama and comedy amongst both genders was due to those types of movies being generally regarded as the best type of movies (hence they get the most views and ratings) or whether it is because those tags are associated with the most movies.

    这些数字很有趣,但我想知道戏剧和喜剧在男女中的受欢迎程度是由于这些类型的电影通常被认为是最佳电影类型(因此它们获得最多的观看次数和收视率)还是因为这些标签与大多数电影相关。

    What Genre is associated with the most Movies?

    哪些类型与最多的电影相关?

    Note: Multiple genres can be associated to a movie (i.e. A movie can be animation, childrens and comedy)

    注意 :电影可以关联多种流派(即电影可以是动画,儿童和喜剧片)

    # get the genre names in the dataframe and their countslabel= item_df.loc[:, "unknown":].sum().indexlabel_counts= item_df.loc[:, "unknown":].sum().values# plot a bar chartplt.figure(figsize=(12, 5))plt.barh(y= label, width= label_counts)plt.title("Genre Popularity", fontsize=16)plt.ylabel("Genres", fontsize=14)plt.xlabel("Counts", fontsize=14)plt.show()

    Just as I thought, drama and comedy tags are associated to the most films in the sample. Maybe filmmakers are aware of our need for a laugh and some drama hence they play on it — this is something we can research.

    正如我认为的那样,戏剧和喜剧标签与样本中的大多数电影相关。 也许电影制片人意识到我们需要笑声和戏剧性,因此他们可以在其中播放-这是我们可以研究的东西。

    Next we observe the average ratings per genre…

    接下来,我们观察每种流派的平均评分…

    What are the Distribution of Ratings per Genre?

    每类评级的分布是什么?

    Note: Density plots are used to observe the distribution of a variable in a dataset.

    注意 :密度图用于观察变量在数据集中的分布。

    # https://github.com/HarilalOP/movielens-data-exploration/blob/master/src/main/code/exploratory_analysis.ipynbdf_temp = full_df[['movie_id','rating']].groupby('movie_id').mean()# Histogram of all ratingsdf_temp.hist(bins=25, grid=False, edgecolor='b', density=True, label ='Overall', figsize=(15,8))# KDE plot per genrefor genre in genres: df_temp = full_df[full_df[genre]==True][['movie_id','rating']].groupby('movie_id').mean() df_temp.rating.plot(grid=True, alpha=0.9, kind='kde', label=genre)plt.legend()plt.xlim(0,5)plt.xlabel('Rating')plt.title('Rating Density plot')plt.show()

    The plot is predominantly left-skewed for most genres — This could possibly by down to users being more willing to rate movies they enjoyed, since people do not really watch a movie if they aren’t enjoying it. We would have to conduct some research on whether this is the case in our instance.

    对于大多数类型的情节来说,该情节大多都是向左倾斜的-这可能是由于用户更愿意对自己喜欢的电影进行评分,因为如果人们不喜欢它,他们就不会真正观看电影。 我们将不得不进行一些研究,以研究我们的情况是否如此。

    Ok, the last plot was more complicated. We can simplify things again by looking more specifically at the users.

    好的,最后一个情节更加复杂。 我们可以通过更加专注于用户的方式再次简化事情。

    What’s the Age Distribution by Gender?

    什么是按性别划分的年龄分布?

    # creating new variable for ages of all males and femalesfemale_age_dist = user_df[user_df["gender"] == "F"]["age"]male_age_dist = user_df[user_df["gender"] == "M"]["age"]# plotting boxplots plt.figure(figsize=(12,5))plt.boxplot([female_age_dist, male_age_dist])plt.xticks([1, 2], ["Female", "Male"], fontsize=14)plt.title("Age Distribution by Gender", fontsize=16)plt.show()

    The male age distribution has some outliers, and the female median age is slightly higher than males. Additionally, the female age distribution box is longer than the male box meaning it’s more dispersed than the male ages.

    男性年龄分布有一些离群值,女性中位年龄略高于男性。 此外,女性年龄分布框比男性年龄框长,这意味着它比男性年龄分布区更分散。

    What’s the most common occupation amongst the users?

    用户中最常见的职业是什么?

    # creating the index and values variables for occupationocc_label= user_df.occupation.value_counts().indexocc_label_counts = user_df.occupation.value_counts().values# plot horizontal bar chartplt.figure(figsize=(12,5))plt.barh(y=occ_label, width=occ_label_counts)plt.title("Most common User Occupations", fontsize=16)plt.show()

    To nobody's surprise, the majority of people in the dataset are students. Let’s see the average ratings given by each occupation.

    让人惊讶的是,数据集中的大多数人都是学生。 让我们看看每个职业给出的平均评分。

    What is the average rating of a given occupation?

    给定职业的平均评分是多少?

    # creating a empty df to store datadf_temp = pd.DataFrame(columns=["occupation", "avg_rating"])# loop through all the occupations for idx, occ in enumerate(occ_label): df_temp.loc[idx, "occupation"] = occ df_temp.loc[idx, "avg_rating"] = round(full_df[full_df["occupation"] == occ]["rating"].mean(), 2)# sort from highest to lowestdf_temp = df_temp.sort_values("avg_rating", ascending=False).reset_index(drop=True)df_temp

    结语 (Wrap Up)

    Stopping at this point was difficult because there is so much more insights we can extract from this data. I personally believe that Data Visualization does not really have an end, so it is down to the person doing the visualizations to decide when to stop. A good indicator may be when we believe we have sufficient understanding of the data to begin to build an effective baseline model (if we don’t have one already). Upon building the model, we can always come back and iterate on our visualizations to get more insight from our data based on our models predictions.

    在这一点上很难停止,因为我们可以从这些数据中提取更多的见解。 我个人认为,数据可视化并没有真正结束,因此取决于进行可视化的人决定何时停止。 一个良好的指标可能是当我们认为我们对数据有足够的了解以开始建立有效的基线模型时(如果我们还没有)。 建立模型后,我们可以随时返回并迭代可视化,以基于模型预测从数据中获取更多见解。

    This article was made using jupyter_to_medium.

    本文是使用jupyter_to_medium 。

    Let’s continue the conversation on LinkedIn…

    让我们继续在LinkedIn上进行对话…

    翻译自: https://towardsdatascience.com/comprehensive-data-explorations-with-matplotlib-a388be12a355

    matplotlib 数据

    相关资源:微信小程序源码-合集6.rar
    Processed: 0.012, SQL: 9