熊猫数据集

    科技2022-07-12  142

    熊猫数据集

    内部AI (Inside AI)

    In my view, Pandas and NumPy library together has saved hundreds of hours of programming time and is an invaluable tool for data scientist and machine learning professionals.

    在我看来,Pandas和NumPy库一起节省了数百小时的编程时间,对于数据科学家和机器学习专业人员而言,这是一个宝贵的工具。

    It is nearly impossible to be a good hands-on data scientist or machine learning professional without having a good grasp over these two libraries.

    如果不掌握这两个库,那么成为一名出色的动手数据科学家或机器学习专业人员几乎是不可能的。

    Most of the time, raw data available has few blank values, not in the right format, or spread across several different sources with a primary key. In this article, I will discuss three advanced pandas data preparation and wrangling technique which will be useful for setting the data in the right format for machine learning algorithms input.

    在大多数情况下,可用的原始数据只有很少的空白值,格式不正确,或者使用主键分布在多个不同的源中。 在本文中,我将讨论三种先进的大熊猫数据准备和整理技术,这将有助于以正确的格式设置数据以供机器学习算法输入。

    Let us start with a simple sample DataFrame with four columns and three rows.

    让我们从一个具有四列三行的简单示例DataFrame开始。

    import pandas as pddf1 = pd.DataFrame([["Keras", "2015", "Python","Yes"],["CNTK", "2016", "C++","Yes"],["PlaidML", "2017", "Python, C++, OpenCL","No"]],index=[1, 2, 3],columns = ["Software", "Intial Release","Written In","CUDA Support" ])print(df1)

    Sample DataFrame has three different attributes viz.initial release, written in and CUDA support of three of the machine learning libraries.

    Sample DataFrame具有三个不同的属性,即.initial release,三个机器学习库的编写和CUDA支持。

    DataFrame df1 — Output of the above code DataFrame df1-上面代码的输出

    We can see that the machine learning library name is one of the columns and different attributes of the libraries are in individual columns.

    我们可以看到机器学习库名称是列之一,并且库的不同属性在各个列中。

    Let us imagine that the machine learning algorithm expects the same information in a different format. It is expecting all the attributes of the libraries in one column and respective values in another column.

    让我们想象一下,机器学习算法期望以不同的格式呈现相同的信息。 期望在一列中包含库的所有属性,而在另一列中包含各个值。

    We can use the “melt” function to transpose the DataFrame from the current format to the expected arrangement.

    我们可以使用“ melt ”功能将DataFrame从当前格式转换为预期的格式。

    The columns from the original DataFrame which stays in the same arrangement, with the individual column, are specified in id_vars parameter.

    在id_vars参数中指定原始DataFrame中与原始列保持相同排列的列。

    df2=df1.melt(id_vars=["Software"],var_name='Characteristics')print(df2)

    The columns from the original DataFrame which stays in the same arrangement, with the individual column, are specified in id_vars parameter. All the remaining attributes in the original DataFrame populated under the characteristics column, and with values in a corresponding column.

    在id_vars参数中指定原始DataFrame中与原始列保持相同排列的列。 原始DataFrame中的所有其余属性填充在features列下,并在相应列中填充值。

    DataFrame df2 after “melt” function on df1 — Output of the above code df1上的“融化”功能后的DataFrame df2 —上面的代码的输出

    Many times the raw data is collected is arranged with all attributes in one column with values mentioned in the corresponding column. The machine learning algorithm is expecting the input data with each attributes values in individual columns. We can use the “pivot” method in Pandas to transform the data in the required format.

    很多时候,原始数据的收集将所有属性排在一列中,并在相应的列中提及值。 机器学习算法期望输入数据具有单独列中的每个属性值。 我们可以在Pandas中使用“ pivot ”方法将数据转换为所需格式。

    df3=df2.pivot(index="Software",columns="Characteristics",values="value")print(df3)

    In essence, “melt” and “pivot” are complementary functions. ‘Melt’ transforms all attributes in one column and values in the corresponding column while “pivot” converts attributes in one column in a separate column.

    本质上,“融化”和“枢轴”是互补的功能。 “融化”转换一列中的所有属性,并转换对应列中的值,而“数据透视”转换一列中另一列的属性。

    Pivot function on DataFrame df2— Output of the above code DataFrame df2上的数据透视功能-上面代码的输出

    Sometimes, we have raw data spread across several files, and we need to consolidate it into one dataset. For the sake of simplicity, first, we will consider the case where data points in individual datasets are already in the right row sequence.

    有时,我们将原始数据分布在多个文件中,需要将其合并为一个数据集。 为了简单起见,首先,我们将考虑单个数据集中的数据点已经在正确的行序列中的情况。

    In a new DataFrame df4, two more attributes declared for the machine learning libraries.

    在新的DataFrame df4中,为机器学习库声明了另外两个属性。

    df4 = pd.DataFrame([["François Chollet", "MIT license"],["Microsoft Research", "MIT license"],["Vertex.AI,Intel", "AGPL"]],index=[1, 2,3],columns = ["Creator", "License"])print(df4)

    We can concatenate this new DataFrame with original DataFrame using the “concat” function.

    我们可以使用“ concat”功能将此新的DataFrame与原始DataFrame进行连接。

    df5=pd.concat([df1,df4],axis=1)print(df5)

    The output of concatenated DataFrame df1 and df4 has the original three attributes and two new attributes in one DataFrame df5

    串联DataFrame df1和df4的输出在一个DataFrame df5中具有原始三个属性和两个新属性

    Concatenation of DataFrame df1 and df4— Output of the above code DataFrame df1和df4的串联-上面代码的输出

    In real life, most of the time data points in the individual datasets are not in the same row sequence, and there is a common key ( identifier) which links the data points in the datasets.

    在现实生活中,各个数据集中的大多数时间数据点都不在同一行序列中,并且存在一个公共键(标识符)来链接数据集中的数据点。

    We have a DataFrame df6 storing the “creator” and “license” value of the machine learning library. “Software” is the common attribute between this and original DataFrame df1.

    我们有一个DataFrame df6,用于存储机器学习库的“创建者”和“许可证”值。 “软件”是此DataFrame与原始DataFrame df1之间的共同属性。

    df6 = pd.DataFrame([["CNTK","Microsoft Research", "MIT license"],["Keras","François Chollet", "MIT license"],["PlaidML","Vertex.AI,Intel", "AGPL"]],index=[1, 2,3],columns = ["Software","Creator", "License"])print(df6)

    We can see that the row sequence of the Dataframe df6 is not the same as the original DataFrame df1. In df6 the attributes of “CNTK” library is stored first, whereas in DataFrame df1 the values “Keras” is first mentioned.

    我们可以看到,Dataframe df6的行顺序与原始DataFrame df1不同。 在df6中,首先存储“ CNTK”库的属性,而在DataFrame df1中,首先提到值“ Keras”。

    DataFrame df1 and df6 — Output of the above code DataFrame df1和df6-上面代码的输出

    In such a scenario where we need to consolidate the raw data spread across several datasets and values are not in the same row sequence in individual datasets, we can consolidate the data into one DataFrame using the “merge” function.

    在这种情况下,我们需要合并散布在多个数据集中的原始数据,并且各个数据集中的值不在同一行序列中,我们可以使用“ merge ”功能将数据合并到一个DataFrame中。

    combined=df1.merge(df6,on="Software")print(combined)

    The common key (unique identifier) to identify the right corresponding value from the different datasets is mentioned with “on” parameter of merge function.

    使用合并功能的“ on”参数提到了从不同数据集中识别正确的对应值的公共密钥(唯一标识符)。

    Output of above code — consolidated DataFrame df1 and df6 based on the unique identifier “Software” 以上代码的输出-基于唯一标识符“软件”的合并DataFrame df1和df6

    Recap

    回顾

    In this article, we have learnt three advanced pandas functions is quite handy during data preparation.

    在本文中,我们了解了三种高级熊猫功能,在数据准备过程中非常方便。

    Melt: Unpivots the attributes/features into one column and values in the corresponding column

    融化:取消将属性/功能分解为一列,并将值分解为相应的列

    Pivot: Splits the attributes/features from one column into an individual column

    数据透视:将属性/功能从一列拆分为一列

    Concat: Concatenate DataFrames along row or columns.

    Concat:沿着行或列连接DataFrame。

    Merge: Merges DataFrames with unique identifier across the DataFrames and values in different row sequences.

    合并:合并具有跨数据框唯一标识符和不同行序列中的值的数据框。

    Conclusion

    结论

    Pandas can help to prepare the raw data in the right format for machine learning algorithms. Pandas functions like “melt”, “concat”, “merge” and “pivot” discussed in the article can transform the data sets with a single line of code, which otherwise will take many nested loops and conditional statements. In my view, the difference between a professional and amateur data scientist is not the knowledge of an exotic algorithm or super optimisation hyper-parameter technique. The deep knowledge of Pandas, NumPy and Matplotlib puts someone in the professional league and enables them to achieve the results quicker with cleaner code and high performance.

    熊猫可以帮助为机器学习算法准备正确格式的原始数据。 本文中讨论的Pandas功能(如“融化”,“ concat”,“合并”和“枢轴”)可以用单行代码转换数据集,否则将需要许多嵌套循环和条件语句。 在我看来,专业数据科学家和业余数据科学家之间的区别并不是对异类算法或超级优化超参数技术的了解。 对Pandas,NumPy和Matplotlib的深入了解使某人进入了职业联盟,使他们能够以更简洁的代码和更高的性能更快地获得结果。

    You can learn data visualisation using pandas in 5 Powerful Visualisation with Pandas for Data Preprocessing.

    您可以在5用于数据预处理的熊猫的强大可视化中学习使用熊猫进行数据可视化 。

    Also, learn Advanced Visualisation for Exploratory data analysis (EDA) like a professional data scientist.

    另外,像专业数据科学家一样,学习用于探索性数据分析(EDA)的高级可视化 。

    翻译自: https://towardsdatascience.com/3-advanced-pandas-methods-for-data-scientist-c7935152b2ca

    熊猫数据集

    相关资源:微信小程序源码-合集6.rar
    Processed: 0.010, SQL: 8