熊猫数据集

    科技2022-08-02  125

    熊猫数据集

    数据科学 , 程序设计 (Data Science, Programming)

    There are many tutorials about pandas on the internet and books. Pandas library is one of the most used libraries by data scientists and data engineers. In this tutorial, I am going to list the pandas’ functions I use the most.

    互联网上有很多关于熊猫的教程和书籍。 熊猫图书馆是数据科学家和数据工程师最常用的图书馆之一。 在本教程中,我将列出我最常使用的熊猫功能。

    While there are many functions to select here are my top ten most used ones. Please let me know yours in the comment section. I will be happy to add them to my collections. I will be focusing on pandas Data Frame rather than series.

    虽然这里有很多功能可供选择,但它们是我最常使用的十大功能。 请在评论部分告诉我您的信息。 我很乐意将它们添加到我的收藏中。 我将专注于熊猫数据框而不是系列。

    Let’s starts with importing pandas.

    让我们从导入熊猫开始。

    import pandas as pd

    1.文件操作 (1. Files Operation)

    Excel

    电子表格

    import pandas as pd data = pd.read_excel(‘path_to_your_excel file’, sep=‘ ’, index_col=‘name’, dtype=dtypes, sheet_name=’’)

    CSV

    CSV

    import pandas as pd data = pd.read_excel(‘path_to_your_excel file’, sep=‘ ’ , index_col=‘name’, dtype=dtypes)

    Read HTML

    阅读HTML

    import pandas as pddata = pd.read_html(‘data.html’, index_col=0)

    Note: you can add datatype when you import the files. i.e and if you want to omit the header, set header=False.

    注意:导入文件时可以添加数据类型。 即,如果要省略标题,请设置header = False。

    dtypes = {‘colname’: ’datatype’, ‘Weight’: ‘float32’}

    Pandas considers these as missing values

    熊猫认为这些是缺失的价值观

    (’ ’),’nan’,’-nan’,’NA’,’N/A’,’NaN’,’null’

    (''),'nan','-nan','NA','N / A','NaN','null'

    When reading and writing files you can replace them like

    读写文件时,您可以像替换它们一样

    df.to_csv('new-data.csv', na_rep='(missing)')

    2.熊猫数据框中的统计数据 (2. Statistics in Pandas Data Frame)

    Sum

    df['colName'].sum()

    Mean

    意思

    df['colName'].mean()

    Cumulative Sum

    累计总和

    df['colName'].cumsum()

    Summary Statistics

    统计摘要

    df['colName'].describe()

    Count

    计数

    df['colName'].count()

    Min/Max

    最小/最大

    df['colName'].min()df['colName'].max()

    Median

    中位数

    df['colName'].median()

    Sample Variance

    样本差异

    df['colName'].var()

    Standard Deviation

    标准偏差

    df['colName'].std()

    Skewness

    偏度

    df['colName'].skew()

    Kurtosis

    峰度

    df['colName'].kurt()

    Correlation Matrix Of Values

    值的相关矩阵

    df.corr()

    3.加入运营 (3. Join Operations)

    Pandas provide SQL like capabilities in large data sets

    熊猫在大型数据集中提供类似SQL的功能

    Simple Join

    简单加入

    pd.concat([df_a, df_b], axis=1)

    Merge two data frames with condition_column value

    合并两个带condition_column值的数据帧

    pd.merge(df_new, df_n, on=‘join_condition_column_name’)

    Merge with outer join

    与外部联接合并

    pd.merge(df_a, df_b, on=‘condition_column', how='outer’)

    Merge with inner join

    与内部联接合并

    pd.merge(df_a, df_b, on=‘condition_column', how='inner’)

    Merge with right join

    以正确的合并合并

    pd.merge(df_a, df_b, on=’common_column_id’, how=’right’)

    Merge with left join

    与左联接合并

    pd.merge(df_a, df_b, on=‘common_column_id ‘, how=’left’)

    Merge while adding a suffix to duplicate column names

    在添加后缀以重复列名称时合并

    pd.merge(df_a, df_b, on=’ common_column_id ‘, how=’left’, suffixes=(‘_left’, ‘_right’))

    Merge based on indexes

    根据索引合并

    pd.merge(df_a, df_b, right_index=True, left_index=True)

    4.熊猫数据分析 (4. Pandas Data Analysis)

    See the first 10 entries

    查看前10个条目

    dataFrame.head(10)

    See the last 10 entries

    查看最近的10个条目

    dataFrame.tail(10)

    Total Number of records in Datasets

    数据集中的记录总数

    dataFrame.shape[0]

    Number of columns in the datasets

    数据集中的列数

    dataframes.columns

    Datasets indexed Detail

    数据集索引明细

    data.index

    Summarize the Data Frame

    汇总数据框

    data.describe()

    Summarize all the columns

    汇总所有列

    data.describe(include = “all”)

    Mean value of a column

    列的平均值

    round(data.columnname.mean())

    Least occurred value in column

    列中出现最少的值

    data.columnnsme.value_counts().tail()

    5.切片切块 (5. Slice and dice)

    loc in Pandas: from https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/

    Loc in Pandas:from https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/

    loc is label-based, which means that we have to specify the name of the rows and columns that we need to filter out.

    loc 是基于标签的,这意味着我们必须指定需要过滤的行和列的名称。

    iloc in Pandas: from https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/

    熊猫的iloc:来自https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/

    On the other hand, iloc is integer index-based. So here, we have to specify rows and columns by their integer index.

    另一方面,iloc是基于整数索引的。 因此,在这里,我们必须通过整数索引指定行和列。

    Examples

    例子

    import numpy as np # imported numpy to make arrays.df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))print(df)

    Output:

    输出:

    # Using `iloc[]`print(df.iloc[0][0])

    Output: 1

    输出1

    # Using `loc[]`print(df.loc[0][2])

    Output: 3

    输出3

    # Using `at[]`print(df.at[1,2])

    Output: 6

    输出6

    # Using `iat[]`print(df.iat[0,2])

    Output: 3

    输出3

    6.将列追加到现有数据框 (6. Appending Column to existing dataframe)

    Example

    df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['A', 'B', 'C'])print(df)# Append a column to `df`df.loc[:, 'D'] = pd.Series(['5', '6' ,'7'], index=df.index)print(df)

    Output:

    输出:

    7.从dataFrames中删除重复的值 (7. Dropping Duplicate value from dataFrames)

    Example

    df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})print(df)# droping duplicate value from A and Cdf = df.drop_duplicates(subset=['A', 'C'], keep=False) print(df)# froping the whole column in index 1, which remove the whole rowdf = df.drop(df.index[[1]]) print(df)

    Output:

    输出:

    8.更新价值 (8. Update Value)

    Example

    df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})print(df)# Replace strings by number (0-4)df = df.replace(['foo', 'A','B' ], [10,90 , 80]) print(df)# Replace using `regex`df = df.replace({r'[^0-9]+': 'apple'}, regex=True) print(df)

    Output:

    输出:

    9.应用功能 (9. Apply Function)

    we can make a custom function or use a lambda function to change the record in the data frame once and dynamically.

    我们可以创建自定义函数或使用lambda函数一次动态地更改数据帧中的记录。

    Example

    df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})print(df)print('\n')doubler = lambda x: x*2df = df.replace({r'[^0-9]+': 10}, regex=True)df = df.apply(doubler)print(df)

    Output:

    输出:

    Creating a new column from existing columns

    从现有列创建新列

    dfa = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})print(dfa)dfa = dfa.assign(C=lambda x: x['A'] + x['B'],D=lambda x: x['A'] + x['C'])dfa

    Output:

    输出:

    10.以字典,系列作为数据框中的输入。 (10. Taking a dictionary, series as input in a data frame.)

    Take a dictionary as input to your DataFrame

    将字典作为DataFrame的输入

    my_dict = {1: [‘1’, ‘3’], 2: [‘1’, ‘2’], 3: [‘2’, ‘4’]}print(pd.DataFrame(my_dict))

    Output:

    输出:

    Series as input to your Data Frame

    系列作为数据框的输入

    my_series = pd.Series({“England”:”London”, “Nepal”:”kathmandu”, “China”:”Baiging”, “Belgium”:”Brussels”})print(pd.DataFrame(my_series))print(len(my_series.shape))

    Output:

    输出:

    There are so many functions i haven’t added in this list. But these are the ones I use the most. Panda is a rich library and it has very good documentation from the official web site. You can read from this link. https://pandas.pydata.org

    我没有在列表中添加太多功能。 但是这些是我使用最多的。 熊猫图书馆是一个丰富的图书馆,它拥有来自官方网站的非常好的文档。 您可以从此链接阅读。 https://pandas.pydata.org

    翻译自: https://medium.com/towards-artificial-intelligence/pandas-for-data-scientist-a2d8a8e81d04

    熊猫数据集

    相关资源:微信小程序源码-合集6.rar
    Processed: 0.009, SQL: 8