python建立数据库索引

    科技2022-08-01  111

    python建立数据库索引

    介绍 (Introduction)

    The Python and NumPy indexing operators [] and attribute operator ‘.’ (dot) provide quick and easy access to pandas data structures across a wide range of use cases. The index is like an address, that’s how any data point across the data frame or series can be accessed. Rows and columns both have indexes.

    Python和NumPy索引运算符[]和属性运算符'。 (点)可在各种用例中快速轻松地访问熊猫数据结构。 索引就像一个地址,这就是如何访问数据帧或系列中的任何数据点。 行和列都有索引。

    The axis labeling information in pandas objects serves many purposes:

    pandas对象中的轴标签信息有许多用途:

    Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.

    使用已知的指标标识数据(即提供元数据),这对于分析,可视化和交互式控制台显示很重要。 Enables automatic and explicit data alignment.

    启用自动和显式数据对齐。 Allows intuitive getting and setting of subsets of the data set.

    允许直观地获取和设置数据集的子集。

    索引和选择数据的不同选择 (Different Choices for indexing and selecting data)

    Object selection has had several user-requested additions to support more explicit location-based indexing. Pandas now support three types of multi-axis indexing for selecting data.

    对象选择具有一些用户请求的添加项,以支持更明确的基于位置的索引。 熊猫现在支持三种类型的多轴索引来选择数据。

    # import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(6, 3), index = ['a','b','c','d','e','f'], columns = ['A', 'B', 'C']) print (df.loc['a':'f'])

    How to check the values is positive or negative in a particular row. For that we are giving condition to row values with zeros, the output is a boolean expression in terms of False and True. False means the value is below zero and True means the value is above zero.

    如何检查特定行中的值是正还是负。 为此,我们给具有零的行值提供条件,输出是一个布尔表达式,用False和True表示。 False表示该值小于零,True表示该值大于零。

    # for getting values with a boolean array print (df.loc['a']>0)

    As we see in the above code that with .loc we are checking the value is positive or negative with boolean data. In row index 'a' the value of the first column is negative and the other two columns are positive so, the boolean value is False, True, True for these values of columns.

    正如我们在上面的代码中看到的那样,使用.loc我们正在检查布尔数据的值是正还是负。 在行索引“ a”中,第一列的值为负,其他两列的值为正,因此,这些列的布尔值分别为False,True,True。

    Then, if we want to just access the only one column then, we can do with the colon. The colon in the square bracket tells the all rows because we did not mention any slicing number and the value after the comma is B means, we want to see the values of column B.

    然后,如果我们只想访问仅一列,则可以使用冒号。 方括号中的冒号表示所有行,因为我们没有提到任何切片编号,并且逗号后的值是B表示我们想查看列B的值。

    print df.loc[:,'B']# import the pandas library and aliasing as pd import pandas as pd import numpy as np df1 = pd.DataFrame(np.random.randn(8, 3),columns = ['A', 'B', 'C']) # select all rows for a specific column print (df1.iloc[:8])

    In the above small program, the .iloc gives the integer index and we can access the values of row and column by index values. To know the particular rows and columns we do slicing and the index is integer based so we use .iloc. The first line is to want the output of the first four rows and the second line is to find the output of two to three rows and column indexing of B and C.

    在上面的小程序中, .iloc给出整数索引,我们可以按索引值访问row和column的值。 要知道我们要切片的特定行和列,并且索引是基于整数的,因此我们使用.iloc 。 第一行是要获得前四行的输出,第二行是要找到B至C的两到三行和列索引的输出。

    # Integer slicing print (df1.iloc[:4]) print (df1.iloc[2:4, 1:3])import pandas as pd import numpy as np df2 = pd.DataFrame(np.random.randn(8, 3), columns = ['A', 'B', 'C']) # Integer slicing print (df2.ix[:4])

    query()方法 (The query() Method)

    DataFrame objects have a query() method that allows selection using an expression. You can get the value of the frame where column b has values between the values of columns a and c.

    DataFrame对象具有query()方法,该方法允许使用表达式进行选择。 您可以获得框架的值,其中b列的值介于a列和c列之间。

    For example:

    例如:

    #creating dataframe of 10 rows and 3 columns df4 = pd.DataFrame(np.random.rand(10, 3), columns=list('abc')) df4

    The condition given in the below code is to check that x is smaller than b and b is smaller than c. If both the condition is true then print the output. With this condition, only one row passed the condition.

    以下代码中给出的条件是检查x小于b且b小于c。 如果两个条件都成立,则打印输出。 在这种情况下,只有一行通过了该条件。

    Give the same conditions to the query function. If we compare these two condition the query syntax is simple than data frame syntax.

    给查询函数相同的条件。 如果我们将这两个条件进行比较,则查询语法比数据帧语法简单。

    #with query() df4.query('(x < b) & (b < c)')

    重复资料 (Duplicate Data)

    If you want to identify and remove duplicate rows in a Data Frame, two methods will help: duplicated and drop_duplicates.

    如果要标识和删除数据框中的重复行,则可以使用两种方法: 重复和drop_duplicates 。

    duplicated: returns a boolean vector whose length is the number of rows, and which indicates whether a row is duplicated.

    重复:返回布尔矢量,其长度为行数,并指示是否重复一行。

    drop_duplicates: removes duplicate rows.

    drop_duplicates:删除重复的行。

    Creating a data frame in rows and columns with integer-based index and label based column names.

    使用基于整数的索引和基于标签的列名称在行和列中创建数据框。

    df5 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two'], 'b': ['x', 'y', 'x', 'y', 'x'], 'c': np.random.randn(5)}) df5

    We generated a data frame in pandas and the values in the index are integer based. and three columns a,b, and c are generated. here we checked the boolean value that the rows are repeated or not. For every first time of the new object, the boolean becomes False and if it repeats after then, it becomes True that this object is repeated.

    我们以熊猫为单位生成了一个数据框,索引中的值基于整数。 并生成三列a,b和c。 在这里,我们检查了是否重复行的布尔值。 对于新对象的每次第一次,布尔值都将变为False,然后在此之后重复,则此对象重复将变为True。

    df5.duplicated('a')

    The difference between the output of two functions, one is giving the output with boolean and the other is removing the duplicate labels in the dataset.

    两个函数的输出之间的区别,一个是给输出提供布尔值,另一个是删除数据集中的重复标签。

    df5.drop_duplicates('a')

    结论: (Conclusion:)

    There are a lot of ways to pull the elements, rows, and columns from a DataFrame. There is some indexing method in Pandas which helps in selecting data from a DataFrame. These are by far the most common ways to index data. The .loc and .iloc indexers use the indexing operator to make selections.

    有很多方法可以从DataFrame中提取元素,行和列。 Pandas中有一些索引方法,可帮助您从DataFrame中选择数据。 到目前为止,这些是索引数据的最常用方法。 .loc和.iloc索引器使用索引运算符进行选择。

    You can reach me at my LinkedIn link here and on my email: [email protected]

    你可以在我的LinkedIn链接到我这里和我的邮箱: [电子邮件保护]

    My Previous Articles:

    我以前的文章:

    Robotic Vision in Agriculture

    农业机器人视觉

    Interesting 10 Machine Learning and Data Science Projects with Datasets

    有趣的10个带有数据集的机器学习和数据科学项目

    Basic Understanding of NLP With Python

    使用Python对NLP的基本了解

    翻译自: https://medium.com/analytics-vidhya/indexing-and-selecting-data-in-python-b09515127b40

    python建立数据库索引

    Processed: 0.014, SQL: 8