熊猫烧香分析报告

    科技2022-07-12  126

    熊猫烧香分析报告

    PYTHON (thePYTHON)

    Did you know that you can have a Top N analysis based on more than one column in Pandas?

    您是否知道可以基于Pandas中的多个列进行Top N分析?

    A Top N analysis can be useful to select a subset of your data matching a specific condition. For example, what if you owned a restaurant and wanted to look at which customers contributed most to your overall sales? The easiest way to do this would be to look at the total sales for all your customers, then sort that list from highest to lowest.

    前N个分析对于选择与特定条件匹配的数据子集很有用。 例如,如果您拥有一家餐厅并想查看哪些客户对您的整体销售额贡献最大? 最简单的方法是查看所有客户的总销售额,然后从最高到最低对列表进行排序。

    Another interesting subset for you to look at might be the customers who had the lowest (or negative) profit contribution. You could then accomplish this in a similar fashion, getting a list of all customers by their profit contribution and then taking only the lowest members.

    您可能要看的另一个有趣的子集可能是利润贡献最低(或为负)的客户。 然后,您可以以类似的方式来完成此操作,通过其利润贡献获得所有客户的列表,然后仅选择最低的成员。

    But what if you wanted to find out if there were customers that appeared on both lists?

    但是,如果您想确定两个列表中都没有出现客户,该怎么办?

    That could help you identify areas in which you’re actually losing money, even if it looks like the total sales value is very high. For example, if you had a dish that had a very low-profit margin, repeated orders of this dish alone might not be beneficial for your bottom line.

    即使总销售价值看起来很高,这也可以帮助您确定实际亏损的领域。 例如,如果您的菜品的利润率很低,那么仅重复订购该菜品可能对您的利润没有好处。

    Let’s take a look at how you can combine the built-in Pandas functions to do this kind of analysis!

    让我们看一下如何结合内置的Pandas函数来进行这种分析!

    The data used in this piece is sourced from Yahoo Finance. We’ll be using a subset of Tesla stock price data. Run the code below if you want to follow along. (And if you’re curious as to the function I used to get the data, scroll to the very bottom and click on the first link.)

    本文中使用的数据来自Yahoo Finance。 我们将使用特斯拉股价数据的子集。 如果要继续,请运行下面的代码。 (如果您对我用来获取数据的功能感到好奇,请滚动到最底部,然后单击第一个链接。)

    import pandas as pddf = pd.read_html("https://finance.yahoo.com/quote/TSLA/history?period1=1546300800&period2=1550275200&interval=1d&filter=history&frequency=1d")[0]df = df.head(30)df = df.astype({"Open":'float', "High":'float', "Low":'float', "Close*":'float', "Adj Close**":'float', "Volume":'float'}) Yahoo Finance 雅虎财经的特斯拉股票价格样本

    前N个和后N个与熊猫结合 (Combined Top N and Bottom N with Pandas)

    To demonstrate how we can combine the Top N and Bottom N analysis, we’re going to answer the following question:

    为了演示如何结合前N个和后N个分析,我们将回答以下问题:

    Whighest increase in stock price while also having the lowest Open price in the data set?

    在数据集中,股票价格的最高涨幅同时开盘价也最低吗?

    First, we’ll need to calculate how much the stock price changed during the day. This can be achieved with a simple calculation:

    首先,我们需要计算股价在一天中发生了多少变化。 可以通过简单的计算来实现:

    df['Gain'] = df['Close*'] - df['Open'] Data with new “Gain” column 具有新“收益”列的数据

    We’ve stored the difference between the “Close*” and “Open” columns into a new column called “Gain”. As you can see in the table above, not all column values are positive, as there were some days where the stock price decreased.

    我们已经将“ Close *”和“ Open”列之间的差异存储在一个名为“ Gain”的新列中。 如您在上表中所见,并非所有列值都是正的,因为在某些日子里股价下跌。

    Next, we’ll be creating two new DataFrames: one with the top 10 highest “Gain” values and one with the top 10 lowest “Open” values.

    接下来,我们将创建两个新的DataFrame:一个具有最高的“增益”前10个值,另一个具有最低的“开放”值前10个。

    条件1:增益的前N个 (Condition 1: Top N of Gain)

    Here, we’ll be using the nlargest method in Pandas. This method accepts the number of elements you want to keep, the column you want to order the DataFrame by, and which duplicate values (if any) should appear in the outputted DataFrame. By default, nlargest will only keep the first of any duplicates, and the rest will be excluded from the returned DataFrame.

    在这里,我们将使用Pandas中nlargest方法。 此方法接受要保留的元素数量,要对DataFrame进行排序的列,以及应在输出的DataFrame中显示哪些重复值(如果有)。 默认情况下, nlargest将仅保留所有重复项中的第一个,其余的将从返回的DataFrame中排除。

    This method will return the same results as df.sort_values(columns, ascending=false).head(n). This code is very easy to understand and will also work, but according to the documentation, the nlargest method is more performant.

    此方法将返回与df.sort_values(columns, ascending=false).head(n)相同的结果。 这段代码很容易理解,也可以使用,但是根据文档, nlargest方法性能更高。

    The code to get the 10 rows with the highest “Gain” values is as follows:

    获取具有最高“增益”值的10行的代码如下:

    df_top = df.nlargest(10, 'Gain') Top N by “Gain” “收益”前N名

    The returned DataFrame now gives us only the values in the original DataFrame with the highest 10 “Gain” values. This new DataFrame is also already sorted in descending order.

    现在,返回的DataFrame仅为我们提供原始DataFrame中具有最高10个“增益”值的值。 这个新的DataFrame也已经按降序排序。

    条件2:开底N (Condition 2: Bottom N of Open)

    Next, we’ll use the nsmallest method on the DataFrame to get the rows with the lowest “Open” values. This method works exactly like the previous one, except it would sort and slice the values ascending order.

    接下来,我们将在nsmallest上使用nsmallest方法来获取“ Open”值最低的行。 此方法与上一个方法完全相同,不同之处在于它将对值的升序进行排序和切片。

    The code to achieve this is as follows:

    实现此目的的代码如下:

    df_bottom = df.nsmallest(10, 'Open') Bottom N by “Open” 底部N通过“打开”

    创建顶部和底部N个值的“组合集” (Creating a “Combined Set” of Top and Bottom N Values)

    We’re now ready to combine the two DataFrames to create a combined set. I’m borrowing this term from a built-in Tableau function, but all it refers to is a subset of the data that matches multiple conditions based on two or more columns. In this case, we’re looking for the data that exists only in the top 10 of “Gain” and the bottom 10 of “Open”.

    现在,我们准备结合两个DataFrame来创建一个组合集 。 我是从内置的Tableau函数中借用此术语的,但是它所指的只是与基于两个或更多列的多个条件匹配的数据的子集。 在这种情况下,我们正在寻找仅存在于“收益”的前10名和“开放”的后10名中的数据。

    To get our combined set, there are two main steps:

    要获得我们的组合集,有两个主要步骤:

    Concatenate the top N and bottom N DataFrames

    连接前N个和后N个数据帧 Remove all rows except the duplicates.

    删除所有重复的行(重复项除外)。

    The code to achieve this is as follows:

    实现此目的的代码如下:

    df_combined = pd.concat([df_top, df_bottom])df_combined['Duplicate'] = df_combined.duplicated(subset=['Date'])df_combined = df_combined.loc[df_combined['Duplicate']==True] Combined Top N and Bottom N v1 前N个和后N个组合v1

    First, we simply call pd.concat and stick the two DataFrames together. Since the top N and bottom N DataFrames come from the exact same source, we don’t need to worry about renaming any columns or specifying an index.

    首先,我们简单地调用pd.concat并将两个DataFrame粘贴在一起。 由于前N个和后N个DataFrame来自完全相同的来源,因此我们不必担心重命名任何列或指定索引。

    Next, to create the “Duplicate” column, we make use of the duplicated method. This function returns a boolean Series with marking a row as “True” if it’s a duplicate and “False” otherwise. You can call this on a DataFrame and specify which column to search for duplicates in by writing the column name as an argument (in this case subset=['Date']). For demonstration, I created a new column “Duplicate” to store the new boolean values in.

    接下来,要创建“重复”列,我们使用duplicated方法。 此函数返回布尔系列,如果它是重复的,则将行标记为“ True”,否则为“ False”。 您可以在DataFrame上调用此函数,并通过将列名作为参数写入(在本例中为subset=['Date'] )来指定要在其中搜索重复项的列。 为了演示,我创建了一个新列“ Duplicate”来存储新的布尔值。

    I won’t go into how the loc[] function works, but if you haven’t used it before, quickly skim through this introduction so you understand how you can use it to filter your DataFrame in various ways. All we’re doing with it here is taking the values in the “Duplicate” column that are True because those are the ones that appear in both DataFrames.

    我不会介绍loc[]函数的工作原理,但是如果您以前从未使用过它,请快速浏览一下此简介,以便您了解如何使用它以各种方式过滤DataFrame。 我们在此所做的所有工作都是使用“ Duplicate”列中的值为True的值,因为这些值同时出现在两个DataFrame中。

    We don’t even really need to create a new column to mark the duplicate values. A slightly condensed (and equivalent) version of the above code would look like this:

    我们甚至根本不需要创建新列来标记重复的值。 上面的代码略有精简(等效)的版本如下所示:

    combined = pd.concat([df_top, df_bottom])combined = combined.loc[combined.duplicated()==True] Combined Top N and Bottom N v2 前N个和后N个组合v2

    Voila! Now we can see the rows in which there was a high “Gain” during one of the days with the lowest “Open” in the whole dataset.

    瞧! 现在,我们可以看到在整个数据集中的某一天中,“收益”高而“空头”最低的行。

    And that’s all!

    就这样!

    I hope you found this quick look at the Top N (and Bottom N) analysis useful. Combining multiple conditions can allow you to filter and work with your data in new ways, which can help you extract valuable information from your dataset.

    我希望您对快速浏览前N(和后N)分析很有用。 组合多个条件可以使您以新的方式过滤和处理数据,这可以帮助您从数据集中提取有价值的信息。

    Good luck with your Pandas work!

    祝您熊猫工作顺利!

    More Pandas stuff by me:- 2 Easy Ways to Get Tables From a Website with Pandas- How to Quickly Create and Unpack Lists with Pandas- Top 4 Repositories on GitHub to Learn Pandas- A Quick Way to Reformat Columns in a Pandas DataFrame

    翻译自: https://towardsdatascience.com/the-upgraded-top-n-analysis-you-havent-seen-yet-with-pandas-47e6fdc67130

    熊猫烧香分析报告

    相关资源:微信小程序源码-合集6.rar
    Processed: 0.081, SQL: 9