If you are a Data Analyst or Data Scientist, you must know the Pandas library in Python that has already become the standard of data wrangling/cleansing tool in Python. However, there are some small tricks in Pandas that I bet you might not know all of them.
如果您是数据分析师或数据科学家,则必须了解Python中的Pandas库,该库已经成为Python中数据整理/清理工具的标准。 但是,我敢打赌,Pandas中有一些小技巧,您可能并不了解所有这些技巧。
In this article, I’ll share some Pandas tricks that I know. I believe they will expedite our jobs and make our life easier sometimes. Now we should begin!
在本文中,我将分享一些我所知道的熊猫技巧。 我相信他们会加快我们的工作,有时会使我们的生活更轻松。 现在我们应该开始!
Well, you must know that Pandas can easily read from CSV, JSON and even directly from the database using SQLAlchemy, but do you know that Pandas can also read from the clipboard of our operating system?
好吧,您必须知道熊猫可以使用SQLAlchemy轻松地从CSV,JSON甚至直接从数据库读取,但是您知道熊猫也可以从我们的操作系统剪贴板中读取吗?
Suppose we have an Excel file with multiple datasheets. Now, we want to have partial data from one sheet to be processed in Python. What we usually might do to implement this?
假设我们有一个包含多个数据表的Excel文件。 现在,我们希望从一张工作表中获取部分数据以Python处理。 我们通常可以做什么来实现这一目标?
Copy the data that we need to be processed in Python from the datasheet. 从数据表中复制我们需要在Python中处理的数据。 Paste it into another datasheet. 将其粘贴到另一个数据表中。 Save the current sheet into CSV file. 将当前工作表保存到CSV文件中。 Get the path of the new CSV file. 获取新的CSV文件的路径。Go to Python, use pd.read_csv('path/to/csv/file') to read the file into a Pandas data frame.
转到Python,使用pd.read_csv('path/to/csv/file')将文件读取到Pandas数据框中。
There is definitely an easier way of doing this, which is pd.read_clipboard().
绝对有一种更简单的方法可以做到这一点,即pd.read_clipboard() 。
Copy the area of the data that you need. 复制所需数据的区域。Go to Python, use pd.read_clipboard().
转到Python,使用pd.read_clipboard() 。
As shown above, how easy it is! You don’t need to have a separated CSV or Excel file if you just want to load some data into Pandas.
如上所示,这多么容易! 如果您只想将一些数据加载到Pandas中,则不需要单独的CSV或Excel文件。
There are also some more tricks in this function. For example, when we have data with date format, it might not be correctly loaded as follows.
此功能还有更多技巧。 例如,当我们具有日期格式的数据时,可能无法按以下方式正确加载。
The trick is to let Pandas know which column is date format that needs to be parsed.
诀窍是让Pandas知道哪一列是需要解析的日期格式。
df = pd.read_clipboard(parse_dates=['dob'])Sometimes we may want to generate some sample data frame. The most common method is probably using NumPy to generate an array with random values, and then generate data frame from the array.
有时我们可能想生成一些样本数据帧。 最常见的方法可能是使用NumPy生成具有随机值的数组,然后从该数组生成数据帧。
I would say that we have to do it like this if we need the data to have a certain distribution, such as normal distribution. However, most of the time we may not care whether the data is normally distributed, we just want to have some data to play around. In this case, there is a much easier way to do so. That is, using pandas.util.testing package to generate the sample data frame.
我要说的是,如果我们需要数据具有一定的分布(例如正态分布),就必须这样做。 但是,大多数时候我们可能并不在乎数据是否呈正态分布,我们只是希望有一些数据可以玩。 在这种情况下,有一种更简单的方法。 也就是说,使用pandas.util.testing包生成示例数据帧。
pd.util.testing.makeDataFrame()The index of the data frame will be generated using random strings. By default, there will be 30 rows with 4 columns.
数据帧的索引将使用随机字符串生成。 默认情况下,将有30行4列。
If we need a certain number of rows and columns, we can define the testing.N as the number of rows and testing.K as the number of columns.
如果需要一定数量的行和列,则可以将testing.N定义为行数,并将testing.K定义为列数。
pd.util.testing.N = 10pd.util.testing.K = 5pd.util.testing.makeDataFrame()You must know that we can easily output a data frame into a file, such as df.to_csv(), df.to_json() and so on. But sometimes, we may want to compress the file to save the disk space or for other purposes.
您必须知道我们可以轻松地将数据帧输出到文件中,例如df.to_csv() , df.to_json()等。 但是有时,我们可能要压缩文件以节省磁盘空间或用于其他目的。
For example, as a Data Engineer, I did meet such a requirement that is to output Pandas data frames into CSV files and transfer them into a remote server. To save the space as well as the bandwidth, the files need to be compressed before sending.
例如,作为一名数据工程师,我确实满足了将Pandas数据帧输出到CSV文件并将它们传输到远程服务器的要求。 为了节省空间和带宽,文件需要在发送前进行压缩。
Usually, the typical solution could be adding one more step in the scheduling tool that is using such as Airflow or Oozie. But we know that we can directly let Pandas to output a compressed file. So, the solution will be neater and less complicated with fewer steps.
通常,典型的解决方案可能是在使用诸如Airflow或Oozie的调度工具中再增加一个步骤。 但是我们知道我们可以直接让Pandas输出压缩文件。 因此,解决方案将变得更加整洁,步骤更少,复杂程度也将降低。
Let’s generate a random data frame using the Trick №2 :)
让我们使用Trick№2生成随机数据帧:)
pd.util.testing.N = 100000pd.util.testing.K = 5df = pd.util.testing.makeDataFrame()See, in this case, we just want a data frame and the values in it is totally not a concern.
看到,在这种情况下,我们只需要一个数据框,而其中的值完全不是问题。
Now, let’s save the data frame into a CSV file, and check the size.
现在,让我们将数据框保存到CSV文件中,然后检查大小。
import osdf.to_csv('sample.csv')os.path.getsize('sample.csv')Then, we can test outputting the same data frame into a compressed file, and check the size of the file.
然后,我们可以测试将相同的数据帧输出到压缩文件中,并检查文件的大小。
df.to_csv('sample.csv.gz', compression='gzip')os.path.getsize('sample.csv.gz')We can see that the compressed file is less than half of the normal CSV file.
我们可以看到压缩文件小于普通CSV文件的一半。
Please note that this might not be a good example, because we don’t have any repeated values in our random data frame. In practice, if we have any categorical values, the compression rate can be very high!
请注意,这可能不是一个好例子,因为我们的随机数据帧中没有重复的值。 实际上,如果我们有任何分类值,则压缩率会非常高!
BTW, maybe you’re thinking the thing that I gonna say. Yes, Pandas can directly read the compressed file back into a data frame. You don’t need to unzip it in the file system.
顺便说一句,也许您在想我要说的话。 是的,Pandas可以直接将压缩文件读回到数据框中。 您无需将其解压缩到文件系统中。
df = pd.read_csv('sample.csv.gz', compression='gzip', index_col=0)I prefer to use gzip because it exists in most of the Linux system by default. Pandas do also support more formats of compressions such as “zip” and “bz2”.
我更喜欢使用gzip因为默认情况下它存在于大多数Linux系统中。 熊猫也确实支持更多压缩格式,例如“ zip”和“ bz2”。
I believe you must have used pd.to_datetime() method to convert some kind of string into DateTime format in Pandas. We usually use this method with a format string such as %Y%m%d.
我相信您必须使用pd.to_datetime()方法将某种字符串转换为Pandas中的DateTime格式。 我们通常将此方法与%Y%m%d类的格式字符串一起使用。
However, we may have the following kind of data frame as our raw data, sometimes.
但是,有时我们可能将以下类型的数据框作为原始数据。
df = pd.DataFrame({ 'year': np.arange(2000, 2012), 'month': np.arange(1, 13), 'day': np.arange(1, 13), 'value': np.random.randn(12)})It is not uncommon to have the year, month and day as separated columns in a data frame. In fact, we can use pd.to_dateframe() to convert them into a DateTime column in one step.
将年,月和日作为数据框中的单独列并不少见。 实际上,我们可以使用pd.to_dateframe()一步将它们转换为DateTime列。
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])How easy it is!
多么简单!
In this article, I’ve shared some tricks that I believe are quite useful in Python Pandas library. I would say that these little tricks are not essentials that we have to know, of course. But by knowing them, sometimes can save time in our life.
在本文中,我分享了一些我认为在Python Pandas库中非常有用的技巧。 我想说,这些小技巧当然不是我们必须知道的要点。 但是通过了解它们,有时可以节省我们的生活时间。
I’ll be kept looking for more interesting stuff for Python. Please keep an eye on my profile. And finally:
我将继续为Python寻找更多有趣的东西。 请关注我的个人资料。 最后:
Life is short, I use Python :)
寿命短,我使用Python :)
翻译自: https://towardsdatascience.com/4-pandas-tricks-that-most-people-dont-know-86a70a007993
相关资源:四史答题软件安装包exe