项目开发中可提升的方向
Software development is the process followed by developers and programmers to design, write, document, and test codes. Regardless of what programming language you use or what your target application field is, following the specific guidelines of good software development is essential in building a high-quality, maintainable project.
软件开发是开发人员和程序员遵循的设计,编写,记录和测试代码的过程。 无论您使用哪种编程语言或目标应用领域是什么,遵循良好软件开发的特定准则对于构建高质量,可维护的项目都是至关重要的。
Data science projects — may be more than other types of software projects — should be built with the mentality of maintainability. That is because, in most data science projects, the data is not constant and is frequently updating. Moreover, it is expected from any data science project to be extendable and to be crash-resistant. It should be immune to any mistake in the data.
数据科学项目(可能比其他类型的软件项目更多)应以可维护性的思想来构建。 这是因为,在大多数数据科学项目中,数据不是恒定的并且经常更新。 此外,任何数据科学项目都希望它具有可扩展性和抗崩溃性。 它应该不受数据中任何错误的影响。
Because every single part of the code in a data science project is build to fit a specific shape or form of data, if a wring data is given to the code, it might break it down. Of course, you never want your code to break, no matter what data it is fed. Hence, when designing and building the code, there are a few things to keep in mind to make your code more resilient.
因为数据科学项目中代码的每个部分都是为适应特定形状或数据形式而构建的,所以如果将拧紧的数据提供给代码,则可能会将其分解。 当然,无论输入什么数据,您都不会希望代码中断。 因此,在设计和构建代码时,需要记住一些要点,以使您的代码更具弹性。
There are many guidelines to follow to design and write good, stable code. However, in this article, we will focus on what I think is the 5 most important rules — or skills — needed to build a solid data science project.
设计和编写良好,稳定的代码时要遵循许多准则。 但是,在本文中,我们将重点介绍构建可靠的数据科学项目所需的5条最重要的规则或技能。
So, let’s get right to it…
所以,让我们开始吧...
We can’t talk about good software without mentioning documentation. Now, there are two steps to keep your code clean and well documented. The first step is commenting on your code. Comments are critical to walking people reading your code and — most importantly — your future self, through your thought process when you wrote the code.
不提文档,我们就谈不上优质软件。 现在,有两个步骤可以使您的代码保持整洁并有据可查。 第一步是对您的代码进行注释。 注释对于引导人们阅读代码以及(最重要的是)您在编写代码时的思考过程中的未来自我至关重要。
Comments need to be simple, not more than two sentences, and straight to the point. Never forget writing a descriptive docstring whenever you define a class or a function or create your own modules. When writing comments, always remember:
注释必须简单明了,不要超过两个句子。 定义类或函数或创建自己的模块时,请不要忘记编写描述性文档字符串。 撰写评论时,请切记:
Comments are not there to explain code to people; code is there to explain comments to the computer.
没有注释可以向人们解释代码。 代码在那里向计算机解释注释。
Once your codes and comments are done — well, for the time being since code is never done — you need to build sufficient documentation to your code. Documentations are external explanations of the code written — usually — in plain English. Documentations are often created using documentation processing tools, such as Sphinx and DocUtils. Documentations are often a part of your project’s website.
一旦完成代码和注释(从现在开始就一直没有完成代码),您需要为代码建立足够的文档。 文档是通常以纯英语编写的代码的外部说明。 通常使用文档处理工具(例如Sphinx和DocUtils)创建文档。 文档通常是项目网站的一部分。
When it comes to best practices, it’s a good idea to start writing your documentation before you start coding. It will act as a guide to what needs to be done. Unfortunately, most of us — including myself — don’t follow this rule. However, we all need to start practicing it.
当涉及最佳实践时,最好在开始编码之前就开始编写文档。 它将作为需要做什么的指南。 不幸的是,我们大多数人-包括我自己-都不遵守这个规则。 但是,我们所有人都需要开始练习它。
When we write code, we often write it based on some variables and datasets. However, it is very common that your code may contain some bugs that will only appear in some particular cases or with a specific dataset. Therefore, testing your application before deploying it can be crucial.
在编写代码时,我们经常根据一些变量和数据集来编写代码。 但是,很常见的是,您的代码可能包含一些错误,这些错误只会在某些特定情况下或在特定数据集下才会出现。 因此,在部署应用程序之前对其进行测试可能至关重要。
But, testing can get quite complicated, especially when it comes to data science projects. Often, data science projects are tested using reviews from other data scientists because most of the well-known testing methodologies are difficult to apply in case of data science projects.
但是,测试可能会变得相当复杂,尤其是在涉及数据科学项目时。 通常,数据科学项目是使用其他数据科学家的评论进行测试的,因为大多数众所周知的测试方法很难在数据科学项目中应用。
That is because a simple change in data could lead to significant changes in the performance of the code. Through the years, researchers and developers have looked for the best way to test data science projects. They found out that the best way to test data science applications is through unit testing.
这是因为简单的数据更改可能会导致代码性能发生重大变化。 多年来,研究人员和开发人员一直在寻找测试数据科学项目的最佳方法。 他们发现测试数据科学应用程序的最佳方法是通过单元测试。
Unit testing is a type of testing that is used to detect changes that may break the flow of your program. They help with maintaining and changing the code. There are many Python testing libraries that you can use to perform unit testing.
单元测试是一种用于 检测可能会破坏程序流程的更改。 它们有助于维护和更改代码。 您可以使用许多Python测试库来执行单元测试。
Unittest is the built-in library in Python that is used to perform unit testing. Unittest is often referred to as PyUnit, and it is an easy way to create unit testing programs.
Unittest是Python中的内置库,用于执行单元测试。 单元测试通常称为PyUnit,它是创建单元测试程序的简便方法。
Pytest is a complete testing tool — which is my favorite. Pytest has a simple straight forward approach to build and uses unit tests.
Pytest是一个完整的测试工具,这是我的最爱。 Pytest有一个简单而直接的方法来构建和使用单元测试。
Hypothesis is a unit test-generation tool. The goal of developing Hypothesis is to assists developers in creating and using unit tests that tackle the edge cases of your code.
假设是一个单元测试生成工具。 开发假设的目的是帮助开发人员创建和使用可解决代码边缘情况的单元测试。
Getting a little bit specific to data science projects, when dealing with data, one thing we need to be careful with is managing our data. We need to consider many things, such as how are your data created? How big is it? Will it be loaded every time or stored in the memory?
在处理数据时,要对数据科学项目有所了解,我们需要注意的一件事是管理数据。 我们需要考虑很多事情,例如如何创建数据? 它有多大? 是每次加载还是存储在内存中?
When working with data, we need to be very careful with memory management and how the code is interacting with the data. One thing to consider is how Python functions call affect the memory usage of your code. Sometimes, function calls take up more memory than you realize.
在处理数据时,我们需要非常注意内存管理以及代码与数据的交互方式。 要考虑的一件事是Python函数调用如何影响代码的内存使用情况。 有时,函数调用占用的内存比您想象的要多。
One way you can overcome that is by using Python’s automatic memory management capabilities. Here’s how Python deals with function calls:
解决该问题的一种方法是使用Python的自动内存管理功能。 以下是Python处理函数调用的方式:
Every time you call a function and object is created with a counter of the number of places; this function is used. 每次调用函数和对象时,都会创建一个带有位数的计数器。 使用此功能。 Whenever we use or reference this function, the counter is incremented by 1. 每当我们使用或引用此函数时,计数器都会增加1。 When the code reference goes away form the function object, the counter is decremented by 1 till it hits 0. Once that’s done, the memory will be freed. 当代码引用离开函数对象时,计数器递减1直到达到0。完成后,将释放内存。If you’re wondering how you can write code that uses this automatic memory management, wonder no more. Itamar Turner proposed 3 different way you can make your functions more memory efficient:
如果您想知道如何编写使用此自动内存管理的代码,不要再想了。 Itamar Turner提出了3种不同的方法来使您的函数更有效地利用内存:
Try to minimize the use of local variables. 尽量减少使用局部变量。 If you can’t, then re-use variables instead of defining new ones. 如果不能,请重用变量而不是定义新变量。 Transfer object ownership of functions that takes a lot of memory usage. 转移占用大量内存的函数的对象所有权。Last but not least, to help you build resilient projects, make use of tools built specifically for data science. Of course, there are well-known tools, such as IPython, Pandas, Numpy, and Matplotlib.
最后但并非最不重要的一点是,为了帮助您构建弹性项目,请使用专门为数据科学构建的工具。 当然,有一些著名的工具,例如IPython,Pandas,Numpy和Matplotlib。
But, let me shed some light on two not very known tools:
但是,让我阐明一下两个不是很知名的工具:
GraphLab Create: is a Python library used to build large-scale, high performing data products quickly. You can use GraphLab Create to apply state-of-the-art machine learning algorithms, such as deep learning, boosted trees, and factorization. You can perform data exploration through visualization, and you can quickly deploy your project using Predictive Sevices.
GraphLab Create :是一个Python库,用于快速构建大规模,高性能的数据产品。 您可以使用GraphLab Create来应用最新的机器学习算法,例如深度学习,增强树和分解。 您可以通过可视化执行数据探索,并且可以使用Predictive Sevices快速部署项目。
Fil: is a Python memory management tool for data science. You can use Fil to measure peak memory usage in your Jupyter notebook. As well as to measure peak memory usage for normal — none Jupyter-based — Python scripts, and debug out-of-memory crashes in your code. Moreover, Fil can help in reducing your memory usage significantly.
Fil:是用于数据科学的Python内存管理工具。 您可以使用Fil来测量Jupyter笔记本中的峰值内存使用量。 除了测量正常的内存使用量(无基于Jupyter的内容)的Python脚本的峰值内存使用率之外,还可以调试代码中的内存不足崩溃。 此外,Fil可以帮助您显着减少内存使用量。
Nowadays, building a good data science project is not enough to make you stand out. You need your project to be crash-resistants and memory efficient. That’s why using some software development skills; you can take your data science project to the next level and make it stand out.
如今,建立一个好的数据科学项目不足以使您脱颖而出。 您需要您的项目具有抗崩溃和内存高效的功能。 这就是为什么要使用一些软件开发技能的原因。 您可以将您的数据科学项目提高到一个新的水平,并使它脱颖而出。
The software development skills we discussed in this article are:
我们在本文中讨论的软件开发技能是:
Efficient documenting and commenting. 高效的记录和评论。 Testing, testing, and then some more testing. 测试,然后再进行一些测试。 Wise data and memory management. 明智的数据和内存管理。 Special tools that can ease up your work and increase the efficiency of your project. 专用工具可以简化您的工作并提高项目效率。What we didn’t talk about, though, is the most crucial skill any developer must obtain, which is the ability always to be working on improving your skills and knowledge base, as well as keeping up to date with recent technologies and tools.
但是,我们没有谈论的是开发人员必须获得的最关键的技能,这是始终致力于提高您的技能和知识库以及与时俱进的最新技术和工具的能力。
翻译自: https://towardsdatascience.com/4-software-development-techniques-to-level-up-your-data-science-project-59a44498ca3f
项目开发中可提升的方向