美团脱颖而出的经验

科技2023-12-18 74

美团脱颖而出的经验

Data Science is no doubt a hot topic in recent years. As machine learning becomes more and more popular, lots of companies believe it can turn data into invaluable treasure. Some of them even think that applying machine learning can allow them to find something that humans can’t discover. Do you also have these thoughts regarding data science? Are you eager to include machine learning as part of your business? Is data scientist one of your dream job position?

数据科学无疑是近年来的热门话题。随着机器学习变得越来越流行，许多公司相信它可以将数据变成宝贵的财富。他们中的一些人甚至认为，应用机器学习可以使他们找到人类无法发现的东西。您还对数据科学有这些想法吗？您是否希望将机器学习作为您业务的一部分？数据科学家是您理想的工作位置之一吗？

Last year, I had a chance to join a data science project. That was my first time dealing with real-world big data. My responsibility is to work with billions of traffic and car exam data and applied machine learning to help our client, a government department, improve their business performance. Before participating in the team, I was confident that machine learning can tell us everything valuable. However, after that project, I had a lot of different opinions about realizing data science in business. If you are considering pursuing a data science career or developing data science projects, I believe the below facts would be helpful to prepare your work.

去年，我有机会参加了一个数据科学项目。那是我第一次处理现实世界中的大数据。我的责任是处理数十亿的交通和汽车检查数据以及应用机器学习，以帮助我们的客户(政府部门)改善其业务绩效。在加入团队之前，我相信机器学习可以告诉我们所有有价值的东西。但是，在完成该项目之后，对于在业务中实现数据科学，我有很多不同的见解。如果您正在考虑从事数据科学职业或开发数据科学项目，我相信以下事实将对您的工作有所帮助。

数据科学万能面具背后隐藏的4个事实 (4 facts hidden behind the know-it-all mask of data science)

1.数据和模型不能告诉您一切。 (1. Data and models can’t tell you everything.)

Photo by Andrea Piacquadio from Pexels Pexels的 Andrea Piacquadio 摄

You can’t expect data and machine learning to tell you everything automatically. To be more specific, they can‘t develop a new business opportunity for your company by themselves. Moreover, they can’t tell you anything without clear definitions of problems and variables.

您不能期望数据和机器学习能够自动告诉您所有信息。更具体地说，他们自己无法为您的公司开发新的商机。此外，如果没有明确的问题和变量定义，他们将无法告诉您任何信息。

Take my experience for example. On the first day of the data science project, my boss asked me to use machine learning to automatically find some new, serious social events that my clients haven’t discovered. However, I gradually realized that machine learning models were just lines of codes, not the salesmen or managers who knew what our clients’ business was. That is to say, they don’t know what drivers care about and what’s the latest road safety problems. Eventually, I was embarrassed to tell them I couldn’t complete this assignment.

以我的经验为例。在数据科学项目的第一天，我的老板让我使用机器学习来自动发现客户未发现的一些新的严重社交事件。但是，我逐渐意识到，机器学习模型只是代码行，而不是知道客户业务的推销员或经理。也就是说，他们不知道驾驶员在乎什么以及最新的道路安全问题。最终，我不好意思告诉他们我无法完成这项任务。

In the data science project, I also work to find what factors affect the occurrence of car accidents. I first give the model two features, the number of vehicles and average speed, without specifying a group of factors and accident types. Then, I was frustrated to know that the two features were not directly correlated with car accidents. The second time, I redefined the objective to let the model discover what environmental factors affect the occurrence of car accidents caused by specific reasons. Because of the clear definition and variables provided, the model searched for associated features with thorough consideration and brought me significant results.

在数据科学项目中，我还致力于发现哪些因素会影响车祸的发生。我首先给模型两个特征，车辆数量和平均速度，而不指定一组因素和事故类型。然后，我很沮丧地知道这两个特征与交通事故没有直接关系。第二次，我重新定义了目标，使模型能够发现哪些环境因素影响由特定原因引起的车祸的发生。由于提供了清晰的定义和变量，因此该模型经过充分考虑后搜索了相关功能，并为我带来了重要的结果。

2.处理现实世界中的大数据需要大量的人力和知识。 (2. Working with real-world big data requires lots of human efforts and knowledge.)

ThisIsEngineering from Pexels的 Pexels ThisIsEngineering摄

Knowing that data science can’t provide you everything, I understood that you must put lots of effort and knowledge into it to make it successful. Unlike at school, in the real world, we don’t have clean datasets with low volume so that we can’t simply apply models on them without considering data formatting, missing value, or computer storage. Furthermore, we don’t have the correct answers to building models.

知道数据科学不能为您提供一切，我知道您必须付出大量的努力和知识才能使其成功。与学校不同，在现实世界中，我们没有干净的数据集，因此数据量很小，因此我们不能在不考虑数据格式，缺失值或计算机存储的情况下简单地将模型应用于这些数据集。此外，我们没有建立模型的正确答案。

Real-world projects’ data usually is a mess and present in high volume. For instance, my project’s data was real-time and produced by the sensors all over Taiwanese roads. Because of the storage limitation, I had to first extract or create the required columns, sample the data, loaded the data in sequence, and split it into smaller files for processing. Then, I needed to deal with the organization issues by imputing the missing value and adjust the formats. All the pre-processing usually take me 70% of the total project time.

实际项目的数据通常是一团糟，并且数量庞大。例如，我的项目数据是实时的，由台湾道路上的传感器生成。由于存储限制，我必须首先提取或创建所需的列，对数据进行采样，按顺序加载数据，然后将其拆分为较小的文件进行处理。然后，我需要通过估算缺失值并调整格式来处理组织问题。通常，所有预处理都需要我花费项目总时间的70％。

As for selecting features and modeling, I had to try different statistical learning or machine learning models to compare the results. This process required carefully checking the model’s fitness and examine different variables. Also, I needed to consider models’ interpretability and availability as the model would be used by our client who had little related professional knowledge. The above work usually is done in the remaining 30% of the project time by trial and error methods.

至于选择特征和建模，我不得不尝试不同的统计学习或机器学习模型来比较结果。此过程需要仔细检查模型的适用性并检查不同的变量。另外，我需要考虑模型的可解释性和可用性，因为该模型将被我们的相关专业知识很少的客户使用。上述工作通常在剩余30％的项目时间内通过反复试验的方法完成。

3.在许多重要领域中的问题通常被忽略。 (3. Issues in many important fields are usually neglected.)

Photo by Anna Shvets from Pexels Pexels的 Anna Shvets摄

The above section mentioned knowledge is important, but something apart from them, such as legal factors related to privacy and data protection, is also vital for success in the data science business. For example, the European Union has enacted the General Data Protection Regulation (GDPR) to limit the usage of personal data and prohibited certain types of data to be transferred across borders. In other words, data can not be used, analyzed, or moved unless it is de-identified or its usage followed the law’s requirement. Failure to comply with the regulation will cost huge money. In 2019, Google was fined 50 million euros because its data team violated the GDPR.

上面提到的知识很重要，但是除了这些知识以外，诸如与隐私和数据保护有关的法律因素等，对于数据科学业务的成功也至关重要。例如，欧盟已经颁布了《通用数据保护条例》(GDPR)，以限制个人数据的使用，并禁止某些类型的数据跨界传输。换句话说，除非被取消标识或其使用符合法律要求，否则不能使用，分析或移动数据。不遵守该法规将花费巨额资金。 2019年，谷歌因其数据团队违反GDPR而被罚款5000万欧元。

Diving deep into the data science world, I also realized how frequently data and machine learning models can produce ethical problems. For instance, machine learning models are trained using past data for future prediction. In this case, if a company adopt AI recruiting using the training data that contained gender inequality or racial discrimination, it is very likely that the models will help the company to recruit more man or reject the minorities unintentionally. From this situation, you can understand these ethical issues are easily neglected by data and models themselves.

深入研究数据科学世界，我还意识到数据和机器学习模型经常会产生道德问题。例如，使用过去的数据对机器学习模型进行训练以用于将来的预测。在这种情况下，如果公司采用包含性别不平等或种族歧视的培训数据进行AI招聘，那么这些模型很可能会帮助公司无意中招募更多的人或拒绝少数群体。从这种情况下，您可以了解数据和模型本身很容易忽略了这些道德问题。

Furthermore, information bubbles created by the models of the recommendation systems are causing social problems. For example, social media recommends posts or news for users based on preferences, social background, personalities, connections, etc. When the social media recommended a political candidate first and the user liked it, it will provide the user with more posts that people who like that candidate also viewed. If the user keeps clicking the like button, more and more positive posts of that candidate will be given, meaning that the user will receive less negative news about that candidate. In the end, this situation may facilitate extremists who live in his information bubble and deny accepting others’ opinions.

此外，由推荐系统的模型创建的信息泡沫正在引起社会问题。例如，社交媒体根据偏好，社交背景，个性，人脉等向用户推荐帖子或新闻。当社交媒体首先推荐政治候选人并且用户喜欢它时，它将为用户提供更多的帖子，像那个候选人一样也看过。如果用户一直单击“赞”按钮，将给出该候选人越来越多的正面帖子，这意味着用户将收到较少的有关该候选人的负面新闻。最后，这种情况可能会使生活在其信息泡沫中并拒绝接受他人意见的极端主义者感到便利。

4.最重要的技能是有意义地解释数据。 (4. The most important skill is interpreting data meaningfully.)

Kaboompics .com from Kaexmpics .com，来自 Pexels Pexels

Finally, with all fields of knowledge considering, you should know that models’ final results still exist biases because of statistical principles, engineers’ subjective judgments, and training data’s accuracy. That is to say, when you are interpreting the results produced by machine learning, you should try to understand the model’s assumptions, criteria, and variables as well as what the original data looks like. Moreover, check the logic of every argument made by the engineers. In this way, you can interpret data models, and final results more meaningfully.

最后，考虑到所有知识领域，您应该知道，由于统计原理，工程师的主观判断以及训练数据的准确性，模型的最终结果仍然存在偏差。也就是说，当您解释机器学习产生的结果时，您应该尝试了解模型的假设，标准和变量以及原始数据的外观。此外，检查工程师提出的每个论点的逻辑。这样，您可以解释数据模型和更有意义的最终结果。

A helpful method for interpreting data is to visualize your final results. most of the time, your machine learning models will give you some numerical analysis outcomes. Those results are very hard to be communicated with normal people without related backgrounds. In order not to confuse others and to convey your work, applying scattergrams, line charts, histograms, or pie charts on your data can be obvious for others at one glance. Visualization tools such as Tableau or Python’s Matplotlib would be useful for you when you work on this interpretation process.

解释数据的一种有用方法是可视化最终结果。大多数时候，您的机器学习模型将为您提供一些数值分析结果。这些结果很难与没有相关背景的普通人进行交流。为了不混淆其他人并传达您的工作，对其他人一目了然，在数据上应用散点图，折线图，直方图或饼图可能是显而易见的。当您在执行此解释过程时，Tableau或Python的Matplotlib等可视化工具将对您很有用。

Last but not least, after the understanding of your data and analysis process, you must know the right place for your models to apply. To be more specific, machine learning models are trained by past data, and those data usually have specific characteristics that can not be applied to the group without them. For example, if your model predicts the car accident rate of drivers and was trained using a sample under age 45, it would be meaningless if you put a 50-year-old man as input to predict his car accident rate on roads. Thus, apply the models carefully on the appropriate data to make the prediction meaningful.

最后但并非最不重要的一点是，在了解了数据和分析过程之后，您必须知道应用模型的正确位置。更具体地说，机器学习模型是根据过去的数据训练的，这些数据通常具有特定的特征，如果没有这些特征，这些特征将无法应用于该组。例如，如果您的模型可以预测驾驶员的车祸发生率，并使用45岁以下的样本进行训练，那么如果您输入50岁的人作为预测道路上的车祸发生率的参考，那将毫无意义。因此，在适当的数据上仔细应用模型以使预测有意义。

Alexas Fotos from Pexels的Alexas Fotos Pexels 摄

Data science is a field that definitely can generate great value and improve our life. However, it can not know everything in this world without much human effort and knowledge put into it. With the above information behind the know-it-all mask of data science, I believe we data enthusiasts can bring better quality and outcomes for the society with thorough consideration and interpretation.

数据科学无疑是一个可以产生巨大价值并改善我们生活的领域。但是，如果没有太多的人力和知识投入，它就无法了解这个世界上的一切。有了以上数据科学知识全知的信息，我相信我们的数据爱好者可以通过深思熟虑和诠释为社会带来更好的质量和成果。

翻译自: https://medium.com/the-innovation/taking-off-the-know-it-all-mask-of-data-science-403d4c699d98

美团脱颖而出的经验

Processed: 0.014, SQL: 8