微软医疗ai
Today, I want to wear my software archeology hat, and share with you one story about the AI efforts at Microsoft and how Microsoft built its open-source high-performance AI runtime that is saving the company time and money.
今天,我想戴上我的软件考古帽子,并与您分享一个有关Microsoft在AI方面所做的努力以及Microsoft如何构建其开源高性能AI运行时以节省公司时间和金钱的故事。
A couple of years ago, I decided to create .NET bindings for TensorFlow and later PyTorch to join in all the fun that everyone is having with AI. In the past, creating bindings has been both a soothing exercise, one that I have used to learn new frameworks, but also to learn how people have solved different problems. In this case, I read my share of articles and tutorials, yet something was missing.
几年前,我决定为TensorFlow和后来的PyTorch创建.NET绑定,以加入每个人在AI中获得的所有乐趣。 在过去,创建绑定既是一项令人舒缓的工作,我曾经用来学习新的框架,但也了解人们如何解决各种问题。 在这种情况下,我阅读了我的文章和教程,但是仍然缺少某些内容。
It turns out that when it comes to deep learning, binding APIs is just not enough. My coworker Jim Radigan described it better — I had a “cocktail party level of understanding”, which was enough to follow along with the news, but not enough to actually grasp it or solve interesting problems with it.
事实证明,在深度学习方面,仅绑定API是不够的。 我的同事吉姆·拉迪根(Jim Radigan)更好地描述了这一点–我对鸡尾酒会的了解程度很高,足以跟上新闻的步伐,但不足以实际掌握新闻或解决新闻中的有趣问题。
So last year, I took Jeremy Howard’s fantastic https://Fast.AI course and things finally clicked for me. The course gave me a much-needed perspective and I finally understood how all the pieces worked (I strongly recommend this course for anyone wanting to get into AI).
因此,去年,我参加了杰里米·霍华德(Jeremy Howard)精彩的https://Fast.AI课程,事情终于为我所吸引。 该课程为我提供了急需的视角,我最终了解了所有这些部分的工作原理(我强烈建议所有想要进入AI的人学习此课程)。
In the meantime, teams at Microsoft had embraced AI for all sorts of problems, and the work that Microsoft Research had done over the years is being put to use in production. Not only did Microsoft start using AI for services like Office and Bing, but it also began offering AI on-demand in the form of Azure services you can easily consume from your applications. The “Seeing AI” project reused work from Microsoft’s AI research teams to solve real-world problems for users.
同时,Microsoft的团队已经接受AI解决各种问题,并且Microsoft Research多年来所做的工作已投入生产。 微软不仅开始将AI用于Office和Bing等服务,而且还开始以Azure服务的形式按需提供AI,您可以从应用程序中轻松使用。 “ Seeing AI ”项目重复使用了Microsoft AI研究团队的工作,为用户解决了现实世界中的问题。
An explosion of usage at Microsoft was taking place and everyone used the best tools that were available at the time — either building their own engines or using off-the-shelf components technologies like Fast.AI, Keras, TensorFlow, and PyTorch, and deploying these into production. And we have deployed these on the Azure cloud on more computers than I can count.
微软的使用激增,每个人都使用当时可用的最佳工具-构建自己的引擎或使用Fast.AI,Keras,TensorFlow和PyTorch等现成的组件技术,并进行部署这些投入生产。 而且我们已经将这些部署在Azure云上的计算机数量超出了我的预期。
The AI world is a bit like the JavaScript world in that there is a tremendous amount of excitement, it feels like a new breakthrough model, clever operators, frameworks, or hardware accelerator comes to life every week.
AI世界有点像JavaScript世界,这是令人兴奋的,它感觉像是一个新的突破性模型,聪明的运算符,框架或硬件加速器每个星期都在发挥作用。
Many AI frameworks tend to be tightly coupled with particular technologies. For example, while PyTorch is great for running your AI models, it really is intended to be used on a PC with an Nvidia GPU. We ended up with an archipelago of solutions, and it was difficult to retarget the code to other systems.
许多AI框架倾向于与特定技术紧密结合。 例如,尽管PyTorch非常适合运行您的AI模型,但它实际上旨在用于具有Nvidia GPU的PC。 我们最终获得了解决方案群岛,并且很难将代码重新定向到其他系统。
Compiler folks have known this as the “many-to-many problem.” In this scenario, we have deep learning frameworks on the left, and targets on the right:
编译人员将其称为“多对多问题”。 在这种情况下,我们在左侧有深度学习框架,在右侧有目标:
While the industry has standardized to a large extent on TensorFlow and PyTorch as the high-level frameworks, we are in the early days of AI, and there are many emerging frameworks that try to improve upon these models. New frameworks to solve problems are being written in Julia, Python, Rust, Swift, and .NET, to name a few.
尽管业界已在很大程度上将TensorFlow和PyTorch作为高级框架进行了标准化,但我们处于AI的早期阶段,并且有许多新兴框架试图对这些模型进行改进。 可以使用Julia,Python,Rust,Swift和.NET编写解决问题的新框架,仅举几例。
The compiler folks figured out the similarities between many of these frameworks and advocated for an intermediate format that was suitable for representing many of these problems. Rather than maintaining this many-to-many world, we could decouple the front-end languages and frameworks from the backend execution. In 2017, Facebook and Microsoft launched the ONNX project, an intermediate representation suitable as an exchange format for many different runtimes.
编译人员发现了许多这些框架之间的相似之处,并主张采用适合于代表许多这些问题的中间格式。 与其维护这个多对多的世界,不如将前端语言和框架与后端执行脱钩。 2017年,Facebook和微软启动了ONNX项目,这是一种中间表示形式,适合作为许多不同运行时的交换格式。
The world of ONNX looks a little bit like this:
ONNX的世界看起来像这样:
Today, many frameworks and runtimes support ONNX as an exporting format, or as an input for their accelerator. There are nice tools that have emerged in this ecosystem. One of my favorites is Lutz Roeder’s ONNX model visualizer: Netron.
如今,许多框架和运行时都支持ONNX作为导出格式或作为其加速器的输入。 这个生态系统中出现了许多不错的工具。 我最喜欢的之一是Lutz Roeder的ONNX模型可视化工具: Netron 。
At Microsoft, we developed an accelerator to run these ONNX models, which the marketing team baptized as “Onnx Enterprise Runtime Professional for Workgroups” until cooler heads prevailed and we settled on the convenient GitHub “onnxruntime” name. Had I been in charge of the naming, it would have been called the “Overload Runtime,” on the thesis that it would give people something to talk about. Anyways, I digress.
在Microsoft,我们开发了一种加速器来运行这些ONNX模型,营销团队将其称为“面向工作组的Onnx Enterprise Runtime Professional”,直到冷酷的头脑盛行,并确定了方便的GitHub“ onnxruntime”名称。 如果我负责命名,那么就将其称为“ Overload Runtime”,其论点是它将给人们一些话题。 无论如何,我离题了。
This runtime was a gift for our internal teams. They could continue to author their models with their existing frameworks, and a separate team could focus on optimizing the runtime for their hardware and the operations that they used the most. The optimizations made for one team, benefitted all the teams, regardless of the higher-level framework that they were using.
这个运行时是我们内部团队的礼物。 他们可以继续使用现有框架来编写模型,而另一个团队可以专注于优化其硬件和最常使用的操作的运行时。 为一个团队进行的优化使所有团队受益,而不管他们使用的是更高级别的框架。
The AI team has made it simple to take your existing code authored in TensorFlow, Keras, or PyTorch and export it to run in the ONNX Runtime. For example, if you are a PyTorch user, you can take your existing PyTorch code, and follow the instructions to run your code with the onnxruntime.
AI团队简化了将您在TensorFlow,Keras或PyTorch中编写的现有代码提取并导出以在ONNX Runtime中运行的过程。 例如,如果您是PyTorch用户,则可以获取现有的PyTorch代码,并按照说明在onnxruntime上运行代码。
At this point you are wondering, why take this extra step if things are working for you as-is?
在这一点上,您想知道,如果一切按原样进行,为什么还要采取这一额外步骤?
In many cases, the answer is that you will save time and money. In a recent post from Tianle Wu and Morgan Funtowicz, we shared some of the results that we see regularly, where they show performance improvements in their training from 20% all the way to 350%. You can imagine what the savings look like in terms of efficiency, time, and energy=, or just plain dollars if you are renting the CPU/GPU time.
在许多情况下,答案是您可以节省时间和金钱。 在Tianle Wu和Morgan Funtowicz的最新帖子中 ,我们分享了一些我们定期看到的结果,他们在培训中表现出从20%一直提高到350%的绩效。 您可以想象在效率,时间和能源=方面节省的费用是多少,如果租用CPU / GPU时间,则节省的费用仅仅是美元。
We certainly hope that the work that we have put into the ONNX Runtime will be generally useful for anyone looking to reduce their costs. Let us turn our attention to how this runtime was tuned for our models, and perhaps these optimizations will be useful for your own day to day work as well.
我们当然希望,我们投入ONNX运行时的工作通常对希望降低成本的人有用。 让我们将注意力转向如何针对我们的模型调整运行时,也许这些优化对您自己的日常工作也很有用。
While everyone knows that AI runs faster on GPUs than CPUs, but you do not really “know” until you try it out. For me, that time was a workshop I attended a couple of years ago. The instructions for attending the workshop included something like “Make sure you get yourself a GPU on the cloud before you come.” I did not do that and instead showed up with my MacBook Pro.
虽然每个人都知道AI在GPU上的运行速度要比CPU快,但是您只有在尝试一下之后才真正知道。 对我来说,那是几年前我参加的一个研讨会。 参加研讨会的说明包括“确保在来之前在云上拥有GPU”之类的内容。 我没有这样做,而是出现在MacBook Pro中。
Everyone was happily exploring their Jupyter Notebooks and doing all the exercises with GPUs on the cloud, while I was doing the work on my laptop. While the speaker guided us through the magnificent world of GANs and people were following along, I was stuck on the first step that took about an hour to complete and drained my battery.
当我在笔记本电脑上工作时,每个人都高兴地探索他们的Jupyter笔记本电脑并使用GPU在云上进行所有练习。 当演讲者引导我们穿越GAN的宏伟世界时,人们一直在追随,但我停留在第一步上,这花了一个小时才能完成,并耗尽了电池的电量。
I learned my lesson. You really want to use a hardware accelerator to do your work.
我学到了教训。 您确实想使用硬件加速器来完成您的工作。
What makes the ONNX Runtime (“Overlord Runtime” for those of you in my camp) run our models faster than plain PyTorch or TensorFlow?
是什么使ONNX Runtime(对我这个阵营的您来说是“霸主运行时”)比普通的PyTorch或TensorFlow更快地运行我们的模型?
Turns out, there is not one single reason for it, but rather a collection of reasons.
事实证明,这不是一个单一的原因,而是一系列原因。
Let me share some of these, which fall in the following areas:
让我分享其中的一些内容,它们属于以下领域:
Graph optimizations 图形优化 MLAS MLAS TVM Code Generator (for some models) TVM代码生成器(某些型号) Pluggable Execution Provider architecture 可插拔执行提供程序体系结构One of the key differences in performance between PyTorch and the ONNX Runtime has to do with how your programs are executed.
PyTorch与ONNX Runtime之间性能的主要差异之一与程序的执行方式有关。
Immediate execution, like the one provided by PyTorch (AI folks call this “eager”), keeps the Python runtime in the middle of the equation, where every mathematical operation is dispatched one by one to the GPU to perform its work.
立即执行就像PyTorch(AI人士称之为“渴望”)所提供的那样,使Python运行时始终处于方程式的中间,在该方程式中,每个数学运算都被逐一分配给GPU以执行其工作。
This approach works and has an immediate quality to it, but it is not exactly speedy. The diagram below represents both my limited drawing skills and a number of operations executed sequentially on the CPU, and then executed on the GPU, one by one:
这种方法行之有效,并具有立即的质量,但是速度并不很快。 下图既代表了我有限的绘画技巧,又代表了在CPU上依次执行,然后在GPU上依次执行的一系列操作:
The ONNX Runtime has a chance to take the entire graph and apply traditional compiler optimizations.
ONNX Runtime有机会绘制整个图形并应用传统的编译器优化。
Some of the optimizations it applies are done at the graph-level, like checking if two operations can be merged into one (this is called “kernel fusion”), or seeing if there are transformations that produce the same result if computed differently:
它应用的一些优化是在图级别完成的,例如检查两个操作是否可以合并为一个(称为“内核融合”),或者查看是否存在不同计算的转换产生相同结果的转换:
In the end, this helps to minimize the number of round-trips between the CPU and the GPU to compute the same problem, as well as minimizing the number of copies and data sharing.
最后,这有助于最大程度地减少CPU和GPU之间的往返次数以计算相同的问题,并最大程度地减少副本和数据共享的次数。
Training AI networks is typically done on GPUs as it is a lengthy operation that can take from hours to weeks to run, and is a task well suited for GPUs. Meanwhile, using a trained network can be done more efficiently on the CPU.
训练AI网络通常是在GPU上完成的,因为它是一项耗时数小时至数周的长时间运行,并且非常适合GPU。 同时,可以在CPU上更有效地使用受过训练的网络。
Given that AI problems are full of math and matrix multiplications done in bulk, the team set out to build optimized versions of the key math components that are used during inferencing on the CPU. and they developed a minimal version of BLAS called MLAS.
鉴于AI问题充满了大量的数学运算和矩阵乘法运算,因此该团队着手构建关键数学组件的优化版本,这些组件将在CPU推理过程中使用。 他们开发了称为MLAS的BLAS的最小版本。
MLAS contains a hand-tuned set of linear algebra operations that are often implemented in assembly language and leverages various vector operations on various processors.
MLAS包含一组手动调整的线性代数运算,这些运算通常以汇编语言实现,并利用各种处理器上的各种矢量运算。
Recently, Andrey Volodin showed a tiny and beautiful runtime for ONNX models that is hardware accelerated using Apple’s Metal APIs called Smelter. It is so small that it takes about 10 minutes to read the whole source code and realize just how simple ONNX is and how you can hardware accelerate individual nodes.
近日,安德烈·沃洛金表现为ONNX模型一个小而美丽的运行时是使用苹果的金属的API叫做硬件加速冶炼厂 。 它是如此之小,以至于花费大约10分钟才能阅读完整的源代码,并意识到ONNX多么简单以及如何硬件加速单个节点。
The ONNX Runtime takes this a couple of steps further by providing a pluggable system where new optimizations and backends can be added. These are called “Execution Providers”, and at startup, they register with the runtime both how to move data in and out of their universe, and can respond to the runtime’s request for “How much of this graph would you like to run?”
ONNX Runtime通过提供一个可插入系统,可以在其中添加新的优化和后端,从而将这两个步骤进一步发展了。 这些被称为“执行提供程序”,并且在启动时,它们向运行时注册如何将数据移入和移出其Universe,并可以响应运行时的请求“您要运行多少图?”
On the one end of the spectrum, a simple provider could work like the tiny and beautiful runtime above and accelerate just a handful of operations. On the other end of the spectrum, it can take entire programs and send those over to be executed on a hardware accelerator (GPUs, or neural processors).
一方面,简单的提供程序可以像上面的小巧漂亮的运行时一样工作,并且可以加速少数操作。 另一方面,它可以接收整个程序并将其发送到硬件加速器(GPU或神经处理器)上执行。
In this diagram, the graph runner gets to choose which parts of the graph should be executed by which providers:
在此图中,图运行器可以选择应由哪些提供者执行图的哪些部分:
Today ONNX Runtime ships with twelve providers, including Intel’s OpenVINO and NVidia’s TensorRT and CUDA backends and others can be added in the future. New execution providers can be added for specific hardware accelerators, different sorts of GPUs, or JIT compilers. Personally, I am going to try to add a Metal backend for my personal use.
如今,ONNX Runtime随12家提供商一起提供,包括英特尔的OpenVINO和NVidia的TensorRT和CUDA后端,将来还可以添加其他后端。 可以为特定的硬件加速器,不同种类的GPU或JIT编译器添加新的执行提供程序。 就个人而言,我将尝试为我的个人用途添加一个Metal后端。
In the pluggable architecture described above, some kinds of graphs can be sped up if they can be JIT compiled. Today for a limited number of cases, the runtime uses the Apache TVM Deep Learning Optimizing Compiler to produce custom code for those operations.
在上述可插拔体系结构中,如果可以对JIT进行编译,则可以加快某些类型的图形的速度。 如今,在少数情况下,运行时使用Apache TVM深度学习优化编译器为这些操作生成自定义代码。
Last year, various teams at Microsoft started exploring the use of the Multi-Level IR compiler, originally developed by Google, and now part of LLVM to be used both to optimize individual graph operations (“kernels”), certain groups of kernels (“kernel fusion” — described above) and generating code for both CPU, GPUs and other hardware accelerators.
去年,Microsoft的各个团队开始探索最初由Google开发的多级IR编译器的使用,现在,它已成为LLVM的一部分,可用于优化单个图形操作(“内核”),某些内核组(“内核融合”(如上所述),并为CPU,GPU和其他硬件加速器生成代码。
An interesting detail is that the Microsoft ONNX Runtime originally was designed to be a high-performance runtime for inferencing. This allowed the team to focus on building a compact and nimble runtime purely focused on performance.
一个有趣的细节是,Microsoft ONNX Runtime最初被设计为用于推理的高性能运行时。 这使团队能够专注于构建紧凑,灵活的运行时,而纯粹是专注于性能。
This initial focus has paid off because there are more users of the inferencing on a day to day basis than there is for training. So this has helped us reduce the cost of rolling out models in more places, and in more scenarios that users can benefit from.
最初的关注已获得回报,因为每天进行推理的用户多于训练的用户。 因此,这帮助我们减少了在更多地方以及用户可以从中受益的更多场景中推出模型的成本。
It was only later that training was added.
直到后来才增加了培训。
This training capability has been under development and use in production for a while, and just this past week announced it as a preview at the Build Conference. Up until now, this post talked about how we made our training on a single machine faster.
这种培训功能已经开发并在生产中使用了一段时间,就在上周, 它在Build Conference上宣布为预览 。 到目前为止,这篇文章谈到了我们如何更快地在一台机器上进行培训。
It turns out that training can also be a team sport. Computers do not need to work on isolation, they can worth together to train these vast models. At Microsoft, earlier this year we talked about how we trained one of the largest published models, and last week at Build, Kevin Scott talked about Microsoft’s AI supercomputer, a vast distributed system for training.
事实证明,培训也可以是一项团队运动。 计算机不需要隔离,它们可以一起训练这些庞大的模型。 今年早些时候,在微软,我们讨论了如何训练最大的已发布模型之一;上周,在Build上, 凯文·斯科特 ( Kevin Scott)谈到了微软的AI超级计算机 ,这是一个用于培训的大型分布式系统。
The technology that powers both of those efforts has now been integrated into the ONNX Runtime that was just released, and you can now also use these capabilities for your own models.
支持这两项工作的技术现已集成到刚刚发布的ONNX Runtime中,您现在还可以将这些功能用于自己的模型。
I am in awe at the success that the AI team at Microsoft has achieved in such a short time. The ONNX Runtime is a pleasure to work with, it is very nice and cleanly architected, and we are looking to extend both its capabilities — driven by our users, as well as adding additional execution providers, and bringing it to new platforms — in particular mobile platforms which now ship with assorted neural network accelerators.
我对微软的AI团队在如此短的时间内取得的成功感到敬畏。 ONNX运行时非常令人愉快,它非常漂亮且结构简洁,我们正在寻求扩展其功能-由用户驱动,并添加其他执行提供程序,并将其引入新平台-特别是现在带有各种神经网络加速器的移动平台。
Everything that I have discussed in this post is part of the open-source ONNX Runtime on GitHub.
我在本文中讨论的所有内容都是GitHub上的开源ONNX Runtime的一部分。
About the Author: Miguel de Icaza
关于作者:Miguel de Icaza
Miguel de Icaza heads the Xamarin Engineering group in charge of Visual Studio for Mac and the Xamarin development tools and IDE components to create .NET applications for Android, iOS, tvOS, watchOS as well as gaming consoles. He was the founder of the open-source Mono project which is the foundation for these efforts.
Miguel de Icaza负责Xamarin工程团队,负责Visual Studio for Mac和Xamarin开发工具以及IDE组件,以创建适用于Android,iOS,tvOS,watchOS以及游戏机的.NET应用程序。 他是开源Mono项目的创始人,该项目是这些工作的基础。
Original post here.
原始帖子在这里。
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.
在 OpenDataScience.com 上阅读更多数据科学文章 ,包括从初学者到高级的教程和指南! 在此处订阅我们的每周新闻, 并在每个星期四接收最新新闻。
翻译自: https://medium.com/@ODSC/a-look-inside-the-ai-runtime-from-microsoft-66698d187fb4
微软医疗ai
相关资源:微软面试100题系列之高清完整版PDF文档[带目录 标签]by_July