Machine learning is not a new technology, so why talk about machine learning platforms now? A combination of factors is driving interest in machine learning engineering as a legitimate engineering discipline. New machine learning libraries and algorithms are widely available, public cloud provides accessible compute resources, and large technology companies are blazing the trail and demonstrating the value of machine learning powered products and services. As a result, more organizations are forming machine learning engineering teams.
In the context of this post, a machine learning platform is infrastructure automation software designed to help developers work productively and ship machine learning applications at scale. This post is not about libraries like TensorFlow or hardware like TPUs — a machine learning platform is an orchestration layer on top of these tools. If you are interested in building or evaluating a machine learning platform, below is a set of principles that are worth considering.
Although many developers are familiar with machine learning concepts, few have built production machine learning applications. Therefore, it’s important to make it easy to get a simple application running end-to-end; a short and pain-free path to “Hello World”.
Developers should be able to use the platform without spending hours reading documentation. It should be simple to deploy an application without understanding all of the features, and most configuration should be optional with sensible defaults. For example, model training should default to a 75–25 training/evaluation split, which is reasonable for most applications, and can be adjusted once the application is running successfully. Advanced users can refer to the documentation to learn how to take advantage of more sophisticated capabilities, but they will be more likely to do so after understanding the basic functionality.
Furthermore, the platform’s APIs should support the tools that machine learning practitioners are already familiar with and limit custom domain-specific languages to the bare minimum. The primary language of machine learning is Python and the most popular tools are Spark and TensorFlow. Developers should be able to tap into those communities for documentation, tutorials, and any other resources that would empower them to build their applications. Familiar tools lower the learning curve. The goal of the platform should be to simplify the experience of leveraging popular tools effectively without overriding their APIs.
Usability should not come at the expense of flexibility. The platform should be flexible enough to allow experienced users to build arbitrarily sophisticated applications. Built-in PySpark data transformers and TensorFlow estimators may suit many applications, but it should be trivial to incorporate custom code. The platform may be opinionated and encourage best practices, but advanced users should be able to override the default behavior if it becomes restrictive.
It may be tempting to focus on one part of the machine learning infrastructure stack (e.g. infrastructure for serving natural language models which optimizes inference latency or infrastructure for highly distributed model training). While these capabilities are critical for some applications, most applications don’t require peak computing performance. It’s more important to enable efficient and reliable end-to-end workflow because handoffs across siloed systems could have a devastating impact on productivity. Data preparation, model training, and prediction serving should be abstracted in such a way that a developer treats the platform as a single unit of infrastructure.
An end-to-end platform empowers developers to build applications faster. If the platform focused exclusively on prediction serving, developers would still have to cobble together tooling to prepare data and train models. Alternatively, a platform focused on data preparation would speed up model development, but increase the friction of deploying the model as an API. Applications that run end-to-end provide value sooner and can be improved iteratively.
In order to enable agile workflows, it’s important to minimize the time it takes a platform to execute its workloads. Otherwise, developers become more hesitant to experiment. Machine learning is hard, so it should be cheap to make mistakes.
An easy solution is to support sampling from datasets to limit the amount of data being processed by the pipeline — if an application runs successfully on a small number of samples, it is likely to run successfully on the full dataset. It’s a lot less frustrating to notice a data transformation bug after processing 100 samples in seconds rather than millions of samples in hours.
Another optimization involves caching at every step of the pipeline. For example, if only model hyperparameters changed, feature data should not be reprocessed.
Machine learning applications are constructed from a set of building blocks: raw data columns, aggregated data values, transformed data columns, and models. A raw data column represents a single column of unprocessed data from the data warehouse. An aggregated data value (or aggregate) is a singular value generated from processing a full column of data (e.g. mean and standard deviation for numeric columns or a vocabulary for categorical columns). A transformed column is a column of data that has been processed in some way (e.g. normalized or string-indexed). Finally, a model is the output of machine learning algorithms applied to a set of data columns. You can also think of a model as a special case of a transformation.
These building blocks can be composed in different ways to create a wide range of applications. Raw columns can be converted into one or more transformed columns, aggregates, or models. Similarly, transformed columns can be converted into one or more transformed columns (by chaining transformations), aggregates, or models. Models may seem like the end result of a typical machine learning pipeline, but they can also be used to generate additional transformed columns to feed into downstream models.
Since each building block may be generated by different workloads behind the scenes, the platform can schedule workloads intelligently. For example, a workload that trains a model can be scheduled independently on GPU infrastructure, while a data transformation workload can be scheduled on infrastructure with more memory. Also, independent training jobs can run in parallel, whereas chained transformations must execute sequentially.
Modularity also enables caching. Machine learning pipelines tend to be computationally expensive, so duplicated computation should be avoided when possible. Breaking down pipelines into smaller building blocks unlocks more fine-grained caching.
One confusing aspect of machine learning engineering is the distinction between developing an application and deploying an application to production. It’s possible to consider data preparation and model training as the “development” stage, while “production” means deploying the trained model in a production environment for serving prediction requests. Machine learning platforms should take a different approach because many applications depend on fresh data. For example, last week’s traffic patterns, fraudulent credit card transactions, and e-commerce user sessions are probably more useful than data from a few years ago. The world is not static, so neither should the models be.
It’s important for machine learning applications to readily update as new data becomes available. A production deployment should encompass the entire pipeline from data processing to model training to prediction serving, and should be fully automated from end to end. The cost (in terms of developer time, execution time, and compute resources) of refreshing the entire pipeline should be as low as possible, and should result in zero downtime as deployed models are updated.
The cloud is a better place to deploy machine learning applications than most local environments for two primary reasons: environment standardization and the availability of compute resources. Modern machine learning applications depend on a variety of software that is hard to orchestrate on a local environment. Do you really want to manage Python, Virtualenv, Kubernetes, Docker, Redis, Airflow, Spark, TensorFlow, TensorFlow Serving, Nginx, and Fluentd (or alternatives) on your laptop? If you answered “yes”, we’re hiring. Additionally, it won’t take long before your local machine’s resources hit their limits on large datasets. In the cloud, just spin up a larger instance or increase the size of your cluster.
While the platform should run workloads on cloud infrastructure, developers should still be able to edit code and interact with the platform via their local machine. A simple CLI should be sufficient to deliver application code and configuration to the platform, and information about the workloads should be streamed back to the developer in real time.
Machine learning applications should be portable and their code should define the flow of data, without the developer having to worry about configuring the underlying infrastructure. As long as developers define applications in a consistent API, the platform should be able to seamlessly execute the application on any cloud provider.
Cloud providers compete on cost, performance, and availability of compute resources. Engineering teams should decide where to run their applications based on these factors, and should not be locked in to one particular cloud provider.
Machine learning is hard. It’s easy to make innocent mistakes that create nearly unexplainable errors, so a machine learning platform should provide guardrails. Simple checks can go a long way, like ensuring that a transformation which expects integers as input actually receives integers, or that classification models are only used to predict integer target values.
The platform can also attempt to prevent more sophisticated pitfalls. Machine learning engineers may have to deal with highly unbalanced datasets or datasets that contain samples with missing values. They may also neglect to normalize columns with high variance or mistakenly normalize categorical columns. A platform should provide simple mechanisms for upsampling or downsampling unbalanced data and dropping samples with missing data. It should also warn about feature data that may result in suboptimal models (e.g. due to multicollinearity), or models that aren’t converging or are overfitting.
Since it’s difficult to prevent all mistakes, the platform should also provide ample debugging capabilities. Developers need an easy and intuitive way to access all log streams from their workloads, and these logs should be useful not just when the workloads crash. For example, data processing workloads should show samples of raw and transformed data to make it easy to sanity check the code, and model training workloads should show loss and evaluation metrics.
A machine learning platform should only solve the problems that are unique to machine learning engineering, such as running distributed TensorFlow workloads and handling large prediction request loads. A minimalistic platform integrates seamlessly into existing development workflows. Version control systems and code review tools already provide workflows for tracking, sharing, and collaborating among developers. Similarly, text editors and IDEs provide interfaces for editing code. None of this functionality needs to be reinvented.
The platform should be stateless, and all application configuration should be defined by configuration files. This way an application can be fully defined as code, which allows developers to integrate their machine learning applications into many of their existing software development lifecycle workflows, like version control and continuous integration.
It is hard to venture into the world of machine learning without coming across notebooks. Notebooks are useful for fast prototyping and encourage writing well-documented code with helpful visualizations. That being said, a machine learning platform focused on production application deployments should not support them. They encourage software engineering anti-patterns that inhibit developers’ ability to build testable and maintainable software.
Text editors and IDEs include rich functionality to help developers work more productively and write better code (e.g. code completion and linting are much more readily available in popular text editors than in notebooks). Unlike Python libraries, code in notebooks is harder to abstract and reuse and is harder to reason about because there may be hidden state. In addition, notebooks are generally unfamiliar to developers without a data science background. The growth of the machine learning engineering discipline will be fueled by developers applying software engineering best practices to machine learning code without being encumbered by the current data science tooling.