End-to-End ML Model Lifecycle on Apolo Platform
This guide demonstrates how to manage a complete machine learning workflow on the Apolo platform, from environment setup to model deployment. We'll walk through the entire ML lifecycle using Apolo Flow, a powerful declarative workflow system (full documentation here). While this guide uses the command-line interface (CLI), these operations can also be performed through the Apolo Console GUI for those who prefer a graphical experience using our built-in Apps such as Jupyter, MLFlow and Apolo Jobs.
We'll use a modified version of PyTorch's "Name Classification using RNN" example (originally from PyTorch's tutorials) to showcase how Apolo Flow simplifies the ML lifecycle management.
Prerequisites
Apolo CLI tools
Access to an Apolo platform instance
Basic understanding of ML workflows
Understanding Apolo Flow
Apolo Flow is a declarative workflow system that allows you to define your entire ML infrastructure as code. The .apolo/live.yml
file in this example is a Flow configuration that defines:
Container images for both training and serving
Storage volumes for data, code, and models
Jobs for training, serving, and monitoring
Dependencies between components
By using this declarative approach, you can ensure reproducibility and easily share workflows with team members. While we'll use the CLI in this guide, all these operations can also be performed through the Apolo Console GUI.
Step 1: Clone the Example Repository
Start by cloning the example repository that contains all the necessary code and configuration:
The repository contains:
.apolo/live.yml
- Workflow definition filescripts/
- Training and serving codescripts/Dockerfile
- Container definition for trainingscripts/Dockerfile.server
- Container definition for serving
Important Note About Compute Presets
Before proceeding, review the live.yml
file and pay attention to the preset
fields for both building images and running jobs. These presets define the computational resources allocated (CPU, RAM, GPU) and might have different names in your specific Apolo cluster.
To find the correct preset names for your cluster:
Navigate to your Apolo Console
Go to Cluster > Settings > Resources
Note the available preset names
Modify the relevant fields in your
live.yml
file accordingly
Using the correct preset names will ensure your jobs have the appropriate resources and can run successfully in your environment.
Step 2: Setup Environment
Ensure you have the Apolo CLI tools installed:
Step 3: Authentication and Resource Preparation
Log in to your Apolo platform instance:
Step 4: Launch MLflow Tracking Server
Start the MLflow service to track your experiments, parameters, metrics, and artifacts:
The MLflow server provides:
A web UI for experiment comparison
Metadata storage for runs
Artifact storage for models and other outputs
A REST API for logging from your training jobs
Step 5: Prepare Training Data
Download and prepare the training data:
Important Note About Data and Code Management:
The Apolo platform automatically synchronizes local directories with remote storage when you use the
local
parameter in thevolumes
section of thelive.yml
file. This means you don't need to manually copy code or data files to the container during build time. When you run a job, Apolo will ensure all the local files defined in your volumes are available in the container at the specified mount points.For example, in our
live.yml
, we defined:This automatically syncs the contents of your local
data/names
directory to the remote storage, which is then mounted at/project/data
in the container.
Step 6: Build Training Environment
Build the Docker image that contains all dependencies for training:
The training image includes:
Python environment
PyTorch framework
Custom code dependencies for the RNN name classifier
Step 7: Train the Model
Launch the training job which will:
Use the data volume mounted at
/project/data
Save the model to the models volume
Log metrics and parameters to MLflow
During training, you can:
Monitor progress in the MLflow UI
Access logs via
apolo-flow logs train
Step 8: Deploy Model Serving API
Deploy the trained model as a RESTful API service:
The serving job:
Loads the model from the shared models volume
Exposes a FastAPI endpoint for predictions
Provides Swagger documentation at the
/docs
endpointCan be scaled or updated independently of training
Accessing Your Deployed Model API
After running apolo-flow run serve
, you'll see output in your terminal with details about the deployed service. Look for the "Http Url" in the output—this is the address where your model API is now available.
When you open this URL in your browser, you'll see a simple "service is up" message, confirming that the API is running successfully.
To interact with your model, add /docs
to the end of the URL. This will take you to an automatically generated API documentation interface powered by FastAPI and Swagger UI. Here, you can:
See all available endpoints (in this example, the
/predict
endpoint)Test the model directly from your browser by clicking on the endpoint
Expand the endpoint details, click "Try it out", and provide a sample input
Execute the request and view the model's predictions
For example, you can submit a name like "Patrick" along with the number of predictions you want, and the model will return the most likely country origins for that name based on its training.
Additional Workflows
The configuration also supports:
Monitoring and Management
This workflow demonstrates the power of declarative ML pipelines on Apolo, enabling reproducible, scalable, and production-ready machine learning workflows. The RNN name classifier example shows how even sophisticated deep learning models can be easily trained and deployed using the platform's orchestration capabilities.
Last updated
Was this helpful?