End-to-End ML Model Lifecycle using Apolo CLI
Last updated
Was this helpful?
Last updated
Was this helpful?
This guide demonstrates how to manage a complete machine learning workflow on the Apolo platform, from environment setup to model deployment. We'll walk through the entire ML lifecycle using Apolo Flow, a powerful declarative workflow system (). While this guide uses the command-line interface (CLI), these operations can also be performed through the Apolo Console GUI for those who prefer a graphical experience using our built-in Apps such as Jupyter, MLFlow and Apolo Jobs.
We'll use a modified version of PyTorch's "Name Classification using RNN" example (originally from ) to showcase how Apolo Flow simplifies the ML lifecycle management.
Apolo CLI tools
Access to an Apolo platform instance
Basic understanding of ML workflows
Apolo Flow is a declarative workflow system that allows you to define your entire ML infrastructure as code. The .apolo/live.yml
file in this example is a Flow configuration that defines:
Container images for both training and serving
Storage volumes for data, code, and models
Jobs for training, serving, and monitoring
Dependencies between components
By using this declarative approach, you can ensure reproducibility and easily share workflows with team members. While we'll use the CLI in this guide, all these operations can also be performed through the Apolo Console GUI.
Start by cloning the example repository that contains all the necessary code and configuration:
The repository contains:
.apolo/live.yml
- Workflow definition file
scripts/
- Training and serving code
scripts/Dockerfile
- Container definition for training
scripts/Dockerfile.server
- Container definition for serving
Before proceeding, review the live.yml
file and pay attention to the preset
fields for both building images and running jobs. These presets define the computational resources allocated (CPU, RAM, GPU) and might have different names in your specific Apolo cluster.
To find the correct preset names for your cluster:
Navigate to your Apolo Console
Go to Cluster > Settings > Resources
Note the available preset names
Modify the relevant fields in your live.yml
file accordingly
Using the correct preset names will ensure your jobs have the appropriate resources and can run successfully in your environment.
Ensure you have the Apolo CLI tools installed:
Log in to your Apolo platform instance:
Start the MLflow service to track your experiments, parameters, metrics, and artifacts:
The MLflow server provides:
A web UI for experiment comparison
Metadata storage for runs
Artifact storage for models and other outputs
A REST API for logging from your training jobs
Download and prepare the training data:
Important Note About Data and Code Management:
The Apolo platform automatically synchronizes local directories with remote storage when you use the
local
parameter in thevolumes
section of thelive.yml
file. This means you don't need to manually copy code or data files to the container during build time. When you run a job, Apolo will ensure all the local files defined in your volumes are available in the container at the specified mount points.For example, in our
live.yml
, we defined:This automatically syncs the contents of your local
data/names
directory to the remote storage, which is then mounted at/project/data
in the container.
Build the Docker image that contains all dependencies for training:
The training image includes:
Python environment
PyTorch framework
Custom code dependencies for the RNN name classifier
Launch the training job which will:
Use the data volume mounted at /project/data
Save the model to the models volume
Log metrics and parameters to MLflow
During training, you can:
Monitor progress in the MLflow UI
Access logs via apolo-flow logs train
Deploy the trained model as a RESTful API service:
The serving job:
Loads the model from the shared models volume
Exposes a FastAPI endpoint for predictions
Provides Swagger documentation at the /docs
endpoint
Can be scaled or updated independently of training
After running apolo-flow run serve
, you'll see output in your terminal with details about the deployed service. Look for the "Http Url" in the output—this is the address where your model API is now available.
When you open this URL in your browser, you'll see a simple "service is up" message, confirming that the API is running successfully.
To interact with your model, add /docs
to the end of the URL. This will take you to an automatically generated API documentation interface powered by FastAPI and Swagger UI. Here, you can:
See all available endpoints (in this example, the /predict
endpoint)
Test the model directly from your browser by clicking on the endpoint
Expand the endpoint details, click "Try it out", and provide a sample input
Execute the request and view the model's predictions
For example, you can submit a name like "Patrick" along with the number of predictions you want, and the model will return the most likely country origins for that name based on its training.
The configuration also supports:
This workflow demonstrates the power of declarative ML pipelines on Apolo, enabling reproducible, scalable, and production-ready machine learning workflows. The RNN name classifier example shows how even sophisticated deep learning models can be easily trained and deployed using the platform's orchestration capabilities.