Apache Spark
Overview
Apache Spark is a unified analytics engine for large-scale data processing that provides high-level APIs in Java, Scala, Python, and R. It offers built-in modules for streaming, SQL, machine learning, and graph processing, making it ideal for both batch and real-time data analytics workloads. Spark's in-memory computing capabilities and optimized execution engine deliver exceptional performance for iterative algorithms and interactive data analysis.
Apolo provides a managed Spark service that enables you to run scalable Apache Spark applications with configurable drivers, executors, and auto-scaling capabilities. The platform handles cluster management, resource allocation, and job orchestration, allowing you to focus on your data processing logic while ensuring optimal performance and cost efficiency in cloud-native environments.
Spark on Apolo comes with integrated support for dynamic scaling, dependency management, and seamless integration with Apolo Files for data storage and retrieval.
Installing
Install via Apolo CLI
To find more details about how to manage your installable applications using Apolo CLI, refer to Managing Apps.
Step 1 — use CLI command to get application configuration file template:
apolo app-template get spark -o myspark.yaml
Step 2 — fill in application parameters. Here is an example config file with some of those parameters:
# Example of myspark.yaml
template_name: spark
template_version: v1.0.0
display_name: myspark
input:
spark_application_config:
type: Python
main_application_file:
path: storage:my-spark-job/main.py
arguments:
- "--input-path"
- "storage:datasets/input.csv"
- "--output-path"
- "storage:results/"
dependencies:
pypi_packages:
- "pandas==2.0.3"
- "numpy>=1.24.0"
packages:
- "org.apache.spark:spark-sql_2.12:3.5.3"
volumes:
- storage_uri:
path: storage:datasets/
mount_path:
path: /data/input
mode:
mode: r
- storage_uri:
path: storage:results/
mount_path:
path: /data/output
mode:
mode: rw
spark_auto_scaling_config:
initial_executors: 2
min_executors: 1
max_executors: 10
shuffle_tracking_timeout: 60
driver_config:
preset:
name: cpu-medium
executor_config:
instances: 3
preset:
name: cpu-large
image:
repository: spark
tag: 3.5.3
Step 3 — Deploy the application in your Apolo project:
apolo app install -f myspark.yaml
Monitor the application status using:
apolo app list
To uninstall the application, use:
apolo app uninstall <app-id>
If you want to see logs of the application, use:
apolo app logs <app-id>
Last updated
Was this helpful?