Skip to main content
Version: 1.3.0

Connect Spark to Iceberg REST

Introduction

Apache Gravitino exposes an Iceberg REST catalog endpoint that any Iceberg-compatible engine can connect to directly — without installing a Gravitino-specific connector plugin. This page describes how to configure Apache Spark to use Gravitino's Iceberg REST (IRC) endpoint.

note

This integration uses the standard Apache Iceberg REST catalog specification. Gravitino enforces its full access-control model on all IRC requests. Per-user identity propagation from the engine is planned for a future release; current requests are authorized using the credentials supplied in the Spark configuration.

Prerequisites

  • Apache Gravitino running with the Iceberg REST service enabled. See Iceberg REST catalog service for setup instructions.
  • The Gravitino IRC endpoint is accessible from your Spark environment. The default port is 9001.
  • The following JAR files available in your Spark environment:
    • iceberg-spark-runtime-3.5_2.12-1.7.1.jar
    • hadoop-aws-3.3.4.jar
    • aws-bundle-2.29.38.jar

This page uses Spark 3.5.3 with Iceberg 1.7.1. For other versions, ensure compatibility between Spark, Scala, and Iceberg runtime versions.

Configuration

spark-defaults.conf is Spark's persistent configuration file. Properties set here are automatically applied to every Spark session — no command-line flags needed. The file lives at:

$SPARK_HOME/conf/spark-defaults.conf

If the file doesn't exist yet, copy the template:

cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf

Simple Authentication

Add the following to $SPARK_HOME/conf/spark-defaults.conf:

# Iceberg extensions
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

# Gravitino IRC catalog
spark.sql.catalog.gravitino_irc org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.gravitino_irc.type rest
spark.sql.catalog.gravitino_irc.uri http://<gravitino-host>:9001/iceberg

# S3 FileIO
spark.sql.catalog.gravitino_irc.io-impl org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.gravitino_irc.s3.region us-east-1
spark.sql.catalog.gravitino_irc.s3.access-key-id <access-key>
spark.sql.catalog.gravitino_irc.s3.secret-access-key <secret-key>

# Hadoop S3A (for s3a:// paths)
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem

# Set as default catalog (optional)
spark.sql.defaultCatalog gravitino_irc
note

gravitino_irc is the catalog identifier used within Spark. It maps to the Gravitino IRC endpoint via the uri property. You may use any identifier you prefer. S3 credentials can alternatively be supplied via environment variables (AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY) or an IAM instance profile, in which case the explicit credential lines can be omitted.

Basic Authentication

If Gravitino uses built-in IDP Basic authentication, add the auth properties to $SPARK_HOME/conf/spark-defaults.conf:

# Iceberg extensions
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

# Gravitino IRC catalog
spark.sql.catalog.gravitino_irc org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.gravitino_irc.type rest
spark.sql.catalog.gravitino_irc.uri http://<gravitino-host>:9001/iceberg

# Basic authentication
spark.sql.catalog.gravitino_irc.rest.auth.type basic
spark.sql.catalog.gravitino_irc.rest.auth.basic.username <username>
spark.sql.catalog.gravitino_irc.rest.auth.basic.password <password>

# S3 FileIO
spark.sql.catalog.gravitino_irc.io-impl org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.gravitino_irc.s3.region us-east-1
spark.sql.catalog.gravitino_irc.s3.access-key-id <access-key>
spark.sql.catalog.gravitino_irc.s3.secret-access-key <secret-key>

# Hadoop S3A (for s3a:// paths)
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem

# Set as default catalog (optional)
spark.sql.defaultCatalog gravitino_irc

OAuth2 Authentication

If Gravitino is configured with OAuth2, add the auth properties to the same $SPARK_HOME/conf/spark-defaults.conf file:

# Iceberg extensions
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

# Gravitino IRC catalog
spark.sql.catalog.gravitino_irc org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.gravitino_irc.type rest
spark.sql.catalog.gravitino_irc.uri http://<gravitino-host>:9001/iceberg

# OAuth2 authentication
spark.sql.catalog.gravitino_irc.rest.auth.type oauth2
spark.sql.catalog.gravitino_irc.rest.auth.oauth2.token <your-token>

# S3 FileIO
spark.sql.catalog.gravitino_irc.io-impl org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.gravitino_irc.s3.region us-east-1
spark.sql.catalog.gravitino_irc.s3.access-key-id <access-key>
spark.sql.catalog.gravitino_irc.s3.secret-access-key <secret-key>

# Hadoop S3A (for s3a:// paths)
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem

# Set as default catalog (optional)
spark.sql.defaultCatalog gravitino_irc

See How to authenticate for Gravitino authentication configuration options.

Local development

For local development, MinIO can be used as an S3-compatible storage backend. Replace the S3 FileIO section with:

spark.sql.catalog.gravitino_irc.io-impl                 org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.gravitino_irc.s3.endpoint http://<minio-host>:9000
spark.sql.catalog.gravitino_irc.s3.path-style-access true
spark.sql.catalog.gravitino_irc.s3.access-key-id <minio-access-key>
spark.sql.catalog.gravitino_irc.s3.secret-access-key <minio-secret-key>
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.endpoint http://<minio-host>:9000
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.connection.ssl.enabled false

See gravitino-irc-quickstart for a complete local development environment using MinIO.

Credential Vending

If Gravitino is configured with credential vending, add the following to enable it on the client side:

spark.sql.catalog.gravitino_irc.header.X-Iceberg-Access-Delegation    vended-credentials

See Credential vending for server-side configuration.

note

For storage not managed by Gravitino, properties are not automatically transferred from the server to the client. Pass custom properties to initialize FileIO explicitly:

spark.sql.catalog.gravitino_irc.<configuration-key>    <property-value>

Start Spark

Once spark-defaults.conf is in place, start your Spark session normally. The Gravitino IRC catalog is available immediately without any additional flags.

Spark Shell (Scala)

$SPARK_HOME/bin/spark-shell

Spark SQL

$SPARK_HOME/bin/spark-sql

PySpark

$SPARK_HOME/bin/pyspark

Examples

List Namespaces

SHOW NAMESPACES IN gravitino_irc;

List Tables

SHOW TABLES IN gravitino_irc.<namespace>;

Query a Table

SELECT * FROM gravitino_irc.<namespace>.<table> LIMIT 10;

Create a Table

CREATE TABLE gravitino_irc.<namespace>.new_table (
id INT,
name STRING,
created_at TIMESTAMP
) USING iceberg;

Insert Data

INSERT INTO gravitino_irc.<namespace>.new_table VALUES (1, 'example', current_timestamp());

Gravitino Connector vs. Iceberg REST

FeatureGravitino Engine ConnectorIceberg REST
Engine plugin requiredYesNo
Gravitino access controlYesYes
Supported enginesTrino, Spark, Flink, DaftAny Iceberg-compatible engine
Credential vendingVariesYes (S3, GCS, OSS, ADLS)