Manage fileset metadata using Gravitino
This page introduces how to manage fileset metadata in Apache Gravitino. Filesets are a collection of files and directories. Users can leverage filesets to manage non-tabular data like training datasets and other raw data.
Typically, a fileset is mapped to a directory on a file system like HDFS, S3, ADLS, GCS, etc. With the fileset managed by Gravitino, the non-tabular data can be managed as assets together with tabular data in Gravitino in a unified way.
After a fileset is created, users can easily access, manage the files/directories through the fileset's identifier, without needing to know the physical path of the managed dataset. Also, with unified access control mechanism, filesets can be managed via the same role based access control mechanism without needing to set access controls across different storage systems.
To use fileset, please make sure that:
- Gravitino server has started, and the host and port is http://localhost:8090.
- A metalake has been created and enabled
Catalog operations
Create a catalog
For a fileset catalog, you must specify the catalog type
as FILESET
when creating the catalog.
You can create a catalog by sending a POST
request to the /api/metalakes/{metalake_name}/catalogs
endpoint or just use the Gravitino Java client. The following is an example of creating a catalog:
- Shell
- Java
- Python
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "catalog",
"type": "FILESET",
"comment": "comment",
"provider": "hadoop",
"properties": {
"location": "file:/tmp/root"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs
# create a S3 catalog
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "catalog",
"type": "FILESET",
"comment": "comment",
"provider": "hadoop",
"properties": {
"location": "s3a://bucket/root",
"s3-access-key-id": "access_key",
"s3-secret-access-key": "secret_key",
"s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com",
"filesystem-providers": "s3"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs
# For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to
# The following link about the catalog properties.
GravitinoClient gravitinoClient = GravitinoClient
.builder("http://localhost:8090")
.withMetalake("metalake")
.build();
Map<String, String> properties = ImmutableMap.<String, String>builder()
.put("location", "file:/tmp/root")
// Property "location" is optional. If specified, a managed fileset without
// a storage location will be stored under this location.
.build();
Catalog catalog = gravitinoClient.createCatalog("catalog",
Type.FILESET,
"hadoop", // provider, Gravitino only supports "hadoop" for now.
"This is a Hadoop fileset catalog",
properties);
// create a S3 catalog
s3Properties = ImmutableMap.<String, String>builder()
.put("location", "s3a://bucket/root")
.put("s3-access-key-id", "access_key")
.put("s3-secret-access-key", "secret_key")
.put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com")
.put("filesystem-providers", "s3")
.build();
Catalog s3Catalog = gravitinoClient.createCatalog("catalog",
Type.FILESET,
"hadoop", // provider, Gravitino only supports "hadoop" for now.
"This is a S3 fileset catalog",
s3Properties);
// ...
// For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to
// The following link about the catalog properties.
gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
catalog = gravitino_client.create_catalog(name="catalog",
type=Catalog.Type.FILESET,
provider="hadoop",
comment="This is a Hadoop fileset catalog",
properties={"location": "/tmp/test1"})
# create a S3 catalog
s3_properties = {
"location": "s3a://bucket/root",
"s3-access-key-id": "access_key"
"s3-secret-access-key": "secret_key",
"s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com"
}
s3_catalog = gravitino_client.create_catalog(name="catalog",
type=Catalog.Type.FILESET,
provider="hadoop",
comment="This is a S3 fileset catalog",
properties=s3_properties)
# For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to
# The following link about the catalog properties.
Currently, Gravitino supports the following catalog providers:
Catalog provider | Catalog property |
---|---|
hadoop | Hadoop catalog property |
Load a catalog
Refer to Load a catalog in relational catalog for more details. For a fileset catalog, the load operation is the same.
Alter a catalog
Refer to Alter a catalog in relational catalog for more details. For a fileset catalog, the alter operation is the same.
Drop a catalog
Refer to Drop a catalog in relational catalog for more details. For a fileset catalog, the drop operation is the same.
Currently, Gravitino doesn't support dropping a catalog with schemas and filesets under it. You have to drop all the schemas and filesets under the catalog before dropping the catalog.
List all catalogs in a metalake
Please refer to List all catalogs in a metalake in relational catalog for more details. For a fileset catalog, the list operation is the same.
List all catalogs' information in a metalake
Please refer to List all catalogs' information in a metalake in relational catalog for more details. For a fileset catalog, the list operation is the same.
Schema operations
Schema
is a virtual namespace in a fileset catalog, which is used to organize the fileset. It
is similar to the concept of schema
in relational catalog.
Users should create a metalake and a catalog before creating a schema.
Create a schema
You can create a schema by sending a POST
request to the /api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas
endpoint or just use the Gravitino Java client. The following is an example of creating a schema:
- Shell
- Java
- Python
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "schema",
"comment": "comment",
"properties": {
"location": "file:/tmp/root/schema"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas
GravitinoClient gravitinoClient = GravitinoClient
.builder("http://localhost:8090")
.withMetalake("metalake")
.build();
// Assuming you have just created a Hadoop catalog named `catalog`
Catalog catalog = gravitinoClient.loadCatalog("catalog");
SupportsSchemas supportsSchemas = catalog.asSchemas();
Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
// Property "location" is optional, if specified all the managed fileset without
// specifying storage location will be stored under this location.
.put("location", "file:/tmp/root/schema")
.build();
Schema schema = supportsSchemas.createSchema("schema",
"This is a schema",
schemaProperties
);
// ...
gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
catalog: Catalog = gravitino_client.load_catalog(name="catalog")
catalog.as_schemas().create_schema(name="schema",
comment="This is a schema",
properties={"location": "/tmp/root/schema"})
Currently, Gravitino supports the following schema property:
Catalog provider | Schema property |
---|---|
hadoop | Hadoop schema property |
Load a schema
Please refer to Load a schema in relational catalog for more details. For a fileset catalog, the schema load operation is the same.
Alter a schema
Please refer to Alter a schema in relational catalog for more details. For a fileset catalog, the schema alter operation is the same.
Drop a schema
Please refer to Drop a schema in relational catalog for more details. For a fileset catalog, the schema drop operation is the same.
Note that the drop operation will also remove all of the filesets as well as the managed files
under this schema path if cascade
is set to true
.
List all schemas under a catalog
Please refer to List all schemas under a catalog in relational catalog for more details. For a fileset catalog, the schema list operation is the same.
Fileset operations
- Users should create a metalake, a catalog, and a schema before creating a fileset.
- Currently, Gravitino only supports managing Hadoop Compatible File System (HCFS) locations.
Create a fileset
You can create a fileset by sending a POST
request to the /api/metalakes/{metalake_name} /catalogs/{catalog_name}/schemas/{schema_name}/filesets
endpoint or just use the Gravitino Java
client. The following is an example of creating a fileset:
- Shell
- Java
- Python
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "example_fileset",
"comment": "This is an example fileset",
"type": "MANAGED",
"storageLocation": "file:/tmp/root/schema/example_fileset",
"properties": {
"k1": "v1"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets
GravitinoClient gravitinoClient = GravitinoClient
.builder("http://localhost:8090")
.withMetalake("metalake")
.build();
Catalog catalog = gravitinoClient.loadCatalog("catalog");
FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
.put("k1", "v1")
.build();
filesetCatalog.createFileset(
NameIdentifier.of("schema", "example_fileset"),
"This is an example fileset",
Fileset.Type.MANAGED,
"file:/tmp/root/schema/example_fileset",
propertiesMap,
);
gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
catalog: Catalog = gravitino_client.load_catalog(name="catalog")
catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "example_fileset"),
type=Fileset.Type.MANAGED,
comment="This is an example fileset",
storage_location="/tmp/root/schema/example_fileset",
properties={"k1": "v1"})
Currently, Gravitino supports two types of filesets:
MANAGED
: The storage location of the fileset is managed by Gravitino when specified asMANAGED
, the physical location of the fileset will be deleted when this fileset is dropped.EXTERNAL
: The storage location of the fileset is not managed by Gravitino, when specified asEXTERNAL
, the files of the fileset will not be deleted when the fileset is dropped.
storageLocation
The storageLocation
is the physical location of the fileset. Users can specify this location
when creating a fileset, or follow the rules of the catalog/schema location if not specified.
The value of storageLocation
depends on the configuration settings of the catalog:
- If this is a S3 fileset catalog, the
storageLocation
should be in the format ofs3a://bucket-name/path/to/fileset
. - If this is an OSS fileset catalog, the
storageLocation
should be in the format ofoss://bucket-name/path/to/fileset
. - If this is a local fileset catalog, the
storageLocation
should be in the format offile:/path/to/fileset
. - If this is a HDFS fileset catalog, the
storageLocation
should be in the format ofhdfs://namenode:port/path/to/fileset
. - If this is a GCS fileset catalog, the
storageLocation
should be in the format ofgs://bucket-name/path/to/fileset
.
For a MANAGED
fileset, the storage location is:
- The one specified by the user during the fileset creation.
- When the catalog property
location
is specified but the schema propertylocation
isn't specified, the storage location iscatalog location/schema name/fileset name
. - When the catalog property
location
isn't specified but the schema propertylocation
is specified, the storage location isschema location/fileset name
. - When both the catalog property
location
and the schema propertylocation
are specified, the storage location isschema location/fileset name
. - When both the catalog property
location
and schema propertylocation
isn't specified, the user should specify thestorageLocation
in the fileset creation.
For EXTERNAL
fileset, users should specify storageLocation
during the fileset creation,
otherwise, Gravitino will throw an exception.
Alter a fileset
You can modify a fileset by sending a PUT
request to the /api/metalakes/{metalake_name} /catalogs/{catalog_name}/schemas/{schema_name}/filesets/{fileset_name}
endpoint or just use the
Gravitino Java client. The following is an example of modifying a fileset:
- Shell
- Java
- Python
curl -X PUT -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"updates": [
{
"@type": "removeProperty",
"property": "key2"
}, {
"@type": "setProperty",
"property": "key3",
"value": "value3"
}
]
}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets/fileset
// ...
// Assuming you have just created a Fileset catalog named `catalog`
Catalog catalog = gravitinoClient.loadCatalog("catalog");
FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
Fileset f = filesetCatalog.alterFileset(NameIdentifier.of("schema", "fileset"),
FilesetChange.rename("fileset_renamed"), FilesetChange.updateComment("xxx"));
// ...
gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
catalog: Catalog = gravitino_client.load_catalog(name="catalog")
changes = (
FilesetChange.remove_property("fileset_properties_key1"),
FilesetChange.set_property("fileset_properties_key2", "fileset_propertie_new_value"),
)
fileset_new = catalog.as_fileset_catalog().alter_fileset(NameIdentifier.of("schema", "fileset"),
*changes)
Currently, Gravitino supports the following changes to a fileset:
Supported modification | JSON | Java |
---|---|---|
Rename a fileset | {"@type":"rename","newName":"fileset_renamed"} | FilesetChange.rename("fileset_renamed") |
Update a comment | {"@type":"updateComment","newComment":"new_comment"} | FilesetChange.updateComment("new_comment") |
Set a fileset property | {"@type":"setProperty","property":"key1","value":"value1"} | FilesetChange.setProperty("key1", "value1") |
Remove a fileset property | {"@type":"removeProperty","property":"key1"} | FilesetChange.removeProperty("key1") |
Remove comment (deprecated) | {"@type":"removeComment"} | FilesetChange.removeComment() |
Drop a fileset
You can remove a fileset by sending a DELETE
request to the /api/metalakes/{metalake_name} /catalogs/{catalog_name}/schemas/{schema_name}/filesets/{fileset_name}
endpoint or by using the
Gravitino Java client. The following is an example of dropping a fileset:
- Shell
- Java
- Python
curl -X DELETE -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" \
http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets/fileset
// ...
// Assuming you have just created a Fileset catalog named `catalog`
Catalog catalog = gravitinoClient.loadCatalog("catalog");
FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
// Drop a fileset
filesetCatalog.dropFileset(NameIdentifier.of("schema", "fileset"));
// ...
gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
catalog: Catalog = gravitino_client.load_catalog(name="catalog")
catalog.as_fileset_catalog().drop_fileset(ident=NameIdentifier.of("schema", "fileset"))
For a MANAGED
fileset, the physical location of the fileset will be deleted when this fileset is
dropped. For EXTERNAL
fileset, only the metadata of the fileset will be removed.
List filesets
You can list all filesets in a schema by sending a GET
request to the /api/metalakes/ {metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/filesets
endpoint or by using the
Gravitino Java client. The following is an example of listing all the filesets in a schema:
- Shell
- Java
- Python
curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" \
http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets
// ...
Catalog catalog = gravitinoClient.loadCatalog("catalog");
FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
NameIdentifier[] identifiers =
filesetCatalog.listFilesets(Namespace.of("schema"));
// ...
gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
catalog: Catalog = gravitino_client.load_catalog(name="catalog")
fileset_list: List[NameIdentifier] = catalog.as_fileset_catalog().list_filesets(namespace=Namespace.of("schema")))