Skip to main content

4 posts tagged with "metadata"

View All Tags

Apache Gravitino 0.7.0 - strengthen the cloud support for Apache Gravitino™ (incubating)

· 6 min read
Jerry Shao
PPMC Member

Gravitino 0.7.0 is the second major release after entering the ASF. In this release, the community mainly focuses on strengthening cloud support, to make Gravitino work better in the cloud environment.

This release blog will briefly introduce the new features related to cloud support, as well as other significant features and improvements. Please keep reading to learn more about what the community has worked on.

Cloud storage support for Gravitino

As more and more users run their data stacks on the cloud and use cloud object storage, cloud storage support becomes an imperative requirement. In this release, the community mainly focuses on adding cloud storage support for Gravitino and makes sure Gravitino itself and its connectors/sources can work smoothly with cloud storage.

In this release:

  • Gravitino Iceberg REST catalog server now supports different cloud storages, including AWS S3, Google GCS, Aliyun OSS. Users can simply configure it to make it work.
  • Gravitino Fileset catalog now supports managing files (objects) stored in S3, GCS, and OSS. Gravitino provides both server-side pluggable framework and client-side Java / Python GVFS (Gravitino Virtual File System) SDK. Users can easily use their existing tools with the Gravitino provided bundled packages to access the data in the cloud storage. Besides, Gravitino also provides a pluggable framework for users to implement their own storage support.
  • Gravitino’s Hive, Paimon, and Iceberg catalogs also adds and verifies the support with different cloud storage in this release.
  • Gravitino’s Spark, Trino connector also verifies to work with cloud storage.

Overall, with 0.7.0 release Gravitino could generally support working with different cloud storages. You can check our issue #4396 to know more. Also, we’re continuing to add more cloud storage support in the following releases, please stay tuned.

Credential vending support in Gravitino

Besides the cloud storage support, credential vending support is also important for Gravitino, especially to work with cloud storage. The traditional way of using AKSK is not convenient and safe, with credential vending technology, Gravitino server will help users to get the temporary tokens for authentication, which will significantly simplify the client side configurations and centralized the authentications.

In Gravitino 0.7.0, we introduce a framework to support Credential vending, also add S3 and GCS token support. Besides, we integrated this framework in Gravitino Iceberg REST catalog service. So users can smoothly access the Iceberg table on S3 and GCS with authentication.

But, this is just the first step of credential vending, we will add more integrations with Gravitino, like fileset support, connector support, etc, in the next release.

For the details of credential vending, please check the issue #4398 and the design document.

Unified access control improvements

In Gravitino 0.6.0, we introduced the alpha version of unified access control with Apache Ranger support (here), but this feature still needs to improve a lot. In the version 0.7.0, we add lots of improvements and fix bunches of bugs to make this access control end to end workable. Now, with the release of 0.7.0, the Gravitino unified access control can work well with Spark and Ranger to secure the table from end to end. To see what we have fixed, please check out our issue #4615. You can also try our playground to experience the unified access control feature.

Centralized audit log support

Thanks to the community, Gravitino now supports centralized audit log. With this feature enabled, users can get the audit log in the centralized place, no matter they’re accessing tables or filesets from various sources.

Gravitino’s audit log framework also supports to plugin different formatter and writer, so users can implement their own log format and output destinations.

Please see the issues #4887 and #4021 to know more about Gravitino’s centralized audit log.

New data sources support

As a unified data catalog, the community always pursue the target of adding more data sources. In this version, Gravitino adds two new data sources, one is Apache Hudi, another is OceanBase. You can now use Gravitino to manipulate Hudi and Oceanbase metadata in a unified manner.

Various core features

Apart from the features listed above, this version also improves a lot in its core, here lists several important features:

  • Add PostgreSQL support for storage backend #4101. Gravitino already supports using MySQL, H2 as its backend metadata storage. In 0.7.0, the community adds the PostgreSQL support to enlarge its adoption.
  • Unify the catalog and metalake drop behavior #5031. In the previous version, we didn’t enforce the behavior of catalog and metadata drop operation. In this version, we redefine its behavior and make it much safer to use.
  • Manage the column in Gravitino #4493. In 0.7.0, we introduce the column entity in Gravitino, and can be managed by Gravitino versionly. With this feature introduced, Gravitino now can support tagging on columns, and in future it can support column level operations.
  • Add event listener for Iceberg REST catalog server #5204 and support pre-event for event listener #5112.

Other notable enhancements

Gravitino core

  • Supporting storing column metadata in Gravitino #4493.
  • Support pre-event for Gravitino #5049.
  • Unify drop metalake and catalog behavior #5031.
  • Add credential vending support in Gravitino #4398.
  • Support audit log in Gravitino #4887.
  • Shrink the package size of Gravitino #4513.

Iceberg REST catalog server

  • Add credential vending for Iceberg REST server. #4993.
  • Add event listener for Iceberg REST server #5204.
  • Support pre-event for event listener #5112.
  • Add OSS support for fileset catalog #5173.
  • Add GCS support for fileset catalog #5074.
  • Add S3 support for fileset catalog #3379.
  • Add pluggable storage support fro fileset catalog #5019.
  • Add S3 support for Paimon catalog #4938.
  • Add catalog support for Hudi #4306.
  • Add catalog support for OceanBase#4848.

API and client

  • Add S3 fileset support for Python GVFS client #5188.
  • Add GCS fileset support for Python GVFS client #5139.
  • Add OSS fileset support for Python GVFS client #5221.
  • Supports unified auditing of Fileset metadata and data operations #4021.
  • Support OAuth2 in Python GVFS #3758.

UI

  • Add UI support for operating fileset #5167.
  • Add UI support for operating schema #5140.

All the resolved issues targeting to the 0.7.0 release can be seen at https://github.com/apache/gravitino/issues?q=is%3Aissue+is%3Aclosed+label%3A0.7.0+.

Overall

Apache Gravitino 0.7.0 is the second ASF release, this version add bunch of new features, we would like to show appreciation to the Gravitino community for their continued support and valuable contributions. Thanks to the feedback of our users, we are able to continue to innovate and build, so thanks to all those reading this!

To explore Gravitino 0.7.0 release, please check the documentation. Your feedback is invaluable to the community and the project.

Credits

This release acknowledges the hard work and dedication of all contributors who have helped make this release possible.

@FANNG1 @LauraXia123 @LindaSummer @LiuQhahah @Naresh-kumar-Thodupunoori @SeanAverS @caican00 @coolderli @diqiu50 @featherchen @hanwxx @jerqi @jerryshao @jingjia88 @justinmclean @koonchen @lsyulong @lw-yang @mchades @noidname01 @puchengy @shaofengshi @theoryxu @xiaozcy @xloya @xunliu @yangyuxia @yaoderek @yuanoOo @yuqi1129

Apache Gravitino 0.6.1 release for Apache Gravitino™ (incubating)

· 3 min read
Minghuang Li
PPMC Member

We are pleased to announce the stable release of Gravitino 0.6.1-incubating, based on branch-0.6. This release brings a suite of new features and enhancements, particularly focusing on the unified access control system. Additionally, it includes various bug fixes and optimizations across other components.

Security

  • Supports list users #3348
  • Supports list roles #3346
  • Supports list roles by object #4886
  • Supports list group #4873
  • Supports grant or revoke privileges for a role #4903
  • Improved security with additional checks for privilege APIs #5054 #5070
  • Fix Hive metastore authentication failed when creating a role #4960
  • Remove role local cache #4246
  • Addressed a response error in Ranger when calling the Ranger CREATE_GROUP API #4975

Gravitino Core

  • Fixed an issue with updating comments in metalake or catalog operations #4845
  • Introduced a basic framework to support multiple JDBC backends #4832 #4868
  • Fixed a cleanup bug occurring after failed catalog creation attempts #5082

Tag

  • Transitioned Tag REST APIs to Object path for improved management #5000

Catalogs

Iceberg

  • Use unified logic to transform catalog backend name to handle the renaming of catalog #4718

Doris

  • Fix the missing distribution information when loading Doris tables #4988

Trino Connector

  • Corrected the default precision settings for Time and Timestamp column types in the Iceberg catalog #4743

UI

  • Supports creating Paimon catalog #4742
  • Improved user experience by showing an expand arrow when reloading tree nodes #5042

Build and Others

  • Fix the env of openAPI lint plugin #4876
  • Addressed an Out Of Memory (OOM) issue during Trino connector tests #4871
  • Resolved a test failure in testCheckLinkDocs for the web module #4914
  • Increase the Python timeout minutes to 45 minutes #5038
  • Ensured that TestHiveTableOperations can be run independently #4851
  • Added LICENSE and NOTICE files for the Iceberg REST binary to comply with licensing requirements #5010

Limitations and Known Issues

  • Please be aware that the Ranger authorization plugin within the unified access control system may exhibit some limitations and known issues. For detailed information, refer to issue #5115.

Credits

We would like to thank the following contributors for their valuable contributions to this release:

@diqiu50 @FANNG1 @jerqi @jerryshao @justinmclean @LauraXia123 @LindaSummer @LiuQhahah @lsyulong @lw-yang @mchades @tyoushinya @yangyuxia @yuqi1129

Apache, Apache Iceberg, Apache Hive, Apache Fink, Apache Paimon and Apache Gravitino are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Apache Gravitino 0.6.0 - First ASF release for Apache Gravitino™ (incubating)

· 7 min read
Jerry Shao
PPMC Member

This blog post will briefly introduce the new features and significant improvements. Keep reading to learn what the community has worked on and understand Gravitino’s use cases.

Introducing the unified RBAC model for Gravitino

Access control is a crucial feature for the enterprise use of a data catalog, providing users with unified and centralized authorization and authentication capabilities. This release introduces a role-based access control (RBAC) model in Gravitino to authorize different securable objects in a unified manner.

We use Privilege, SecurableObject, Role, User, and Group to define the permissions.

RBAC model

Privilege

Privilege defines the types of operations on different metadata objects, and is used to allow or deny a specific type of operation on a metadata object.

SecurableObject

SecurableObject binds multiple operation-specific types of privileges to a single metadata object.

Role

A Role is a collection of SecurableObjects, and a role represents multiple operation type permissions on multiple metadata objects.

User Users are granted one or multiple roles, and users have different operating privileges depending on their roles.

Group

To make it easier to grant a single permission to multiple users, we can add users to a group, and then grant one or more roles to that user group. This process allows all users belonging to that user group to have the permissions in those roles.

More importantly, the privileges authorized by the user in Gravitino will be pushed down to the underlying permission system. Currently, we support push permissions to Apache Ranger, others like IAM are under development.

Authorization flow

For more information about how our RBAC works, please check out our design document. To enable and use access control in Gravitino, please refer to the user document.

Our implementation of unified access control capability is still in the alpha stage, and we’re striving to add more features and make it stable as soon as possible, so please stay tuned.

Separation of the Iceberg REST catalog service

Apache Iceberg is a first-class citizen, and Gravitino has provided an embedded Iceberg REST catalog service since version 0.3. We have seen the increased demands and adoption of Iceberg REST catalog service as a standalone server. So, in version 0.6.0, we refactored the whole architecture and modularized the Iceberg REST catalog service as a standalone service, allowing it to be deployed with or without the Gravitino server. Besides the refactoring, we also bumped the supported version to Iceberg 1.5.2, added support for S3 cloud storage, and now support the registerTable interface.

Iceberg REST catalog support is crucial to Gravitino, and modularization is just the first step. In future releases, we will add more features like cloud storage support and integrating Gravitino’s RBAC model, credential vending, etc.

To use the Gravitino Iceberg REST catalog service, please check our user document. The umbrella issue is #4058.

Tagging support

Tagging on metadata objects is useful for data discovery, classification, and data governance. It can also be leveraged by query engines to provide tag-based access control. In Gravitino 0.6.0, we introduce tag support users can add tags on metadata objects like CATALOG, SCHEMA, TABLE, FILESET, and TOPIC. To know how our tag system is designed, please check out the design document and issue #3344. To use tags in both REST API and Java SDK, please see how to manage tags.

As an open data catalog, we want to be able to support all query engines. Therefore, alongside Trino and Apache Spark, we have added Apache Flink as our newest supported query engine.

In 0.6.0, we added a new Flink Gravitino connector #1354 and supported querying Hive tables using Flink with Gravitino. Hive support is just our first step, we will continue to add more table support.

To know how to use the Flink Gravitino connector, please refer to our documentation.

Apache Paimon table management in Gravitino

Apache Paimon has become quite popular this year, and many companies use Paimon to build their streaming warehouse or lakehouse. To manage all the lakehouse tables in a unified manner, Gravitino has added Paimon table management in 0.6.0 #1129. Users can use our unified API to manage Paimon tables as well as other tables. To know more about how to manage Paimon tables, please refer to Lakehouse Paimon Catalog document.

Add Python GVFS support for fileset

In Gravitino 0.5, we added a Java Hadoop Compatible Filesystem (HCFS) support (GVFS) for fileset read/write in Gravitino. The provided Java GVFS can be used by query engines like Apache Spark to read/write data from files or folders. Although this works well in big data, AI development is largely dominated by Python, which can create an obstacle and hinder users from using Fileset with AI frameworks.

In 0.6.0, we followed the Python fsspec to provide a Python GVFS package that can be used by popular Python frameworks like Apache Arrow, Pandas, Ray, LlamaIndex, and more. You can check out Python GVFS document for more information.

Notable enhancements

Gravitino core

  • Support catalog reload after a property is altered #2267.
  • Deprecate KV store and add H2 support as embedded storage backend #3968.

Catalog relate

  • Add API test catalog connection #4107.
  • Improve the type system to support unknown types #3427.
  • Add Kerberos support for fileset Hadoop catalog #3462.
  • Add S3 support for Iceberg #4264.
  • Support cloud and region property when creating catalog #3966.
  • Support multiple Kerberos authentication for Hive catalog #3906.
  • Unify the behavior of purge for all the catalogs #3685.

API and client

  • Refactor Java and Python API for better user experience #3626.
  • Add missing error handlers in Python client #4225.

All the resolved issues targeting the 0.6.0 release can be seen at https://github.com/apache/gravitino/issues?page=12&q=is%3Aissue+is%3Aclosed+label%3A0.6.0.

Overall

Apache Gravitino 0.6.0 is the first ASF release, we would like to show appreciation to the Gravitino community for their continued support and valuable contributions. Thanks to the feedback of our users, we are able to continue to innovate and build, so thanks to all those reading this!

To explore Gravitino 0.6.0 release, please check the documentation. Your feedback is invaluable to the community and the project.

Credits

This release acknowledges the hard work and dedication of all contributors who have helped make this release possible.

@1996fanrui @BSSsunny @FANNG1 @IamSaker @JinsYin @JosefinaOller @LanceHsun @LauraXia123 @Leonidas963 @LindaSummer @MukarramHaq @Naresh-kumar-Thodupunoori @Nishtha-Jain-1119 @SteNicholas @TEOTEO520 @Vishesh-Paliwal @ashwin1596 @bknbkn @caican00 @ch3yne @charliecheng630 @coolderli @danhuawang @diqiu50 @featherchen @hanwxx @ian910297 @jenish-thapa @jerqi @jerryshao @jingjia88 @jtao1 @justinmclean @kalencaya @khmgobe @kiratkumar47 @kohantikanath @kristopherkane @lsyulong @lw-yang @mchades @mygrsun @noidname01 @pan3793 @pravo23 @qqqttt123 @rich7420 @rohit-satya @shaofengshi @theoryxu @totalo @unknowntpo @xiaozcy @xloya @xunliu @yijhenlin @yuqi1129 @zhoukangcn @zivali

Apache Gravitino is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by ASF Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Apache, Apache Iceberg, Apache Hive, Apache Fink, Apache Paimon and Apache Gravitino are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Gravitino is an Open Source Data and AI Multi-Cloud Solution

· 4 min read
Justin Mclean
PPMC Member

In the ever-evolving landscape of data and artificial intelligence, innovation is the key driver of progress. Gravitino is an open source, next-generation data and AI platform. Gravitino aims to unify all aspects of your data, analytics, and AI in one seamless accessible platform.

The power of open source

Open source embodies collaboration, transparency, and community-driven development. Making Gravitino open source and as an incubating project of the Apache Software Foundation extends an invitation to developers worldwide to participate in shaping the future of multi-cloud data management and analytics.

Unified data, analytics, and AI fabric

Gravitino isn't just a tool; it's a fabric that weaves together all your data, analytics, and AI into a single, unified platform. Regardless of where your data resides, be it in various public or private cloud environments, different vendors or different regions, Gravitino provides a solution and delivers optimal performance and cost efficiency.

Operational Simplicity

Gravitino offers a unified perspective of all your data and AI models, ensuring seamless access to all your data. Gravitino empowers users with operational simplicity, allowing them to focus on deriving insights rather than managing complex data infrastructure.

Developer experience

For developers, Gravitino enables a unified ANSI standard-compatible SQL interface, making data handling ETL-free and codeless. Its REST interface, coupled with a built-in SQL optimizer and intelligent query execution, ensures an efficient developer experience. Gravitino empowers developers to focus on innovation rather than grappling with the intricacies of data handling.

Performance and cost efficiency

Gravitino aims to take data management to the next level by eliminating unnecessary data transmission, providing the best performance for data queries on multi-cloud environments. With global data acceleration, Gravitino enables faster and more cost-effective data analysis. This performance boost ensures that organizations can derive insights quicker and more efficiently.

Data source connection, data virtualization, federated computing

Gravitino comes equipped with enterprise-ready connectors for seamless access to cloud data lakes with a focus on high performance. It offers a unified experience for data in remote regions through data virtualization, progress on intelligent acceleration, and allows effortless data analysis and training across different data sources, breaking down traditional silos.

Why Gravitino?

Breaking down data silos

Gravitino tackles the age-old challenge of data silos by providing a unified metadata management and federated analytics engine. This allows for direct data analysis from various cloud and SaaS services without the need for time-consuming ETL processes.

Query federation and in-situ analysis

Gravitino is creating a world where users can access data from diverse systems within a single query, eliminating the need for complex data replication and transformation processes.

Open source commitment

Gravitino's journey isn't just about software; it's about community-driven development. Actively engaged in open source development under the Apache License, a business-friendly permissive license, join the developer community to be part of this exciting journey.

The future of multi-cloud data management

In the era of data-driven decision-making, Gravitino emerges as a beacon of innovation and collaboration. Embracing open source, the belief in the power of community-driven development to shape the future of data and AI is evident. Gravitino isn't just a platform; it represents a movement toward a more connected, efficient, and accessible data landscape. Join the journey to redefine the possibilities of data management and analytics with Gravitino, the next-generation data and AI fabric.

Discover the power of Gravitino, an open source platform reshaping multi-cloud data and AI. Join the community and redefine the possibilities of data management. Get started on GitHub!, on GitHub you also find documentation and a Docker playground to help get you started, you can also join the community slack channel to discuss ideas and seek help.