Govern Data Streams
- Overview
- Stream Governance
- Manage Schemas
- Fundamentals
- Reference
- FAQ
Stream Governance Packages
Essentials and Advanced. The governance package type you choose determines the features, capabilities, limits, and billing for the governance package. Use the information in this topic to find the governance package with the features and capabilities that’s right for you:
- Use the Essentials package to help you get started with the governance fundamentals.
- Use the Advanced package for enterprise grade data governance in production workloads.
The table below offers a high-level comparison of features across the governance package types.
| Feature | Essentials | Advanced |
|---|---|---|
| Schema Registry | ✔ (Yes) | ✔ (Yes) |
| Schema Registry SLA | 99.5% | 99.99% |
| Schema Registry calls per second | Read: 75 Write: 25 | Read: 75 Write: 25 |
| Data rules | No | ✔ (Yes) |
| Number of schemas included 1 | 100 2 | 20,000 |
| Number of exporters included | 10 | 100 |
| Stream catalog tags | ✔ (Yes) | ✔ (Yes) |
| Stream catalog business metadata | (No) | ✔ (Yes) |
| Stream catalog REST API | ✔ (Yes) | ✔ (Yes) |
| Stream catalog GraphQL API | (No) | ✔ (Yes) |
| Data portal (powered by the Stream Catalog) | ✔ (Yes) | ✔ (Yes) |
| Stream lineage (point in time: last 10 minutes) | ✔ (Yes) | ✔ (Yes) |
| Stream lineage (point in time: last 7 days) | No | ✔ (Yes) |
| AsyncAPI specification export and import | ✔ (Yes) | ✔ (Yes) |
Data Portal
Data Portal is a self-service interface for discovering, exploring, and accessing Apache Kafka® topics on Confluent Cloud.
Building new streaming applications and pipelines on top of Kafka can be slow and inefficient when there is a lack of visibility into what data exists, where it comes from, and who can grant access. Data Portal leverages Stream Catalog and Stream Lineage to empower data users to interact with their organization’s data streams efficiently and collaboratively.
With Data Portal, data practitioners can:
- Search and discover existing topics with the help of topic metadata and get a drill-down view to understand data they hold (without access to the actual data).
- Request access to topics through an approval workflow that connects the data user with the data owner, and admins that can approve the request.
- View and use data in topics (once access is granted) to build new streaming applications and pipelines.
Prerequisites and notes
- Data Portal is available in Confluent Cloud for users with a Stream Governance package enabled in their environments.
- User-generated metadata should be appended to topics to make them discoverable and present them effectively on the Data Portal. In particular, description, tags, business metadata and owner name and email should be added to topics.
- The topic access request workflow is not available for topics on Basic clusters.
- The collaboration workflow for topic access requests through email is dependent upon having owner names and emails appended to topics.
- Users need search Stream Catalog permissions to use the Data Portal, at a minimum DataDiscovery role. On the “add new user” workflow, the DataDiscovery role is pre-selected by default to give the user permission to use the Data Portal.
- To approve access requests to topics, users need topic read and write granting permissions, in particular ResourceOwner, CloudClusterAdmin, EnvironmentAdmin or OrganizationAdmin.
- Users need query permissions on one or more compute pools to query data with Apache Flink®️, at a minimum FlinkDeveloper role.
The "Access requests" tab in the "Accounts & access" section of Confluent Cloud is where you can view and manage all pending and past access requests for Kafka topics. This tab is part of the Data Portal feature, which enables a self-service workflow for requesting and granting access to topics.
- Data Portal on Confluent Cloud | Confluent Documentation
- Organize your data quickly and easily using Confluent Data Portal - YouTube
- Data Portal for Confluent Stream Governance
- Organize your data quickly and easily using Confluent Data Portal - YouTube
Data Contracts for Schema Registry on Confluent Cloud
How To Improve Data Quality With Domain Validation Rules | Data Quality Rules - YouTube
Confluent Schema Registry adds support for tags, metadata, and rules, which together support the concept of a Data Contract. As a part of a Stream Governance solution, data contracts play a key role in ensuring data quality, data consistency, interoperability, and compatibility when sharing information across different systems or organizations.
Limitations
Current limitations are:
- Kafka Connect on Confluent Cloud does not support rules execution.
- Flink SQL and ksqlDB do not support rules execution in either Confluent Platform or Confluent Cloud.
- Confluent Control Center (Legacy) does not show the new properties for Data Contracts on the schema view page, in particular metadata and rules.
- Schema rules are only executed for the root schema, not referenced schemas. For example, given a schema named “Order” that references another schema named “Product” which has some rules attached to it, the serialization/deserialization of the “Order” object will not execute the rules of the “Product” schema.
- The non-Java clients (.NET, go, Python, JavaScript) do not yet support the DLQ Action.
- JavaScript and .NET do not support schema migration rules for Protobuf due to a limitation of the underlying third-party Protobuf libraries.
Understanding the scope of a data contract
A data contract is a formal agreement between an upstream component and a downstream component on the structure and semantics of data that is in motion. A schema is only one element of a data contract. A data contract specifies and supports the following aspects of an agreement:
- Structure. This is the part of the contract that is covered by the schema, which defines the fields and their types.
- Integrity constraints. This includes declarative constraints or data quality rules on the domain values of fields, such as the constraint that an age must be a positive integer.
- Metadata. Metadata is additional information about the schema or its constituent parts, such as whether a field contains sensitive information. Metadata can also include documentation for a data contract, such as who created it.
- Rules or policies. These data rules or policies can enforce that a field that contains sensitive information must be encrypted, or that a message containing an invalid age must be sent to a dead letter queue.
- Change or evolution. This implies that data contracts are versioned, and can support declarative migration rules for how to transform data from one version to another, so that even changes that would normally break downstream components can be easily accommodated.
Keeping in mind that a data contract is an agreement between an upstream component and a downstream component, note that:
- The upstream component enforces the data contract.
- The downstream component can assume that the data it receives conforms to the contract.
Data contracts are important because they provide transparency over dependencies and data usage in a stream architecture. They also help to ensure the consistency, reliability, and quality of the data in motion.
The upstream component could be a Apache Kafka® producer, while the downstream component would be the Kafka consumer. But the upstream component could also be a Kafka consumer, and the downstream component would be the application in which the Kafka consumer resides. This differentiation is important in schema evolution, where the producer may be using a newer version of the data contract, but the downstream application still expects an older version. In this case the data contract is used by the Kafka consumer to mediate between the Kafka producer and the downstream application, ensuring that the data received by the application matches the older version of the data contract, possibly using declarative transformation rules to massage the data into the desired form.
Data Contracts for Schema Registry on Confluent Cloud | Confluent Documentation
Manage Schemas - Broker-Side Schema ID Validation
Schema ID Validation enables the broker to verify that data produced to a Kafka topic uses a valid schema ID in Schema Registry that is registered according to the subject naming strategy. (See also, Schemas, subjects, and topics.)
Schema Validation does not perform data introspection, but rather checks that the schema ID in the Wire Format is registered in Schema Registry under a valid subject.
You must use a serializer and deserializer (serdes) that respect the Wire format, or use a Confluent supported serde, as described in Formats, Serializers, and Deserializers.
Limitations
- Schema validation feature does not reject tombstone records, records with a null value, even if there is no schema ID associated with the record. Messages with a null value or a null key will pass validation. This is a design choice that supports effective data management and deletion in compacted topics.
Prerequisites
- Schema ID Validation on Confluent Cloud is only available on Dedicated clusters through the hosted Schema Registry. Confluent Cloud brokers cannot use self-managed instances of Schema Registry, only the Confluent Cloud hosted Schema Registry. (Schema validation is available for on-premises deployments through Confluent Enterprise).
- You must have a Schema Registry enabled for the environment in which you are using Schema ID Validation.
- **Schema ID Validation is bounded at the level of an environment. All dedicated clusters in the same environment share a Schema Registry. Clusters do not have visibility into schemas across different environments
Schema ID Validation Configuration options on a topic
Schema ID Validation is set at the topic level with the following parameters.
| Property | Description |
|---|---|
confluent.key.schema.validation | When set to true, enables schema ID validation on the message key. The default is false. |
confluent.value.schema.validation | When set to true, enables schema ID validation on the message value. The default is false. |
confluent.key.subject.name.strategy | Set the subject name strategy for the message key. The default is io.confluent.kafka.serializers.subject.TopicNameStrategy. |
confluent.value.subject.name.strategy | Set the subject name strategy for the message value. The default is io.confluent.kafka.serializers.subject.TopicNameStrategy. |
confluent.schema.validation.context.name | Set the specific schema context that will be searched by the broker when validating the schema ID of the message key and value. The default value of this property is default. |
Tip
- Value schema and key schema validation are independent of each other; you can enable either or both.
- The subject naming strategy is tied to Schema ID Validation. This will have no effect when Schema ID Validation is not enabled.
Schemas, subjects and topics
A Kafka topic contains messages, and each message is a key-value pair. Either the message key or the message value, or both, can be serialized as Avro, JSON, or Protobuf. A schema defines the structure of the data format. The Kafka topic name can be independent of the schema name. Schema Registry defines a scope in which schemas can evolve, and that scope is the subject. The name of the subject depends on the configured subject name strategy, which by default is set to derive subject name from topic name.
Schema Registry Concepts for Confluent Cloud | Confluent Documentation
Subject name strategy
A serializer registers a schema in Schema Registry under a subject name, which defines a namespace in the registry:
- Compatibility checks are per subject
- Versions are tied to subjects
- When schemas evolve, they are still associated to the same subject but get a new schema ID and version
The subject name depends on the subject name strategy. Three supported strategies include:
| Strategy | Description |
|---|---|
| TopicNameStrategy | Derives subject name from topic name. (This is the default.) |
| RecordNameStrategy | Derives subject name from record name, and provides a way to group logically related events that may have different data structures under a subject. |
| TopicRecordNameStrategy | Derives the subject name from topic and record name, as a way to group logically related events that may have different data structures under a subject. |
Broker-Side Schema ID Validation on Confluent Cloud | Confluent Documentation
Manage Schemas - Schema Linking
Schema Linking keeps schemas in sync across two Schema Registry clusters. Schema Linking can be used in conjunction with Cluster Linking to keep both schemas and topic data in sync across two Schema Registry and Kafka clusters.
Contexts and exporters
Schema Registry introduces two new concepts to support Schema Linking:
-
Contexts - A context represents an independent scope in Schema Registry, and can be used to create any number of separate “sub-registries” within one Schema Registry cluster. Each schema context is an independent grouping of schema IDs and subject names, allowing the same schema ID in different contexts to represent completely different schemas. Any schema ID or subject name without an explicit context lives in the default context, denoted by a single dot
.. An explicit context starts with a dot and can contain any parts separated by additional dots, such as.mycontext.subcontext. Context names operate similar to absolute Unix paths, but with dots instead of forward slashes (the default schema is like the root Unix path). However, there is no relationship between two contexts that share a prefix. -
Exporters - A schema exporter is a component that resides in Schema Registry for exporting schemas from one Schema Registry cluster to another. The lifecycle of a schema exporter is managed through APIs, which are used to create, pause, resume, and destroy a schema exporter. A schema exporter is like a “mini-connector” that can perform change data capture for schemas.
Limitations and considerations
- On Confluent Cloud, you can have a maximum of 10 exporters with the Essentials package and 100 in the Advanced package.
- One exporter can transfer multiple schemas.
- There is no upper limit on the number of schemas that can be transferred using an exporter.
Schema Linking for Confluent Cloud Developers | Confluent Documentation