The Cloud Storage sink connector periodically polls data from Pulsar and in turn moves it to objects in cloud storage (AWS S3, Google GCS, etc.) in either Avro, JSON, or Parquet formats without duplicates. Depending on your environment, the Cloud Storage sink connector can export data by guaranteeing exactly-once delivery semantics to its consumers.
The Cloud Storage sink connector provides partitioners that support default partitioning based on Pulsar partitions and time-based partitioning in days or hours. A partitioner is used to split the data of every Pulsar partition into chunks. Each chunk of data acts as an object whose virtual path encodes the Pulsar partition and the start offset of this data chunk. The size of each data chunk is determined by the number of records written to objects in cloud storage and by schema compatibility. If no partitioner is specified in the configuration, the default partitioner, which preserves Pulsar partitioning, is used.
The Cloud Storage sink connector provides the following features:
Format
interface.TimeBasedPartitioner
class based on the publishTime
timestamp of Pulsar messages. Time-based partitioning options are daily or hourly.Download the NAR file of the Cloud Storage sink connector and then you can use the connector with Pulsar IO. For details, see Usage.
In addition, you can also build the Cloud Storage sink connector from the source code. For details, see here.
The Cloud Storage sink connector supports the following properties.
Name | Type | Required | Default | Description |
---|---|---|---|---|
provider |
String | True | null | The Cloud Storage type, such as aws-s3 ,gcs . |
accessKeyId |
String | True | null | The Cloud Storage access key ID. |
secretAccessKey |
String | True | null | The Cloud Storage secret access key. |
role |
String | False | null | The Cloud Storage role. |
roleSessionName |
String | False | null | The Cloud Storage role session name. |
bucket |
String | True | null | The Cloud Storage bucket. |
endpoint |
String | False | null | The Cloud Storage endpoint. |
formatType |
String | False | "json" | The data format type. Available options are JSON, Avro, or Parquet. By default, it is set to JSON. |
partitionerType |
String | False | "partition" | The partitioning type. It can be configured by topic partitions or by time. By default, the partition type is configured by topic partitions. |
timePartitionPattern |
String | False | "yyyy-MM-dd" | The format pattern of the time-based partitioning. For details, refer to the Java date and time format. |
timePartitionDuration |
String | False | "1d" | The time interval for time-based partitioning, such as 1d, or 1h. |
batchSize |
int | False | 10 | The number of records submitted in batch. |
batchTimeMs |
long | False | 1000 | The interval for batch submission. |
Before using the Cloud Storage sink connector, you need to create a configuration file through one of the following methods.
JSON
{
"tenant": "public",
"namespace": "default",
"name": "cloud-storage-sink",
"inputs": [
"user-avro-topic"
],
"archive": "connectors/pulsar-io-cloud-storage-0.0.1.nar",
"parallelism": 1,
"configs": {
"provider": "aws-s3",
"accessKeyId": "accessKeyId",
"secretAccessKey": "secretAccessKey",
"role": "none",
"roleSessionName": "none",
"bucket": "testBucket",
"region": "local",
"endpoint": "us-standard",
"formatType": "parquet",
"partitionerType": "time",
"timePartitionPattern": "yyyy-MM-dd",
"timePartitionDuration": "1d",
"batchSize": 10,
"batchTimeMs": 1000
}
}
YAML
tenant: "public"
namespace: "default"
name: "cloud-storage-sink"
inputs:
- "user-avro-topic"
archive: "connectors/pulsar-io-cloud-storage-0.0.1.nar"
parallelism: 1
configs:
provider: "aws-s3",
accessKeyId: "accessKeyId"
secretAccessKey: "secretAccessKey"
role: "none"
roleSessionName: "none"
bucket: "testBucket"
region: "local"
endpoint: "us-standard"
formatType: "parquet"
partitionerType: "time"
timePartitionPattern: "yyyy-MM-dd"
timePartitionDuration: "1d"
batchSize: 10
batchTimeMs: 1000
Prepare the Cloud Storage service. In this example, we use s3mock
as an example.
docker pull apachepulsar/s3mock:latest
docker run -p 9090:9090 -e initialBuckets=pulsar-integtest apachepulsar/s3mock:latest
Put the pulsar-io-cloud-storage-2.5.1.nar
in the Pulsar connector catalog.
cp pulsar-io-cloud-storage-2.5.1.nar $PULSAR_HOME/connectors/pulsar-io-cloud-storage-2.5.1.nar
Start Pulsar in the standalone mode.
$PULSAR_HOME/bin/pulsar standalone
Run the Cloud Storage sink connector locally.
$PULSAR_HOME/bin/pulsar-admin sink localrun --sink-config-file cloud-storage-sink-config.yaml
Send Pulsar messages. Currently, only Avro or JSON schema is supported.
try (
PulsarClient pulsarClient = PulsarClient.builder()
.serviceUrl("pulsar://localhost:6650")
.build();
Producer<TestRecord> producer = pulsarClient.newProducer(Schema.AVRO(TestRecord.class))
.topic("public/default/test-parquet-avro")
.create();
) {
List<TestRecord> testRecords = Arrays.asList(
new TestRecord("key1", 1, null),
new TestRecord("key2", 1, new TestRecord.TestSubRecord("aaa"))
);
for (TestRecord record : testRecords) {
producer.send(record);
}
}
Validate Cloud Storage data.
To get the path, you can use the jclould to verify the file, as shown below.
Properties overrides = new Properties();
overrides.put(“jclouds.s3.virtual-host-buckets”, “false”);
BlobStoreContext blobStoreContext = ContextBuilder.newBuilder(“aws-s3”)
.credentials(
“accessKeyId”,
“secretAccessKey”
)
.endpoint(“http://localhost:9090”) // replace to s3mock url
.overrides(overrides)
.buildView(BlobStoreContext.class);
BlobStore blobStore = blobStoreContext.getBlobStore();
final long sequenceId = FunctionCommon.getSequenceId(message.getMessageId());
final String path = “public/default/test-parquet-avro” + File.separator + “2020-09-14" + File.separator + sequenceId + “.parquet”;
final boolean blobExists = blobStore.blobExists(“testBucket”, path);
Assert.assertTrue(“the sink record does not exist”, blobExists);
You can find the data in your testBucket
Bucket. The path is something like public/default/test-parquet-avro/2020-09-14/1234.parquet
.
The path consists of three parts, the basic part of the topic, partition information, and format suffix.
public/default/test-parquet-avro/
This part consists of the tenant, namespace, and topic name of the input topic.2020-09-14/${messageSequenceId}
The date is generated based on the partitionerType
parameter in the configuration. And the ${messageSequenceId}
is generated by FunctionCommon.getSequenceId(message.getMessageId())
..parquet
This part is generated based on the formatType
parameter in the configuration.