Aiven source connector for Salesforce on Apache Kafka Architecture Notes

Bulk API

The Bulk API 2.0 is used to move large amount of data from Salesforce, when we query the data, we will return every entry unless the Connector is restarted or stopped in the middle of processing. If data changes multiple times between queries only the data available when the Bulk API is re-queried will be returned. This means that intermediate changes will not all be sent to Kafka.

For example if the following query is made “SELECT Id, Name, BillingAddress, LastModifiedDate” If the Name of an entry was updated twice from “Company Ltd” to “Company Enterprise” to “Company AG” Then the name on the event sent to Kafka will be Company AG e.g. event

{"Id" : "abcd", "Name":"Company AG","BillingAddress":"Bill to here","LastModifiedDate":"2025-10-16T11:37:07.000Z"}

Of course if their are multiple updates for a record and some affect different fields, the latest data from each field is sent to Kafka, as long as that field is included in the ‘SELECT’ portion of the configured query.

Restart and recovery

The Salesforce objects act like databases, they are continually updated and require special attention especially on restarts. On a restart the last recorded LastModifiedDate is retrieved from the offset for each query. Currently a standard shutdown or restart is handled the same way as a crash and restart is

On restart
1. With no data change
  1. The offset is retrieved and the LastModifiedDate is retrieved and used to query the data it includes the date that was found in the offset
  2. A duplicate event will be sent, this is configured as such to ensure that any updates that were made at that timestamp are all processed and sent to Kafka, as bulk updates can be sent to Salesforce or made in the UI resulting in a dataset greater then one existing at that LastModifiedDate
2. With data changed
  1. The offset is retrieved and the LastModifiedDate is retrieved and used to query the data it includes the date that was found in the offset
  2. If the original piece of data at that timestamp has been updated, then no duplicate event will be sent, just an update with the latest data.

Offsets

How the offsets are handled

We include a hash of the query, LastExecutionDate and the API name as part of the offset Key
We include the LastModifiedDate, jobId, next Locator, recordCount in the offset data
As we order the result set that is returned we then are able to use the lastModifiedDate in the offset, on a restart we read the offset and seed the correct lastModifiedDate