When you are consuming data from Kafka then it is important to know about the consumer group concept.
Consumer groups decide which message in a topic partition will be delivered to which consumer.
For example, if you have two consumers with two different groups, one topic has 10 partitions and the other has five partitions.
Assuming all topics are balanced across partitions, each consumer would likely receive 5 messages for every 10 produced because each consumer is only responsible for half of its own subset of partitions.
This can result in “out-of-order” or missing messages. You can also go for some Kafka Connect.
To avoid data loss on consumer side use following settings while start kafka server:
This will make sure that the consumer always starts from the beginning of the partition and never misses a message.
Group Id: The group id is used to identify the group of consumers responsible for consuming messages from a specific topic.
This setting is required when there are more than one consumer instances reading data from Kafka.
A unique group identifier (e.g., “foo”) must be specified for each group of consumers reading data from the same Kafka topic.
Image Credit: Unsplash
This setting is optional if there is only one consumer instance reading data from a Kafka topic. In this case, the group can be omitted, and Kafka will automatically assign a group id.
Auto Offset Reset: The “auto.offset.reset” setting determines how the Kafka consumer handles offset positions when starting up. There are three possible settings for this property:
1) “manual”: This is the default setting. When “manual” is specified, the Kafka consumer will not automatically reset its position to the beginning of the partition upon startup. The consumer will continue consuming messages from the point where it left off when it last stopped consuming data from Kafka.
2) “earliest”: When “earliest” is specified, the Kafka consumer will reset its position to the beginning of the partition upon startup. This ensures that no messages are missed, but it also means that any in-flight messages in the consumer’s network buffer that were already going to be processed when the consumer stopped will be replayed.
3) “latest”: When “latest” is specified, the Kafka consumer does not reset its position when starting up, and will continue processing from where it last stopped. This can lead to skipping duplicate messages if the partition has been updated since the consumer was last active, but ensures that no messages are ever missed or replayed.
Image Credit: Unsplash
This setting is optional. If this property is not explicitly set then it defaults to “manual.” Note that regardless of which value you specify for this property, your consumers will never miss any data during normal operation—only upon startup after a failure or when explicitly restarting an instance.
In this article, we learned how to avoid data loss on Kafka consumer side by using following settings:
By using these settings, we can ensure that our Kafka consumer will never miss any data. This is a critical setting to be aware of when consuming data from Kafka.