apache-flink Consume data from Kafka Kafka partitions and Flink parallelism


In kafka, each consumer from the same consumer group gets assigned one or more partitions. Note that it is not possible for two consumers to consume from the same partition. The number of flink consumers depends on the flink parallelism (defaults to 1).

There are three possible cases:

  1. kafka partitions == flink parallelism: this case is ideal, since each consumer takes care of one partition. If your messages are balanced between partitions, the work will be evenly spread across flink operators;

  2. kafka partitions < flink parallelism: some flink instances won't receive any messages. To avoid that, you need to call rebalance on your input stream before any operation, which causes data to be re-partitioned:

    inputStream = env.addSource(new FlinkKafkaConsumer10("topic", new SimpleStringSchema(), properties));
        .map(s -> "message" + s)
  1. kafka partitions > flink parallelism: in this case, some instances will handle multiple partitions. Once again, you can use rebalance to spread messages evenly accross workers.