Below are my recent notes and thoughts collected during the recent work with Kafka, to build data streaming pipelines between data warehouses and data lakes. Maybe someone will benefit.
Some points on picking (or not picking) Kafka as the solution
- Kafka originated at LinkedIn, who remains a major user. Kafka is now an open-source Apache project
- Kafka is good as glue between components. Connecting n data sources to m data sinks needs n*m integration points. Doing the same with Kafka needs only n+m integration points
- That said, anything can act as such glue: a REST microservice framework, a data warehouse, or even raw storage (HDFS/S3). Consider Kafka if your data can be abstracted as time-stamped events
- If your data sources/sinks are on Amazon AWS, consider Kinesis as your data glue. Many people use both Kafka and Kinesis in one architecture
- Kafka is typically used to (1) transfer data, (2) store data, and (3) process data (ETL)
- Kafka is extremely popular and used by: Linkedin, Airbnb, Pinterest, Uber, Netflix, Homeaway, Spotify, Coursera, Expedia among others. I saw a quote that 35% of Fortune 500 use Kafka.
- Kafka has several distributions. Besides Apache, there’s Confluent (founded by original developers of Kafka), Cloudera, Hortonworks…
- Kafka is run as a cluster, for parallel execution. Streams for records are grouped and stored in categories called topics. A record has a key, a value and a timestamp.
- Kafka servers are called brokers. A topic data, divided into partitions, may be distributed between brokers
- Traditional messaging frameworks fall into either a queue model (where each record goes to one consumer) or pub-sub model (where each record is copied to all consumers)
- Kafka merges both approaches, due to powerful concept of consumer groups. Each record is broadcast to all consumer groups. Within a consumer group however, each record will be given to one consumer only.
- Note the concept of a consumer group is intrinsically linked to the concept of topic partitions. It is a good idea to have as many consumers within group as partitions.
- If there’re more consumers in a group than paritions, some consumers will get no data. If you add new consumer, it will take over some partitons from old members. If you remove a consumer from the group, its partition will be released and allocated to other member.
- Programmatically, Kafka has: Producer API, Consumer API, Streams API (for applications that process streams), and Connector API (to connect to third party resources).
- Streams API is built on the primitives that Producer and Consumer API provide. The library is for advanced processing such as: out-of-sync data and non-trivial aggregations
- with many asynchronous consumers and producers, you will indeed run into issues with ordering data that is out of sync and missing data.
Details to consider
- For your serialized data, use strongly typed data formats eventhough Kafka itself will not enforce it. Enforce schemas
- Use library to enforce schema validation at serialization/deserialization time. Many people use Avro
- Consider central schema registry, available for both data sources and data sinks. Data will be lighter if it travels without schema
- Avro allows schema to travel with the data, which is good for schema versioning consideration. Schema is defined in JSON. Avro generally supports schema evolution
- Avro stores schema separately from the data. Data is stored in compressed format. Avro is row-based
- Other big data storage format worth knowing are Parquet and ORC (both columnar, but with files organized into batches of rows), also SequenceFile, especially if destination is HDFS
- Avro has been developed by Hadoop. Parquet by Cloudera & Twitter and used with Apache Impala. ORC by Hortonworks, designed for Hive
- Most people use Kafka for transient data storage only. Kafka retention policy can be set to, for instance, a few days.
- For longer permanent storage, consider transporting data elsewhere such as HDFS.
- Also, when more robust data processing is needed, a data lake might be a better place, since the events abstraction might not be useful at this point (we might want the data in a different form), and more data processing tools are readily available