Let’s clarify: I work with AWS since 7 years now, and not in one, but in two companies I have been forced to use Kinesis to develop components that didn’t need Kinesis. And every time I try to explain that this is a mistake I usually have to face the bitter truth: there is no explanation when a nice name is used, especially if the pattern has been “successfully” used for years. Kinesis is sold as “streaming service”, a pipeline needs a stream of data and so 1+1=kinesis!!!
Is that true? NO! Even AWS teaches you to not use Kinesis if it is not needed, when you do one of their courses. And in their documentation they try to make it clear: “Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. […] With Amazon Kinesis, you can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for machine learning, analytics, and other applications. Amazon Kinesis enables you to process and analyze data as it arrives and respond instantly instead of having to wait until all your data is collected before the processing can begin.“
This introduction aim to the following patterns:
- analyze real-time data
- do machine learning and analytics
- respond instantly
- … eventually, if you really need it, ingest data for other applications
An example that my teacher taught me when I was in the AWS course was the following: if you have to create a mobile app that gets the data of the current location for a race, for example, and shows it on a map, that is a good use case for Kinesis: you have producers, the people doing the race, and consumers, the ones using the app. When a person starts the app it receives all the data from the beginning of the Kinesis stream and it displays the path dot by dot on the map. If it closes the app and start it again later, the app downloads again all data from the beginning, to rebuild the path, and so does any other user.
This is absolutely a good scenario in which the pattern of a Kinesis stream works!!! But if you have a stream (unfortunately the concept of stream made Kinesis being so popular, but it is obviously a misunderstanding of this concept) of data that has to be process by the component A that sends a result to B, that sends a result to C and D, C may send its result to E or to F and so on… THAT IS A FUXXING PIPELINE!!! And no, Kinesis is not the right pattern!!!
Why Kinesis is not the right pattern? Well, for different reasons:
- a pipeline usually needs to process each record once, no need to keep the processed records
- there is usually no need of analytics
- when your component access a kinesis stream, it needs to store the lease, usually in a dynamo db, increasing the complexity
- lease coordination generates issues, as consumers may lose the lease and throw exceptions
- there is no easy way to know the number of messages in the stream, or even worse, how many messages are left to process for a single application
- kinesis can scale, but scaling is not such an obvious matter (you scale by creating different child shards, but old data stays in parent shard that becomes inactive)
- scale on kinesis is complex and can throttle producers during a peak before the scale is effective
- there are read and write throughput per shard, and this complicates deciding the number of shards
- consumers share the throughput of the stream if fan-out is not used. Fan-out is difficult to manage in real time, because it needs time to start and replicate. Both those things may affect scaling during peak
- unevenly distributing hash keys may lead to shard being hot and throttling consumers, while Cloudwatch may look absolutely fine (this because metrics should be enabled per shard)
- connecting two components with a kinesis stream definitely opens to ownership problems: who owns the stream? Who can use it for performance/integration tests? …
It is obvious that ALL THOSE PROBLEMS ARE NOT BLOCKING, meaning that you can find an acceptable solution for all of them and you can still use Kinesis and be happy, have a great life, lot of children and so on… but WHY???
AWS suggests, for pipelines, to use a different pattern: each component that acts as a consumer has one (or more, for example a Dead Letter Queue or DLQ) queue to persist data in case of outage, each component that acts as a producer writes into an SNS topic. Therefore if a component is both producer and consumer, it will have both queues and SNS topic.
What are the advantages of this pattern? The following are only few of them:
- “Amazon SQS leverages AWS to dynamically scale based on demand. SQS scales elastically with your application so you don’t have to worry about capacity planning and pre-provisioning. There is no limit to the number of messages per queue, and standard queues provide nearly unlimited throughput“
- you can process and remove the message from queue, having always the exact number of message left to process by your component
- it is really easy to have retry mechanism and dead letter queues
- no sharding and hashing needed
- no lease coordination
- no throttling
- the ownership is clear: if producer and consumer have to run tests at the same time, it is enough to remove the subscription of the queue to the SNS topic (an architectural change) and everything is still working as before
- possibly, if needed for real time data analytics, the kinesis stream can be subscribed to the SNS topic of a producer, enabling the real time data analysis without the annoyance of sharing troughput, throttles, scaling…
I am sure that if I think about that I can find other reason to not use Kinesis in pipelines, and still I keep working in companies that use this pattern. I hope this post will help one young and motivated developer to bring a wind of change in his company and have a list of arguments to stop adopting Kinesis and start using SNS+SQS.