on Tuesday I came back from holidays, and yesterday I attended the first conference at Codenode (Skillsmatter) related Data Automation. The talk was made by Thomas Kaliakos (Twitter – LinkedIn), a software developer at OVOEnergy.
The topic of the talks starts has been chosen starting from the importance of the data nowadays. Companies want to make decisions based on data and the bigger the quantity of data a company has, the stronger it is. But from the moment in which a stakeholder decides that he needs some data to the moment in which he gets it, lot of time can pass. This because lot of people are involved: a request goes to a BI analyst, who asks it to the data warehouse that has to setup the data source. This is slow.
Not only: it is unclear the owner of data and business logic, the capacity of data processing is limited and data is not automatically ingested.
After this intro, the talk discussed a bit about microservices (in a simple way). I covered much more complicated discussions in my three posts on microservices (first, second and third). He ended up underlining only the following pros of the architecture: scalable, clear ownership, small components, easy to create, indipendents. He also spoke about the problem of the “web of components” but only for introducing the concept of message broker.
He then introduced three brokers: AWS Kinesis (why not SQS?), RabbitMQ and Kafka. Of these they chose Kafka because it has a better performance than RabbitMQ and it is not binding the company to AWS like Kinesis (that, moreover, keeps messages only for 7 days… even if I am notsure why they need them for longer).
He then explained some details of Kafka, like the division of the topics in partitions, the fact that it uses faster access to low level operating system functions to boost the performance and other basic informations. It is fast, it can scale, it is resilient, so they found it a proper tool for they usage. They use Aiven as host in the cloud for their kafka instances. And for the message format, among the others (json, thrift, protobuf…) they chose Avro. For kafka manifesto Avro shoul be used and its schemas should stay in a central schema registry. As last concept related kafka, Thomas has shown how simple it is with Akka Streams to consume messages from kafka.
The Project Athena is the internal project aimed to create technology which enables easy discovery of (Big) Data. Basically, it is made by a Kafka cluster with the schema registry for Avro. A component, the Topic Controller, retrieves the topic list from Kafka all the schemas from the registry and inject all the ones that are connected to a topic into another component, the Kafka Connect. This component, having then all the topic configurations for extracting data from Kafka, takes this data and send it as events to BigQuery.
The schema description from Kafka is also sent to a component, the Description Updater, that eventually updates the BigQuery Table description.
This design has evolved again by introducing Apache Airflow to run a graph of timed jobs (like cronjobs). Jobs are ordered and can have metadata passed from one to another, and this lets the whole infrastructure extract and prepare the views that BI people and stakeholders can use to query for data. This because there is a separation between tables and views, both under the control of the Data Platform but structured in such way in Big Query that OVO BI can only use the views (both via a webapp or Tableau).
And finally they are thinking about using ElasticSearch for live data, and Dataflow as an unified model for data processing. This means that Kafka Connect is sending data also to ElasticSearch that then provides data to Dataflow that process it and creates the views for data analysis through the website or tools like the mentioned Tableau.
This basically closed the talk. Stay tuned!
10th September 2017 at 23:16
I am really glad that the talk inspired you for a blog blog and I hope you enjoyed it!
Regarding the “why not SQS” part, you are actually right…SQS is a valid option as a message broker . Two “drawbacks” though that it has is that as Kinesis, its an AWS only thing and the other is that the messages get deleted from the queue once consumed. Thus in order to have many consumers for the same message we must either implement an intermediate layer or copy the message to many queues !
In my understanding SQS is closer to RabbitMQ and Kinesis closer to Kafka!
12th September 2017 at 21:32
Yes yes, I wrote that because we made the same mistake in my previous company and we ended up using kinesis while we needed SQS. I may have not understood ( 😛 ) that you needed to process the same message with multiple component, that makes Kinesis (or maybe SNS?) more useful. Kinesis is fine if you may want to spawn another instance that needs to recompute all the messages from scratch.
Sns is a topic with subscription, but if you are not subscribed when the message arrives, you won’t process it.
Anyway, I agree with what you said, and I will read the two links.
🙂 Thank you for the talk!