on Tuesday I came back from holidays, and yesterday I attended the first conference at Codenode (Skillsmatter) related Data Automation. The talk was made by Thomas Kaliakos (Twitter – LinkedIn), a software developer at OVOEnergy.
The topic of the talks starts has been chosen starting from the importance of the data nowadays. Companies want to make decisions based on data and the bigger the quantity of data a company has, the stronger it is. But from the moment in which a stakeholder decides that he needs some data to the moment in which he gets it, lot of time can pass. This because lot of people are involved: a request goes to a BI analyst, who asks it to the data warehouse that has to setup the data source. This is slow.
Not only: it is unclear the owner of data and business logic, the capacity of data processing is limited and data is not automatically ingested.
After this intro, the talk discussed a bit about microservices (in a simple way). I covered much more complicated discussions in my three posts on microservices (first, second and third). He ended up underlining only the following pros of the architecture: scalable, clear ownership, small components, easy to create, indipendents. He also spoke about the problem of the “web of components” but only for introducing the concept of message broker.
He then introduced three brokers: AWS Kinesis (why not SQS?), RabbitMQ and Kafka. Of these they chose Kafka because it has a better performance than RabbitMQ and it is not binding the company to AWS like Kinesis (that, moreover, keeps messages only for 7 days… even if I am notsure why they need them for longer).
He then explained some details of Kafka, like the division of the topics in partitions, the fact that it uses faster access to low level operating system functions to boost the performance and other basic informations. It is fast, it can scale, it is resilient, so they found it a proper tool for they usage. They use Aiven as host in the cloud for their kafka instances. And for the message format, among the others (json, thrift, protobuf…) they chose Avro. For kafka manifesto Avro shoul be used and its schemas should stay in a central schema registry. As last concept related kafka, Thomas has shown how simple it is with Akka Streams to consume messages from kafka.
The Project Athena is the internal project aimed to create technology which enables easy discovery of (Big) Data. Basically, it is made by a Kafka cluster with the schema registry for Avro. A component, the Topic Controller, retrieves the topic list from Kafka all the schemas from the registry and inject all the ones that are connected to a topic into another component, the Kafka Connect. This component, having then all the topic configurations for extracting data from Kafka, takes this data and send it as events to BigQuery.
The schema description from Kafka is also sent to a component, the Description Updater, that eventually updates the BigQuery Table description.
This design has evolved again by introducing Apache Airflow to run a graph of timed jobs (like cronjobs). Jobs are ordered and can have metadata passed from one to another, and this lets the whole infrastructure extract and prepare the views that BI people and stakeholders can use to query for data. This because there is a separation between tables and views, both under the control of the Data Platform but structured in such way in Big Query that OVO BI can only use the views (both via a webapp or Tableau).
And finally they are thinking about using ElasticSearch for live data, and Dataflow as an unified model for data processing. This means that Kafka Connect is sending data also to ElasticSearch that then provides data to Dataflow that process it and creates the views for data analysis through the website or tools like the mentioned Tableau.
This basically closed the talk. Stay tuned!