second day of the course. We were a bit behind schedule, but the teacher was not really impressed. He did a kind of briefing on everything said in the first day: Regions, Availability Zones (with different costs, services, regulations…), the separation made through VPCs, the “exit points” of a VPC (internet gateway, S3 endpoint, Peer endpoint to connect to another VPC of another account or VPN tunnel to an on-prem net).
He then started speaking about Security Groups: they let you define rules for inbound and outbound traffic. You can share them among VPC, but when you do that it doesn’t mean you can communicate between VPCs of the same groups, it is just a way to reuse the same rules. Secutiry Groups can be associated to different components, but VPCs can also define ACL (maybe Access Control Lists?) that gives access rule at VPC level. This means that for a component A to speak with a component B in another VPC, rights should be granted at both Security Group level and ACL level. The teacher spoke also about “Jump Box“, component with access rights that you can use to jump in a subnet and interact freely (maybe for monitoring… anyway he did not spend too much time on that).
Amazon VPC Flow Logs is a service that lets you log all the IP traffic for a max amount of time of 1 month. You are not allowed to log all the traffic indeed, just the IP header.
We spent a bit of time in discussing how to connect an internal VPC to the outside world. Especially for external subnet. As already said, VPC peering is used for peering multi account VPC. VGW is the virtual gateway, used to connect your on prem subnet to the one stored in AWS. When you setup a VGW Amazon generate for you all the command to run into your Cisco gateway to tunnel the data between the two points. Anyway both these solutions have as drawback to make the traffic pass through internet. If you want to connect to AWS in a dedicated connection, fast, low latency and reliable, you have to use Direct Connect that is a service available only on few big data centers.
We also spoke about internet gateways and NAT Gateways. I missed a bit of that discussion so I would send you to the official AWS explanation for avoiding errors.
For achieving High Availability it is needed to avoid having single points of failures, for example use DBs in Master-Slave, and replication multi-AZ. It is difficult to replicate multi-region because the philosophy of Amazon is “what it is in a region stays in a region“, meaning that if you want to replicate data across regions you have to do it with some custom service (for example, you have to write a component that speaks to a component in another region and sends it all the data) and this can be time and resource consuming.
Load Balancers (ELB) distributes load and recognize unhealthy instances. It performs connections draining meaning that if you kill an instance (EC2), the load balancer stops sending traffic to that instance immediately. It is perfectly doable to expose a load balancer to https connection from internet and have it talk with http (not secure) connection inside our VPC, or use it for doing port forwarding.
Amazon Route 53 is the DNS service managed by Amazon and it can do load distribution among regions.
We had then a small discussion about the PassRole permission we encountered yesterday in the Lab exercise: let’s say that a user has a security group associated with all permissions allowed for any resources related to EC2 (EC2:*). It would be normally possible to create instances EC2 with any privileges, like with a role with FullAdmin on *.*, and then log in to that instance and becoming a power user capable of everything. iam:PassRole is useful in this case: being a IAM action is not part of EC2, and denying it means that the user has no power to give roles to the services he has powers on.
Then we faced the big chapter of SCALING.
With Amazon CloudWatch we can setup alarms for scaling. Rules in CloudWatch has the form of “metric is > limit for x minutes“. Amazon gives back metrics on CPU usage or latency or others… NOT ABOUT MEMORY because memory can be filled for lot of things (Linux systems, for example, fill the memory for leverage the access to disks). But you can send any kind of custom metric to CloudWatch and set up alarms on that. You can, for example, use Flow Logs to extract metrics about http connections (headers) and set alarms related how many 400 results you have, or on the size of messages. Alarms on CloudWatch has three possible states: Unalarmed, Alarmed and NA status, meaning that you can say to cloudwatch to send you a message when a metrics go below a treshold and send another message when, after going below, it goes again above.
You can have Launch Configuration for autoscaling, defining AMI, VPC, Subnets, Security. Using then the Autoscale Groups, another service in Amazon, you can define minimum number of instances, maximum and desired number. Keep in mind that autoscaling is complicated: how many instances to launch (suggestion: double or nothing), the load on database, when to scale in or scale out… all of these are problems that can highly affect the performances of your service.
EC2 Auto Recovery is a service that tries to run an instance in a unhelthy status into another instance (maybe just created, maybe in a different AZ).
Scaling RDS is possible through shard databases, but you can have a Master and N read-only replicas. DynamoDB is easier to scale, it is enough to raise its RCU (read capacity unit) and WCU (write capacity unit) or to use DAX, the DynamoDB Accelerator, a cache for DynamoDB. If you want an in-memory cache you can use Amazon ElastiCache that is a managed service to have a Redis compatible or Memcached compatible service for your applications. It runs inside your VPC.
Lambda Functions are a very important service in AWS. They have no infrastructure at all from the point of view of the user. Lambda functions are intended for small computation related to a trigger and give scalability for free. DECOUPLING is an important concept for lambdas: you can divide your services into small components (microservices) and have part (or all?) of them implemented as lambdas.
Lambdas can be attached to lot of triggers: SNS topics (be careful, NOT SQS, SQS is a queue and it is storing messages until someone consume them… you can anyway put a time trigger for a lambda that, every 5 minutes for example, starts the lambda that reads messages from a queue and write them in a DynamoDB), time triggers (like chronojob), cloudwatch, particular events and also DynamoDB streams (trigger event fired by DynamoDB when its data is changed – please note that DynamoDB can store data with TTL that, when expires, delete the entry to which it is associated)…
Moreover you can setup API Gateways in front of lambdas: https url to call the lambda function without using AKSK.
Lambda functions can be attached to Subnets and VPC. You pay only based on time and memory consumption, but you have FOR FREE each month something like 400Gb/s and 1 milion requests. If an event triggers different lambdas (the teacher showed for example a web application to which you can send an image and it creates a thumb, a blurred and a grayscale version) you can group them into a Step Function (you can include also human interactive lambdas in that) so that all the lambdas will be considered terminated (and return a result) when the full step will be over.
At this point the teacher showed us a detail very important related lambda functions: let’s say that a function define a global var and changes its value on every computation (a counter and increments it). When the lambda is invoked it keeps incrementing the value into close requests: this because basically the architecture on which lambda is working is not automatically destroyed at each invoke. Maybe it will be stopped in 30 minutes. But if the computation takes a while and we receive a second one in the meantime, Amazon needs to start another instance, and so the counter will have a brand new value.
You can achieve decoupling by using wisely SNS topics (Amazon SNS is an easy to use publish&subscribe message system that follows the pattern of fire and forget, meaning that if a subscriber is not available when a message is fired, it won’t receive it when it will be available again) and SQS (Amazon SQS is the queuing service, it is scalable and reliable, the cheapest version does not preserve ordering of arrival and unicity of messages, meaning that you can receive the same message twice, but lately Amazon has extended it with a guaranteed FIFO unique message flavour, a bit more expensive.
A note: CloudFront can cache data in a Stream way.
Last topic of the day: AUTOMATION. This is a feature based on templates: a template describe how your environment should look like in terms of ALL ITS RESOURCES. It has a specific JSON (or eventually a more compact YAML) way to define every resource (really error prone, to be honest), and all the stacks started by a template are somehow bound to that template, meaning that you can change all of them by changing the template (and asking CloudFormation to reprocess it). You can change the running resources even without stopping them, if the change does not need a restart.
Templates are stored in S3 and there are services in Amazon to show you what will happen to apply a change to a stack, which steps AWS will perform to apply those changes.
The service that handles automation is Amazon CloudFormation. You can split a template into small parts handling only few concepts or regions of the whole stack (like the networking part, EC2 instances…) and you can deploy the single parts and have a stack reference another part deployed with another template. If, for example, you want to reference (export) a lambda into two different stacks, you can deploy the lambda by itself and include it into the two templates.
To help writing template, Terraform can be a nice external tool.