Julian Wraith

Menu Close

Category: Amazon Web Services

Using AWS to scale SDL Web Publishers

In a recent blog post, Brandon Mahoney asked How Big Can CMS Infrastructure Get? with the short answer being as much as you want to spend. It is true today as it always has been, that you can build any infrastructure to fulfill any set of requirements and requirements such as disaster recovery and performance traditionally can drive you to provision infrastructure that will for the most part remain idle in a data centre. Take a cloud based approach and you can change your overall footprint of infrastructure which will save costs and improve agility.

So how can you avoid over spending on infrastructure but achieve your requirements?

In a standard SDL Web infrastructure, like Brandon shares, there are various strategies you can employ to reduce the overall spend when you use public cloud. If you compare an always-on on-premises environment to an always-on cloud environment, then it could be that the difference in costs is not dramatically different – although it should be noted that businesses rarely have a good grip on what things actually cost (see AWS’ TCO calculator for some help with that). If you are not using higher-level services, like Kinesis, that are fully managed by AWS so you will still need to do some or all of the hard work in managing the environment yourself. Automation can often be applied to both something in the cloud and something on-premises (but rarely is) so that means you are ultimately you are still managing infrastructure and therefore to reduce the cost of it is to manage less of it, more efficiently without compromising performance and reliability. Five principles to this strategy are:

  • Automate everything – if you do something more than three times, write a script or process for it.
  • Build stateless applications – this way you can easier send servers to their deaths if their existence has ceased to have value
  • Scale servers elastically – easier if they are stateless but add and reduce load as needed based upon demand. Demand in the form of users but also in the form of work if servers are batch or processing based
  • Use the right instance sizes – do not just throw the nicest looking instance type at a workload. X1 instances all round! 🙂
  • Understand and use Reserved Instances, On-Demand Instances and Spot instances – they all have a potential place in your infrastructure

AWS provides you with a significant tool set to implement this strategy and I will outline, using SDL Web as the example, how you can implement these tools to approach building low cost and agile infrastructures.

Putting it into practice?

I have blogged many times on performance of publishing and why with the right setup you can achieve a high-throughput of items published from SDL Web (formerly SDL Tridion). In a previous life with a previous approach to infrastructure, I had helped a customer reach a peak of 850 thousand items publishing in a day. I suspected we could have gone higher but this was the natural load and we never got to give the production infrastructure a full stress test. The implemented infrastructure relied very heavily of physical servers and lots of them to achieve such a high throughput. But this is not the pattern we need to follow, so how do we do this?

What we want to achieve is the following flow of steps:

  1. We have a minimal infrastructure that serves our basic need
  2. We detect demand and scale up to meet that demand
  3. We scale down when the demand has subsided

With this flow, we only use the resources we actually need. For SDL Web, publishing is a fluctuating task and most organizations follow a pattern like:

  1. They do not typically publish content at night
  2. They publish several hundred (or thousand) items over a day
  3. They publish at peak times of just before lunch and just before the end of the day
  4. They have occasional significantly high peaks in load for site roll outs or large content changes

Publishing in SDL Web, is a more or less stateless process meaning that the queue and the data is outside the process itself, however, during rendering of a publishing job state is held on disk and memory. Whilst it is normal to hold some state in memory or disk, a rendering job could be a significant batch job executed by one server and this poses a complication to scaling down which we need to address.
Our high-level architecture is shown in figure 1, and consists of a database, AWS RDS, a Content Manager and a Publisher in an auto-scaling group which will have the ability to scale up and down with demand. Because this is a test architecture focusing on publishing, the Content Manager is not scaled or is redundant. In a production scenario you would probably choose to place the Content Manager behind a load balancer and in its own auto-scaling group. I have also chosen to ignore some additional complications of elements such as Workflow Agents and Search Indexing.

Figure 1 – High-level Architecture

To get the Publishers to scale horizontally, we need to understand what the demand is at a given point in time and scale accordingly. More often than not CPU Utilization is a good metric that shows demand; CPU Utilization is high, therefore you add more capacity to reduce overall utilization. However, Publishers do not work like this, they typically run at a high utilization regardless of demand and therefore this is not a good measure. Demand comes in the form of the queue of items that are waiting so we need to establish if we have 100 or 100,000 items in the queue. To do this I use a Lambda function to query SDL Web and provide a Custom CloudWatch Metric showing the Queue length. This metric will simply give us a number and I chose to get this directly from the database. You could query the RESTFUL API of SDL Web but this is both a little more complex and, in my opinion probably a slower approach; a quick database query will provide what we need for the metric.
The Lambda function is written in .NET Core to be able to leverage the native SQL Server Database drivers. We first define the method (note: the code show in this post is not production worthy code and requires more work to make it so):

public async Task FunctionHandler(ILambdaContext context)
 {

Then get the count of the database (ideally you do not hardcode the connection string but I am lazy):

Int32 count = 0;
 using (var connection = new SqlConnection("user id=TCMDBUSER;password=12345;server=dbinstance.something.us-east-1.rds.amazonaws.com,1433;database=tridion_cm;connection timeout=45"))
 {
 connection.Open();
 SqlCommand comm = new SqlCommand("SELECT COUNT(*) FROM QUEUE_MESSAGES", connection);
 count = (Int32)comm.ExecuteScalar();
 }

Now we have the count which we will pass to the metric:

var dimension = new Dimension
 {
 Name = "Publishing",
 Value = "Queue"
 };

var metric1 = new MetricDatum
 {
 Dimensions = new List() { dimension },
 MetricName = "Waiting for Publish",
 Timestamp = DateTime.Now,
 Unit = StandardUnit.Count,
 Value = (double)count
 };

var request = new PutMetricDataRequest
 {
 MetricData = new List() { metric1 },
 Namespace = "SDL Web"
 };

IAmazonCloudWatch client = new AmazonCloudWatchClient();
 var response = await client.PutMetricDataAsync(request);

And then close off:

return response.HttpStatusCode.ToString();
 }

This will give us a metric which we can then find back in CloudWatch metrics under “SDL Web” and then “Publishing/Queue/Waiting for Publish”. We are then going to set a CloudWatch Alarm to alarm when load is over 100 items in the queue for a sustained period of 5 minutes.

For my auto-scaling group, a Launch Configuration specifies the instance that will be launch when the alarm goes off. The Launch Configuration specifies things like the AMI of my publishing server, security groups to allow it to talk to the database and its role which allows it to talk to other AWS services such as SSM which we will get back to later in this blog post. When the alarm fires, an auto-scaling group policy will take decisions about what to do and is defined as follows:

When the alarm is in breach for a sustained period of 5 minutes the scaling policy will:

  • set the auto-scaling group to 2 instances if the queue is between 100 and 1000 content items
  • set the auto-scaling group to 3 instances if the queue is between 1000 and 2500 content items
  • set the auto-scaling group to 5 instances if the queue is greater than 2500 content items

As demand drops, the auto-scaling group will be set to lower amounts and will eventually return to 1 instance running (the minimum in our auto-scaling group).

So now, we have our Lambda generating our metric and an alarm triggering the auto-scaling group which will then add more instances based upon how high demand is (figure 2).

Figure 2 – Alarm

As content editors publish items they will be a small delay, 5 minutes, and then new publishers will be added to publish the content items in the queue. This is just an example on how you could do this. The way the alarm reacts and how quickly up or down you scale is all configurable. Earlier in this post I also mentioned that you typically see higher loads at certain times of the day. As such, we could just add a new publisher at 4 PM to handle the increased load proactively rather that reactively. The choice is yours on how you address this.

When the auto-scaling group removes a publisher we need to be sure that it is done in a nice way. Earlier I mentioned that the publisher is stateless but if it is rendering a content item that will be in memory and on disk. In a recent release of SDL Web, SDL added a graceful shutdown of the publisher, meaning, it will finish what it is doing before it shuts down. When an instance in an auto-scaling group is terminated, processes do not get the chance to cleanly shut down, so we need to use a lifecycle hook to pause termination. Once set, the flow of termination for a publisher is as follows:

  • Lifecycle hook on termination pauses termination and waits
  • A CloudWatch Event is written to say the hook for our auto-scaling group is waiting
  • A CloudWatch rule traps the event and fires a Lambda function in response
  • The Lambda function uses AWS Systems Manager (SSM) to stop the publisher and issue a resume on the termination

Each of our publishers is a Managed Instance which means that we can manage its configuration while it is running without needing to log on to the instance. Managing an instance can be a manual process or you can do that from a Lambda function. In this case, we are doing to run the local PowerShell command to stop a service, the publisher. The code, written in Python, is as follows:

Define the handler and the libraries we will use:

def lambda_handler(event, context):
ssmClient = boto3.client('ssm')
s3Client = boto3.client('s3')
asgClient = boto3.client('autoscaling')

set the details of what instance is in termination

message = event['detail']
instanceId = str(message[EC2_KEY])
lifeCycleHook = "SDLWebPubShutdown"
autoScalingGroup = "SDLWebPublisherASG"

Send the shutdown command to the instance in the form of a powershell command and wait until it has completed

ssmCommand = ssmClient.send_command( 
   InstanceIds = [ instanceId ], 
   DocumentName = 'AWS-RunPowerShellScript', 
   TimeoutSeconds = 240, Comment = 'Stop Publisher', 
   Parameters = { 'commands': ["Stop-Service TcmPublisher"] }
) 

#poll SSM until EC2 Run Command completes 
status = 'Pending' 
while status == 'Pending' or status == 'InProgress': 
   time.sleep(3) 
   status = (ssmClient.list_commands(CommandId=ssmCommand['Command']['CommandId']))['Commands'][0]['Status']

if(status != 'Success'):
   print "Command failed with status " + status

End the waiting termination hook and function

   response = asgClient.complete_lifecycle_action(
   LifecycleHookName=lifeCycleHook,
   AutoScalingGroupName=autoScalingGroup,
   LifecycleActionResult='ABANDON',
   InstanceId=instanceId
)
return None

Once we resume the termination, the instance is terminated and the auto-scaling group has been downsized.

Summary

Figure 3 – Complete Solution

As we can see in figure 3, we have a complete auto-scaling process which will:

  1. Measure the length of the publishing queue thus recording the current demand
  2. Use these measurements to scale the publishers horizontally to meet demand
  3. Gracefully scale back the publishers when demand subsides

In addition we received the following benefits:

  1. Because we create a metric, we can actively monitor the length of the queue on operational dashboards together with other metrics like CPU Utilization, Network and Memory (operational insights)
  2. We can easily provision new publisher instances and recycle bad instances quickly (improved agility)
  3. Reduction in the need for manual intervention (simplification)
    We improve the experience of the content editors in the use of the product

Selecting the right AWS region

Selecting the right Amazon Web Services (or AWS) region is an important step when deploying you workloads to AWS. In this blog post, I hope to briefly outline what a region is and share five factors which you should look at when selecting a region.

What is a region?
A Region is a geographically separated area of AWS and AWS has 13 regions (as of October 2016) spread across the world. Each region has multiple Availability Zones (AZ) which allow the placement of resources and data in multiple locations within that region. Each AZ within a region is connected to the other AZs via a low latency connectivity, but remains an isolated unit capable of operating independently from the others. Because of the arrangement of region and AZs, you do not need to combine two regions together to achieve high-availability, rather you need to spread your resources across multiple AZs. All regions are accessible from the same AWS account with the exception of China and GovCloud regions which require separate AWS accounts.

Five factors to consider in choosing a region:

Compliancy
In AWS, data in a region does not leave that region unless a customer selects a different region for the data. Data such as backups or data replicas can exist in any region but AWS itself never moves data outside a region, this has to be explicitly done by the customer. If you are required by EU law to keep data to certain regions, either in the EU or within a specific country such as Germany or France then it is possible to do so by utilizing the most appropriate region for that data. Compliancy is a complex topic and use case specific so I would advise reading up on the compliancy requirements for your given workload and industry.

Service requirements
Whilst all customers of AWS receive a uniform platform of services, not all regions have yet received all services from AWS. Newer regions, especially, have yet to have some services and you should check if the AWS services your workload needs are available in a given region.

Cost
The cost of AWS services are different per region and this is because the cost at a region level differs per region (e.g. building costs). You should review if this negatively impacts your workload and if perhaps a different region is more cost effective. You can compare costs using the AWS Simple Monthly Calculator to determine what the difference will be for your workload.

Latency
Latency is the time taken to receive a response from, in the case, resources running in an AWS region. In order to reduce latency to a minimum, your workload should be located as near to the end user as possible. If your users are mostly based in North America then a region in the US would most likely have the lower latency times of all AWS regions and therefore would make the best choice in terms of region. With a geographically spread user base, you can employ multiple regions or Amazon’s CloudFront CDN to improve latency for more remote users.

Carbon Neutral
AWS has a long-term commitment to achieve 100% renewable energy usage for its global infrastructure footprint and by the end of 2016 aims to run 40% of power consumption on renewable energy sources. In line with this goal, AWS currently runs four carbon neutral regions which means that workloads placed in those carbon neutral regions can contribute to your own organizations goals of being carbon neutral. AWS makes it possible that customers can run fewer resources through technologies such as auto-scaling, but also that AWS regions are more efficient and consume less power than average corporate data centers which all helps reduce the carbon emissions of your workloads by 88%.

New AWS features – September 2016

Occasionally I will share some new features from AWS that sparked my interest and so here in are some new features that I like:

  • Upload AWS Cost & Usage Reports to Redshift and QuickSight (link)  –  increased flexibility and efficiency in generating reports on billing. You could also load in other datasets to Redshift to create reports combining other data such as hour reporting from engineering teams
  • AWS Config Rules is now available in Singapore (link) – AWS Config rules is a way of controlling the configuration of large numbers of servers and now available in more regions
  • Organize Your AWS Resources by Using up to 50 Tags per Resource (link) – it was 10 tags which could have been restricting especially if you use tags to contain metadata to control resources (e.g. startup time), so it makes tagging more flexible
  • Amazon RDS for Oracle introduces a License Included offering for Oracle Standard Edition Two (SE2) (link) – expanded Oracle database types
  • New AWS Application Load Balancer (link) – routing based on URL now gives you more flexibility in how you architect your applications and spread load over different instances

AWS Elastic File System

Yesterday in a knowledge session between Solution Architects, the topic of AWS Elastic File System was raised and after a short discussion it was decided to take a closer look and set something up. To quote Top Gear, how hard could it be?

What is EFS?

AWS Elastic File System, or EFS, is Amazon Web Services’ latest storage solution and is a fully managed, simple and scalable file storage to use with EC2 instances. As the name suggests, it grows and shrinks automatically with your storage needs and EC2 instances can access EFS using NFS (v4.1), over multiple availability zones at low latency with high throughput (50 MB/s per TB with 100 MB/s burst). AWS lists the use cases of EFS to be; Big Data and analytics, media processing workflows, content management, web serving and home directories. Content Management you say? Hmmm J

From my past, scalable single sources of file system based content were expensive and difficult to deploy. So much so, that product and implementations strategy meant that putting all content in a database was by far and away the most logical route to take. So could EFS now resolve that headache? I will give it a test to find out.

What do I have to set up?

So I will simulate a website setup where I have an application server tier that would host my Tomcat (or similar) application servers and a back end file system which will be mounted as to my application servers so that the files can be used. Onto my file system I will deploy my content. I won’t install or configure Tomcat, this is simple to do but covered very well in other places.

The simple architecture

The simple architecture

So, I will need

  1. An auto-scaling group covering two availability zones (eu-west-1a, eu-west-1b) with two instances of Amazon Linux (no Tomcat, no auto-scaling rules for now)
  2. Security Group to allow my auto-scaling instances to talk NFS to my EFS
  3. An EFS created and mounted to my instances

For my auto-scaling group, I have gone and created a simple one and it is up and running across my two availability zones. I have gone and terminated an instance or two just for fun. That’s not related to this post, it is just fun to terminate something and watch it auto-magically reappear.

My security group allows instances that are a member of my auto-scaling security group, access to the EFS volumes via the NFS protocol

My Security Group

My Security Group

I can now create my EFS for my website content.

I first need to configure the file system access which consists of my VPC, my mount targets (availability zones) and the security group that defines the source of access requests (the one I created early):

Configuration of EFS

Configuration of EFS

Then I configure the optional settings. I have chosen to give it a friendly name and stuck to the default “Performance Mode” of general purpose.

Configure the EFS options

Configure the EFS options

The final review step and then I am done. That was it. No configuring disk sizes, difficult calculations on my requirements of how much content I have. It’s done.

Review what I did

Review what I did

After a shirt whole my volumes are ready and I can keep track on the status of creation in the main EFS dashboard under “life cycle state”.

After a short while they will be ready

After a short while they will be ready

Next we are going to test drive mounting my volume to my instance. EFS provides some instructions to be able to do this from the dashboard. Running in a ssh session (from the root);

Step 1: If needed, install the NFS client on your EC2 instance

sudo yum install -y nfs-utils

Step 2: Create a new directory on your EC2 instance, such as “efs”

sudo mkdir efs

Step 3: Mount your file system using the DNS name.

sudo mount -t nfs4 -o nfsvers=4.1 $(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone).fs-d3658b1a.efs.eu-west-1.amazonaws.com:/ efs

Once that is done I can switch to the directory and create myself a simple index.html file for my eventual Tomcat server to see. If I then log on to my other instance, I can see that my file has been replicate from the first availability zone to the next. This means, if I would write my content to disk as I have done, it would be available instantly in the other availability zones and all my sites would be updated.

As I did this manually, if my auto-scaling group scales then I would need to do this each time. This defeats the purpose of auto-scaling. However, if I mount this directory at instance initialization time (e.g. chef) then it would be mounted when my new instance starts. To test this I made a very simply launch script and updated my Launch configuration (made a new one as edits are not possible) to add the following to the user data portion of the configuration.

#!/bin/bash
cd /
sudo mkdir efs
sudo mount -t nfs4 -o nfsvers=4.1 $(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone).fs-d3658b1a.efs.eu-west-1.amazonaws.com:/ efs

Warning: I would not use this code in production. No really, please don’t.

Summary

The most complicated thing about this is to mount the drives as creation of the fully managed and scalable storage is incredibly easy. For content management systems, like SDL Web (Tridion) this is a real help in deployment of content in a scalable and reliable way.

© 2018 Julian Wraith. All rights reserved.

Theme by Anders Norén.