Using AWS to scale SDL Web Publishers

In a recent blog post, Brandon Mahoney asked How Big Can CMS Infrastructure Get? with the short answer being as much as you want to spend. It is true today as it always has been, that you can build any infrastructure to fulfill any set of requirements and requirements such as disaster recovery and performance traditionally can drive you to provision infrastructure that will for the most part remain idle in a data centre. Take a cloud based approach and you can change your overall footprint of infrastructure which will save costs and improve agility.

So how can you avoid over spending on infrastructure but achieve your requirements?

In a standard SDL Web infrastructure, like Brandon shares, there are various strategies you can employ to reduce the overall spend when you use public cloud. If you compare an always-on on-premises environment to an always-on cloud environment, then it could be that the difference in costs is not dramatically different – although it should be noted that businesses rarely have a good grip on what things actually cost (see AWS’ TCO calculator for some help with that). If you are not using higher-level services, like Kinesis, that are fully managed by AWS so you will still need to do some or all of the hard work in managing the environment yourself. Automation can often be applied to both something in the cloud and something on-premises (but rarely is) so that means you are ultimately you are still managing infrastructure and therefore to reduce the cost of it is to manage less of it, more efficiently without compromising performance and reliability. Five principles to this strategy are:

  • Automate everything – if you do something more than three times, write a script or process for it.
  • Build stateless applications – this way you can easier send servers to their deaths if their existence has ceased to have value
  • Scale servers elastically – easier if they are stateless but add and reduce load as needed based upon demand. Demand in the form of users but also in the form of work if servers are batch or processing based
  • Use the right instance sizes – do not just throw the nicest looking instance type at a workload. X1 instances all round! 🙂
  • Understand and use Reserved Instances, On-Demand Instances and Spot instances – they all have a potential place in your infrastructure

AWS provides you with a significant tool set to implement this strategy and I will outline, using SDL Web as the example, how you can implement these tools to approach building low cost and agile infrastructures.

Putting it into practice?

I have blogged many times on performance of publishing and why with the right setup you can achieve a high-throughput of items published from SDL Web (formerly SDL Tridion). In a previous life with a previous approach to infrastructure, I had helped a customer reach a peak of 850 thousand items publishing in a day. I suspected we could have gone higher but this was the natural load and we never got to give the production infrastructure a full stress test. The implemented infrastructure relied very heavily of physical servers and lots of them to achieve such a high throughput. But this is not the pattern we need to follow, so how do we do this?

What we want to achieve is the following flow of steps:

  1. We have a minimal infrastructure that serves our basic need
  2. We detect demand and scale up to meet that demand
  3. We scale down when the demand has subsided

With this flow, we only use the resources we actually need. For SDL Web, publishing is a fluctuating task and most organizations follow a pattern like:

  1. They do not typically publish content at night
  2. They publish several hundred (or thousand) items over a day
  3. They publish at peak times of just before lunch and just before the end of the day
  4. They have occasional significantly high peaks in load for site roll outs or large content changes

Publishing in SDL Web, is a more or less stateless process meaning that the queue and the data is outside the process itself, however, during rendering of a publishing job state is held on disk and memory. Whilst it is normal to hold some state in memory or disk, a rendering job could be a significant batch job executed by one server and this poses a complication to scaling down which we need to address.
Our high-level architecture is shown in figure 1, and consists of a database, AWS RDS, a Content Manager and a Publisher in an auto-scaling group which will have the ability to scale up and down with demand. Because this is a test architecture focusing on publishing, the Content Manager is not scaled or is redundant. In a production scenario you would probably choose to place the Content Manager behind a load balancer and in its own auto-scaling group. I have also chosen to ignore some additional complications of elements such as Workflow Agents and Search Indexing.

Figure 1 – High-level Architecture

To get the Publishers to scale horizontally, we need to understand what the demand is at a given point in time and scale accordingly. More often than not CPU Utilization is a good metric that shows demand; CPU Utilization is high, therefore you add more capacity to reduce overall utilization. However, Publishers do not work like this, they typically run at a high utilization regardless of demand and therefore this is not a good measure. Demand comes in the form of the queue of items that are waiting so we need to establish if we have 100 or 100,000 items in the queue. To do this I use a Lambda function to query SDL Web and provide a Custom CloudWatch Metric showing the Queue length. This metric will simply give us a number and I chose to get this directly from the database. You could query the RESTFUL API of SDL Web but this is both a little more complex and, in my opinion probably a slower approach; a quick database query will provide what we need for the metric.
The Lambda function is written in .NET Core to be able to leverage the native SQL Server Database drivers. We first define the method (note: the code show in this post is not production worthy code and requires more work to make it so):

public async Task FunctionHandler(ILambdaContext context)

Then get the count of the database (ideally you do not hardcode the connection string but I am lazy):

Int32 count = 0;
 using (var connection = new SqlConnection("user id=TCMDBUSER;password=12345;,1433;database=tridion_cm;connection timeout=45"))
 SqlCommand comm = new SqlCommand("SELECT COUNT(*) FROM QUEUE_MESSAGES", connection);
 count = (Int32)comm.ExecuteScalar();

Now we have the count which we will pass to the metric:

var dimension = new Dimension
 Name = "Publishing",
 Value = "Queue"

var metric1 = new MetricDatum
 Dimensions = new List() { dimension },
 MetricName = "Waiting for Publish",
 Timestamp = DateTime.Now,
 Unit = StandardUnit.Count,
 Value = (double)count

var request = new PutMetricDataRequest
 MetricData = new List() { metric1 },
 Namespace = "SDL Web"

IAmazonCloudWatch client = new AmazonCloudWatchClient();
 var response = await client.PutMetricDataAsync(request);

And then close off:

return response.HttpStatusCode.ToString();

This will give us a metric which we can then find back in CloudWatch metrics under “SDL Web” and then “Publishing/Queue/Waiting for Publish”. We are then going to set a CloudWatch Alarm to alarm when load is over 100 items in the queue for a sustained period of 5 minutes.

For my auto-scaling group, a Launch Configuration specifies the instance that will be launch when the alarm goes off. The Launch Configuration specifies things like the AMI of my publishing server, security groups to allow it to talk to the database and its role which allows it to talk to other AWS services such as SSM which we will get back to later in this blog post. When the alarm fires, an auto-scaling group policy will take decisions about what to do and is defined as follows:

When the alarm is in breach for a sustained period of 5 minutes the scaling policy will:

  • set the auto-scaling group to 2 instances if the queue is between 100 and 1000 content items
  • set the auto-scaling group to 3 instances if the queue is between 1000 and 2500 content items
  • set the auto-scaling group to 5 instances if the queue is greater than 2500 content items

As demand drops, the auto-scaling group will be set to lower amounts and will eventually return to 1 instance running (the minimum in our auto-scaling group).

So now, we have our Lambda generating our metric and an alarm triggering the auto-scaling group which will then add more instances based upon how high demand is (figure 2).

Figure 2 – Alarm

As content editors publish items they will be a small delay, 5 minutes, and then new publishers will be added to publish the content items in the queue. This is just an example on how you could do this. The way the alarm reacts and how quickly up or down you scale is all configurable. Earlier in this post I also mentioned that you typically see higher loads at certain times of the day. As such, we could just add a new publisher at 4 PM to handle the increased load proactively rather that reactively. The choice is yours on how you address this.

When the auto-scaling group removes a publisher we need to be sure that it is done in a nice way. Earlier I mentioned that the publisher is stateless but if it is rendering a content item that will be in memory and on disk. In a recent release of SDL Web, SDL added a graceful shutdown of the publisher, meaning, it will finish what it is doing before it shuts down. When an instance in an auto-scaling group is terminated, processes do not get the chance to cleanly shut down, so we need to use a lifecycle hook to pause termination. Once set, the flow of termination for a publisher is as follows:

  • Lifecycle hook on termination pauses termination and waits
  • A CloudWatch Event is written to say the hook for our auto-scaling group is waiting
  • A CloudWatch rule traps the event and fires a Lambda function in response
  • The Lambda function uses AWS Systems Manager (SSM) to stop the publisher and issue a resume on the termination

Each of our publishers is a Managed Instance which means that we can manage its configuration while it is running without needing to log on to the instance. Managing an instance can be a manual process or you can do that from a Lambda function. In this case, we are doing to run the local PowerShell command to stop a service, the publisher. The code, written in Python, is as follows:

Define the handler and the libraries we will use:

def lambda_handler(event, context):
ssmClient = boto3.client('ssm')
s3Client = boto3.client('s3')
asgClient = boto3.client('autoscaling')

set the details of what instance is in termination

message = event['detail']
instanceId = str(message[EC2_KEY])
lifeCycleHook = "SDLWebPubShutdown"
autoScalingGroup = "SDLWebPublisherASG"

Send the shutdown command to the instance in the form of a powershell command and wait until it has completed

ssmCommand = ssmClient.send_command( 
   InstanceIds = [ instanceId ], 
   DocumentName = 'AWS-RunPowerShellScript', 
   TimeoutSeconds = 240, Comment = 'Stop Publisher', 
   Parameters = { 'commands': ["Stop-Service TcmPublisher"] }

#poll SSM until EC2 Run Command completes 
status = 'Pending' 
while status == 'Pending' or status == 'InProgress': 
   status = (ssmClient.list_commands(CommandId=ssmCommand['Command']['CommandId']))['Commands'][0]['Status']

if(status != 'Success'):
   print "Command failed with status " + status

End the waiting termination hook and function

   response = asgClient.complete_lifecycle_action(
return None

Once we resume the termination, the instance is terminated and the auto-scaling group has been downsized.


Figure 3 – Complete Solution

As we can see in figure 3, we have a complete auto-scaling process which will:

  1. Measure the length of the publishing queue thus recording the current demand
  2. Use these measurements to scale the publishers horizontally to meet demand
  3. Gracefully scale back the publishers when demand subsides

In addition we received the following benefits:

  1. Because we create a metric, we can actively monitor the length of the queue on operational dashboards together with other metrics like CPU Utilization, Network and Memory (operational insights)
  2. We can easily provision new publisher instances and recycle bad instances quickly (improved agility)
  3. Reduction in the need for manual intervention (simplification)
    We improve the experience of the content editors in the use of the product

Selecting the right AWS region

Selecting the right Amazon Web Services (or AWS) region is an important step when deploying you workloads to AWS. In this blog post, I hope to briefly outline what a region is and share five factors which you should look at when selecting a region.

What is a region?
A Region is a geographically separated area of AWS and AWS has 13 regions (as of October 2016) spread across the world. Each region has multiple Availability Zones (AZ) which allow the placement of resources and data in multiple locations within that region. Each AZ within a region is connected to the other AZs via a low latency connectivity, but remains an isolated unit capable of operating independently from the others. Because of the arrangement of region and AZs, you do not need to combine two regions together to achieve high-availability, rather you need to spread your resources across multiple AZs. All regions are accessible from the same AWS account with the exception of China and GovCloud regions which require separate AWS accounts.

Five factors to consider in choosing a region:

In AWS, data in a region does not leave that region unless a customer selects a different region for the data. Data such as backups or data replicas can exist in any region but AWS itself never moves data outside a region, this has to be explicitly done by the customer. If you are required by EU law to keep data to certain regions, either in the EU or within a specific country such as Germany or France then it is possible to do so by utilizing the most appropriate region for that data. Compliancy is a complex topic and use case specific so I would advise reading up on the compliancy requirements for your given workload and industry.

Service requirements
Whilst all customers of AWS receive a uniform platform of services, not all regions have yet received all services from AWS. Newer regions, especially, have yet to have some services and you should check if the AWS services your workload needs are available in a given region.

The cost of AWS services are different per region and this is because the cost at a region level differs per region (e.g. building costs). You should review if this negatively impacts your workload and if perhaps a different region is more cost effective. You can compare costs using the AWS Simple Monthly Calculator to determine what the difference will be for your workload.

Latency is the time taken to receive a response from, in the case, resources running in an AWS region. In order to reduce latency to a minimum, your workload should be located as near to the end user as possible. If your users are mostly based in North America then a region in the US would most likely have the lower latency times of all AWS regions and therefore would make the best choice in terms of region. With a geographically spread user base, you can employ multiple regions or Amazon’s CloudFront CDN to improve latency for more remote users.

Carbon Neutral
AWS has a long-term commitment to achieve 100% renewable energy usage for its global infrastructure footprint and by the end of 2016 aims to run 40% of power consumption on renewable energy sources. In line with this goal, AWS currently runs four carbon neutral regions which means that workloads placed in those carbon neutral regions can contribute to your own organizations goals of being carbon neutral. AWS makes it possible that customers can run fewer resources through technologies such as auto-scaling, but also that AWS regions are more efficient and consume less power than average corporate data centers which all helps reduce the carbon emissions of your workloads by 88%.

New AWS features – September 2016

Occasionally I will share some new features from AWS that sparked my interest and so here in are some new features that I like:

  • Upload AWS Cost & Usage Reports to Redshift and QuickSight (link)  –  increased flexibility and efficiency in generating reports on billing. You could also load in other datasets to Redshift to create reports combining other data such as hour reporting from engineering teams
  • AWS Config Rules is now available in Singapore (link) – AWS Config rules is a way of controlling the configuration of large numbers of servers and now available in more regions
  • Organize Your AWS Resources by Using up to 50 Tags per Resource (link) – it was 10 tags which could have been restricting especially if you use tags to contain metadata to control resources (e.g. startup time), so it makes tagging more flexible
  • Amazon RDS for Oracle introduces a License Included offering for Oracle Standard Edition Two (SE2) (link) – expanded Oracle database types
  • New AWS Application Load Balancer (link) – routing based on URL now gives you more flexibility in how you architect your applications and spread load over different instances

AWS Elastic File System

Yesterday in a knowledge session between Solution Architects, the topic of AWS Elastic File System was raised and after a short discussion it was decided to take a closer look and set something up. To quote Top Gear, how hard could it be?

What is EFS?

AWS Elastic File System, or EFS, is Amazon Web Services’ latest storage solution and is a fully managed, simple and scalable file storage to use with EC2 instances. As the name suggests, it grows and shrinks automatically with your storage needs and EC2 instances can access EFS using NFS (v4.1), over multiple availability zones at low latency with high throughput (50 MB/s per TB with 100 MB/s burst). AWS lists the use cases of EFS to be; Big Data and analytics, media processing workflows, content management, web serving and home directories. Content Management you say? Hmmm J

From my past, scalable single sources of file system based content were expensive and difficult to deploy. So much so, that product and implementations strategy meant that putting all content in a database was by far and away the most logical route to take. So could EFS now resolve that headache? I will give it a test to find out.

What do I have to set up?

So I will simulate a website setup where I have an application server tier that would host my Tomcat (or similar) application servers and a back end file system which will be mounted as to my application servers so that the files can be used. Onto my file system I will deploy my content. I won’t install or configure Tomcat, this is simple to do but covered very well in other places.

The simple architecture

The simple architecture

So, I will need

  1. An auto-scaling group covering two availability zones (eu-west-1a, eu-west-1b) with two instances of Amazon Linux (no Tomcat, no auto-scaling rules for now)
  2. Security Group to allow my auto-scaling instances to talk NFS to my EFS
  3. An EFS created and mounted to my instances

For my auto-scaling group, I have gone and created a simple one and it is up and running across my two availability zones. I have gone and terminated an instance or two just for fun. That’s not related to this post, it is just fun to terminate something and watch it auto-magically reappear.

My security group allows instances that are a member of my auto-scaling security group, access to the EFS volumes via the NFS protocol

My Security Group

My Security Group

I can now create my EFS for my website content.

I first need to configure the file system access which consists of my VPC, my mount targets (availability zones) and the security group that defines the source of access requests (the one I created early):

Configuration of EFS

Configuration of EFS

Then I configure the optional settings. I have chosen to give it a friendly name and stuck to the default “Performance Mode” of general purpose.

Configure the EFS options

Configure the EFS options

The final review step and then I am done. That was it. No configuring disk sizes, difficult calculations on my requirements of how much content I have. It’s done.

Review what I did

Review what I did

After a shirt whole my volumes are ready and I can keep track on the status of creation in the main EFS dashboard under “life cycle state”.

After a short while they will be ready

After a short while they will be ready

Next we are going to test drive mounting my volume to my instance. EFS provides some instructions to be able to do this from the dashboard. Running in a ssh session (from the root);

Step 1: If needed, install the NFS client on your EC2 instance

sudo yum install -y nfs-utils

Step 2: Create a new directory on your EC2 instance, such as “efs”

sudo mkdir efs

Step 3: Mount your file system using the DNS name.

sudo mount -t nfs4 -o nfsvers=4.1 $(curl -s efs

Once that is done I can switch to the directory and create myself a simple index.html file for my eventual Tomcat server to see. If I then log on to my other instance, I can see that my file has been replicate from the first availability zone to the next. This means, if I would write my content to disk as I have done, it would be available instantly in the other availability zones and all my sites would be updated.

As I did this manually, if my auto-scaling group scales then I would need to do this each time. This defeats the purpose of auto-scaling. However, if I mount this directory at instance initialization time (e.g. chef) then it would be mounted when my new instance starts. To test this I made a very simply launch script and updated my Launch configuration (made a new one as edits are not possible) to add the following to the user data portion of the configuration.

cd /
sudo mkdir efs
sudo mount -t nfs4 -o nfsvers=4.1 $(curl -s efs

Warning: I would not use this code in production. No really, please don’t.


The most complicated thing about this is to mount the drives as creation of the fully managed and scalable storage is incredibly easy. For content management systems, like SDL Web (Tridion) this is a real help in deployment of content in a scalable and reliable way.

Amazon Web Services, Simple Storage Service (S3)

Having recently joined Amazon Web Services (AWS), I need to deep dive all the services in detail to understand the features in as much detail as I can. Simple Storage Service (or S3) has been a recent topic I have had focus on, part because of learning but also due to needing to support my customer with questions on S3. So what is S3? We shall start with a quote from the AWS documentation that describes it in one paragraph much better than I can:

Amazon Simple Storage Service (Amazon S3), provides developers and IT teams with secure, durable, highly-scalable cloud storage. Amazon S3 is easy to use object storage, with a simple web service interface to store and retrieve any amount of data from anywhere on the web. With Amazon S3, you pay only for the storage you actually use. There is no minimum fee and no setup cost.

Object Storage

S3 is an object based store which means it does not store files like a file system but rather as objects which consists of both the file itself but also its metadata. Metadata contains information about the data (the file) and can be used to support application behaviors and administrative actions. These objects are organized into buckets and each bucket needs to have a unique name across S3 which means that you need to be a little creative in naming your buckets because chances are someone has already taken the name “test”. Buckets can store objects from 1 byte to 1 TB and objects can be organized into folders and subfolders.


S3 access polices set on it that dictate who and what can access a given S3 resource (e.g. object, bucket). Such access policies that are attached to resources are called resource-based policies. If you attach an S3 related policy to a user in your AWS Account this this is referred to as a user policy. User policies may say if the given user has permissions over a bucket where as a resource policy may state that “everyone” has access to read a bucket. A combination of policy types can be used to manage access to the objects.

By default buckets are closed to the outside world and so you need to open up access if you want them to be used by other resources or users and you can set as fine grain access as you require. It’s generally a good policy to restrict access as much as possible and implement features like MFA delete to ensure that it’s harder to make mistakes.

Durable (and available)

One of the major features of S3 is both the availability and durability of the service. In on-premise environments, you needed to go to a lot of expense and effort to ensure that your storage is available to a high-standard. S3 rolls of your mouse click with a, mostly, four nines availability. Why only mostly four nines? You do not always need such levels of availability so AWS has a differing S3 storage type called “Infrequently Accessed” or IA storage. This storage type drops the availability down to three nines and should only be used for data you need from time to time and that if it is not available, it’s not really a major issue (for example, old product documentation).

If your data is not available you can be very sure that it has not gone anywhere. The difference between availability and durability is, is your data accessible and is your data still there, respectively. You can lose access to the data but be sure that it is still going to be there when access is restored. For the most storage types the standard is a startling eleven nines. Which in essence means is near impossible to lose an object. “Reduced Redundancy Storage” or RRS storage has a lower durability and should only be used for things you can lose such as copies of data or temporary data. Still at four nines, I would still class it is highly durable.

Storage Class Durability  Availability
Standard 99.999999999% 99.99%
Standard IA 99.999999999% 99.9%
GLACIER 99.999999999% 99.99% (after you restore objects)
RRS 99.99% 99.99%



S3 is highly scalable and you do not need to do anything to enable that, it’s all part of the service.

Storing and retrieving

Objects can be uploaded, updated, deleted etc. from the AWS management interface. However, for normal use the most likely way to deal with objects is programmatically via the SDKs that talk to S3’s RESTFul interface, probably via an integration with a product that uses S3 as a storage tier. S3 has as a consistency model of “Read after Write consistency” for PUTS of new objects and “eventual consistency” for overwrite PUTS and DELETES. This means that when you PUT and object for the first time, it will be readable directly after being written. However, when you overwrite PUT the consistency is eventual, meaning it will be available on all replicates eventually. There is therefore a chance that applications read the older version of an object if they read an object after the object is overwritten but before it is consistent across all replicas.

For large (>100mb) uploads, you should consider multi-part uploads. In multi-part uploads, the file you are uploading is broken into pieces and sent separately. S3 assembles the parts back to the complete file when all the parts have arrived. Doing so not only improves throughput (e.g. uploading in parallel but also uploading whilst creating) but also improves the reliability of your uploads (e.g. network errors, needing to pause).


For S3 you pay for only what you use and there are no setup costs or up-front fees. The AWS website holds details of the costs and cost differs per storage tier. The differing storage classes all have different associated costs (e.g. per GB) and this means with the good storage planning you can save significant expenditure. Organizations who have existing data on say standard S3 could also remodel their storage to improve its cost effectiveness.

Other important S3 features

Lifecycle Management

Lifecycle Management allows you to manage the lifetime of objects in your S3 buckets against rules you have defined. A simple use case of this is managing backup data. For backups, you typically have a policy that dictates how long you store your backups.

For example, you keep daily backups for the last 30 days and then a monthly backup for the last 12 months.

This means that you need to automatically remove monthly backups older than 12 months and daily backups older than 30 days. With lifecycle management you can do this. Moreover you can add addition S3 based rules. For instance, you could decide to keep the last 7 days of backups on standard storage and then the 7-30 days backups on Infrequently Accessed Storage and then all the monthly backups on Glacier. All of which will lower the cost of the storage of the backups.


Versioning allows you to keep versions of objects as they are updates with new objects and is used in combination with Lifecycle Management. For each bucket you want to use it on, you need to enable it (as it costs storage) but it then makes it more difficult to permanently lose something.

Cross-Region Replication

Cross-Region Replication allows you to asynchronously copy data from one S3 bucket to another in a different region. S3 is a region based service and data is never moved from a region without a customer enabling this function. So, like versioning, you need to enable this on your bucket and decide which destination bucket in which region that will be the target of replication and what to replicate (all or a subset of the bucket).

To do this you need to have version enabled buckets (which also needs Lifecycle Management enabled), you need to have two buckets in two regions and S3 needs permission to replicate the data from one bucket to another. It’s important to note what is and is not replicated because things like Lifecycle Management needs to be dealt with per region and not via replication.

All change @ me

Sonia the Hippo enjoys the mud of the lake at Longleat

Sonia the Hippo enjoys the mud of the lake at Longleat

In the recent past I have been watching a program on the BBC called “All Change at LongLeat“. Longleat is engrained in Britishness being the home to Lord Bath and his safari park. Lord Bath’s son and his wife have moved in to take over the day to day running of the house and park and the change is not without some problems. The estate needs to change if it is to remain viable and there is a constant friction of the old versus the new.  Ceawlin, Lord Bath’s son, seemed to know what was needed but those who heavily influenced the situation, like Lord Bath, made transforming the estate a challenge. I got an appreciation for the way Ceawlin dealt with things; drawing lines and borders, sometimes harshly, he manage to get things changed. I think most of those borders he drew to keep his sanity whilst keeping his eyes on the end goal.

2015 had to be my “Annus horribilis“. If it could go wrong, it did and even when you thought there was no surprise left, there was something waiting around the corner. However, its now 2016 and I have drawn borders and lines under things and am making this a year of change. Since 2003 I have worked for SDL (I started when it was just the Dutch company Tridion) and I have enjoyed an awesome 11.5 years and have traveled the world. I have had a wealth of good times, worked with truly great people, learnt a lot and have grown as a person hugely. There were good times and bad times but the good out weighed the bad. But sometimes, things have to change.

In early March, I started as a Solution Architect for Amazon Web Services. Since then I have been overwhelmed by the freight train of new things and new people. I am sad to miss the awesome people I have worked with over the past years, some of which I count friends but I am enjoying the change of scene and the new technology.

What is Hybrid Cloud?

The term Hybrid cloud is becoming more common terminology with customers as they adopt cloud as part of their overall IT strategy. But what really is a hybrid cloud and why is that different to cloud as a strategy?

© Dilbert

A few years ago the term “cloud strategy” was use to describe the adoption of a cloud vendor. Ultimately, some of these cloud vendors were (and still are) traditional hosting companies who use the term cloud. I have always pushed the fact that cloud really means a true, public, cloud vendor such as Azure or AWS. Hybrid has now taken over as the strategic approach and has been describe by Forrestor’s Dave Bartoletti as “cloud plus anything.” This can mean any of the following:

  • AWS & Azure
  • Cloud vendor (e.g. AWS) & private data center (private cloud)
  • Cloud vendor (e.g. AWS) & hosting party
  • etc.

In essence, two (or more) separate environments that will need to work together as one cohesive unit of which one is a public cloud. Inevitably adopting a hybrid cloud strategy makes strategic sense as you cannot just lift and shift everything to a cloud in one go and most likely your strategy will evolve into a multi-cloud strategy once you are free from all (or most of) your private data centers. So hybrid is the logical step between the two.

@copy; RightScale

The RightScale State of the Cloud report indicates that only 10% of enterprise business use the public cloud fully and 16% are fully private cloud. That means there is 74% that have a mixture of two or more clouds (public and private).  Things holding back the 16% are topics such as security concerns, outdated policies that prevent adoption or simply a lack of understanding of the benefits over systems already adopted (such as virtualization).

As one size does not fit all, organizations should be looking to train their teams in cloud and getting a full understanding of how they can move existing workloads to the cloud and in what architecture. This is especially important where organizations manage what could be classed as commodity applications (e.g. SharePoint, CMS systems) that can be purchased as a Paas or SaaS offering from a public cloud or product vendor. More specialist, differentiating and legacy applications could continue to be run in a private cloud but leverage technology from the cloud such as off-shore backup or “single pane” management technology.

Building a successful, cohesive, hybrid cloud will take a significant amount of time for most enterprise organizations and we can expect that it is five years away for mainstream adoption to be realized, starting this strategy now is the best approach to being on track with the five year expectation of Gartner.

10 things to do when deploying SDL Web to the cloud

Now that cloud is on the uptake with almost every organization, it’s likely that during the next upgrade or new implementation of SDL Web you will be asked to move it to a cloud provider like AWS. What you should not do during this move is fall into the trap of just porting some virtual machines to the cloud and running it like you did before. The cloud is better than that (and it is 2016, not 2005), so you should investigate a little deeper in how you should deploy. So to help, here are 10 things to consider;

1.       Use SDL Web 8

It should go without saying that you should use the latest version of a product, but some organizations pull back from such steps until a product version is in a SP1. But with the new product release you get better support for the cloud from SDL. You can read my earlier post on the new infrastructure features that help deployment on the cloud but the main one you need to shoot for is the support for database-as-a-service from AWS or Azure

2.       Elastically scale delivery

In this nice article, the AWS Startup folks at Medium explain that the minimum viable product must scale in order to be a success; they are spot on. As your product is that website you are building then not implementing automatic scaling (or using something like AWS Elastic Beanstalk) of delivery should now be counted as a crime against humanity.

3.       Automate the deployment of your environment

Automating the deployment of an environment is more than saving a VM template of your build servers, but automating the configuration management, topology, software installations etc. This is essential in auto-scaling but deeply important when you are planning things like disaster recovery. The old school methods can go if you can automatically rebuild an environment inside 30 minutes.

4.       Implement Continuous Delivery and then Continuous Deployment

Rome was not build in a day and therefore you need to take this slow but most cloud providers have a CD pipeline available that integrates with the deployment options available at that cloud provider (see. Visual Studio Online or AWS Codepipeline). Get to the point where you can deploy a new version of the site (or part of it) multiple times per day in a robust way.

5.       Scale Publishers when needed

Number one complaint for publishing is it is slow when deploying lots of content. Well, following point four, you should not be able to spin up more publishers when needed. This does not per say need to automatic in response to load (but it could be) but it should at least be an automated process leveraging a graceful shut down and destroy.

6.       Process log files with a tool like Splunk

Now you have servers spinning up and down automatically you probably have little or no access to the running application. It is therefore important to put in automatic log file analysis to ensure that the application is running error free, you can spot failure trends and you can keep the overall health of the environment high. Applications will fail and that is OK, but you need data to proactively reduce the errors, feed into your continuous delivery pipeline and improve performance.

7.       Write custom monitors for SDL Web functionality

The cloud provides you with metrics for things like CPU and memory, but there are no monitors for specific and relevant SDL Web functionality. Is the publishing queue a little too long? Fire a warning to your integrated monitoring solution that something may need to change like a new publisher spun up to help with the load.

8.       Deploy new CD environments for temporary sites

You can easily spin up new delivery environment should you need to deploy a new site that will only last for a short period of time. This keeps complexity low on each site and impact to another site from the new site is impossible.

9.       Adopt a microservice architecture

Architecting your application in the microservice model means that CDaaS can be utilized and sites can just feed content from that. Further splitting your application into smaller functional components which can all be scaled separately will reduce software costs, improve deployment complexity, improve resilience and improve scaling.

10.   Test performance and scaling actively

Too often this is the last of the pile in regards to things to do. Automatic scaling does not remove the need to test performance, in fact, it makes it more important. In years gone by, if the site did not perform it just got slow and then probably crashed. Sadly, website visitors were used to that, but it does little for your business reputation. Now we can scale automatically, all this essentially means is that we keep adding new instances/servers until we meet demand. And what follows is a small heart attack when you read the monthly invoice.

Instead it is now more important that you need to keep your application performing well. Performance should be tested in the CD pipeline as well as on a frequent basis in production. Plenty of tools exist to support this and can help testing from different parts of the world if your site needs to respond to a global customer base.

New Infrastructure Features of SDL Web 8

SDL recently released the latest version of their web content management and this releases has some interesting changes from an infrastructure perspective that I would like to highlight.

Product Name Change

Firstly, whilst not an infrastructure change, the product has changed its name from SDL Tridion to SDL Web. The new name, Web, says a little bit more about what it does (at a very basic level) but does somewhat reject the kudos and history that comes with the name Tridion. The name “Tridion” is distinct and easily recognizable, Web is a little more generic and bland in my opinion.

The version number for the new release is 8 which is a throwback to the “R” releases of Tridion before SDL took over the Tridion company. The last product named in the “R” series was R5.3 and since then there has been the releases 2009, 2011, 2013 and now 8. It’s not directly logical that Web 8 should not actually be called Web 9, but 2009 and 2011 are really R5.4 and R6 respectively.

Improved Cloud Support

The new release has some heavy focus on changes to make it easier to deploy and manage Web. First up is the improved cloud support. SDL always did support “the cloud” through the proxy of supporting specific operating systems and provided they ran as normal it did not really matter where they ran. This meant that an IaaS based deployment of Tridion was always possible.

What SDL now means to say is that SDL now “supports specific features of some cloud providers”. Those are only AWS and Azure and nothing is mentioned about Google or Oracle as a platform. SDL Web 8 adds support for Azure SQL and Amazon RDS. The documentation states “Azure and Amazon RDS” but this is an oversight as it means “Azure SQL” as Azure is the Microsoft cloud platform rather than a specific piece of technology.

All this means is that you now take advantage of the database-as-a-service offerings from these two providers providing you are not using the SDL Web legacy pack (e.g. for VBScript templates), transactional core service code (you can write this out) or implementations with certain extensions. This is because Tridion traditionally made use of distributed transactions and these are not supported on AWS or Azure and legacy style code still needs MSDTC.

In all cases, you can only use the SQL Server engines which has a cost impact over and above the MySQL engine options on AWS RDS but is cheaper than the Oracle engines.

If you use a version of SDL Web prior to Web 8, you should note that you can use Azure SQL or AWS RDS for the Content Delivery database but it is simply not supported by SDL.

Topology Management

Topology Management is new product feature that replaces the existing (now deprecated) Publishing management (e.g. Publish Targets) with a more advanced approach which clearly de-couples the configuration of publishing from the management of content and makes the configuration of delivery environment something tied to the environment (e.g. production) rather than the content management database. Managed through Powershell, the Topology Manager manages the relationships between publications and delivery environments. There is a .NET API which means automation options from other applications that are not PowerShell compliant is possible and an example of using that API can be found here.

Some key terminology to grasp with topology management itself;

  • Content Delivery Environment: in essence just the same as before and is communicated with through a Discovery Endpoint
  • Topology Type: defines the purposes like “Staging” and “Live” and can define a series of purposes to help support a publishing workflow (e.g. Staging Editors -> Staging Executive Approval -> Live)
  • Topologies: combines one or more content delivery environments which have a particular Topology Type (including Purposes).

And on the Content Management side:

  • Target Type are the same as before in that it is what the user selects when publishing to undertake the publishing and has “a” Purpose e.g. “Live”
  • Business Process Type defines how content flows through the organization (published or not) and in this context what topology type and target types, defines the Minimal Approval Status of a target type and the priority which both used to be in the publication target. The Business Process Type is in a publication and can be inherited through the child publications.

The features increase the complexity of publishing management and it is a little frustrating there is no user interface as this was one of the nice features of the current publishing approach. Whilst I support the move, it has yet to be seen how much this would be an advantage in an automation / NoOps approach over what you could already do through existing APIs.

For now, there is no need to change to the new approach, so the advice to customers would be to sufficiently test with the new approach before rolling into a production situation.

More Graceful Publishing Management

SDL has improved how you can manage publishing services. Prior to Web 8, when a publishing service was stopped, it simply forgot everything it was doing (much like a “kill -9”). With the advent of technologies like auto-scaling, this makes simply stopping a service a royal pain because a service may be busy with something that you simply just do not want to forget about. These new features are only available in the new publishing approach described above:

  • Pausing the Publisher service: Pausing means that the Publisher does not pick up any new transactions, but does keep processing deployment feedback on items already send for deployment. Assuming you can test if it has completed all feedback items it had open (?) then a graceful destroy of a server could take place.
  • Graceful shutdown of Publisher service: Shutting down the Publisher service will allow the following to take place before shut down is completed; all transactions that have not yet been transported back in the publish queue (set to Waiting For Publish) and all transactions that have the transaction state “Scheduled for Deployment” have sent their commit packages to transport.

With the approach of starting and stopping delivery environments a little more dynamically then delivery environments have some additional options to help management them:

  • Graceful deactivation and reactivation of a Content Delivery environment: halting a Content Delivery environment for maintenance is now possible from the Content Manager server. The documentation is unclear with what happens to publishing transactions being pushed to other sites as well as what happens to records (e.g. audience manager) that are written to databases belonging to redundant databases in other deployment stacks.
  • Decommissioning of a Content Delivery environment: You can decommission an entire Content Delivery environment without having to unpublish content first.

Content Delivery as a Service (CDaaS)

The major change from architecture side for delivery is the introduction of Content Delivery as a Service or CDaaS for short. This new feature means that web applications can feed content from a SDL Web using a microservice approach. This approach allows non-Tridion (Web) skilled teams to talk to Web and minimizes any impact the libraries for Web would have on other applications.

Development and maintenance of the CDaaS and connected web applications can all happen separately on their own development tracks and upgrades to CDaaS will not affect the applications using the content (assuming the interfaces are reverse compatible). You can scale the CDaaS microservice separately to your other application services (a concept drawn from microservices architectural approach) which means that applications that are not content rich need not have large content delivery farms.

Discovery Service

The Discovery Service is now the know it all of the delivery farms, with the centralized webservice being the go to point to understand what content delivery endpoints there are deployed. The topology manager (see above) needs to only talk to the Discovery Service to get everything he needs to know.

[ Update 9th Feb 2016 – Thanks to Nuno Linhares for some corrections on the version numbers and the information regarding the .NET client for the Topology Manager]

Continuous Delivery vs Content Management

Recently I had a discussion where the initial question somewhat baffled me. Having thought about it more, I want to write something about it to see if I can come to a nice conclusion. The question was; Is Continuous Delivery a threat to Content Management? The form of the question predicates that the asker thinks that Continuous Delivery is actually a threat to Content Management, but why?

As someone who tries to take the independent stance but heavily leaning on Content Management for the staple of work, my initial reaction is that no, Content Management is no threat to Continuous Delivery, but nor is Continuous Delivery a threat to Content Management. Both have a place in any internet delivery environment and such a question is a little like comparing apples and pears. But for kicks, let’s look at it in a little more detail.

What is Content Management (CMS)?
Specifically, we are talking about Web Content Management (rather than the general definition). Wikipedia describes this as:

A web content management system is a software system that provides website authoring, collaboration, and administration tools designed to allow users with little knowledge of web programming languages or markup languages to create and manage website content with relative ease. A robust Web Content Management System provides the foundation for collaboration, offering users the ability to manage documents and output for multiple author editing and participation. (source:

Systems like SDL Tridion Web make good on the following: allowing non-technical users to edit site content (and even manipulate layout), collaborate on content, version and reuse content across multiple channels and sites. Some systems allow for additional integrations to support content create such as translation systems, DAM etc. and changes can be made to a production website in a matter of minutes. Not all CMS platforms support direct updates but rather they rely on periodic refreshes of the content.

What is Continuous Delivery (CD)?
Continuous Delivery differs in what it, as an approach is trying to resolve. Wikipedia describes it as:

Continuous Delivery is a software engineering approach in which teams produce software in short cycles, ensuring that the software can be reliably released at any time. It aims at building, testing, and releasing software faster and more frequently. The approach helps reduce the cost, time, and risk of delivering changes by allowing for more incremental updates to applications in production. A straightforward and repeatable deployment process is important for continuous delivery. (source:

The focus here is on agility of changes in a development lifecycle with a heavy focus on automation of repetitive tasks to lead to productivity and quality improvements. These automated stages are things like build, test and deployment. They feature integrated products covering things like collaboration, unit tests, versioning and source control and typically this is focused on product (code which could be a website) development which can and often does include editing of assets such as labels, (website) content and binary objects.

Comparing the two
Both have overlap in two areas; versioning control and pipeline management. Both paradigms focus on rapid delivery of assets and both are only comparable if we are talking about the delivery of a website. A CMS is no good for supporting delivery of a desktop application. Whilst many CMSs support the delivery of code and have a web application, SDL Web does not mandate such a thing and you can develop any application you would like with varying degrees of code in the CMS itself. Currently, the recommended practice of SDL is not to include code in CMS, but to develop a separate application and have SDL Web deliver content which it can do in any form you need (e.g. JSON or XHTML). Continuous delivery specializes in the delivery of code and assets.

In Continuous Delivery, you can enter content assets into a versioning system (e.g. Git) and include that in your build which is eventually deployed. Content can be edited with a suitable IDE in a semi- non-technical form. I do not want to say completely non-technical because an IDE is typically still a technical tool. CMS systems tend to focus on empowerment of non-technical users and organizations that use SDL Web have non-technical marketing users editing content either via forms or using tools like Experience Manager. Content that is entered into the versioning system can then be pushed through the delivery pipeline into the deployment together with all the application code. This has an advantage in agility of deployment because the content will always be delivered by the deployment and the content is as available as the application. Where SDL Web sometimes has challenges is that you need to have a single, scaled and redundant source of content for the webapp. This means that always needs to be there to make the web application works. However, separating the two pipelines of code and content using CD and WCM means that you can make minute by minute changes to the website and not require the application to be redeployed. If you want to separate your web application from your CMS, then content can be delivered through content-as-a-service.

Every web application needs content so if you do not have a CMS then you will need to deliver your content through CD. CD will provide enough features to edit and manage content providing you have the right people and you allow the right speed of updates in the form of multiple deployments per day. What you lose by not having a CMS is the features that the CMS would bring such as content inheritance, translation, inline editing, per minute updates of content. For a simple site (i.e. a micro-site), having a full enterprise CMS is perhaps overkill especially if you do not already have a CMS. If you do, reusing content and content editing processes from the existing CMS is a considerable plus.

If your website is larger in content terms than a simple site and is really multiple sites in multiple languages with a high amount of content reuse, then using CMS and CD together, seems to be the ideal solution. You can manage all your content for all your channels (including campaigning) though one tool and develop awesome apps in record time with CD. One is not a threat to the other.

Going forward I would make a recommendation that your deployments are done in a microservice architecture and in that, your CMS content should be delivered as a service (along with all the other things like targeting). This means all deployed sites take advantage of content that is centrally managed, application deployments are not weighed down by large volumes of content assets and CMS features like content targeting are uniformly deployed on all channels.

Photo credit: Ian Brown (Flickr)