Whether you are building a data warehouse, a client reporting system, a MDM system or a trade booking system you need to run a development team. Development Operations are things that you need to run a software development team. What do you need to run a development team? They are:
1. Release pipeline
2. Change control
3. Sprint planning
4. Automated testing
5. Code repository
6. Server build
In writing this article I will try to combine all my experiences from about 10 companies in the last 12 years. Some of them do best practices, some do bad practices. I will only write about the good ones here.
1. Release Pipeline
The first thing that you need as a development team is a release pipeline. A release pipeline is a process to automatically build, deploy and test the code to a series of environments, i.e. development, test, UAT, production and support. A release is a collection of code such as stored procedures, SSIS packages, .Net code, views, tables, etc. bundled into 1 package to be deployed together as 1 unit.
Can’t you deploy it manually by copying the code? Of course you can. But it will take more effort. And it is prone to error. What if you want to rollback to previous version? Well you have to do it manually right, by copying the previous version of each component. Who is keeping track of the version for each component? What if you missed 1 component? You could be deploying 50 components in test but only 49 in production. It’s a headache right? With a release pipeline, you can deploy all components into many environments consistently and effortlessly. You deploy into Test environment, and test it here. You deploy into UAT environment, and test it here. And you deploy the same thing into Production. And then you deploy the same thing into Support environment. The environments are guaranteed to be consistent.
If you have 100 stored procedures, 100 tables, 100 views, 100 SSIS packages, a view SSAS cubes, 50 SSRS reports and hundreds of .Net codes, then you will appreciate the benefit of having an automated release pipeline. It is a huge time saver. And it improves your quality.
Deployment into an environment requires approvals. A deployment into the test environment for example, may require the test manager approval. Whereas to deploy a release into production requires an approval from the production support manager.
2. Change Control
The second thing that you need to run a development team is change control. Change control is a process of approving every single change going into production environment. It can be a code change (like a release). It can be a data change (like an UPDATE or DELETE statement). It can be a configuration change (like moving the data folder from C drive to E drive). It can be an infrastructure change (like setting up a new SQL server).
Change control is very important because:
– It shows that the change has been tested properly in Test environment, before it is deployed to production.
– It shows that the development manager and the production support manager are aware about the change.
– When the company is audited you have evidence that every single change in production is approved and tested.
– You can choose which change will be deployed at the same time in production, and which ones are not at the same time to minimise the risk
– It shows the sign off from the business, agreeing to the change.
– When the production deployment failed, it is recorded properly including the reason and the solution.
– It shows that every change in production is controlled, so the production systems are stable and reliable.
A change request is an electronic form which specifies:
– The reason for the change (usually a business reason but can be a technical reason)
– The list of components and data to be changed
– The evidence that the change has been deployed to Test/UAT environment the same way it will be deployed into production
– Link to the design specification or requirement which forms the basis of the change
– The evidence that the change satisfies the design specification or requirement (or if it is a bug fix, the evidence that the change fixes the bug)
– The impact of this change to the business users
– The approval from the business, from the development manager, and from the production support
– The updated documentation for the production support
– How the change will be deployed to production and to the DR system
– How the change will be tested that it works in production
– How the change will be rolled back if it doesn’t work in production
– The date the change is intended to be deployed to production
– The developer who did the change
– Who from the development team will be on support on the day the change is deployed to production
– Which teams are required to deploy the change
– The applications which are impacted, including downstream applications
– The severity of impact of the change, i.e. high, medium, low (the risk)
– The scale of the change, i.e. major change, minor change or medium (how many components)
– Whether it is an emergency change or not
The change request form is not just for application changes, but also for infrastructure changes such as servers and networks. It is for any changes impacting any production system, including codes, files, databases, servers, configurations, firewall, connections, permissions, scheduling.
Major change requests are discussed in a weekly meeting between the developers and the production support team to understand:
– What changes are going in this week
– Whether there are any conflicts between the changes which are going in this week
– Where those changes are in terms of their positions in the daily batch
– If things do go wrong, how long it will take to recover (or to fix)
– Who is required to standby to fix if anything goes wrong (DBA, infrastructure, developer)
– When those changes should be done from production point of view (to minimise the impact).
– To scrutinise the test evidence, not whether the change satisfies the requirements, but whether the change has been deployed in the Test system the same way as it would be deployed in production. In particular, whether the condition of the Test environment is the same as production. Also, whether the impacted application has been tested in Test environment.
– To check that the rollback script really covers the whole change, not just part of it. And whether it has been tested.
– Whether you need to communicate the change to the users or not, and whether the help desk should be prepared to field calls from the users on the deployment day.
The changes are then either approved or rejected. If it is approved, it is given a date and time when it will be deployed. Naturally, the business and the development team want the change to be deployed as soon as possible, but it is the responsibility of the production support team to decide the timing, to minimise the risk. Changes with major impact (or high risk) are usually deployed on Saturday when the business users are not working so they are not impacted and you have Sunday to recover or fix any issues. Changes with medium impact (and medium risk) can be deployed on Friday, so you have Saturday for fixing. Changes with low impact and low risk can be deployed earlier in the week (Wed or Thu but not Mon or Tue).
If there are major changes going in this week end, it is wise to defer deploying other major changes to next week end. Similarly if the required support resources is not available this week end (if it goes wrong) then the change should be postpone to next week.
Some changes requires coordination with an external vendor. So the timing of production deployment must be mutually agreed. Some changes are time-bound, i.e. they have to be done before certain date otherwise the company will get a penalty from the regulator.
Some changes need to be done immediately, for example because of security risk, or risk of losing a great deal of money. This is classified as an emergency change and it is usually a bug which needs to be fixed immediately. The procedure for emergency change request is different to a normal change request. It still requires approval, it still needs to be tested, but a lot more streamlined. In most cases the principle of applying emergency changes is “fix after”, meaning if things go wrong you do not roll the change back, but you fix it. This requires more people to standby compared to a normal change. This “fix after” principle allows the testing to be light weight.
3. Sprint Planning
The third thing that you need to run a development team is sprint planning. A sprint is a unit of two weeks. In theory it is possible to define a sprint as 1 or 3 weeks, but I’ve been working at 4 companies which do sprint, and every single one does 2 weeks.
Yes Sprint is a way of running Agile. No one does Waterfall any more these days. Funny how things moved, 5 years ago people still debating Agile vs Waterfall. But not now. No more debate on that. The debate is more around CI/CD atau automated testing.
Azure DevOps is probably the best tool in the market, followed by Jenkins, Travis. Many are not a complete DevOps tool. GitHub, Grade, JIRA, Bamboo, AWS CodePipeline, Trello for example. They just doing build and source code control but without CI/CD, or vice versa. Or doing both but not sprint planning (Trello do sprint planning but no release pipeline, whereas AWS CodePipeline is the opposite). Even Azure DevOps is not complete (it doesn’t do change control), but it does sprint planning, source code control, release pipeline CI/CD and automated testing.
Every two weeks you plan the works that you need to do in the next sprint. These pieces of work are called user stories. First you define the team capacity, i.e the number of working days minus the holidays. You then allocate how many days is required for each story, who will be doing it, and the acceptance criteria (the definition of done), i.e. just doing the development and test, or until production. Then you check if anyone is under or over allocated. If someone is over capacity, distribute or reduce the work.
After planning, during the 2 weeks sprint run you track the progress of each story. For each story you create a few tasks with an estimate and the person doing it. As each task gets done, you mark the actual hours against the estimate. The status of each task changes from new to in-progress to completed. This will create a burndown chart, which tracks the whole team’s actual hours against estimate.
It’s all very well and good doing planning and tracking as above, but how do you determine what needs to be done in the first place? This is where the product owner comes in. The list of work in each sprint should be driven largely from the prioritised list of projects + product increment. If the team is a dedicated project team, then the list of work comes from just 1 project + the BAU of that project. Some teams serve multiple projects (size of 3 months to 1 year) plus hundreds enhancement requests (size of 2 to 10 days) throughout the year.
4. Automated Testing
The fourth thing that you need to run a development team is automated testing. Many people leave this to the back of the queue, but the reality is that this thing saves a lot of time (and money), and increases the quality of the code enormously. Automated testing means two things:
a) Ensuring that deployment to production does not break anything
b) The code satisfies all the intended functionalities, and none of the unintended ones
Point a) is achieved by copying the production system into the test environment. The lot, i.e. thousands of components, code, settings, databases and files from production is restored into the test environment overnight. Then you apply the latest code and data changes from the release pipeline into this test environment. Then you run the whole system, and check the expected results. The key thing is: all of these are conducted programmatically, every day, without any human intervention.
Point b) is achieved by having a test environment which is completely empty, i.e. no data. You have the databases, but there are no rows in any tables. You have the data folders but there are no data files in them. Then you test 1 piece of code, e.g. a .Net web service or a stored procedure. You do this by setting up test data, e.g. you insert a few rows into the input tables. Each row reflect one specific test scenario. Then you run the code and check the result. Then you repeat it for all pieces of code in the system, thousands of them. For 1 piece of code you need to run some positive tests and some negative tests. The key thing is: all of these are conducted programmatically, every day, without any human intervention.
If you have to do all that manually, it would requires thousands of hours. Actually it would be impossible to do because for every release you will have to repeat both tests. If you don’t have automated testing, you can’t deploy to production 3x a week. The best you can do is once a quarter, because of the amount of testing involved. It does not make sense to do 10 hours of development and then spend 2000 hours of testing right? But once your hours of testing is down to zero, then you can do a 10 hours development, test it and deploy to production. And then you can do another 10 hours of development, and so on.
Because for every single change you run “the whole lot of testing” twice (the a and the b above), you can release very often, and you increase the quality of the code. You do positive testing and negative testing on every component, and also integration testing. The a) above is called “system testing”, and the b) above is called “unit testing”. Both are automated.
Does it mean you do not need to do manual testing? No, you still need to. When a major feature is developed, you need to do manually test the functionalities across the whole piece. Suppose you are adding a new feature in an order processing system which enable the system to combine orders from different customers and process them together. There are 9 places in the system which need code changes. Each of these 9 pieces has been unit tested individually using the b) above, which proves that each piece works individually (using minimal amount of data, i.e. a few rows). The whole system has been system tested using the a) above, which proves that there is no negative impact to the system as a whole.
But how do you know that the ability to combine orders really works end-to-end, using a normal production data volume? The automated unit test only uses a few rows, and they are made up data. It does not prove that end to end the new feature is working. To prove that the new feature is working we need to run thousands of real orders from last month across the system, combining hundreds of them, process them and check the output. Are all order processed correctly? Are there any issues? If it passes this test we can say hand on heart that the system now has capability of combining orders.
5. Code Repository
This is a must in any development team. You can use GitHub, TFS, SVN, PVCS, BitBucket, or others, but you need to have the capability of storing code, versioning, master and feature branches, committing changes, packaging and releasing code, merging, collaborating, restoring to previous version, doing code reviews and pull requests.
Whether it is Java code, Python, databases, ETL tool, cubes or reports, they are code. And they ALL need to be in ONE code repository. And the code repository MUST be integrated with the release pipeline. You need to be able to push a release candidate to the release pipeline, test it and deploy it automatically into various environment including test, support and production.
6. Server Build
Like automated testing, this one is frequently put at the back of the queue, but it really is a time saver and money saver. So we need to do it. Server Build is the ability to create a server programmatically. A release pipeline consists of a series of environments such as Dev, Test, UAT, Prod and Support. Each environment consist of several servers, like SQL Server, application server, SSIS server, SSAS server, Python server, etc.
These servers need to be built. If you build them manually, it would take a lot of hours. Not just once, but they will need to be configured, patched and maintained. And when they don’t work, you need to troubleshoot it and fix it.
A better way to create servers in Azure is using PowerShell DSC. For each environment you need to create a DSC configuration which tells PowerShell DSC the server name, resource group, folder paths, Windows features and any other characteristics which are specific to each environment and each server. The PowerShell DSC then reads this DSC configuration and create each server in each environment according to the settings written in this file.
The best practice is to recreate these servers on a weekly basis (except the production environment), to avoid having to upgrade and patch them.
7. BAU/support/bug fixes
8. Business requirement & design
The traditional understanding of “DevOps” as a system normally includes points 1 to 6, but not 7 to 9. But 7 to 9 are certainly required to run a development team for any system