For the past two years, I have been managing a team in charge of the data center infrastructure. It is a medium-sized team in charge of the HPC clusters and bare-metal servers, storage, network, and also virtualization and containers platforms, such as VMWare, OpenStack, and Kubernetes, plus some central services.
This team has a dual mandate: keep daily operations running smoothly and carry out innovative and challenging engineering projects. As the person responsible for resource management and keeping the team’s project schedule on track, one important challenge is to find the right balance between these two responsibilities. This requires a strategic approach to resource allocation and time management. Let me give you my take on operations vs projects balance in IT infrastructure teams.
I have managed various network teams in the past, which were also in charge of both operations and engineering projects. The difference in this organization is that we are an HPC research center, so we must stay at the cutting edge of technology. Therefore, we have many ambitious and very innovative IT projects compared to a standard business company, which in most cases aims more for stability than innovation in the IT side.
Understanding the context: Operations vs. Projects
Operational tasks include all day-to-day activities essential for the continuous and efficient functioning of our data center environment. Including:
- Monitoring and maintaining servers, networks, storage, platforms, and related hardware.
- Performing regular maintenance, such as updates, patches, backups, and security checks.
- Troubleshooting and resolving any ongoing issues with infrastructure or services.
- Supporting end-users (internals and externals).
- Resolving urgent incidents, such as hardware failures, network outages, or similar.
- Dealing with system performance issues that need immediate attention.
- Working on root cause analysis for recurring problems and finding long-term solutions.
- Improve and automate our operational processes to facilitate and reduce these tasks.
On the other side, project-based engineering tasks are finite initiatives aimed at achieving specific objectives. Such as:
- Deployment of new technologies, clusters, platforms, or even sites.
- Implementation or integration of new services.
- Evaluations and tests (PoC) of future new technologies or services.
On top of that, there are also the non-technical tasks that we all do: emails, Slack, meetings, administrative tasks, non-essential interruptions, etc.
Allocating Time and Resources
Determining the optimal division of time between operations and projects is key. While specific percentages can vary based on organizational needs and industry standards, a common challenge arises when operational demands consume a disproportionate share of resources, leaving limited capacity for projects, or vice versa.
This imbalance can slow down innovation and strategic progress for the organization or create a technological gap – and probably also security issues – because the team does not have enough time for operational tasks.
Another important point to consider is that in many organizations, the management tends to add more and more projects to the pile, far beyond the team’s workload capacity. The role of the team leader is precisely to prevent this from happening, and for that, there are some requirements and various solutions, which we will see below. But if, for any reason, this is not done properly, in the case of strong pressure from the management or the stakeholders, the natural team’s reaction will be to reduce the time spent on operational tasks to be able to cope with the project workload and deliver the projects on time. Here, the role of the team leader is very important to prevent this behavior. Otherwise, a technology debt kind of issue will happen quickly.
Requirements for an effective leader for finding a good balance.
The leader’s first requirement is to know the capacity of his team. This can be done with different methods and data, both used in combination to get a better estimate. The first and most important, in my opinion, is experience; over time, a leader gets to know his team well, he must know what they are capable of. He must also be able to gauge the mood of each member of the team, which also gives a good indication of their workload and capacity. There are other techniques used in agile methodology, such as the use of story points for each task, followed by the analysis and monitoring of these story points over previous periods. This can then give a fairly accurate idea of the team’s capacity.
Then, a leader must be able to prioritize and coordinate the work within the team. If the priorities of the projects are not clear or well-defined by the stakeholders or management, the team leader or project manager must clarify this. Also, do regular checks to correctly determine the current status, the priorities, and the dependencies of the various projects within the team. Certain priorities can also change quickly, so it is important to be able to adapt and adjust the team’s priorities as necessary.
The leader must also be able to push back certain tasks and be able to negotiate to have some projects or tasks carried out later. With more experience and a good internal network, an effective leader should also be able to anticipate some requests. By talking with the different key people of different departments in the organization, there are certainly projects in the pipeline that can filter through.
Strategies and Techniques for Effective Management
So, as the leader of a team in charge of infrastructure, what is the best way to effectively balance operational and engineering tasks?
Resources Segmentation
One approach is to designate specific members of the team, or sub-groups, exclusively for operational tasks and others for engineering project work. This clear demarcation ensures that project timelines are met without compromising operational stability. But in general, this is only possible with a fairly large team. In addition, the team’s full potential is not necessarily used to its maximum capacity all the time. But it is still a good method.
If we go in that direction, there is an additional option, which I have seen in some large companies and among some Internet service providers. To rotate people between the operational and the engineering project teams regularly. Spending time on the operational side is important to stay in touch with customer needs and issues, and rotating the engineers help avoiding silos or different way of working between the two groups. This also motivates the engineers working on projects to properly document their work, as they may find themselves having to debug it later, once in operation.
Continuous Monitoring and Adjustment
If it is not possible to divide the work into different subgroups, it is essential to regularly assess and evaluate the distribution of resources and time between operations and projects. To do this, the use of key performance indicators (KPIs) can provide precious information on efficiency and highlight areas requiring adjustment. In addition, as mentioned above, clear prioritization of tasks is essential.
Automation of operational tools and tasks
In addition to the solutions presented above, the implementation of automation tools can considerably reduce the manual effort required for regular operational tasks. By automating repetitive processes, IT teams can free up time and resources, allowing them to focus more on strategic projects.
The addition of automation, or AIOps to be trendy, certainly helps to reduce the working time for repetitive tasks, but the deployment and maintenance time of the automation tools must be added to the list of operations above, it is not to be neglected when deciding whether to automate a process or not.
In some cases, manual operations are beneficial
Yes, in some cases, manual operations are beneficial. For example, human judgment allows for nuanced decision-making in complex situations. It also allows a quick adaptation to unexpected changes or anomalies. It also facilitates personalized responses to unique problems.
Another strong point in favor of manual or semi-manual operations tasks, in my opinion, is learning; we enhance team knowledge and skills development by doing some tasks manually. If, for example, we talk about network configuration; if a tool generates and applies a configuration automatically, how can we debug it afterwards if we only have a vague idea of what has been deployed? Ideally, the same tool should be able to detect and troubleshoot a problem. But what if it can’t? This is where learning by doing things manually brings a big advantage.
Downsides of Manual Operations
In other cases, it is preferable to automate certain tasks, and for many reasons. For example, to reduce potential human errors. Or to save time on repetitive tasks, which raises the risk of burnout and operational delays. Manual processes also struggle to scale with growing demand, and it is difficult to maintain consistency in larger-scale operations. Manual tasks can also create bottlenecks during peak workload periods.
In addition, it is more difficult to monitor the performance and reliability of processes. There will be a lack of consistent logging and reporting for in-depth analysis. This reduces the possibilities for proactive improvement and optimization.
What should be automated?
Define which task can potentially be automated.
If we intend to automate some tasks to facilitate operational work, we first need to define which tasks can potentially be automated. To do so, first, we must start by identifying repetitive tasks by monitoring processes and activities to identify frequent manual interventions that systematically require the same steps.
Then, we must identify the activities with low added value. The objective is to distinguish the recurring tasks that bring added value from those that prevent a focus on innovative projects.
To help, we can also use metrics. For example, by collecting and analyzing tasks-tracking tools, tickets, logs, or any data related to operational work. With this kind of information, we can identify recurring operational challenges and detect signs of excessive manual work.
Collecting the feedback from team members may also help to discover hidden sources of repetitive or difficult work that we don’t necessarily see.
Then, assess whether a manual task should be automated or not.
If a process is carried out three times a year, we can assume that the efforts made to automate it and maintain the code are probably not cost- or time-effective. But maybe this task, carried out only three times a year, takes several days of work for several team members, and the human errors associated with it have a strong impact on the quality of an infrastructure or service.
It is therefore necessary to make a good cost-benefit analysis to determine whether a task should be automated or not. There are some evaluation criteria, such as the frequency, time needed, and repetition of the manual task to take into consideration. Also, the time and resources needed to automate and maintain the potential future automated task, script, or code is important. We can also compare the manual time and efforts to the potential gains from automation, including the possible human error and potential impacts on the system or infrastructure.
Finally, by gradually automating repetitive operational tasks, we can free up the team’s time for projects while keeping a buffer in case of unplanned incidents.
Conclusions
It is essential for the “dual role” teams in charge of IT infrastructure to achieve a harmonious balance between operational tasks and project initiatives to meet the infrastructure’s immediate needs and promote its future evolution.
The role of the leader is essential; by strategically managing resources, adopting automation where necessary, and maintaining flexibility and agility, the team can effectively manage the complexity of its dual role, ensuring both reliability and innovation.
Featured image: Photo by Mango Matter on Unsplash