Scaling Engineering Teams & Rise of Platform Engineering Squads
Note: If you are familiar with Spotify Agile model, you can skip the whole “Introduction” part.
Introduction
The Spotify model is a people-driven, autonomous approach for scaling agile that emphasizes the importance of culture and network, also helps you to scale up your engineering teams. It has helped Spotify and other organizations increase innovation and productivity by focusing on purpose, autonomy, mastery, communication, accountability, and quality. The Spotify model isn’t a framework, as Spotify coach Henrik Kniberg noted, since it represents Spotify’s view on scaling from both a technical and cultural perspective. It’s one example of organizing multiple teams in a product development organization and stresses the need for culture and networks.
…the Spotify model focuses on how we structure an organization to enable agility.
The Spotify model was first introduced to the world in 2012, when Henrik Kniberg and Anders Ivarsson published the whitepaper Scaling Agile @ Spotify, which introduced the radically simple way Spotify approached agility. Since then, the Spotify model generated a lot of buzz and became popular in the agile transformation space. Part of its appeal is that it focuses on organizing around work rather than following a specific set of practices. In traditional scaling frameworks, specific practices (e.g. daily stand-ups) are how the framework is executed, whereas the Spotify model focuses on how businesses can structure an organization to enable agility.
The Spotify model champions team autonomy, so that each team (or Squad) selects their framework (e.g. Scrum, Kanban, Scrumban, etc.). Squads are organized into Tribes and Guilds to help keep people aligned and cross-pollinate knowledge. Spotify has little standardization, it doesn’t have a formal standard. They believe that cross-pollination is better than standardization. For example, when enough Squads use a specific tool, that tool becomes a path of less resistance and other Squads tend to choose the same tool. After other Squads use the same tool, test it, and collaborate together, then the tool becomes a default standard.
Consistency x Flexibility
Spotify employs an internal open-source model, their culture is more sharing than owning. Based on mutual respect and little ego, Spotify has a peer code review, where anyone can add any code at anytime. Then a peer can review the code and make adjustments. Everybody collaborates together and spreads the knowledge! They also have a culture that focuses on motivation, which has helped them build a very good reputation as a workplace.
What is the gap?
Microservices architecture — Container orchestration, API Gateways, Infrastructure as Code, Shared Identity Providers, etc. have drastically increased the complexity of deploying even a small application in many large organizations that are still navigating the switch from on-premise to cloud-native solutions. This has led to the rise of the “Platform Engineering” and Platform Engineering squads. You can read our detail transformation (and yes, we are still hiring top talents) in here.
These are sample output of the Platform Engineering squads include:
- Code / Templates
- Architectural Decisions
- Git flows and GitOps
- Reusable components across all tribes such as Exception Handling, logging, iOS/Android/web components, backend utilities & services, etc
- Accelerators such as maven archetypes
- Leading Practices, unit test, code coverage, etc
- Reference Architecture
- Reference Implementation
- Developer Playbook
Build Squads leverage all of these artifacts to hit the ground running. Given that the primary consumers are developers, it makes sense to create them in a tool/location like a Source Code Repository that developers visit frequently. Kyle Gene Brown covers this in his recent article on GitArchitecture (a better way to capture Architectural decisions).
What is The Platform?
The Platform is abstraction layer that hides the underlying complexity of operating the software and infrastructure layers, takes care of all the details of handling Infrastructure operations, services orchestration, CI/CD, maintainability, innovation & improvement, and monitoring all these components.
It is fair to set an expectation that “The Platform” can be seen as an internal product whose stakeholders are the technology teams that build on top of it the software and applications that power the Business and helps it thrive in this ever-changing technology landscape.
What are The Platform components?
The Platform can be a multi-layered entity where each layer has its responsibilities and clearly defined boundaries. I define such components like below:
How to fit into Spotify Agile Model?
After I read this article of ibm garage, noticed they have something similar as well. The Architecture Squad comprises of architects that are on the transformation program full time. The squad also has a Chief Architect. You could have two Chief Architects if you are pairing which I have done a few times and it works well to change behavior. Every architect is assigned to 1–2 Build Squads once these are stood up. They provide the necessary architecture support the Build Squad requires.
As a transformation program starts up, reference architecture and a reference implementation need to be established and built. The architecture squad, just like every other squad, has a backlog with prioritized stories that they work from. The architecture stories are focused on the reference architecture/implementation. The architects work on the design of these stories and pass them to the Platform Engineering Squad.
The Platform Engineering Squad is a Build Squad that focuses on building out the reference architecture and the reference implementation that the Build Squads focused on building products will leverage. The Architecture Squad plays the role of Product Owner for the Platform Engineering Squad.
The initial set of architecture stories are created by the architects, however, once the Build Squads are operational, they can submit architecture stories to the Architecture Squads backlog for consideration. If the story is cross-cutting, strategic, or will occur frequently, the Architecture Squad will prioritize it. If it isn’t, it will turn it back to the Build Squad that submitted it and leave it to them to implement.
The architecture backlog is jointly groomed and prioritized by the Chief Architects, the Squad Leads of all the Build Squads, and all the Product Owners. I just extend that a bit and create the following visualization:
Platform Engineering Principles
- Automation. People are prone to errors. Automation within the platform allows us to be more confident when executing a piece of code. This allows us to isolate any bugs and errors within the code and then do continuous deployment.
- Efficiency. Building a highly efficient platform is important. This will allow us to move faster. Fixing bugs fast to build efficiency and building features on necessity basis. Reusing code and creating reference implementation is a key. This will help the wider business to get a higher lead time to market as well as a competitive advantage. Efficiency in the platform also means failing fast and fixing it. The platform should be as transparent as possible when showing errors. Errors will then lead to faster debugging and deployments. Efficiency lies in iterating small features rather than a big deployment.
- Self-Service. Interactive training labs and developers portal is useful. A place where we can do discovery of mvp and reference implementation. All new engineers should build something using the platform on their first month. This can be a part of initial orientation of new hires. This will also let us uncover issues within the self service nature of the platform. Also retraining for each new part of the platform. Doing DIY discovery within the platform is encouraged. Reinventing the wheel and using shadow Platform Engineering Squads is actively discouraged. Maintenance of many implementation of the same thing is wasteful and unneeded.
- Authority. In many ways the success or failure of a platform team lies in the decision it makes. The platform team will have to make decisions that affects other teams. This happens while building the foundations of the platform. For example the language the tools and frameworks we use.
- Advocacy. Engineering efficiency is the constant advocacy of the platform team. The entire purpose of building the platform team is for engineers to build more with less cognitive overhead. The details that could be reused and automated usually falls to the platform team.
- Accountability. Accountability as a team is important to make sure that whenever the team is making a breaking change the rest of the team is informed. Blameless post mortem is a requirement to make each of the member feel safe to make changes. Building a better system at the same time taking ownership of the system.
- Expertise. The experience and expertise needed on platform team depends on the structure of the company. Vendor management is also a task that can be delegated to application support teams. But generally here are the expertise you will need within your platform team:
- container orchestration and containerization
- cloud management
- vendor management
- pipeline management
- dns and cdn configuration
- server configuration
- git and scm
- product-ionization
- observability (logging monitoring tracing)
- operationalization (runbooks and support escalation post mortems alerting)
- soft skill and people management
- software defined infrastructure (infrastructure as code)
- collaboration with other teams and negotiation with the management
- common workflow and architecture management
- security
- developer training and teaching
- documentation development