RESEARCH
What are data centers and how do they function today?
INTRODUCTION
Data Centers at a glance
Meta operates large, state-of-the-art data centers designed to store and process vast amounts of digital information. These data centers are built to scale and can handle the computational demands of millions of users, processing and storing data from social media, cloud storage, and websites. A Meta data center consists of employees directly hired by Meta as well as contractors who together form the different teams. Each team is divided based on responsibilities and their tasks are handed to them on their independent ticketing system.
Apart from the ticketing system, while teams do have other common documentation channels such as Q and A forums or training modules, the outdated knowledge and incomplete documentation makes them unreliable. Teams instead rely on direct and instant methods of communication such as messaging to make the process faster and frictionless. But, this divided system of documentation leads to inconsistencies in knowledge shared between individuals, teams and data centers.​
SENSE MAKING
So much to ask, in so little time
Through mixed method research, we unpacked the different parts of the data centers and identified the design principles important to Meta. With privacy and efficiency being at the core of a data center, we narrowed down the important challenges to scalability they face today.
​
Some of the research methods we used
SYNTHESIS
Connecting the dots
​With multiple data center visits under our belt, extensive background research, and discussions with a variety of robotics experts we needed to begin bringing these isolated learnings together and think deeply about how they connect with one another. This was a challenging process for us because we did not know if the insights we were generating would be relevant because of the sheer difference in scale of the data centers that we were able to experience in person verses Meta's hyper-scale data centers. Up until this point, we had focused very heavily on robotics and human-robot interaction (HRI). It was something that caught our interest very early on but going through this process of untangling the mass of information we had accumulated helped us broaden our view of what we could try to do within the space of data centers.
​
Up until this point, we had focused very heavily on robotics and human-robot interaction (HRI). It was something that caught our interest very early on but going through this process of untangling the mass of information we had accumulated helped us broaden our view of what we could try to do within the space of data centers.
What we learned
INSIGHT 1
Most of the knowledge in data centers today is not in formal systems, but in people’s heads.
Meta has multiple documentation channels they use in tandem to keep track of issues raised, issues resolved and guides to resolve them. Documentations act as guides when engineers are new to the task or unable to identify the solution.​
​
However, many times these documents are incomplete, outdated or very difficult to use due to which engineers rely on informal channels such as internal messaging platforms and notes for assistance.
The knowledge cycle
Knowledge sharing within a data center can be broken down into three parts - capture, retrieve and use. First, all the available knowledge needs to be captured effectively. Second, the robust knowledge captured needs to be retrieved so that engineers can immediately use it when required. And third, the retrieved knowledge needs to be made easy to digest to assist utilization.
The system today cannot support the flow of knowledge through the three phases. Upon reflection, it becomes evident that it is not the systems that are running these data centers, but rather the humans that are bridging the knowledge gaps and compensating for technology. People are the key to the functioning of data centers today.
INSIGHT 2
The current knowledge-sharing system actually incentivizes engineers to skip documentation.
Engineers’ time is very valuable, and they do not always have the bandwidth or motive to efficiently and accurately document. This creates a system in which the process to capture knowledge is full of friction.
Engineers in data centers are under constant time pressure. Each engineer has multiple tickets they must resolve during their shift. These tickets all have Service Level Agreements, or SLAs, which are set amounts of time in which the task must be completed. Because engineers must navigate these competing SLAs, taking out time to formally document their work may interfere with their ability to resolve another task. Completing tasks will always take priority over documenting them.
Although many formal channels for documentation exist, it is much faster to share information through an informal channel. Engineers we spoke to said the primary motivation to document was so they could use it as a reference point later on. However, they also knew most people consult informal channels first before turning to formal channels. Knowing their documentation may never be used decreases the motivation to capture than knowledge.There is also no external motivation, such as a financial incentive, to document.
INSIGHT 3
Engineers also struggle with information retrieval
Retrieval is divided across many platforms and searching for the one right piece of information is time consuming. Engineers are under time pressure to complete tasks, and constantly changing documentation spread out across systems means it is easier to go through human-to-human channels.
The path to parse through different platforms is not linear and a lot of time might be spent searching through various documentation channels before an engineer can find the solution to their problem. This results in an information retrieval process that is time consuming. This time pressure results in engineers choosing to use their own human-based systems to share knowledge instead of taking the effort to parse through these formal knowledge sharing channels.
Human-human knowledge transfer costs Meta hundreds of millions of dollars. From a labor perspective, every employee to manager conversation has a dollar cost associated with it. Reducing knowledge transfer time would also result in improved server uptime. Speaking to the Meta team, we learned that one hour of downtime can contribute up to 500 thousand dollars in lost revenue. This problem is costing Meta hundreds of millions of dollars annually because every extra minute spent on knowledge transfer is an extra minute that could have been spent getting a server online faster.
What are the opportunities?
1
Create a culture that rewards knowledge sharing. This can include extrinsic rewards such as promotion or bonuses. It could also take advantage of the existing community-centric culture at Meta to emphasize intrinsic rewards, such as helping out your peers, self development, or contributing to a system larger than any individual.
​
2
Reduce friction to make knowledge capture feel seamless. The reliance on informal channels can be largely attributed to the ease of these methods in comparison to formal channels. If formal channels felt equally effortless, engineers would be much more inclined to use them.
3
Build an integrated and digestible knowledge base. This will enable employees to quickly understand and act on the information available. Information that is easy to access helps users save time and makes it easier to find the information they need to complete their tasks whether at their desk or while working on a rack.
Our Solution
From policy changes to AR glasses to LLMs, we explored every viable technology out there that can be used to document better. While figuring out the technology was one part of the problem, the bigger question was, how do we change the engineers' behavior to get them to document in the first place.