Practical machine learning is about much more than statistics and model architectures only. Resilient and reliable infrastructure is the foundation to building a performant ML-pipeline successfully. In this series of articles, I'll share my experience of being part of Hasty.ai's development team and building the infrastructure for an AI-enabled image annotation tool since day 1. I'll provide a brief overview of how to address a startup's application infrastructure and build a strategy that keeps it under control, provides predictable reliability and expenses, minimizes technical debt, and improves the development process overall. The first article (this one) will address the most common mistake made by startups regarding infrastructure development, namely, pursuing growth to prioritize shipping features over less tangible hygiene topics such as infrastructure and the surrounding domain. The second article will reveal how we adopted infrastructure as a solution to this problem at Hasty.ai, and the third one will discuss how these changes catalyzed our business's scaling.
Infrastructure includes all provisioned hardware, services, and facilities needed to operate a product or application. This includes, but is not limited to:
Infrastructure typically is first provisioned and then configured, Then the application is deployed for users to access. These steps: provisioning, configuration, and deployment are the key points to be addressed when planning work on infrastructure. There is one very intuitive yet critical difference between infrastructure and the supported application. The application can not exist without being written as code - infrastructure can. This often (if not always) leads to a situation where the infrastructure domain is being seen as the continuous consumption of some infrastructure administration service in addition to the application codebase. This is analogous to an accounting service for a company's operations. With early-stage startups or small projects, this is an extremely popular pattern and a massive mistake.
Figure 1. Business consumes service from the infrastructure team, which does its "magic" to maintain the project infrastructure.
The key topic here is knowledge. Consuming services does not guarantee that any knowledge that the infrastructure team operates with will be stored and preserved. Having strict contracts, requirements, and other employee-level enforcements doesn't guarantee that either: it leads to having a paper trail, but nothing enforces its strict consistency and actuality, which is crucial. The only way to guarantee consistency and actuality is to implement a strategy that doesn't allow any way to omit the step of persisting and updating the knowledge base. It is easier to implement infrastructure changes through this step, as opposed to other methods. Building the development process this way is one of the biggest challenges between the business and infrastructure teams. Important information for infrastructure to gather is:
That knowledge is normally more expensive than the actual infrastructure objects per se. That knowledge is part of the intellectual property that a company gains as it develops, adding to the overall value. Obviously, it's not an option to keep these things stored in someone's memory. Relying on written documentation is also limited as the documentation:
So while documentation is a proper solution for scenarios and architectural overview, it is not acceptable for facilities registry, configurations, etc. Any attempt to document them would likely be a waste of time, producing useless results. The only probable way of getting information would be reverse engineering the solution, which is a big shame and does not provide any guarantees. An option is to have an SLA with an external company that guarantees running your application that clearly outlines everything needed for your infrastructure. Product development is completely separate from infrastructure evolution. It often means that either the engineering team has limited capabilities, is restricted regarding available solutions to challenges they might face, or that the SLA requires constant updates - infrastructure has to evolve together with the application. This is natural and efficient in all aspects: performance and costs. Finally, the worst consequence of a poor infrastructure strategy is that the essential knowledge about infrastructure becomes black magic done by a wizard. In a normal workflow, it slows down application development in moments when there's a need to rely on specific infrastructure functionality. Moreover, in the event of a disaster, this house of cards starts to collapse.
The approach that handles the increasing complexity of infrastructure well (even small projects have several hundred infrastructure objects) has a key component: the knowledge that later is being used to prepare the infrastructure is created by the team while building said infrastructure. That knowledge (basically, the code) is a source of truth and a form of intellectual property, and an asset that adds value to the whole business. This approach is made possible by having infrastructure as code and widespread usage of automation.
Figure 2. The infrastructure team produces knowledge that is being turned into the actual infrastructure and gives back business value and provides control over the infrastructure domain.
The drawback of this approach is that it is heavily reliant on automation. Therefore you must ensure that any automation has to mitigate certain infrastructure aspects:
With all these challenges, even in an early-stage startup, there's an infrastructure domain that should be addressed by a dedicated infrastructure team that should provide:
I design cloud infrastructure at Hasty.ai, a Berlin-based startup that's building the next-gen annotation tool for computer vision. We have custom AI-Assistants who observe you while you annotate and then take over the annotation work for you. They allow you to annotate data 10x faster and provide you with rapid feedback so you can validate and adapt your models as you work.
We just started our blog. We want to continue sharing our experience of working with computer vision projects on a daily basis. To make sure that we provide content relevant to you, please fill out this survey. It takes less than 1min, we promise!