Practical machine learning is about much more than statistics and model architectures only. Resilient and reliable infrastructure is the foundation to building a performant ML-pipeline successfully.
In this series of articles, I'll share my experience of being part of Hasty.ai's development team and building the infrastructure for an AI-enabled image annotation tool since day 1. I'll provide a brief overview of how to address a startup's application infrastructure and build a strategy that keeps it under control, provides predictable reliability and expenses, minimizes technical debt, and improves the development process overall.
The first article (this one) will address the most common mistake made by startups regarding infrastructure development, namely, pursuing growth to prioritize shipping features over less tangible hygiene topics such as infrastructure and the surrounding domain. The second article will reveal how we adopted infrastructure as a solution to this problem at Hasty.ai, and the third one will discuss how these changes catalyzed our business's scaling.
The nature of infrastructure and its challenges
Infrastructure includes all provisioned hardware, services, and facilities needed to operate a product or application. This includes, but is not limited to:
- physical or virtual hardware (servers, network equipment), owned and hosted somewhere or rented at some provider or cloud,
- anything-as-a-service that is part of serving the application,
- application-specific configurations related to the hardware and services.
Infrastructure typically is first provisioned and then configured, Then the application is deployed for users to access. These steps: provisioning, configuration, and deployment are the key points to be addressed when planning work on infrastructure.
There is one very intuitive yet critical difference between infrastructure and the supported application. The application can not exist without being written as code - infrastructure can. This often (if not always) leads to a situation where the infrastructure domain is being seen as the continuous consumption of some infrastructure administration service in addition to the application codebase. This is analogous to an accounting service for a company's operations. With early-stage startups or small projects, this is an extremely popular pattern and a massive mistake.
Figure 1. Business consumes service from the infrastructure team, which does its "magic" to maintain the project infrastructure.
The key topic here is knowledge. Consuming services does not guarantee that any knowledge that the infrastructure team operates with will be stored and preserved. Having strict contracts, requirements, and other employee-level enforcements doesn't guarantee that either: it leads to having a paper trail, but nothing enforces its strict consistency and actuality, which is crucial. The only way to guarantee consistency and actuality is to implement a strategy that doesn't allow any way to omit the step of persisting and updating the knowledge base. It is easier to implement infrastructure changes through this step, as opposed to other methods. Building the development process this way is one of the biggest challenges between the business and infrastructure teams.
Important information for infrastructure to gather is:
- Facilities registry: is provisioned by the provider, including how and for what purposes it is being used.
- Access credentials registry: how a particular object can be accessed, who can access it, and with which privileges.
- Configurations: how hardware and services should be configured for every particular use case. How it was configured and reconfigured.
- Scenarios: facilities provisioning and configuration, application deployment, certain maintenance procedures (e.g., key rotation), and disaster recovery.
- Architecture overview for audits, inspection, and developer onboarding.
- Monitoring and health status to identify and address problems.
That knowledge is normally more expensive than the actual infrastructure objects per se. That knowledge is part of the intellectual property that a company gains as it develops, adding to the overall value. Obviously, it's not an option to keep these things stored in someone's memory.
Relying on written documentation is also limited as the documentation:
- requires constant updating, which usually happens well after the fact, if at all,
- is hard to verify regarding its accuracy, completeness, ambiguity, relevance, or general comprehensibility.
So while documentation is a proper solution for scenarios and architectural overview, it is not acceptable for facilities registry, configurations, etc. Any attempt to document them would likely be a waste of time, producing useless results. The only probable way of getting information would be reverse engineering the solution, which is a big shame and does not provide any guarantees.
An option is to have an SLA with an external company that guarantees running your application that clearly outlines everything needed for your infrastructure. Product development is completely separate from infrastructure evolution. It often means that either the engineering team has limited capabilities, is restricted regarding available solutions to challenges they might face, or that the SLA requires constant updates - infrastructure has to evolve together with the application. This is natural and efficient in all aspects: performance and costs.
Finally, the worst consequence of a poor infrastructure strategy is that the essential knowledge about infrastructure becomes black magic done by a wizard. In a normal workflow, it slows down application development in moments when there's a need to rely on specific infrastructure functionality. Moreover, in the event of a disaster, this house of cards starts to collapse.
A more successful approach
The approach that handles the increasing complexity of infrastructure well (even small projects have several hundred infrastructure objects) has a key component: the knowledge that later is being used to prepare the infrastructure is created by the team while building said infrastructure. That knowledge (basically, the code) is a source of truth and a form of intellectual property, and an asset that adds value to the whole business. This approach is made possible by having infrastructure as code and widespread usage of automation.
Figure 2. The infrastructure team produces knowledge that is being turned into the actual infrastructure and gives back business value and provides control over the infrastructure domain.
The drawback of this approach is that it is heavily reliant on automation. Therefore you must ensure that any automation has to mitigate certain infrastructure aspects:
- Infrastructure is very stateful when it comes to running databases or object storage, which purpose of existence for preserving each state. So any manipulations and modifications need to respect this.
- Infrastructure inherently changes its state over time. For example, a server may become short on disk space, and thus, certain functions become unavailable, or updates may change some APIs.
- Infrastructure often depends on human interactions. The user must enter payment details, agree to terms and conditions, sign a contract, or call someone to provision/modify something or get access to some API. These requirements are unavoidable. These processes may be deeply integrated into some routines, which become hard to automate, and thus a flow for how to handle that needs to be mapped out.
- Infrastructure has very soft borders. Sometimes it is hard to say whether some service or hardware is part of project infrastructure or not. It is open to debate and almost philosophical to decide, thus who should own it.
With all these challenges, even in an early-stage startup, there's an infrastructure domain that should be addressed by a dedicated infrastructure team that should provide:
- A reliably functioning application that satisfies the SLA and empowers developers to predict and address potential issues.
- Reproducible deployments to deliver an application in the same manner to the same or multiple targets.
- Exhaustive information about provisioned hardware and services, their configurations, and the purpose of their existence.
- Disaster recovery scenarios that can be activated and executed in the event of a malfunction.
- A knowledge base for other team members to learn infrastructure in the context of their work or the execution of disaster recovery scenarios.
I design cloud infrastructure at Hasty.ai, a Berlin-based startup that's building the next-gen annotation tool for computer vision. We have custom AI-Assistants who observe you while you annotate and then take over the annotation work for you. They allow you to annotate data 10x faster and provide you with rapid feedback so you can validate and adapt your models as you work.
We just started our blog. We want to continue sharing our experience of working with computer vision projects on a daily basis. To make sure that we provide content relevant to you, please fill out this survey. It takes less than 1min, we promise!