CloudFactory launches Accelerated Annotation after acquiring
26.01.2021 — Andriy Borodiychuk

Infrastructure that is never noticed

Practical machine learning is about much more than statistics and model architectures only. Resilient and reliable infrastructure is the foundation to building a performant ML-pipeline successfully.

Infrastructure that is never noticed

Practical machine learning is about much more than statistics and model architectures only. Resilient and reliable infrastructure is the foundation to building a performant ML-pipeline successfully.
In this series of articles, I'll share my experience of being part of's development team and building the infrastructure for an AI-enabled image annotation tool since day 1. I'll provide a brief overview of how to address a startup's application infrastructure and build a strategy that keeps it under control, provides predictable reliability and expenses, minimizes technical debt, and improves the development process overall.
The first article (this one) will address the most common mistake made by startups regarding infrastructure development, namely, pursuing growth to prioritize shipping features over less tangible hygiene topics such as infrastructure and the surrounding domain. The second article will reveal how we adopted infrastructure as a solution to this problem at, and the third one will discuss how these changes catalyzed our business's scaling.

The nature of infrastructure and its challenges

Infrastructure includes all provisioned hardware, services, and facilities needed to operate a product or application. This includes, but is not limited to:

Infrastructure typically is first provisioned and then configured, Then the application is deployed for users to access. These steps: provisioning, configuration, and deployment are the key points to be addressed when planning work on infrastructure.
There is one very intuitive yet critical difference between infrastructure and the supported application. The application can not exist without being written as code - infrastructure can. This often (if not always) leads to a situation where the infrastructure domain is being seen as the continuous consumption of some infrastructure administration service in addition to the application codebase. This is analogous to an accounting service for a company's operations. With early-stage startups or small projects, this is an extremely popular pattern and a massive mistake.

1 tSgvYg2zH0ZbkHFaiGqMng
Figure 1. Business consumes service from the infrastructure team, which does its "magic" to maintain the project infrastructure.

The key topic here is knowledge. Consuming services does not guarantee that any knowledge that the infrastructure team operates with will be stored and preserved. Having strict contracts, requirements, and other employee-level enforcements doesn't guarantee that either: it leads to having a paper trail, but nothing enforces its strict consistency and actuality, which is crucial. The only way to guarantee consistency and actuality is to implement a strategy that doesn't allow any way to omit the step of persisting and updating the knowledge base. It is easier to implement infrastructure changes through this step, as opposed to other methods. Building the development process this way is one of the biggest challenges between the business and infrastructure teams.
Important information for infrastructure to gather is:

  1. Facilities registry: is provisioned by the provider, including how and for what purposes it is being used.
  2. Access credentials registry: how a particular object can be accessed, who can access it, and with which privileges.
  3. Configurations: how hardware and services should be configured for every particular use case. How it was configured and reconfigured.
  4. Scenarios: facilities provisioning and configuration, application deployment, certain maintenance procedures (e.g., key rotation), and disaster recovery.
  5. Architecture overview for audits, inspection, and developer onboarding.
  6. Monitoring and health status to identify and address problems.

That knowledge is normally more expensive than the actual infrastructure objects per se. That knowledge is part of the intellectual property that a company gains as it develops, adding to the overall value. Obviously, it's not an option to keep these things stored in someone's memory.
Relying on written documentation is also limited as the documentation:

So while documentation is a proper solution for scenarios and architectural overview, it is not acceptable for facilities registry, configurations, etc. Any attempt to document them would likely be a waste of time, producing useless results. The only probable way of getting information would be reverse engineering the solution, which is a big shame and does not provide any guarantees.
An option is to have an SLA with an external company that guarantees running your application that clearly outlines everything needed for your infrastructure. Product development is completely separate from infrastructure evolution. It often means that either the engineering team has limited capabilities, is restricted regarding available solutions to challenges they might face, or that the SLA requires constant updates - infrastructure has to evolve together with the application. This is natural and efficient in all aspects: performance and costs.
Finally, the worst consequence of a poor infrastructure strategy is that the essential knowledge about infrastructure becomes black magic done by a wizard. In a normal workflow, it slows down application development in moments when there's a need to rely on specific infrastructure functionality. Moreover, in the event of a disaster, this house of cards starts to collapse.

A more successful approach

The approach that handles the increasing complexity of infrastructure well (even small projects have several hundred infrastructure objects) has a key component: the knowledge that later is being used to prepare the infrastructure is created by the team while building said infrastructure. That knowledge (basically, the code) is a source of truth and a form of intellectual property, and an asset that adds value to the whole business. This approach is made possible by having infrastructure as code and widespread usage of automation.

1 DplAOGBlWx13-tPcwvGFKQ
Figure 2. The infrastructure team produces knowledge that is being turned into the actual infrastructure and gives back business value and provides control over the infrastructure domain.

The drawback of this approach is that it is heavily reliant on automation. Therefore you must ensure that any automation has to mitigate certain infrastructure aspects:

With all these challenges, even in an early-stage startup, there's an infrastructure domain that should be addressed by a dedicated infrastructure team that should provide:

  1. A reliably functioning application that satisfies the SLA and empowers developers to predict and address potential issues.
  2. Reproducible deployments to deliver an application in the same manner to the same or multiple targets.
  3. Exhaustive information about provisioned hardware and services, their configurations, and the purpose of their existence.
  4. Disaster recovery scenarios that can be activated and executed in the event of a malfunction.
  5. A knowledge base for other team members to learn infrastructure in the context of their work or the execution of disaster recovery scenarios.

Shameless plug time

Only 13% of vision AI projects make it to production. With Hasty, we boost that number to 100%.
Our comprehensive vision AI platform is the only one you need to go from raw data to a production-ready model. We can help you with:

All the data and models you create always belong to you and can be exported and used outside of Hasty at any given time entirely for free.

You can try Hasty by signing up for free here. If you are looking for additional services like help with ML engineering, we also offer that. Check out our service offerings here to learn more about how we can help.

Keep reading

Get ready to scale your project

For 80% of vision AI teams, data is the bottleneck. Not with us.