Creating a great API
These are some unstructured musings about what makes an API great.
To benefit the consumers of the API it should be:
— Easy to integrate
— Reliable
— Fast
To benefit those maintaining the API it should be:
— Easy to modify
— Fault tolerant
— Efficient
— Secure
Fast:
An API allows two programs to communicate. Just how quickly can we send data from one program to another? We are going to investigate how much bandwidth is available and how effectively it is used. As a test, I wrote a client and server in Rust. The client streams 10 GB of data to the server as UDP packets, sends a final verification message and then exits. Execution completes in under a second.
Network and disk access is handled by the kernel (sys). A large portion of time was spent waiting on the kernel to handle the network requests.
10GB in 0.824 seconds gives a bandwidth of just under 100 Gbps. These speeds are well above typical internet connection speeds.
For a data-heavy API that does not need to wait on other systems, the network is going to be our bottleneck.
If your own API is not constrained by bandwidth limitations, there might be other things to focus on first.
If it has been determined that network bandwidth is the limiting factor, steps should be taken to ensure that the available bandwidth is being used effectively. Compression can be used. Client side caching can be implemented to ensure that data is sent only once. Technologies like GraphQL can be used to ensure that only the required data is sent. For particular data sets like audio, a binary format like Protobuf could be far more efficient than JSON encoding over HTTP.
Reliable against hardware failures:
Anticipating and mitigating various forms of failure is necessary to creating a super reliable API.
Most of the following techniques add complexity, so must be considered against the benefit of increased uptime. Also consider the impact that downtime has on consumers of the API. Improving the uptime from 99.99%* to 100% might require a huge amount of time and money, yet a client with an unreliable mobile phone connection would probably not notice the difference.
*An uptime of 99.99% means the API would be offline for about 9 seconds per day.
Servers will need to be restarted on occasion to allow security updates to be installed. They can also fail unexpectedly due to hard disk errors, segfaults or power failures. To achieve uptime greater than the uptime of a single physical server, multiple copies of the API service can be run on different servers.
If there are multiple servers running, clients can no longer simply connect to a single server, there now needs to be a load balancing system that routes client requests to a currently running service. The load-balancing system needs to be informed of what servers are currently ready to process client requests, this is done by a service-discovery service.
To protect against hardware failures, every component of the API service would need multiple redundant copies running. This includes the database, back-end service and load balancer. Each component must also be able to handle failures of other components. As an example, if a load balancer forwards a request to a back-end service that crashes, it needs to re-try with another service.
The screen shot shows that medium.com has multiple IP addresses. If the external facing server with the first IP address was unavailable, a web browser will attempt to connect to the second.
Reliable against human failures:
Code is rarely bug-free and adding features can introduce bugs.
It is hard to write a complex system that is reliable. A common practice is to write several smaller components that can be thoroughly tested and verified in isolation. These proven components can be connected together to provide a system that is complex and reliable.
Unit testing is a common way to prove that small components of the code are working as expected. If a function doubles a number, a unit test could verify that the results are correct for a variety of sample cases. Eg double(2) = 4.
Some classes of bugs can be mostly avoided by choosing a programming language that does not allow that kind of bug, or by using tooling that performs checks to avoid or at least warn against some bugs.
Ideally, all bugs would be discovered by developers or testers before being pushed into production. Even with great care, bugs will end up in customer facing services.
Some bugs in production can be detected by normal monitoring tools. A sudden increase in the percentage of 404 errors presented by a service could be the result of a bug.
Other bugs might be harder to detect from looking at metrics. Cache services that are delivering stale versions of data might look to be behaving correctly.
Once a bug is detected in production, it should be easy and fast to roll back to a good version or to make a fix and roll forward. Implementing CICD pipeline is crucial. A CICD pipeline provides benefits by making production changes simple and reproducible, also by enforcing that automated tests have not found any new bugs. A new version of the services created by the CICD pipeline should have a version number. When a new version is ready, newly created services should be started using the new version while old services are being shutdown. This can be done slowly, diverting just a small percentage of traffic to the new version. If a problem is detected, the new version can be marked as bad and the roll-out halted.
Further points to discuss:
— JSON over HTTP vs Protobuf over gRCP
— Using a cache to improve uptime
— Pub-sub model to ensure cache is never stale
— What endpoints to offer
— Security
— Versioning