In this article by Anand Balachandran Pillai, the author of the book Software Architecture with Python, learn to build complex software architecture for a software application using Python.

(For more resources related to this topic, see here.)

Imagine the checkout counter of a supermarket on a Saturday evening, the usual rush hour time. It is common to see long queues of people waiting to check out with their purchases. What would a store manager do to reduce the rush and waiting time?

A typical manager would try a few approaches includingtelling those manning the checkout counters to pick up their speed and try and redistribute people to different queues so that each queue roughly has the same wait time. In other words, the manager would manage the current load with available resources by optimizing performance of existing resources.

However, if the store has existing counters which are not in operation and enough people at hand to manage them, the manager could enable those counters and move people to these new counters,in other words, add resources to the store to scalethe operation.

Software systems too scale in a similar way. An existing software application can be scaled by adding compute resources to it.

When the system scales by either adding or making better use of resources inside a compute node, such as CPU or RAM, it is said to scale verticallyor scale up. On the contrary, when a system scales by adding more compute nodes to it, such as a creating a load balanced cluster, it is said to scale horizontallyor scale out.

The degree to which a software system is able to scale when compute resources are added is called its scalability. Scalability is measured in terms of how much the systems performance characteristics, such as throughput or latency, improve with respect to the addition of resources. For example, if a system doubles its capacity by doubling the number of servers, it is scaling linearly.

Increasing the concurrency of a system often increases its scalability. In the preceding supermarket example, the manager is able to scale out his operations by opening additional counters. In other words, he increases the amount of concurrent processing done in his store. Concurrency is the amount of work that gets done simultaneously in a system.

We look at different techniques of scaling a software application with Python. We start with concurrency techniques within a machine, such as multithreading and multiprocessing, and go on to discuss asynchronous execution. We also look at how to scale outan application across multiple servers and also some theoretical aspects of scalability and its relation to availability.

Scalability andperformance

How do we measure the scalability of a system? Let's take an example and see how this could be done.

Let's say our application is a simple report generation system for employees. It is able to load employee data from a database and generate a variety of reports at bulk, such as payslips, tax deduction reports, employee leave reports, and so on.

The system is able to generate 120 reports per minute––this is the throughputor capacityof the system expressed as the number of successfully completed operations in a given unit of time. Let's say the time it takes to generate a report at the server side (latency) is roughly 2seconds.

Let's say, the architect decides to scale up the system by doubling the RAM on its server ––scaling upthe system.

Once this is done, a test shows that the system is able to increase its throughput to 180 reports per minute. The latency remains the same at 2 seconds.

So at this point, the system has scaled close to linearin terms of the memory added. The scalability of the system expressed in terms of throughput increase is as follows:

Scalability (throughput) = 180/120 = 1.5X

As the second step, the architect decides to double the number of servers on the backend all with the same memory. After this step, he finds that the system's performance throughput has now increased to 350 reports per minute. The scalability achieved by this step is as follows:

Scalability (throughput) = 350/180 = 1.9X

The system has now responded much better, with a close to linear increase in scalability.

After further analysis, the architect finds that by rewriting the code that was processing reports on the server to run in multiple processes instead of a single process, he is able to reduce the processing time at the server and hence the latency of each request by roughly 1 second per request at peak time. The latency has now gone down from 2seconds to 1 second.

The system's performance with respect to latency has become better,as follows:

Performance (latency) X  = 2/1 = 2X

How does this affect thescalability ? Since the latency per request has come down, the system overall would be able to respond to similar loads at a faster rate (since processing time per request is lesser now) than what it was able to earlier. In other words, with the exact same resources, the system's throughput performance, and hence scalability, would have increased, assuming other factors remain the same.

Let's summarize what we discussed so far in the following lists:

First, the architect increased the throughput of a single system by scaling it up by adding extra memory as a resource, which increased the overall scalability of the system. In other words, he scaled the performance of a single system by scaling up which boosted overall performance of the whole system.
Next, he added more nodes to the system, and hence its ability to perform work concurrently, and found that the system responded well by rewarding him with a near linear scalability factor. Simply put, he increased the throughput of the system by scaling its resource capacity. In other words, he increased scalability of the system by scaling out by adding more compute nodes.
Finally, he made a critical fix by running a computation in more than one process. In other words, he increased the concurrency of a single system by dividing the computation to more than one part. He found that this increased the performance characteristic of the application by reducing its latency, potentially setting up the application to handle workloads better at high stress.

We find that there is a relation between scalability, performance, concurrency, and latency as follows:

When performance of a single system goes up, the scalability of the total system goes up
When an application scales in a single machine by increasing its concurrency, it has the potential to improve performance and hence the net scalability of the system in deployment
When a system reduces its performance time at server or its latency,it positively contributes to scalability

We have captured the relationships between concurrency, latency, performance and scalability in the following table:

Concurrency	Latency	Performance	Scalability
High	Low	High	High
High	High	Variable	Variable
Low Unlock access to the largest independent learning library in Tech for FREE! Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of. Renews at AU $19.99/month. Cancel anytime	High	Poor	Poor

An ideal system is one which has good concurrency and low latency––a system that has high performance and would respond better to scaling up and/or scaling out.

A system with high concurrency but also high latency would have variable characteristics,its performance, and hence scalability, would be potentially very sensitive to other factors such as network load, current system load, geographical distribution of compute resources and requests,and so on.

A system with low concurrency and high latency is the worst case––it would be difficult to scale such a system as it has poor performance characteristics. The latency and concurrency issues should be addressed before the architect decides to scale the system either horizontally or vertically.

Scalability is always described in terms of variation in performance throughput.

Concurrency

A system's concurrency is the degree to which the system is able to perform work simultaneously instead of sequentially. An application written to be concurrent in general can execute more units of work in a given time than one which is written to be sequential or serial.

When wemake a serial application concurrent, we make the application make better use of existing compute resources in the system––CPU and/or RAM at a given time. Concurrency in other words is the cheapest way of making an application scale inside a machinein terms of the cost of compute resources.

Concurrency can be achieved using different techniques. The common techniquesare as follows:

Multithreading: The simplest form of concurrency is to rewrite the application to perform parallel tasks in different threads. A thread is the simplest sequence of programming instructions that can be performance by a CPU. A program can consist of any number of threads. By distributing tasks to multiple threads, a program can execute more work simultaneously. All threads run inside the same process.
Multiprocessing: The next step of concurrency is to scale the program to run in multiple processes instead of a single process. Multiprocessing involves more overhead than multithreading in terms of message passing and shared memory. However, programs that perform a lot of latent operations such as disk reads and those which perform lot of CPU heavy computation can benefit more from multiple processes than multiple threads.
Asynchronous Processing: In this technique, operations are performed asynchronously,in other words, there is no ordering of concurrent tasks with respect to time. Asynchronous processing picks tasks usually from a queue of tasks and schedules them to execute at a future time, often receiving the results in callback functions or special future objects. Typically, operations are performed in a single thread.

There are other forms of concurrent computing, but in this article, we will focus our attention to only these three, hence we are not introducing any other types of concurrent computing here.

Python, especially Python3, has built-in support for all these types of concurrent computing techniques in its standard library. For example, it supports multithreading via its threading module and multiple processes via its multiprocessing module. Asynchronous execution support is available via the asynciomodule. A form of concurrent processing which combines asynchronous execution with threads and processes is available via the concurrent.futuresmodule.

Concurrency versusparallelism

We will take a brief look at the concept of concurrency and its close cousin, namely parallelism.

Both concurrency and parallelism are about executing work simultaneously than sequentially. However, in concurrency, the two tasks need not be executing at the exact same time. Instead, they just need to be scheduled to be executed simultaneously. Parallelism, on the other hand, requires that both the tasks execute together at a given pointin time.

To take a real-life example, let's say you are painting two exterior walls of your house. You have employed just one painter and you find that he is taking a lot more time than you thought. You can solve the problem in two ways.

Instruct the painter to paint a few coats on one wall before switching to the next wall and doing the same there. Assuming he is efficient, he will work on both walls simultaneously (though not at the same time) and achieve same degree completion on both walls for a given time. This is a concurrent solution.

Employ one more painter, and instruct first painter to paint the first wall and second painter to paint the second wall. This is a truly parallel solution.

For example, two threads performing byte code computations in a single core CPU are not exactly performing parallel computation as the CPU can accommodate only one thread at a time. However, from a programmer's perspective, they are concurrent, since the CPU scheduler performs fast switching in and out of the threads, so theylook parallel in all appearances and purposes. But they are not truly parallel.

However on a multi-core CPU, two threads can perform parallel computations at any given time in its different cores. This is true parallelism.

Parallel computation requires computation resources to increase at least linearly with respect to its scale. Concurrent computation can be achieved using techniques of multi-tasking where work is scheduled and executed in batches, making better use of existing resources.

In this article, we will use the term concurrent nearly uniformly to indicate both types of execution. In some places, it may indicate concurrent processing in the traditional way, and in some other, it may indicate true parallel processing. Kindly use the context to disambiguate.

Concurrency in Python –multithreading

We will start our discussion on concurrent techniques in Python with multithreading.

Python supports multiple threads in programming via its threading module. The threading module exposes a Threadclass thatencapsulates a thread of execution. Along with it, it also exposes the following synchronization primitives:

Lock object:This is useful for synchronized, protected access to share resources and its cousin RLock
Condition object:This is useful for threads to synchronize while waiting for arbitrary conditions
Event object: This provides a basic signaling mechanism between threads
Semaphore object:This allows synchronized access to limited resources
Barrier object:This allows a set of fixed number of threads to wait for each other and synchronize to a particular state and proceed

The thread objects in Python can be combined with the synchronized Queueclass in the queue module for implementing thread-safe producer/consumer workflows.

Thumbnail generator

Let's start our discussion of multithreading in Python with the example of a program which is used to generate thumbnails of image URLs. We use the Python Imaging Library (PIL) for performing the following operation:

# thumbnail_converter.py
from PIL import Image
import urllib.request

def thumbnail_image(url, size=(64, 64), format='.png'):
""" Save thumbnail of an image URL """

    im = Image.open(urllib.request.urlopen(url))
    # filename is last part of the URL minus extension + '.format'
    pieces = url.split('/')
    filename = ''.join((pieces[-2],'_',pieces[-1].split('.')[0],'_thumb',format))
    im.thumbnail(size, Image.ANTIALIAS)
    im.save(filename)	
    print('Saved',filename)

This works very well for single URLs.

Let's say we want to convert five image URLs to their thumbnails as shown in the following code snippet:

img_urls = ['https://dummyimage.com/256x256/000/fff.jpg',
'https://dummyimage.com/320x240/fff/00.jpg',
'https://dummyimage.com/640x480/ccc/aaa.jpg',
'https://dummyimage.com/128x128/ddd/eee.jpg',
'https://dummyimage.com/720x720/111/222.jpg']

The code forusing the preceding function would be as follows:

for url in img_urls:
    thumbnail_image(urls)

Let's see how such a function performs with respect to the time taken:

writing-applications-scale-img-0

Let's now scale the program to multiple threads so that we can perform the conversions concurrently. Here is the rewritten code to run each conversion in its own thread (not showing the function itself as it hasn't changed):

import threading

for url in img_urls:
    t=threading.Thread(target=thumbnail_image,args=(url,))
    t.start()

Take a look at the response time of the threaded thumbnail convertor for five URLs as shown in the following screenshot:

writing-applications-scale-img-1

With this change, the program returns in 1.76 seconds, almost equal to the time taken by a single URL in the serial execution we saw earlier. In other words, the program has now linearly scaled with respect to the number of threads. Note that we had to make no change to the function itself to get this scalability boost.

Summary

In this article, you learned the importance of writing Scalable applications. We also saw the relationships between concurrency, latency, performance, and scalability and the techniques we can use to achieve concurrency. You also learned how to generate the thumbnail of image URLs using PIL.