In this article by Prakash and Achyutuni Sri Krishna Rao, authors of the book R Data Structures and Algorithms, we will discuss how an algorithm can be defined as a set of step-by-step instructions which govern the outline of a program that needs to be executed using computational resources. The execution can be in any programming language such as R, Python, and Java. Data is an intricate component of any program, and depending on how data is organized (data structure), your execution time can vary drastically. That’s why data structure is such a critical component of any good algorithm implementation.

(For more resources related to this topic, see here.)

The sorting algorithm, which acts as a connecter between the user-defined input and user-desired output, can be approached in multiple ways:

Bubble sort and Shell sort, which are simple variants of sorting, but are highly inefficient
Insertion sort and Selection sort, primarily used for sorting small datasets
Merge sort, Heap sort, and Quick sort, which are efficient ways of sorting based on the complexities involved in an average system runtime
Distributed sorts such as counting sort, bucket sort, and radix sort, which can handle both runtime and memory usage

Each of these options can, in turn, handle a particular set of instances more effectively. This essentially deduces the concept of a “good algorithm”. An algorithm can be termed as “good” if it possesses attributes such as the following among many others:

Shorter running time
Lesser memory utilization
Simplicity in reading the code

Generality in accepting inputs This book will concentrate primarily on running time or time complexity, partly on memory utilization, and their relationship during program execution.

Introduction

A problem can be approached using multiple algorithms, and each algorithm can be assessed based on certain parameters such as:

System runtime
Memory requirement

However, these parameters are generally affected by external environmental factors such as:

Handling of data structures
System software and hardware configurations
Style of writing and compiling codes
Programming language

As it is highly impossible to control all external parameters, it becomes difficult to estimate the system runtime of multiple algorithms for performance comparison (ideal scenario analysis). Asymptotic analysis is one such technique which can be used to assess an algorithm’s efficiency without actually coding and compiling the entire program. It is a functional form representing a pseudo system runtime based on the size of input data and the number of operations. It is based on the principle that the growth rate of input data is directly proportional to the system runtime. For example, in the case of insertion sorting, the size represents the length of the input vector, and the number of operations represents the complexity of sort operations. This analysis can only be used to gauge the consideration of implementing the algorithm rather than evaluating the merits and demerits of algorithms in comparison. The following table represents the most widely used growth rate functional forms.

algorithm-analysis-img-0

The most widely used functional forms of growth rates are based on the size of input data, which are used to analyze the performance of algorithms. These are also considered as pseudo-functional forms to evaluate an algorithm’s system runtime.

Memory management in R

Memory management primarily deals with the administration of available memory and prediction of additional memory required for smoother and faster execution of functions. The current section will cover the concept of memory allocation, which deals with storage of an object in the R environment.

Memory Allocation: R allocates memory differently to different objects in its environment. Memory allocation can be determined using the object_size function from the pryr package. The pryr package can be installed from the CRAN repository using install.packages(“pryr”). The object_size function in pryr is similar to the object.size function in the base package. However, it is more accurate as it:

Takes into account the environment size associated with the current object
Takes into account the shared elements within a given object under consideration

The following are examples of using the object_size function in R to evaluate memory allocation:

> object_size(1)    ## Memory allocated for a single numeric vector
48 B
> object_size(“R”)  ## Memory allocated for a single character vector
96 B
> object_size(TRUE) ## Memory allocated for a single logical vector
48 B
> object_size(1i)   ## Memory allocated for a single complex vector
56 B

The storage required by an object can be attributed to the following parameters:

Metadata: Metadata of an object is defined by the type of object used such as character, integers, logical, and so on. The type can also usually be helpful during debugging.
Node pointer: The node pointer maintains the link between the different nodes, and depending on the number of node pointers used, memory requirement changes. For example, a doubly linked list requires more memory than a singly linked list, as it uses two node pointers to connect to the previous and next nodes.
Attribute pointer: Pointer to keep reference for attributes; this helps to reduce memory allocation, especially the data stored by a variable.
Memory allocation: Length of the vector representing the currently used space
Size: Size represents the true allocated space length of the vector.
Memory Padding: Padding applied to a component, for example, each element begins after an 8-byte boundary.

The Object_size() command is also used to see the inherent memory allocation as shown in the following table:

algorithm-analysis-img-1

The preceding table shows inherent memory allocated by each data structure/type.

Let’s simulate scenarios with varying lengths of a vector with different data types such as integer, character, Boolean, and complex. The simulation is performed by simulating a vector length from 0 to 60 as follows:

> vec_length <- 0:60
> num_vec_size <- sapply(vec_length, function(x) object_size(seq(x)))
> char_vec_size <- sapply(vec_length, function(x) object_size(rep(“a”,x)))
> log_vec_size <- sapply(vec_length, function(x) object_size(rep(TRUE,x)))
> comp_vec_size <- sapply(vec_length, function(x) object_size(rep(“2i”,x)))

Num_vec_size computes the memory requirement for each numeric vector from zero to 60 number of elements. These elements are integers increasing sequentially, as stated in the function. Similarly, incremental memory requirements are calculated for character (char_vec_size), logical (log_vec_size), and complex (comp_vec_size) vectors. The result obtained from the simulation can be plotted using code.

> par(mfrow=c(2,2))
> plot(num_vec_size ~ vec_length, xlab = “Numeric seq vector”, ylab = “Memory allocated (in bytes)”, 
+      type = “n”)
> abline(h = (c(0,8,16,32,48,64,128)+40), col = “grey”)
> lines(num_vec_size, type = “S”)

The result obtained on running the preceding code is shown in following figure. From the following figure, it can be observed that memory allocated to a vector is a function of its length and the object type used. However, the relationship does not seem to be linear—rather, it seems to increase in step. This is due to the fact that for better and consistent performance, R initially assigns big blocks of memory from RAM and handles them internally. These memory blocks are individually assigned to vectors based on the type and the number of elements within. Initially, memory blocks seem to be irregular towards a particular level (128 bytes for numeric/logical vector, and 176 bytes for character/complex vectors), and later become stable with small increments of 8 bytes as can be seen in the plots:

algorithm-analysis-img-2 Memory allocation based on length of vector

Due to initial memory allocation differences, numeric and logical vectors show similar memory allocation patterns, and complex vectors behave similar to the character vectors. Memory management helps to efficiently run an algorithm. However, before the execution of any program, we should evaluate it based on its runtime. In the next sub-section, we will discuss the basic concepts involved in obtaining the runtime of any function, and its comparison with similar functions.

System runtime in R

System runtime is very essential for benchmarking different algorithms. The process helps us in comparing different options, and pick the best algorithm.

The CRAN package microbenchmark is used to evaluate the runtime of any expression/function/code at an accuracy of a sub-millisecond. It is an accurate replacement to the system.time() function. Also, all the evaluations are performed in C code to minimize any overhead. The following methods are used to measure the time elapsed:

The QueryPerformanceCounter interface on Windows OS
The clock_gettime API on Linux OS
The mach_absolute_time function on MAC OS
The gethrtime function on Solaris OS

In our current example, we shall be using the mtcars data, which is in the package datasets. This data is obtained from 1974 Motor Trend US magazine, which comprises of fuel consumption comparison along with 10 automobile designs and the performance of 32 automobiles (1973-74 models).

Now, we would like to perform an operation in which a specific numeric attribute (mpg means miles per gallon) needs to be averaged to the corresponding unique values in an integer attribute (carb means no of carburetors). This can be performed using multiple ways such as aggregate, group_by, by, split, ddply(plyr), tapply, data.table, dplyr, sqldf, dplyr and so on.

In our current scenario, we have used the following four ways:

aggregate function
```
aggregate(mpg~carb,data=mtcars,mean)
```

ddply from plyr package

ddply( mtcars, .(carb),function(x) mean(x$mpg))

data.table format
```
mtcars_tb[,mean(mpg),by=carb]
```

group_by function

summarize(group_by(mtcars, carb), mean(mpg))

Then, microbenchmark is used to determine the performance of each of the four ways mentioned in the preceding list. Here, we will be evaluating each expression 100 times.

> library(microbenchmark)
> MB_res <- microbenchmark(
+   Aggregate_func=aggregate(mpg~carb,data=mtcars,mean),
+   Ddply_func=ddply( mtcars, .(carb),function(x) mean(x$mpg)),
+   Data_table_func = mtcars_tb[,mean(mpg),by=carb],
+   Group_by_func = summarize(group_by(mtcars, carb), mean(mpg)),
+   times=1000
+ )

The output table is as follows:

> MB_res
Unit: microseconds
            expr      min        lq      mean   median        uq      max neval
  Aggregate_func  851.489  913.8015 1001.9007  944.775 1000.4905 6094.209  1000
      Ddply_func 1370.519 1475.1685 1579.6123 1517.322 1575.7855 6598.578  1000
 Data_table_func  493.739  552.7540  610.7791  577.495  621.6635 3125.179  1000
   Group_by_func  932.129 1008.5540 1095.4193 1033.113 1076.1825 4279.435  1000

The output plot is as follows:

> library(ggplot2)
> autoplot(MB_res)

algorithm-analysis-img-3 Distribution of time (microseconds) for 1000 iterations in each type of aggregate operation

Among these four expressions and for the given dataset, data.table has performed effectively in the least possible time as compared to the others. However, expressions need to be tested under scenarios with a high number of observations, high number of attributes, and both prior to finalizing the best operator.

Best, worst, and average Cases

Based on the performance in terms of system runtime, a code can be classified under best, worst or average category for a particular algorithm. Let’s consider a sorting algorithm to understand in detail. A sorting algorithm is used to arrange a numeric vector in an ascending order, wherein the output vector should have the smallest number as its first element and largest number as its last element with intermediate elements in subsequent increasing order. In insertion sorting algorithm, the elements within a vector are arranged based on moving positions. In our scenario, we will be inserting each element at a time into a previously sorted vector, with a smaller set of elements moving towards the end.

Now, let’s define best, worst and average-case scenarios for an insertion sorting algorithm.

Best Case: A best case is one which requires the least running time. For example: a vector with all elements arranged in increasing order requires least amount of time for sorting.
Worst Case: A worst case is one which requires the maximum possible runtime to complete sorting a vector. For example: a vector with all the elements sorted in decreasing order requires most amount of time for sorting.
Average Case: An average case is one which requires intermediate time to complete sorting a vector. For example: a vector with half elements sorted in increasing order and the remaining in decreasing order. An average case is assessed using multiple vectors of differently arranged elements.

Generally, the best-case scenarios are not considered to benchmark an algorithm, since they evaluate an algorithm most optimistically. However, if the probability of occurrence of best case is high, then algorithms can be compared using the best-case scenarios. Similar to best case, worst-case scenarios evaluate the algorithm most pessimistically. It is only used to benchmark algorithms which are used in real-time applications, such as railway network controls, air traffic controls, and the like. Sometimes, when we are not aware of input data distributions, it is safe to assess the performance of the algorithm based on the worst-case scenario.

Most of the times, average-case scenario is used as a representative measure of an algorithm’s performance; however, this is valid only when we are aware of the input data distribution. Average-case scenarios may not evaluate the algorithm properly if the distribution of input data is skewed. In the case of sorting, if most of the input vectors are arranged in descending order, the average-case scenario may not be the best form of evaluating the algorithm.

In a nutshell, real-time application scenarios, along with input data distribution, are major criterions to analyze the algorithms based on best, worst, and average cases.

Summary

This article summarizes the basic concepts and nuances of evaluating algorithms in R. We covered the conceptual theory of memory management and system runtime in R. We discussed the best, worst, and average-case scenarios to evaluate the performance of algorithms.