





















































In this article by Prakash and Achyutuni Sri Krishna Rao, authors of the book R Data Structures and Algorithms, we will discuss how an algorithm can be defined as a set of step-by-step instructions which govern the outline of a program that needs to be executed using computational resources. The execution can be in any programming language such as R, Python, and Java. Data is an intricate component of any program, and depending on how data is organized (data structure), your execution time can vary drastically. That’s why data structure is such a critical component of any good algorithm implementation.
(For more resources related to this topic, see here.)
The sorting algorithm, which acts as a connecter between the user-defined input and user-desired output, can be approached in multiple ways:
Each of these options can, in turn, handle a particular set of instances more effectively. This essentially deduces the concept of a “good algorithm”. An algorithm can be termed as “good” if it possesses attributes such as the following among many others:
Generality in accepting inputs This book will concentrate primarily on running time or time complexity, partly on memory utilization, and their relationship during program execution.
A problem can be approached using multiple algorithms, and each algorithm can be assessed based on certain parameters such as:
However, these parameters are generally affected by external environmental factors such as:
As it is highly impossible to control all external parameters, it becomes difficult to estimate the system runtime of multiple algorithms for performance comparison (ideal scenario analysis). Asymptotic analysis is one such technique which can be used to assess an algorithm’s efficiency without actually coding and compiling the entire program. It is a functional form representing a pseudo system runtime based on the size of input data and the number of operations. It is based on the principle that the growth rate of input data is directly proportional to the system runtime. For example, in the case of insertion sorting, the size represents the length of the input vector, and the number of operations represents the complexity of sort operations. This analysis can only be used to gauge the consideration of implementing the algorithm rather than evaluating the merits and demerits of algorithms in comparison. The following table represents the most widely used growth rate functional forms.
The most widely used functional forms of growth rates are based on the size of input data, which are used to analyze the performance of algorithms. These are also considered as pseudo-functional forms to evaluate an algorithm’s system runtime.
Memory management primarily deals with the administration of available memory and prediction of additional memory required for smoother and faster execution of functions. The current section will cover the concept of memory allocation, which deals with storage of an object in the R environment.
Memory Allocation: R allocates memory differently to different objects in its environment. Memory allocation can be determined using the object_size function from the pryr package. The pryr package can be installed from the CRAN repository using install.packages(“pryr”). The object_size function in pryr is similar to the object.size function in the base package. However, it is more accurate as it:
The following are examples of using the object_size function in R to evaluate memory allocation:
> object_size(1) ## Memory allocated for a single numeric vector
48 B
> object_size(“R”) ## Memory allocated for a single character vector
96 B
> object_size(TRUE) ## Memory allocated for a single logical vector
48 B
> object_size(1i) ## Memory allocated for a single complex vector
56 B
The storage required by an object can be attributed to the following parameters:
The Object_size() command is also used to see the inherent memory allocation as shown in the following table:
The preceding table shows inherent memory allocated by each data structure/type.
Let’s simulate scenarios with varying lengths of a vector with different data types such as integer, character, Boolean, and complex. The simulation is performed by simulating a vector length from 0 to 60 as follows:
> vec_length <- 0:60
> num_vec_size <- sapply(vec_length, function(x) object_size(seq(x)))
> char_vec_size <- sapply(vec_length, function(x) object_size(rep(“a”,x)))
> log_vec_size <- sapply(vec_length, function(x) object_size(rep(TRUE,x)))
> comp_vec_size <- sapply(vec_length, function(x) object_size(rep(“2i”,x)))
Num_vec_size computes the memory requirement for each numeric vector from zero to 60 number of elements. These elements are integers increasing sequentially, as stated in the function. Similarly, incremental memory requirements are calculated for character (char_vec_size), logical (log_vec_size), and complex (comp_vec_size) vectors. The result obtained from the simulation can be plotted using code.
> par(mfrow=c(2,2))
> plot(num_vec_size ~ vec_length, xlab = “Numeric seq vector”, ylab = “Memory allocated (in bytes)”,
+ type = “n”)
> abline(h = (c(0,8,16,32,48,64,128)+40), col = “grey”)
> lines(num_vec_size, type = “S”)
The result obtained on running the preceding code is shown in following figure. From the following figure, it can be observed that memory allocated to a vector is a function of its length and the object type used. However, the relationship does not seem to be linear—rather, it seems to increase in step. This is due to the fact that for better and consistent performance, R initially assigns big blocks of memory from RAM and handles them internally. These memory blocks are individually assigned to vectors based on the type and the number of elements within. Initially, memory blocks seem to be irregular towards a particular level (128 bytes for numeric/logical vector, and 176 bytes for character/complex vectors), and later become stable with small increments of 8 bytes as can be seen in the plots:
Memory allocation based on length of vector
Due to initial memory allocation differences, numeric and logical vectors show similar memory allocation patterns, and complex vectors behave similar to the character vectors. Memory management helps to efficiently run an algorithm. However, before the execution of any program, we should evaluate it based on its runtime. In the next sub-section, we will discuss the basic concepts involved in obtaining the runtime of any function, and its comparison with similar functions.
System runtime is very essential for benchmarking different algorithms. The process helps us in comparing different options, and pick the best algorithm.
The CRAN package microbenchmark is used to evaluate the runtime of any expression/function/code at an accuracy of a sub-millisecond. It is an accurate replacement to the system.time() function. Also, all the evaluations are performed in C code to minimize any overhead. The following methods are used to measure the time elapsed:
In our current example, we shall be using the mtcars data, which is in the package datasets. This data is obtained from 1974 Motor Trend US magazine, which comprises of fuel consumption comparison along with 10 automobile designs and the performance of 32 automobiles (1973-74 models).
Now, we would like to perform an operation in which a specific numeric attribute (mpg means miles per gallon) needs to be averaged to the corresponding unique values in an integer attribute (carb means no of carburetors). This can be performed using multiple ways such as aggregate, group_by, by, split, ddply(plyr), tapply, data.table, dplyr, sqldf, dplyr and so on.
In our current scenario, we have used the following four ways:
aggregate(mpg~carb,data=mtcars,mean)
ddply( mtcars, .(carb),function(x) mean(x$mpg))
mtcars_tb[,mean(mpg),by=carb]
summarize(group_by(mtcars, carb), mean(mpg))
Then, microbenchmark is used to determine the performance of each of the four ways mentioned in the preceding list. Here, we will be evaluating each expression 100 times.
> library(microbenchmark)
> MB_res <- microbenchmark(
+ Aggregate_func=aggregate(mpg~carb,data=mtcars,mean),
+ Ddply_func=ddply( mtcars, .(carb),function(x) mean(x$mpg)),
+ Data_table_func = mtcars_tb[,mean(mpg),by=carb],
+ Group_by_func = summarize(group_by(mtcars, carb), mean(mpg)),
+ times=1000
+ )
The output table is as follows:
> MB_res
Unit: microseconds
expr min lq mean median uq max neval
Aggregate_func 851.489 913.8015 1001.9007 944.775 1000.4905 6094.209 1000
Ddply_func 1370.519 1475.1685 1579.6123 1517.322 1575.7855 6598.578 1000
Data_table_func 493.739 552.7540 610.7791 577.495 621.6635 3125.179 1000
Group_by_func 932.129 1008.5540 1095.4193 1033.113 1076.1825 4279.435 1000
The output plot is as follows:
> library(ggplot2)
> autoplot(MB_res)
Distribution of time (microseconds) for 1000 iterations in each type of aggregate operation
Among these four expressions and for the given dataset, data.table has performed effectively in the least possible time as compared to the others. However, expressions need to be tested under scenarios with a high number of observations, high number of attributes, and both prior to finalizing the best operator.
Based on the performance in terms of system runtime, a code can be classified under best, worst or average category for a particular algorithm. Let’s consider a sorting algorithm to understand in detail. A sorting algorithm is used to arrange a numeric vector in an ascending order, wherein the output vector should have the smallest number as its first element and largest number as its last element with intermediate elements in subsequent increasing order. In insertion sorting algorithm, the elements within a vector are arranged based on moving positions. In our scenario, we will be inserting each element at a time into a previously sorted vector, with a smaller set of elements moving towards the end.
Now, let’s define best, worst and average-case scenarios for an insertion sorting algorithm.
Generally, the best-case scenarios are not considered to benchmark an algorithm, since they evaluate an algorithm most optimistically. However, if the probability of occurrence of best case is high, then algorithms can be compared using the best-case scenarios. Similar to best case, worst-case scenarios evaluate the algorithm most pessimistically. It is only used to benchmark algorithms which are used in real-time applications, such as railway network controls, air traffic controls, and the like. Sometimes, when we are not aware of input data distributions, it is safe to assess the performance of the algorithm based on the worst-case scenario.
Most of the times, average-case scenario is used as a representative measure of an algorithm’s performance; however, this is valid only when we are aware of the input data distribution. Average-case scenarios may not evaluate the algorithm properly if the distribution of input data is skewed. In the case of sorting, if most of the input vectors are arranged in descending order, the average-case scenario may not be the best form of evaluating the algorithm.
In a nutshell, real-time application scenarios, along with input data distribution, are major criterions to analyze the algorithms based on best, worst, and average cases.
This article summarizes the basic concepts and nuances of evaluating algorithms in R. We covered the conceptual theory of memory management and system runtime in R. We discussed the best, worst, and average-case scenarios to evaluate the performance of algorithms.
Further resources on this subject: