Performance
To further understand how increasing the number of threads and processes affects the time required when downloading, here is a table of results for crawling 500 web pages:
Script | Number of threads | Number of processes | Time | Comparison with sequential | Errors Seen? |
Sequential | 1 | 1 | 1349.798s | 1 | N |
Threaded | 5 | 1 | 361.504s | 3.73 | N |
Threaded | 10 | 1 | 275.492s | 4.9 | N |
Threaded | 20 | 1 | 298.168s | 4.53 | Y |
Processes | 2 | 2 | 726.899s | 1.86 | N |
Processes | 2 | 4 | 559.93s | 2.41 | N |
Processes | 2 | 8 | 451.772s | 2.99 | Y |
Processes | 5 | 2 | 383.438s | 3.52 | N |
Processes | 5 | 4 | 156.389s | 8.63 | Y |
Processes | 5 | 8 | 296.610s | 4.55 | Y |
The fifth column shows the proportion of time in comparison to the base case of sequential downloading. We can see that the increase in performance is not linearly proportional to the number of threads and processes but appears logarithmic, that is, until adding more threads actually decreases performance. For example, one process and five threads lead to 4X better performance, but 10 threads only leads to 5X better performance, and using 20 threads actually decreases performance. Depending on...