So, let's import NumPy and play with it a bit. For that, we need to start the Python interactive shell:
>>> import numpy
>>> numpy.version.full_version
1.13.3
As we do not want to pollute our namespace, we certainly should not use the following code:
>>> from numpy import *
If we do this, then, for instance, numpy.array
will potentially shadow the array package that is included in standard Python. Instead, we will use the following convenient shortcut:
>>> import numpy as np
>>> a = np.array([0,1,2,3,4,5])
>>> a
array([0, 1, 2, 3, 4, 5])
>>> a.ndim
1
>>> a.shape
(6,)
With the previous code snippet, we created an array in the same way that we would create a list in Python. However, the NumPy arrays have additional information about the shape. In this case, it is a one-dimensional array of six elements. That's no surprise so far.
We can now transform this array into a two-dimensional matrix:
>>> b = a.reshape((3,2))
>>> b
array([[0, 1],
[2, 3],
[4, 5]])
>>> b.ndim
2
>>> b.shape
(3, 2)
It is important to realize just how much the NumPy package is optimized. For example, doing the following avoids copies wherever possible:
>>> b[1][0] = 77
>>> b
array([[ 0, 1],
[77, 3],
[ 4, 5]])
>>> a
array([ 0, 1, 77, 3, 4, 5])
In this case, we have modified the value 2
to 77
in b
, and we immediately see the same change reflected in a
, as well. Keep in mind that whenever you need a true copy, you can always perform the following:
>>> c = a.reshape((3,2)).copy()
>>> c
array([[ 0, 1],
[77, 3],
[ 4, 5]])
>>> c[0][0] = -99
>>> a
array([ 0, 1, 77, 3, 4, 5])
>>> c
array([[-99, 1],
[ 77, 3],
[ 4, 5]])
Note that, here, c
and a
are totally independent copies.
Another big advantage of NumPy arrays is that the operations are propagated to the individual elements. For example, multiplying a NumPy array will result in an array of the same size (including all of its elements) being multiplied:
>>> d = np.array([1,2,3,4,5])
>>> d*2
array([ 2, 4, 6, 8, 10])
This is also true for other operations:
>>> d**2
array([ 1, 4, 9, 16, 25])
Contrast that with ordinary Python lists:
>>> [1,2,3,4,5]*2
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
>>> [1,2,3,4,5]**2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'
Of course, by using NumPy arrays, we sacrifice the agility Python lists offer. Simple operations, such as adding or removing elements, are a bit complex for NumPy arrays. Luckily, we have both at our hands, and we will use the right one for the task at hand.
Part of the power of NumPy comes from the versatile ways in which its arrays can be accessed.
In addition to normal list indexing, it allows you to use arrays themselves as indices by performing the following:
>>> a[np.array([2,3,4])]
array([77, 3, 4])
Coupled with the fact that conditions are also propagated to individual elements, we gain a very convenient way to access our data, using the following:
>>> a>4
array([False, False, True, False, False, True], dtype=bool)
>>> a[a>4]
array([77, 5])
By performing the following command, we can trim outliers:
>>> a[a>4] = 4
>>> a
array([0, 1, 4, 3, 4, 4])
As this is a frequent use case, there is a special clip
function for it, clipping the values at both ends of an interval with one function call:
>>> a.clip(0,4)
array([0, 1, 4, 3, 4, 4])
Handling nonexistent values
The power of NumPy's indexing capabilities comes in handy when preprocessing data that we have just read in from a text file. Most likely, this will contain invalid values that we will mark as not being real numbers, using numpy.NAN
, as shown in the following code:
>>> # let's pretend we have read this from a text file:
>>> c = np.array([1, 2, np.NAN, 3, 4])
array([ 1., 2., nan, 3., 4.])
>>> np.isnan(c)
array([False, False, True, False, False], dtype=bool)
>>> c[~np.isnan(c)]
array([ 1., 2., 3., 4.])
>>> np.mean(c[~np.isnan(c)])
2.5
Let's compare the runtime behavior of NumPy to normal Python lists. In the following code, we will calculate the sum of all squared numbers from 1 to 1,000 and see how much time it will take. We will perform it 10,000 times and report the total time so that our measurement is accurate enough:
import timeit
normal_py_sec = timeit.timeit('sum(x*x for x in range(1000))',
number=10000)
naive_np_sec = timeit.timeit('sum(na*na)',
setup="import numpy as np; na=np.arange(1000)",
number=10000)
good_np_sec = timeit.timeit('na.dot(na)',
setup="import numpy as np; na=np.arange(1000)",
number=10000)
print("Normal Python: %f sec" % normal_py_sec)
print("Naive NumPy: %f sec" % naive_np_sec)
print("Good NumPy: %f sec" % good_np_sec)
Executing this will output:
Normal Python: 1.571072 sec
Naive NumPy: 1.621358 sec
Good NumPy: 0.035686 sec
We can make two interesting observations from this code. Firstly, just using NumPy as data storage (naive NumPy) takes longer, which is surprising since it seems it should be much faster, as it is written as a C extension. One reason for this increased processing time is that the access of individual elements from Python itself is rather costly. Only when we are able to apply algorithms inside the optimized extension code do we get speed improvements. The other observation is quite a tremendous one: using the dot()
function of NumPy, which does exactly the same, allows us to be more than 44 times faster. In summary, in every algorithm we are about to implement, we should always look at how we can move loops over individual elements from Python to some of the highly optimized NumPy or SciPy extension functions.
However, this speed comes at a price. Using NumPy arrays, we no longer have the incredible flexibility of Python lists, which can hold basically anything. NumPy arrays always have only one data type:
>>> a = np.array([1,2,3])
>>> a.dtype
dtype('int32')
If we try to use elements of different types, such as the ones shown in the following code, NumPy will do its best to correct them to be the most reasonable common data type:
>>> np.array([1, "stringy"])
array(['1', 'stringy'], dtype='<U11')
>>> np.array([1, "stringy", {1, 2, 3}])
array([1, 'stringy', {1, 2, 3}], dtype=object)