Analyzing Patient Data - Exercises

Slicing Strings

A section of an array is called a slice. We can take slices of character strings as well:

element = 'oxygen'
print('first three characters:', element[0:3])
print('last three characters:', element[3:6])
first three characters: oxy
last three characters: gen

What is the value of element[:4]? What about element[4:]? Or element[:]?

Solution

oxyg
en
oxygen

What is element[-1]? What is element[-2]?

Solution

n
e

Given those answers, explain what element[1:-1] does.

Solution

Creates a substring from index 1 up to (not including) the final index, effectively removing the first and last letters from ‘oxygen’

How can we rewrite the slice for getting the last three characters of element, so that it works even if we assign a different string to element? Test your solution with the following strings: carpentry, clone, hi.

Solution

element = 'oxygen'
print('last three characters:', element[-3:])
element = 'carpentry'
print('last three characters:', element[-3:])
element = 'clone'
print('last three characters:', element[-3:])
element = 'hi'
print('last three characters:', element[-3:])
last three characters: gen
last three characters: try
last three characters: one
last three characters: hi

Thin Slices

The expression element[3:3] produces an empty string, i.e., a string that contains no characters. If data holds our array of patient data, what does data[3:3, 4:4] produce? What about data[3:3, :]?

Solution

array([], shape=(0, 0), dtype=float64)
array([], shape=(0, 40), dtype=float64)

Stacking Arrays

Arrays can be concatenated and stacked on top of one another, using NumPy’s vstack and hstack functions for vertical and horizontal stacking, respectively.

import numpy

A = numpy.array([[1,2,3], [4,5,6], [7, 8, 9]])
print('A = ')
print(A)

B = numpy.hstack([A, A])
print('B = ')
print(B)

C = numpy.vstack([A, A])
print('C = ')
print(C)
A =
[[1 2 3]
 [4 5 6]
 [7 8 9]]
B =
[[1 2 3 1 2 3]
 [4 5 6 4 5 6]
 [7 8 9 7 8 9]]
C =
[[1 2 3]
 [4 5 6]
 [7 8 9]
 [1 2 3]
 [4 5 6]
 [7 8 9]]

Write some additional code that slices the first and last columns of A, and stacks them into a 3x2 array. Make sure to print the results to verify your solution.

Solution

A ‘gotcha’ with array indexing is that singleton dimensions are dropped by default. That means A[:, 0] is a one dimensional array, which won’t stack as desired. To preserve singleton dimensions, the index itself can be a slice or array. For example, A[:, :1] returns a two dimensional array with one singleton dimension (i.e. a column vector).

D = numpy.hstack((A[:, :1], A[:, -1:]))
print('D = ')
print(D)
D =
[[1 3]
 [4 6]
 [7 9]]

Solution

An alternative way to achieve the same result is to use Numpy’s delete function to remove the second column of A.

D = numpy.delete(A, 1, 1)
print('D = ')
print(D)
D =
[[1 3]
 [4 6]
 [7 9]]

Change In Inflammation

The patient data is longitudinal in the sense that each row represents a series of observations relating to one individual. This means that the change in inflammation over time is a meaningful concept. Let’s find out how to calculate changes in the data contained in an array with NumPy.

The numpy.diff() function takes an array and returns the differences between two successive values. Let’s use it to examine the changes each day across the first week of patient 3 from our inflammation dataset.

patient3_week1 = data[3, :7]
print(patient3_week1)
 [0. 0. 2. 0. 4. 2. 2.]

Calling numpy.diff(patient3_week1) would do the following calculations

[ 0 - 0, 2 - 0, 0 - 2, 4 - 0, 2 - 4, 2 - 2 ]

and return the 6 difference values in a new array.

numpy.diff(patient3_week1)
array([ 0.,  2., -2.,  4., -2.,  0.])

Note that the array of differences is shorter by one element (length 6).

When calling numpy.diff with a multi-dimensional array, an axis argument may be passed to the function to specify which axis to process. When applying numpy.diff to our 2D inflammation array data, which axis would we specify?

Solution

Since the row axis (0) is patients, it does not make sense to get the difference between two arbitrary patients. The column axis (1) is in days, so the difference is the change in inflammation – a meaningful concept.

numpy.diff(data, axis=1)

If the shape of an individual data file is (60, 40) (60 rows and 40 columns), what would the shape of the array be after you run the diff() function and why?

Solution

The shape will be (60, 39) because there is one fewer difference between columns than there are columns in the data.

How would you find the largest change in inflammation for each patient? Does it matter if the change in inflammation is an increase or a decrease?

Solution

By using the numpy.max() function after you apply the numpy.diff() function, you will get the largest difference between days.

numpy.max(numpy.diff(data, axis=1), axis=1)
array([  7.,  12.,  11.,  10.,  11.,  13.,  10.,   8.,  10.,  10.,   7.,
         7.,  13.,   7.,  10.,  10.,   8.,  10.,   9.,  10.,  13.,   7.,
        12.,   9.,  12.,  11.,  10.,  10.,   7.,  10.,  11.,  10.,   8.,
        11.,  12.,  10.,   9.,  10.,  13.,  10.,   7.,   7.,  10.,  13.,
        12.,   8.,   8.,  10.,  10.,   9.,   8.,  13.,  10.,   7.,  10.,
         8.,  12.,  10.,   7.,  12.])

If inflammation values decrease along an axis, then the difference from one element to the next will be negative. If you are interested in the magnitude of the change and not the direction, the numpy.absolute() function will provide that.

Notice the difference if you get the largest absolute difference between readings.

numpy.max(numpy.absolute(numpy.diff(data, axis=1)), axis=1)
array([ 12.,  14.,  11.,  13.,  11.,  13.,  10.,  12.,  10.,  10.,  10.,
        12.,  13.,  10.,  11.,  10.,  12.,  13.,   9.,  10.,  13.,   9.,
        12.,   9.,  12.,  11.,  10.,  13.,   9.,  13.,  11.,  11.,   8.,
        11.,  12.,  13.,   9.,  10.,  13.,  11.,  11.,  13.,  11.,  13.,
        13.,  10.,   9.,  10.,  10.,   9.,   9.,  13.,  10.,   9.,  10.,
        11.,  13.,  10.,  10.,  12.])