Numpy exercises

Schedule (tentative):

9:30: people sit down, introductions, make groups
10:00: section 1: numpy intro, 1D arrays
11:15: section 2: the ndarray
12:30: lunch
13:30: section 3: I/O
14:15: section 4: Graphical output
15:00: section 5: interfacing with C
16:00: section 6: threading/multiprocessing
16:30: section 7: long exercices
17:45: end of day

General documentation:

Excellent numpy intro
For Matlab users: Numpy for Matlab users
help(np.toto) to get help on function toto
Google + stackoverflow (or similar)
in ipython, np.toto? to get help, np.toto?? shows the source of the function

Debugging: pdb.set_trace() in a script stops execution and shows debugger.

Command completion and history in standard python: Save pythonstartup.py to somewhere and point the PYTHONSTARTUP environment variable to it.

Presentation slides: numpy_training.pdf

Exercice data: exo_data.zip.

Transcript: transcript.txt.

Python/Numpy wrt. Matlab, C++

Level 1: immediate applications

range vs. arange

Use range and arange to allocate python and numpy arrays of 10 million elements. Compare how long it takes to sum up the values and the memory usage of the python processes.

Level 2: requires some searching

Primes

Write a sieve of Eratosthenes to find the prime numbers between 1 and 1 million.

Vanishing series

Consider the series

u[n+1] = u[n] / 2 if u[n] % 2 == 0 else 3 * u[n] + 1

It seems that this series always reaches 1 after a number of iterations. Find u[0] between 1 and 10000 such that this number is as high as possible.

Random numbers with a given distribution

Draw 1 million numbers in range(10) with the distribution

np.array([0.0052, 0.3324, 0.0002, 0.0494, 0.0728, 0.1860, 0.0004, 0.2601, 0.0832, 0.0103])

verify with np.bincount that the obtained numbers follow the distribution.

Level 3: no solution

Redo the "vanishing series" exercice with u[0] becoming arbitrarily large. Encode large numbers as an array of uint64's.

Matrix operations

Broadcasting rules: Broadcasting

Level 1

Regular polygon

Compute the 2D coordinates of the vertices of a regular polygon of n sides.

Centering vectors

Subtract the mean from the set of line vectors in matrix mat.

n = 50
d = 3
# normal-distributed, size n * d 
mat = np.random.randn(n, d)

Swapping lines and columns

Write a function

def swap2(m, i, j):  
   ...

that swaps line i with line j, and column i with column j of the square matrix m. Test it on

mat = np.arange(100).reshape(10, 10)

Level 2

Transpose

Transpose a huge square matrix in-place. Generate the matrix with

n = 10000
mat = np.arange(n * n).reshape(n, n)

Breaking codes

You are a hacker, and you have stolen a set of encrypted keys, such that key = secret1 * random_data. You have to factor the keys in order to retrieve the secret data. For this you have a collection of prime numbers, and you have to check if each key is a multiple of each prime number.

primes = np.array([1019485547,   64557029,  792566693,   34504159,  978908729,
                   102425021,  691631531, 1023131983,  828393641,  475191313])

keys = np.array([ 292335155802921311,  427471988313857183,  695054650526904211,
                  788771393194700959,  136768386653471549,  892996927643234027,
                  471122708082272941,   49573215150725351,  375523477835944969,
                  717623420602527097,  736489563786313291,  525359469860836031,
                  64209891704585449,  899547115585616039,  379105674211997129,
                  1053060284084815579,  110490479689392181,  411072609706303627,
                  241802712248472367,  350832229336362539,  865041043785275113,
                  50937552645756583,  136500121478323819, 1124773520037276449,
                  437286882517595333,  154520509144700371,  785708189696199079,
                  1006616443473107413,  118607898132922181,  811826995499862379])

Write code to find which key can be factorized by any of the primes.

Distance computations

The code exo_data/matrix/distances.py generates two matrices X and Y of size n*d and m*d, computes the pairwise Euclidean distances between rows of X and Y (size n*m), and finds the nearest neighbor of each row of X in Y.

Rewrite it without any loop, using a matrix multiplication.

Input/output

Documentation: IO routines

Level 1

Load fvecs

The fvecs file format contains a matrix. It is a series of concatenated binary blocks.

field	field type	description
desdim	int	descriptor dimension, a 32-bit int
components	float*desdim	the vector components, in 32-bit floats

Write code to load the file exo_data/io/ff_k16_00.fvecs.

Level 2

Fast load fvecs

Load a fvecs file without any loop.

OFF file format

The textual OFF file format describes polyhedra with triangular facets (well, a restricted version). It contains:

OFF
N M 0
x0 y0 z0 
x1 y1 z1
...
x(N-1) y(N-1) z(N-1)
3 v(0,0) v(0,1) v(0,2)
3 v(1,0) v(1,1) v(1,2)
...
3 v(M-1,0) v(M-1,1) v(M-1,2)

N = nb vertices, M = nb facets, x(i) = x coord of vetex i, v(i, j) = id of vertex j of facet i.

Write code to load the vertices and facets of exo_data/io/dragon_vrip_res3.off, without any loop. Are all vertices used?

.mat file format

The Matlab .mat file exo_data/io/000023.mat contains a set of rectangular matrices. Find them and convert them to a list of numpy float32 arrays.

Load the file with x = scipy.io.loadmat("000023.mat")

Parse output log

This is the output of a verbose computational geometry code: exo_data/io/csg_log.gz. It is line-oriented.

A path with name <1212122112> represents a path from root to leaf in a binary tree: from root, go to child 1, then from root's child 1 go to child 2, etc.

The program explores the tree and when a leaf node is reached, it says (a):

  Exploring node <21112212212112221121222122222121> ...

Followed by lines (b)

    Intersection edge of polygons F34 F2213 ...

Then (c)

     Candidate triple point: intersection of F1048 inter F2210 with F2601, ...

Count the number of (a), (b), (c) for each leaf depth.

Graphical output with Matplotlib

Documentation: Matplotlib gallery.

Level 1

Plot distributions

An algorithm produces a prediction score that attempts to predict "positive" or "negative". In order to show how well the algorithm works, produce a plot that shows the distribution of the score for ground-truth positives and negatives.

The data for a series of experiments is here: exo_data/graph/data_to_plot.dat it can be loaded with np.loadtxt("data_to_plot.dat"). Each matrix line is

[score label]

where label is +1 or -1 for ground-truth positives and negatives.

Level 2

View image matches

An algorithm provides a list of matching points between the two images: exo_data/graph/image_matches.dat. Each line of the matrix

[x1, y1, x2, y2, score]

describes a point match between (x1, y1) on the first image and (x2, y2) on the second one, with the confidence value score.

Draw these two images one above another, and a line between each pair of matches, with a code for the strength of the matches (color or thickness).

Level 3

2D plot

Reproduce the following plot

using the datapoints exo_data/graph/sift_pqsym.py. The code length is m * b.

Neural net layers

Visualize the convolution weights of the neural network exo_data/graph/neural_net_state_1747.6.pickle using subplots.

Interfacing with C

Docs:

Level 1

Sqrt of an array

Here is an example of code that computes the square root of an array exo_data/c/mymath.c. Compilation instructions are in the source code.

Run it and compare the speed with the numpy sqrt (on an array of size 10 million).

Add a pow function that can be called as

(apow, n_negative) = mymath.pow(a, 0.2)

where apow is array a raised at power 0.2 and n_negative is the number of negative values in a (that will cause nans in the output array)

Add a function to the method_list table. On input, add a double to the format string, on output build the result with Py_BuildValue.

solution: exo_data/c/mymath_solution.c exo_data/c/test_mymath.py

Level 2

Mandelbrot (cython)

The Mandelbrot fractal is defined by the complex series

u[n+1] = u[n] ** 2 + u[0]

The outside of the fractal is the set of u[0] such that abs(u) diverges. It can be shown that if abs(u[n]) becomes larger than 2 at some point, then it diverges.

This is code to compute the Mandelbrot fractal and display it: exo_data/c/mandel_slow.py exo_data/c/test_mandel.py

Copy mandel_slow.py to mandel.pyx, generate C with

cython mandel.pyx

Compile it by hand, then using the exo_data/c/setup_mandel.py setup file, using python setup_mandel.py build_ext --inplace.

Image library (SWIG)

The following C library manipulates images: exo_data/c/pnm_image.h exo_data/c/pnm_image.c.

The SWIG file exo_data/c/pnm_image.i gives access to these functions. Generate C with

swig -python pnm_image.i

Compile it by hand, then using the exo_data/c/setup_pnm_image.py setup file, using python setup_pnm_image.py build_ext --inplace.

The code exo_data/c/test_pnm_image.py uses the generates pnm_image to load the pnm image exo_data/c/ar.ppm and extracts the red channel from it.

Add a C function

PyObject *image_to_numpy_array(const image_t *im)

to pnm_image.i that converts the image_t to a PyArrayObject. Use it to display the image.

The function must be declared twice: code goes in the %{ %} block, and a header must be in the main function. Also, the block

%init %{
    import_array();
%}

should appear somewhere in the .i file.

solution: exo_data/c/pnm_image_solution.i

Level 3

Parallel Mandelbrot

Parallelize the Mandelbrot fractal computation in the previous section.

Writing parallel applications

Documentation:

Level 1

The knapsack problem

In the knapsack problem, we are given a bag and a set of n items. Item i has weight weights[i] and value values[i]. We want to find a subset of the items that we can put in the bag so that the total weight is below maxweight and the total value as large as possible. This problem is NP hard.

The code exo_data/parallel/knapsack.py finds an approximate solution by drawing a set of items in a random order and checking how many of them fit in the knapsack.

Parallelize the solution using multiprocessing, then multithreading.

The code exo_data/parallel/c_knapsack.c is a C implementation of the fill_knapsack function. Compare its speed against the python version. Add code to release the global interpreter lock when appropriate and compare again.

Long exercices

OpenGL

This exercise requires to install PyOpenGL. Instructions:

curl -O https://pypi.python.org/packages/source/P/PyOpenGL/PyOpenGL-3.1.0b3.tar.gz
tar xvzf PyOpenGL-3.1.0b3.tar.gz 
cd PyOpenGL-3.1.0b3
mkdir -p /Users/matthijs/Desktop/numpy_training/packs/lib/python2.7/site-packages/
export PYTHONPATH=/Users/matthijs/Desktop/numpy_training/packs/lib/python2.7/site-packages/
python setup.py install --prefix=/Users/matthijs/Desktop/numpy_training/packs

Documentation:

The code exo_data/longer/gl/view_torus.py displays a polyhedron using glBegin(GL_TRIANGLES). Rewrite it using a vertex buffer object.

solution: exo_data/longer/gl/view_torus_vbo.py

Symbolic computations

You need the C code to compute the inverse of a 4x4 matrix. Use symbolic manipulations in sympy to produce this code.

Documentation: SymPy documentation.

solution: exo_data/longer/sym/invert_4x4.py

Sparse matrices

This zip exo_data/longer/sparse/proceedings.zip contains the text of 2800 articles. The code exo_data/longer/sparse/compute_BOW.py computes a bag-of-words on each document, producing a sparse matrix of size ndocument * ndictionary.

Compute significant word groups using a sparse SVD on the matrix.

solution: exo_data/longer/sparse/term_vectors.py

Documentation: Sparse matrices.