# Parallelizing MATLAB on Newton

Running MATLAB in interactive mode on Newton is very similar to running it on any other system. Once logged into Newton, just type "matlab" to start the client. If you have configured your SSH client for graphical applications, you should see the full MATLAB GUI. Otherwise, you will get a text-mode MATLAB client.

By default, MATLAB on Newton will only use 1 CPU on the machine it is executing on regardless of the environment in which it is running. Therefore in order to parallelize MATLAB jobs, worker pools must be used within the jobs along with the following MATLAB constructs:

- gcp
- parfor
- spmd
- mapreduce
- mapreducer
- parfeval
- parfevalOnAll

*Warning: Do not use maxNumCompThreads(). It has been depreciated and will be ignored by Newton.*

## Creating a Job file

Parallelized MATLAB jobs can be treated the same as any other multithreaded job. Within the job file, use the #$ pe threads N to reserve a number of CPUs, where N is the number of CPUs and hence workers that your job will use. For instance the following shows a job file named test that executes the MATLAB 'test' in the test.m file with 8 CPUs available for MATLAB to use. *A word of caution, this job file merely tells the scheduler that 8 CPUs will be used. This information is not being passed to MATLAB so it is the responsibility of test.m to create a worker pool to use the CPUs.*

#$ -pe threads 8 #$ -N test matlab -r test

## Creating a worker pool

The current worker pool can be retrieved via the gcp() function in MATLAB. As you can see below the default number of workers for a Newton MATLAB job is 1.

>> gcp() Starting parallel pool (parpool) using the 'local' profile ... connected to 1 workers. ans = Pool with properties: Connected: true NumWorkers: 1 Cluster: local AttachedFiles: {} IdleTimeout: 30 minute(s) (30 minutes remaining) SpmdEnabled: true

To create a larger worker pool, first delete the current worker pool and then use the parpool(N) function to create a pool of workers.

>> delete(gcp()) >> parpool(8)

The new worker pool with have 8 worker threads that can be used by MATLAB’s parallel constructs. You must create a worker pool within in your script otherwise the pool size will always be 1 and result in sequential execution.

## Parfor

The simplest parallelization construct is the parallelized for loop called parfor.

parfor i = range;; end

It is similar to the basic for loop except that each iteration of the body is executed by one of the available worker threads meaning that at anyone time up to N iterations of the loop can be running simultaneously.

The following parfor will print out the string ‘hello from #’ six times where # is the value of i within the loop body for that iteration. As you can see the output is not in the order expected had the loop executed sequentially. Instead the order is non-deterministic and reliant upon the speed of each thread.

>> parfor i=1:1:6; fprintf('hello from %d\n’, i); end hello from 2 hello from 1 hello from 4 hello from 3 hello from 6 hello from 5

## SPMD

spmd allows for the parallel execution of a block code in the form of:

spmd;; end

This is different from parfor in that there is no looping involved and the body is simply parallelized by the number of worker threads in the pool. Each thread of execution has a variable called labindex that provides a unique index from 1 to N to identify each thread. In addition to the default form, there are two more forms of the spmd construct available. spmd(N) and spmd(M,N) that allows you to control the exact number of workers and a range of workers respectively to use for the body.

The following example creates a pool of 4 worker threads and executes two spmd bodies that print the identity of the worker thread. The first spmd construct uses all of the workers to execute the body where as the second restricts the execution to only two workers.

>> parpool(4) >> spmd; fprintf(‘Hello from thread %d\n’, labindex); end Lab 1: Hello from thread 1 Lab 2: Hello from thread 2 Lab 3: Hello from thread 3 Lab 4: Hello from thread 4 >> spmd(2); fprintf('Hello from thread %d\n', labindex); end Lab 1: Hello from thread 1 Lab 2: Hello from thread 2

## Map-Reduce

mapreduce is a more complex parallelizing construct than parfor and spmd, but is great when working with datasets that too large for memory. It works in two stages: map and reduction. In the map stage, each worker performs the map function on a chunk of the dataset and then passes the result of the transform to the reduction stage. In the reduction stage the results are combined back together according to the reduce function before being returned. This setup work great for large datasets where computation doesn’t require shared state between the map methods and workers.

Assume we have a collection of numbers in a file called numbers.csv and we want to find the maximum value. We can define the following two functions. The first function mapper will find the local maximum in the chunk of the overall dataset it is provided via the data variable. It stores the result into a interim key value store to be passed to reducer. The reducer takes the intermediate results and finds the global maximum putting the solution into the output key value store.

function mapper(data, info, intermKVStore) localMax = max(data.x1); add(intermKVStore, 'localMax', localMax); end function reducer(intermKey, intermValIter, outKVStore) globalMax = -inf; while hasnext(intermValIter) globalMax = max(getnext(intermValIter), globalMax); end add(outKVStore,'MaxValue',globalMax); end

With these two functions, it is now possible to find the maximum value in numbers.csv using mapreduce. We start by creating a datastore from the numbers file which will serve as the input to mapreduce. A mapreduce worker pool is initialized by first creating a parallel pool of 4 and then passing the pool to the mapreducer to create the pool. Mapreduce is called with the input datastore, the mapper and reducer functions and the mapreduce worker pool. Once finished, map reduce returns the final key value store from reducer. The results of the mapreduce can then be further analyzed.

>> ds = datastore('numbers.csv'); >> p = parpool(4); >> mr = mapreducer(p); >> maxValue = mapreduce(ds,@mapper,@reducer, mr); >> readall(maxValue) ans = Key Value __________ _____ 'MaxValue' [9]

## parfeval

So far all of the parallel constructs examined are synchronized, pausing the issuing thread until the construct is complete. However, Matlab also offers asynchronous worker execution that allows for the background execution of tasks while work can continue in the foreground. parfeval() takes a worker pool, a function to execute, the number of return arguments from the function and a set of input arguments to the function. parfeval() returns a variable `parallel.FevalFuture`

variable that can used to determine when the function has been executed and to access the output arguments from the function. At some point in the future, the function will be executed by a worker in the specified pool and `parallel.FevalFuture`

will be updated with the results.

The following function computes the summation of all of the values between 1 and limit and returns the result.

function s = adder(limit) s = 0; for i=1:limit s = s + i; end end

parfeval() can be used to execute adder in the background to find the summation from 1 to 100 while adder is used in the forefront to find the summation from 1 to 50. Once adder(50) has returned, fetchOutputs() waits for adder(100) finish executing and then returns the result of the summation.

>> future = parfeval(p, @adder, 1, 100); >> sumOf50 = adder(50); >> sumOf100 = fetchOutputs(future); >> fprintf('Sum of 50 = %d Sum of 100 = %d\n', sumOf50, sumOf100); Sum of 50 = 1275 Sum of 100 = 5050

Unlike the other constructs, parfeval() only executes on one thread instead of all of them. Therefore to execute multiple instances of a function you invoke parfeval() multiple times from a for loop or use parfevalOnAll() to execute the function across all of the worker threads.

## GPGPU

Matlab on the Newton cluster also supports GPGPU computing on GPU enabled nodes, freeing up the CPU for other tasks and potentially speeding up complex calculations. Most of Matlab’s core functions and toolkits support the GPGPU allowing jobs to be easily ported.

In order to use a GPGPU from Matlab requires scheduling the job to run on an enabled node and reserving the GPGPU for use. This can be achieved by including the following complex within your sge script.

#$ -l gpgpu=1

A device handle and information for each the available GPGPUs are accessed via the gpuDevice() method. Without an index, the gpuDevice() method will return all of the GPGPUs or and can access a particular GPU using an index. For instance:

>> gpuDevice() ans = CUDADevice with properties: Name: 'Tesla M2090' Index: 1 ComputeCapability: '2.0' SupportsDouble: 1 DriverVersion: 7 ToolkitVersion: 6.5000 MaxThreadsPerBlock: 1024 MaxShmemPerBlock: 49152 MaxThreadBlockSize: [1024 1024 64] MaxGridSize: [65535 65535 65535] SIMDWidth: 32 TotalMemory: 5.6366e+09 AvailableMemory: 5.5395e+09 MultiprocessorCount: 16 ClockRateKHz: 1301000 ComputeMode: 'Default' GPUOverlapsTransfers: 1 KernelExecutionTimeout: 0 CanMapHostMemory: 1 DeviceSupported: 1 DeviceSelected: 1

As stated, most of Matlab’s core functions already support utilizing the GPGPU and only require that their input arguments be GPU handles to activate GPU functionality.

To create GPU handles, you can either transfer an existing array to a GPU or create one on the GPU, preferably the later when possible since that will be significantly faster. `GpuArray()`

will transfer an input array from main memory to the GPU and return a GPU handle to it. Most array generating methods (eye, ones, zeros, etc.) can create an array on the GPU by specifying ’gpuArray’ as an argument. The following example does a simple GPU matrix multiple. The first line creates a random 3000x3000 array in main memory and then moves it to the GPU and the second line creates a 3000x3000 identity matrix directly on the GPU. Since A and B are both GPU arrays, the multiple operation is performed on the GPU and the result is stored in C.

>> A = gpuArray(rand(3000,3000)); >> B = eye(3000, ‘gpuArray’); >> C = A * B;

Since the computation results of a GPU are stored on the GPU, the results must be moved back to main memory before they can be accessed by non GPU enabled code. Gather() transfers an inputed GPU array to main memory and returns the array. So in the above example before C can be used outside of the GPU it must be gathered back as shown below. C1 is now a normal Matlab array with the contents from C.

>> C1 = gather(C);

Here’s another example using the fast fourier transform function, fft(), on CPU and and the GPU with timing instructions. The first 3 lines compute the fft on the CPU and the 3 subsequent lines perform the same operation on the GPU. Note that the only difference between the two is creating the array on the GPU. As you can see, the GPGPU method is 21.6 times faster!!!

>> A = rand(1000,1000); >> tic; fft(A); toc Elapsed time is 3.577400 seconds. >> A = rand(1000,1000,'gpuArray'); >> tic; fft(A); toc Elapsed time is 0.165280 seconds.

A drawback to the GPU approach thus far is that every operation results in a separate CUDA kernel on the GPU. This can result in wasted time loading kernels for each step. Thankfully, Matlab provides two functions that can take normal user defined methods and convert them to a single kernel. The first of these is arrayfun() which can convert and execute functions that contain only element-wise array operations. In the example below there is a function defined for calculating the Horner series expansion for e^x using 10 terms. The first invocation of the horner method is called like the preceding examples with a gpuArray and results in the calculation being performed in about 1.12 seconds. The second invocation first passes the horner method to arrayfun to compile as a single CUDA kernel and then invoke it with the gpuArray as the argument. The second method completes in 0.15 seconds providing a further speedup of 7.5 times.

function y = horner(x) y = 1 + x.*(1 + x.*((1 + x.*((1 + x.*((1 + x.*((1 + x.*((1 + x.*((1 + x.*((1 + x./9)./8))./7))./6))./5))./4))./3))./2)); end >> A = rand(15000, 'gpuArray'); >> tic; horner(A); toc; Elapsed time is 1.122302 seconds. >> tic; arrayfun(@horner, A); toc; Elapsed time is 0.154633 seconds.

For array operations that are not element-wise like transpose and inverse, you can use pagefun instead. Additionally, Pagefun is useful for executing array operations, including element-wise ones, on arrays that exceed 2 dimensions. In this mode, pagefun executes the given operation on every 2 dimensional array contained within the greater arrays.

## Distributed computing

The distributed toolkit is not supported on Newton; however, it is still possible to run MATLAB across multiple nodes provided that the various MATLAB instances do not need to communicate with one another. One way to achieve is to partition the data that needs to be processed into smaller units that individual instances of MATLAB will be able to process. Instead of creating a separate job script for each instance, a job array can be used to duplicate the job multiple times with a unique identifier named $SGE_TASK_ID that is accessible in the job script. The $SGE_TASK_ID variable can be used as the problem space index so that each instance will know what partition of the dataset to access.

A job array can be created by specifying the -t option in the job script with either the number of instances to run or a range in the form of n-m that will create m-n instances. In the first form, the value $SGE_TASK_ID will range from 1 to the number of instances and in the second from n to m.

For instance, say there is an 10x10 array, named A, saved in a file m1.mat then the following script will find the sum of each column in the matrix. It creates a job array of size 10 (#$ -t 1-10) and within matlab loads the matrix from the file and specifies the column index of A for each sum function to operate upon. Note that since #$ -t 1-10 was given, then the line: matlab -r "load('m1.mat'); sum(A(:,$SGE_TASK_ID))" is transformed into matlab -r "load('m1.mat'); sum(A(:,1))”, matlab -r "load('m1.mat'); sum(A(:,2))”, ..., matlab -r "load('m1.mat'); sum(A(:,10))" for each of the 10 jobs of the job array.

#$ -q short* #$ -t 1-10 #$ -cwd module load matlab matlab -r "load('m1.mat'); sum(A(:,$SGE_TASK_ID))"

Building on the last example, it is possible to find summation of the entire matrix. The script is altered below to save the summation of each column as the variable ‘colsum’ in matlab files of the form ’$SGE_TASK_ID.mat’ (1.mat, 2.mat, ..., 10.mat). The subsequent script then has MATLAB load each colsum variables from the saved files and sums them together to provide the total summation of the original 10x10 matrix. In essence, these two scripts are performing an explicit mapreduce operation to obtain the summation.

#$ -q short* #$ -t 1-10 #$ -cwd module load matlab matlab -r "load('m1.mat'); colsum = sum(A(:,$SGE_TASK_ID)); save('$SGE_TASK_ID.mat’, ‘colsum’)”

#$ -q short* #$ -cwd module load matlab matlab -r “matsum=0; for i = 1:10; load(strcat(num2str(i), '.mat')); matsum = matsum + colsum; end; disp(matsum)”