Per Flab

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

CENG331, Fall 2011 HW Assignment #4: Code Optimization Assigned: Dec. 20, 2011 - Due: Jan.

6, 2012, 23:55:50

1 Introduction
This assignment deals with optimizing memory intensive code. Image processing offers many examples of functions that can benet from optimization. In this homework, we will consider two image processing operations: rotate, which rotates an image counter-clockwise by 90 , and smooth, which smooths or blurs an image. For this homework, we will consider an image to be represented as a two-dimensional matrix M , where Mi,j denotes the value of (i, j)th pixel of M . Pixel values are triples of red, green, and blue (RGB) values. We will only consider square images. Let N denote the number of rows (or columns) of an image. Rows and columns are numbered, in C-style, from 0 to N 1. Given this representation, the rotate operation can be implemented quite simply as the combination of the following two matrix operations: Transpose: For each (i, j) pair, Mi,j and Mj,i are interchanged. Exchange rows: Row i is exchanged with row N 1 i. This combination is illustrated in Figure 1. The smooth operation is implemented by replacing every pixel value with the average of all the pixels around it (in a maximum of 3 3 window centered at that pixel). Consider Figure 2. The values of pixels M2[1][1] and M2[N-1][N-1] are given below: M2[1][1] = M2[N 1][N 1] =
2 i=0 2 j=0 M1[i][j]

9
N 1 i=N 2 N 1 j=N 2 M1[i][j]

2 Logistics
Any clarications and revisions to the assignment will be posted on the metu.ceng.course.331 newsgroup. 1

j (0,0)

Rotate by 90 i (counterclockwise) j

(0,0) i (0,0) Transpose Exchange Rows

Figure 1: Rotation of an image by 90 counterclockwise

M1[1][1]

M2[1][1]

smooth

M1[N1][N1]

M2[N1][N1]

Figure 2: Smoothing an image

3 Hand Out Instructions


Download the le perflab-handout.tar from COW interface. Start by copying perflab-handout.tar to a protected directory in which you plan to do your work. Then give the command: tar xvf perflab-handout.tar. This will cause a number of les to be unpacked into the directory. The only le you will be modifying and handing in is kernels.c. The driver.c program is a driver program that allows you to evaluate the performance of your solutions. Use the command make driver to generate the driver code and run it with the command ./driver. In this lab, you are required to work on regular 64-bit inek machines, since your implementations will be evaluated on one of them and the performance measures that will be used in evaluation depend on the machine you are working.

4 Implementation Overview
Data Structures
The core data structure deals with image representation. A pixel is a struct as shown below:
typedef struct { unsigned short red; /* R value */ unsigned short green; /* G value */ unsigned short blue; /* B value */ } pixel; As can be seen, RGB values have 16-bit representations (16-bit color). An image I is represented as a onedimensional array of pixels, where the (i, j)th pixel is I[RIDX(i,j,n)]. Here n is the dimension of the image matrix, and RIDX is a macro dened as follows: #define RIDX(i,j,n) ((i)*(n)+(j)) See the le defs.h for this code.

Rotate
The following C function computes the result of rotating the source image src by 90 and stores the result in destination image dst. dim is the dimension of the image. void naive_rotate(int dim, pixel *src, pixel *dst) { int i, j; for(i=0; i < dim; i++) for(j=0; j < dim; j++) dst[RIDX(dim-1-j,i,dim)] = src[RIDX(i,j,dim)]; return; }

The above code scans the rows of the source image matrix, copying to the columns of the destination image matrix. Your task is to rewrite this code to make it run as fast as possible using techniques like code motion, loop unrolling and blocking. See the le kernels.c for this code.

Smooth
The smoothing function takes as input a source image src and returns the smoothed result in the destination image dst. Here is part of an implementation: void naive_smooth(int dim, pixel *src, pixel *dst) { int i, j; for(i=0; i < dim; i++) for(j=0; j < dim; j++) dst[RIDX(i,j,dim)] = avg(dim, i, j, src); /* Smooth the (i,j)th pixel */ return; } The function avg returns the average of all the pixels around the (i,j)th pixel. Your task is to optimize smooth (and avg) to run as fast as possible. (Note: The function avg is a local function and you can get rid of it altogether to implement smooth in some other way.) This code (and an implementation of avg) is in the le kernels.c.

Performance measures
Our main performance measure is CPE or Cycles per Element. If a function takes C cycles to run for an image of size N N , the CPE value is C/N 2 . Table 1 summarizes the performance of the naive implementations shown above and compares it against an optimized implementation. Performance is shown for for 5 different values of N . All measurements were made on one of the inek machines. The ratios (speedups) of the optimized implementation over the naive one will constitute a score of your implementation. To summarize the overall effect over different values of N , we will compute the geometric mean of the results for these 5 values. That is, if the measured speedups for N = {32, 64, 128, 256, 512} are R32 , R64 , R128 , R256 , and R512 then we compute the overall performance as R =
5

R32 R64 R128 R256 R512

Assumptions
To make life easier, you can assume that N is a multiple of 32. Your code must run correctly for all such values of N , but we will measure its performance only for the 5 values shown in Table 1.

Method Naive rotate (CPE) Optimized rotate (CPE) Speedup (naive/opt) Method Naive smooth (CPE) Optimized smooth (CPE) Speedup (naive/opt)

Test case N

1 64 5.2 4.5 1.2 32 155.8 39.5 3.9

2 128 6.2 4.3 1.4 64 156.3 43.1 3.6

3 256 10.3 4.3 2.4 128 156.6 43.3 3.6

4 512 18.0 8.3 2.2 256 156.7 43.3 3.6

5 1024 64.7 13.1 4.9 512 157.2 44.1 3.6

Geom. Mean

2.1 Geom. Mean

3.7

Table 1: CPEs and Ratios for Optimized vs. Naive Implementations

5 Infrastructure
We have provided support code to help you test the correctness of your implementations and measure their performance. This section describes how to use this infrastructure. The exact details of each part of the assignment is described in the following section. Note: The only source le you will be modifying is kernels.c.

Versioning
You will be writing many versions of the rotate and smooth routines. To help you compare the performance of all the different versions youve written, we provide a way of registering functions. For example, the le kernels.c that we have provided you contains the following function: void register_rotate_functions() { add_rotate_function(&rotate, rotate_descr); } This function contains one or more calls to add rotate function. In the above example, add rotate function registers the function rotate along with a string rotate descr which is an ASCII description of what the function does. See the le kernels.c to see how to create the string descriptions. This string can be at most 256 characters long. A similar function for your smooth kernels is provided in the le kernels.c.

Driver
The source code you will write will be linked with object code that we supply into a driver binary. To create this binary, you will need to execute the command unix> make driver You will need to re-make driver each time you change the code in kernels.c. To test your implementations, you can then run the command:

unix> ./driver The driver can be run in four different modes: Default mode, in which all versions of your implementation are run. Autograder mode, in which only the rotate() and smooth() functions are run. This is the mode we will run in when we use the driver to grade your handin. File mode, in which only versions that are mentioned in an input le are run. Dump mode, in which a one-line description of each version is dumped to a text le. You can then edit this text le to keep only those versions that youd like to test using the le mode. You can specify whether to quit after dumping the le or if your implementations are to be run. If run without any arguments, driver will run all of your versions (default mode). Other modes and options can be specied by command-line arguments to driver, as listed below: -g : Run only rotate() and smooth() functions (autograder mode). -f <funcfile> : Execute only those versions specied in <funcfile> (le mode). -d <dumpfile> : Dump the names of all versions to a dump le called <dumpfile>, one line to a version (dump mode). -q : Quit after dumping version names to a dump le. To be used in tandem with -d. For example, to quit immediately after printing the dump le, type ./driver -qd dumpfile. -h : Print the command line usage.

6 Assignment Details
Optimizing Rotate (50 points)
In this part, you will optimize rotate to achieve as low a CPE as possible. You should compile driver and then run it with the appropriate arguments to test your implementations.

Optimizing Smooth (50 points)


In this part, you will optimize smooth to achieve as low a CPE as possible. Some advice. Look at the assembly code generated for the rotate and smooth. Focus on optimizing the inner loop (the code that gets repeatedly executed in a loop) using the optimization tricks covered in class. The smooth is more compute-intensive and less memory-sensitive than the rotate function, so the optimizations are of somewhat different avors.

Coding Rules
You may write any code you want, as long as it satises the following: It must be in ANSI C. You may not use any embedded assembly language statements.

It must not interfere with the time measurement mechanism. You will also be penalized if your code prints any extraneous information. You can only modify code in kernels.c. You are allowed to dene macros, additional global variables, and other procedures in these les.

Evaluation
Your solutions for rotate and smooth will each count for 50% of your grade. Your implementations will be evaluated on one of the inek machines preserving that no user is connected to that machine other than the evaluater assistant. The score for each will be based on the following: Correctness: You will get NO CREDIT for buggy code that causes the driver to complain! This includes code that correctly operates on the test sizes, but incorrectly on image matrices of other sizes. As mentioned earlier, you may assume that the image dimension is a multiple of 32. CPE: You will get full credit for your implementations of rotate and smooth if they are correct and achieve mean CPEs above thresholds Sr = 2.1 and Ss = 3.7 respectively. You will get partial credit for a correct implementation which achieves mean CPEs below thresholds, and that does better than the supplied naive one. Implementations that does not do better than the supplied naive one will not get any credit. Please also note that the thresholds given above are subject to change and try to achieve mean speedups as large as you can.

7 Hand In Instructions
You must rename your nal kernels.c le to eXXXXXXX_kernels.c, where XXXXXXX is your seven-digit student id, and submit it to COW. Make sure that the rotate() and smooth() functions correspond to your fastest implementations, as these are the only functions that will be tested when we use the driver to grade your assignement. Remove any extraneous print statements. Good luck!

You might also like