Micro optimisation patterns

Constants

Variables which are both static and final can be optimised by the javac compiler and ProGuard at compile time. Math and logic operations done at compile time or only once as the application starts running are much faster than repeating the calculation inside a loop or each time a method is called.

For example, suppose a class begins with:

      private static int loopCount = 10;
      private static long startTime = System.currentTimeMillis();
      private static boolean enableImages = true;

We can give the compiler and ProGuard more opportunities to optimise the code at the compile step, and this will also give the ARM processor opportunities for handling these variables with more efficient byte codes. We will also change the capitalization of the variable names to reflect the fact that they are now constants. This following of coding standards keeps our code much easier to read and maintain:

      private static final int LOOP_COUNT = 10;
      private static final long START_TIME = System.currentTimeMillis();
      private static final boolean ENABLE_IMAGES = true;

For performance, if you can make a variable a constant, do so, as this is faster than a variable. If you cannot make it into a constant, consider if there is some slight change to your algorithm that would allow you to do so. Often by decreasing the number of variables, and to the associated degree necessary increasing the number of constants, you will find your algorithms simplify at the same time that they run faster.

Primitives

Use int instead of short, byte or long. Only use short if you are going to allocate so large number of primitives that using short or byte would significantly affect the amount of RAM needed. Use long of course as needed if you have a really big number, but pay attention to convert back down to int as soon as the calculations make that possible. You can do this for example by using a type cast to (int) in the middle of a calculation to make the rest of the calculation faster.

Some performance differences between short, int, and long were measured with the following type of loops:

for (int i = 0; i < 3000000; i++) {
    short/int/long a = 123;
    short/int/long b = -44;
    short/int/long c = 12;

    a += c;
    b += a;
    c *= b;
}

Average times spent in loops on Nokia Asha 305 on Nokia Asha software platform (obfuscated):

  • int: 710 (580) ms

  • short: 900 (850) ms 50% slower

  • long: 1450 (1150) ms 100% slower

As expected, long is slower than int, but, furthermore, short is slightly slower than int because of the implicit type conversion (see the resulting bytecode below). javac enforces this according to the Java standards.

short

int

long

17: bipush 123

19: istore 7

21: bipush -44

23: istore 8

25: bipush 12

27: istore 9

29: iload 7

31: iload 9

33: iadd

34: i2s

35: istore 7

37: iload 8

39: iload 7

41: iadd

42: i2s

43: istore 8

45: iload 9

47: iload 8

49: imul

50: i2s

51: istore 9

113: bipush 123

115: istore 7

117: bipush -44

119: istore 8

121: bipush 12

123: istore 9

125: iload 7

127: iload 9

129: iadd

130: istore 7

132: iload 8

134: iload 7

136: iadd

137: istore 8

139: iload 9

141: iload 8

143: imul

144: istore 9

206: ldc2_w #73

209: lstore 7

211: ldc2_w #75

214: lstore 9

216: ldc2_w #77

219: lstore 11

221: lload 7

223: lload 11

225: ladd

226: lstore 7

228: lload 9

230: lload 7

232: ladd

233: lstore 9

235: lload 11

237: lload 9

239: lmul

240: lstore 11

Object creation

Object creation is relatively costly operation. The first point to remember is to try and avoid creating invariant objects in a loop (objects which do not change during loop execution). Bring the object creation before the loop, and make the object final if possible to give further hints for automatic code optimisation.

Second, when multiple instances of a class need access to an object in a variable local to those instances, it is better to make that variable static rather than having a separate reference for every instance. If you have a collection object, try to pre-size it as large as it will eventually need to be. It is better for the collection object to be slightly bigger than necessary since resizing involves relatively slow new object creation and array copy operations.

Third, do not over-design new objects for every case and level of abstraction that you might someday possibly need. Server programmers are famously guilty of this, but it does not make the world a better (or faster) place. For example, if you have a Car object, you do not necessarily need to also design objects for Velocity (with vX and vY), CarColor, and Length (with width and height) but just put those as attributes as primitives to Car and thereby gain significantly in performance:

float vX, vY;
int color;
int width, height;

Final

Using the final keyword is not only a good way to make code more readable and prevent yourself from modifying a constant primitive or object. It can also prevent inadvertent overriding of a method or class, but the point here is you will gain an incremental performance improvement in many cases.

The performance gains from the final keyword were measured with the following loops:

for (int i = 0; i < 1000000; i++) {
    a = finalMethod(1, 2, 3);
}

for (int i = 0; i < 1000000; i++) {
    a = nonFinalMethod(1, 2, 3);
}

public final int finalMethod(final int a, final int b, final int c) {
    final float x = 1.23f, y = 0.05f;
    final float z = x * y;
    final int d = a + b + c;

    return d;
}

public int nonFinalMethod(int a, int b, int c) {
    float x = 1.23f, y = 0.05f;
    float z = x * y;
    int d = a + b + c;

    return d;
}

Average times on a Nokia Asha 305 on Nokia Asha software platform:

  • finalMethod: 650 ms

  • nonFinalMethod: 940 ms 45% slower

In this case, the time difference comes from final keyword before x and y. It is logical because then z value can be precalculated. The final keywords with parameters a, b, c let us not precalculate d or anything. And because we do not use z, it being final does not help us.

In finalMethod(), the multiplication operation (fmul) is optimised out (precalculated at build time) as shown in the resulting bytecode below:

nonFinalMethod()

finalMethod()

0: ldc #62 // float 1.23f

2: fstore 4

4: ldc #79 // float 0.05f

6: fstore 5

8: fload 4

10: fload 5

12: fmul

13: fstore 6

15: iload_1

16: iload_2

17: iadd

18: iload_3

19: iadd

20: istore 7

22: iload 7

24: ireturn

0: ldc #62 // float 1.23f

2: fstore 4

4: ldc #79 // float 0.05f

6: fstore 5

8: ldc #80 // float 0.0615f

10: fstore 6

12: iload_1

13: iload_2

14: iadd

15: iload_3

16: iadd

17: istore 7

19: iload 7

21: ireturn

Note also that to give the compiler and ProGuard obfuscator every possible chance to optimise at compile time, we have marked finalMethod() itself as final (final finalMethod()). This may be considered a part of good object design and a high performance coding habit to use wherever possible, but it is not significant to the performance difference seen in this test. In more general programming, we would likely have marked the entire class as final rather than just this one method when possible as this has the effect in one word of making all methods in the class final.

Static

Generally static methods and variables should be faster. Oddly, with some combinations of ARM and JVM, instance accesses are slightly faster. For example, when testing with Nokia Asha 305 on Nokia Asha software platform, the loop calling instance method below was faster.

Below are the tests run on Nokia Asha 305 on Nokia Asha software platform:

for (int i = 0; i < 1000000; i++) {
    staticMethod();   
}

for (int i = 0; i < 1000000; i++) {
    nonStaticMethod();
}

private static void staticMethod() {
    b++; // static variable
}

private void nonStaticMethod() {
    a++; // instance variable
}

Average times spent in loops on Nokia Asha 305 on Nokia Asha software platform:

  • nonStaticMethod: 570 ms

  • staticMethod: 680 ms 20% slower

Bytecodes (no obfuscation applied):

nonStaticMethod()

staticMethod()

0: aload_0

1: dup

2: getfield #2 // Field a:I

5: iconst 1

6: iadd

7: putfield #2 // Field a:I

10: return

0: getstatic #60 // Field b:I

3: iconst 1

4: iadd

5: putstatic #60 // Field b:I

8: return

String concatenation

If you are going to concatenate a large number of small Strings, use StringBuffer.append() instead of the String += operator. String is much slower because every time you concatenate a string to another with += operator, a new StringBuffer is created under the hood. Depending on the number of concatenations, a single explicit StringBuffer can be many times faster than multiple implicit StringBuffers created by String addition.

Copying arrays

If you have to copy an array, do not make your own copy method but use System.arraycopy(). It takes advantage of low level routines to do the work for you and is many times faster than copying the array by yourself.

Even better than System.arraycopy() is to avoid copying in the first place. If you can alter your algorithm or method parameters to accept the data in its present form, or generate the data in a form which the algorithm can accept without copying, this is much faster still.

Increment and bitshift

With bitshift operations you can achieve a small performance gain, but in cases involving constants as the second operator, ProGuard will do this for you. When tested with the following loops on Nokia Asha 305 on Nokia Asha software platform:

for (int i = 0; i < 10000000; i++) {
    if (i % 2 == 0) {
        a <<= 1;
    }
    else {
        a >>= 1;
    }
}
for (int i = 0; i < 10000000; i++) {
    if (i % 2 == 0) {
        a *= 2;
    }
    else {
        a /= 2;
    }
}

The average times spent in the loops were (obfuscated):

  • Bitshifting: 3650 (3500)

  • * and / operators: 3820 (3500) 5% slower without obfuscation

So ProGuard automatically optimises the *2 and /2 operations to bit shifts. For readability of your code, if you don’t have really many of those operations, it’s not worth replacing them with bitshifts (especially when using ProGuard). But remember that ProGuard can not automatically convert to bit shifts in cases where it does not know the correct integer power of two multiple at build time. Thus you should use bit shifts explicitly if you are shifting by a variable amount, but in other cases you are better served by leaving your math calculations clear and easy to read.

It is interesting in the above test that only a 5% speed increase was realised. This reflects the fact that modern ARM processors with a 6 stage pipeline and a lot of silicon dedicated to multiplication and division operations are surprisingly good. But this also makes it fairly tricky to generalise the performance gain from simple tests like this into more complex real world code with several other branch and math operations going on in close proximity to the bit shift. For example, several bit shifts in a row may give different results from this test, but in no case will the bit shift be slower than multiplication or division.

Loop invariants and method call reduction

Before starting to iterate through a loop, try to calculate as many things as possible in advance to avoid redundant calculation. Also try to avoid accessing slow external variable accessors multiple times if they will not change and a local copy will suffice.

Wrong way:

for (int i = 0; i < monsters.size(); i++) {
    if (((Monster)monsters.elementAt(i)).getTarget().getType() == TROLL) {
        ...
    }
    ((Monster)monsters.elementAt(i)).decreaseHealth(10);
}

Faster and clearer:

final int size = monsters.size(); 
Monster monster;
for (int i = 0; i < size; i++) {
    monster = ((Monster)monsters.elementAt(i)).getTarget();
    if (monster.getType() == TROLL) {
        ...
    }
    monster.decreaseHealth(10);
}

Loop exit

Do not continue iterating the loop to end if you can stop earlier. For example, if you want to traverse a list to see if any of the elements has some attribute, you can terminate it right after finding an element with that attribute.

boolean anyMonsterAlive = false;
int size = monsters.size();
for (int i = 0; i < size; i++) {
    Monster monster = (Monster)monsters.elementAt(i);
    if (monster.isAlive()) {
        anyMonsterAlive = true;
        break; // terminate here
    }
}

If there are a large number of elements, you can save a lot of time. Carefully examining and perhaps purposefully arranging patterns in the data you access in the loop (for example sorting and indexing) may further accelerate loop exits.

Vector vs. Array

If you have a fixed number of objects to store, use an array as it is always the fastest structure. If the number of objects in the list changes often, both array and Vector are possible depending on the pattern of change. If you use an array, you have to create an array of maximum size of objects and keep track of the current, actual size of used array elements (this allocates memory that is unneeded most of time). Or then you can change the size of array but it is very slow with a large number of elements. If you use Vector, adding and removing elements is fast and simple, but accessing elements is slower.

When creating a Vector, if you know the exact or even rough number of elements to be added, you can make the creation faster.

Vector v = new Vector(600);

Below are some performance test results of measuring array and Vector performance on Nokia Asha 305 on Nokia Asha software platform:

 

Array

Vector

Creating 50,000 elements*

190 ms

370 ms (280 ms if initial size is given as parameter)

Accessing all elements

9 ms

90 ms

Extending array is very slow. For example, extending an empty array one by one (with new Objects) to length of 1000 took 120 milliseconds.

Math operations

We compared a number of math operation on a Nokia 311 to see the relative performance of various operations, argument and variable types.

Integer vs. Long

long m = 0;
int j = 11;

for (i = 0; i < 1000000; i++) {
    m = m + j;
}
 

m is long

m is int

Basic addition

31 ms

21 ms

j is final

31 ms

20 ms

j is static, m = m + j

35 ms

20 ms

j is static, m += j

31 ms

20 ms

In some cases it appears the JVM or ARM11 processor optimises m = m + j to be the same speed as m += j, but this cannot be relied on as it depends on the variable type and also other activities in the processor pipeline.

If we turn on ProGuard, we get 0 ms for all tests. ProGuard optimises the entire loop and test away at compile time.

By changing the test to use the result of the calculation, we can prevent ProGuard from eliminating the loop. We can also verify by our tests that ProGuard changes m = m + j to the faster m += j form automatically. Thus you can use whichever style of calculation you find most readable and expect it to be very efficient in your obfuscated final release builds.

Now if we modify the loop to calculate on a changing variable “i” instead of “j”, it slows down slightly because we are not referring to an unchanging variable, but one which changes each time through the loop. This apparently increases load on the Arm11 6 stage pipeline, possibly removing the possibility of optimising the variable i into register space for looping. ProGuard is again shown to improve the calculation speed by folding into a more efficient math operation.

 

m is long

m is int

m = m + i

37 ms

22 ms

m += i

33 ms

21 ms

m = m + i, ProGuard on

32 ms

21 ms

m += i, ProGuard on

33 ms

21 ms

Turning off “compile with optimisation” on javac had no effect on these results with ProGuard on or off. These are presumably too simple to be optimised at the javac level, but this may improve results in more complex logic cases, and since it does not harm we may as well leave it turned on even if these optimisations are likely a redundant subset of what ProGuard does on the resulting bytecode.

Addition vs. Multiplication vs. Division

Division operations are simply slower at low level and for that reason the loop takes more time. If there is much calculation in a loop, try to avoid division operations.

Performance differences between addition, multiplication, and division operations were measured with the following loops (on Nokia Asha 305 on Nokia Asha software platform).

for (int i = 0; i < 500000; i++) {
    a = 1.23f;
    b = 1.45f;
    c = 0.004523f;

    c += a;
    a = b + c;
}
for (int i = 0; i < 500000; i++) {
    a = 1.23f;
    b = 1.45f;
    c = 0.004523f;

    c *= a;
    a = b * c;
}
for (int i = 0; i < 500000; i++) {
    a = 1.23f;
    b = 1.45f;
    c = 0.004523f;

    c /= a;
    a = b / c;
}

Average times spent in the loops:

  • Multiplying: 330 ms

  • Addition: 360 ms 9% slower

  • Division: 560 ms 70% slower

Lookup tables

Lookup tables are a way to decrease calculation time in a loop by pre-calculating and storing the values in an array. For example, we can pre-calculate the x-y chart below to speed up later drawing.

final int[] line = new int[230];        
for (int i = 0; i < line.length; i++) {
   line[i] = 20 + i / 3; // Fill lookup table
}
.. 
for (int x = 9; x < 230; x += 10) {
   // Now we avoid recalculating y = 20 + x / 3
   final int y = line[x];
   g.drawLine(x, y, x + 2, y + 2);
   g.drawLine(x, y + 2, x + 2, y);
}

For another example of a lookup table in the pureDownscale() routine, see Case Example: Image Resize.

Fixed point math

This can be a bit confusing, and you are advised to seek a complete treatment of fixed point math. The short summary is that if you need fractions, but within only a limited range of accuracy, consider using bit shifted integers instead. This trick is commonly used in 3D graphics to increase performance, or for example you can use it in a game to perform physical calculations with an acceptable accuracy but much more quickly.

For example, if floating point with accuracy of 0.125 (1/8, 3 bits) is sufficient, you can replace floating point operations with fixed point integer math as follows:

	// original calculation
	float a = 4.1;
	float b = 3.3;
	float c = b / a; // floating point divide is very slow

Can be replaced with:

	// fixed point calculation
	int a_fp= (int) (4.1 * 8); // fixed point, 3 bit fractions
	int b_fp = (int) (3.3 * 8); // same: 3 bits right of decimal
	int c_fp = b / a;  // in a loop, for speed, fast operation
	float c = c_fp / 8.0; // convert back to float, unshift 3 bits

For example, if you are doing physics engine calculations in a game, fixed point calculations are often sufficiently accurate and much faster than using float or double.

Parallel math

This is a rarely used technique, but worth understanding. You can for example perform 4 byte additions for the cost of one integer addition if you treat each of the 4 bytes as a separate number.

Parallel math requires you to use very precise ranging of your math operations to avoid overflow operation where one byte will overflow into a neighboring byte. This is done by masking the values to decrease accuracy such that no overflow can occur.

For a complete, real world example of fixed point math, see Case Example: Image Resize.

Comparisons

Switch vs. If

The switch statement in C is implemented as a direct jump which is extremely fast. In Java on Nokia Series 40 phones, switches are implemented at the bytecode level as a series of if statements. Therefore in many cases a switch statement is less efficient than a manually created series of if..else statements in which the first positive case is selected as the one which occurs more frequently. If you prefer to use switch statements for code clarity, then arrange them so that the most frequent cases appear first.