Variables which are both static and final can be optimised by
the javac
compiler and ProGuard at compile time.
Math and logic operations done at compile time or only once as the
application starts running are much faster than repeating the calculation
inside a loop or each time a method is called.
For example, suppose a class begins with:
private static int loopCount = 10; private static long startTime = System.currentTimeMillis(); private static boolean enableImages = true;
We can give the compiler and ProGuard more opportunities to optimise the code at the compile step, and this will also give the ARM processor opportunities for handling these variables with more efficient byte codes. We will also change the capitalization of the variable names to reflect the fact that they are now constants. This following of coding standards keeps our code much easier to read and maintain:
private static final int LOOP_COUNT = 10; private static final long START_TIME = System.currentTimeMillis(); private static final boolean ENABLE_IMAGES = true;
For performance, if you can make a variable a constant, do so, as this is faster than a variable. If you cannot make it into a constant, consider if there is some slight change to your algorithm that would allow you to do so. Often by decreasing the number of variables, and to the associated degree necessary increasing the number of constants, you will find your algorithms simplify at the same time that they run faster.
Use int
instead of short
, byte
or long
. Only use short
if you are going to allocate so large number of primitives that
using short
or byte
would significantly
affect the amount of RAM needed. Use long
of course
as needed if you have a really big number, but pay attention to convert
back down to int
as soon as the calculations make
that possible. You can do this for example by using a type cast to (int)
in the middle of a calculation to make the rest of
the calculation faster.
Some performance differences between short
, int
, and long
were
measured with the following type of loops:
for (int i = 0; i < 3000000; i++) { short/int/long a = 123; short/int/long b = -44; short/int/long c = 12; a += c; b += a; c *= b; }
Average times spent in loops on Nokia Asha 305 on Nokia Asha software platform (obfuscated):
int
: 710 (580) ms
short
: 900 (850) ms 50% slower
long
: 1450 (1150) ms 100% slower
As expected, long
is slower than int
, but, furthermore, short
is slightly
slower than int
because of the implicit type conversion
(see the resulting bytecode below). javac
enforces
this according to the Java standards.
short |
int |
long |
17: bipush 123 19: istore 7 21: bipush -44 23: istore 8 25: bipush 12 27: istore 9 29: iload 7 31: iload 9 33: iadd 34: i2s 35: istore 7 37: iload 8 39: iload 7 41: iadd 42: i2s 43: istore 8 45: iload 9 47: iload 8 49: imul 50: i2s 51: istore 9 |
113: bipush 123 115: istore 7 117: bipush -44 119: istore 8 121: bipush 12 123: istore 9 125: iload 7 127: iload 9 129: iadd 130: istore 7 132: iload 8 134: iload 7 136: iadd 137: istore 8 139: iload 9 141: iload 8 143: imul 144: istore 9 |
206: ldc2_w #73 209: lstore 7 211: ldc2_w #75 214: lstore 9 216: ldc2_w #77 219: lstore 11 221: lload 7 223: lload 11 225: ladd 226: lstore 7 228: lload 9 230: lload 7 232: ladd 233: lstore 9 235: lload 11 237: lload 9 239: lmul 240: lstore 11 |
Object creation is relatively costly operation.
The first point to remember is to try and avoid creating invariant
objects in a loop (objects which do not change during loop execution).
Bring the object creation before the loop, and make the object final
if possible to give further hints for automatic code
optimisation.
Second, when multiple instances of a class need access to an object in a variable local to those instances, it is better to make that variable static rather than having a separate reference for every instance. If you have a collection object, try to pre-size it as large as it will eventually need to be. It is better for the collection object to be slightly bigger than necessary since resizing involves relatively slow new object creation and array copy operations.
Third, do not over-design new objects for every case and level of abstraction that you might someday possibly need. Server programmers are famously guilty of this, but it does not make the world a better (or faster) place. For example, if you have a Car object, you do not necessarily need to also design objects for Velocity (with vX and vY), CarColor, and Length (with width and height) but just put those as attributes as primitives to Car and thereby gain significantly in performance:
float vX, vY; int color; int width, height;
Using the final
keyword is not only a good way
to make code more readable and prevent yourself from modifying a constant
primitive or object. It can also prevent inadvertent overriding of
a method or class, but the point here is you will gain an incremental
performance improvement in many cases.
The performance gains
from the final
keyword were measured with the following
loops:
for (int i = 0; i < 1000000; i++) { a = finalMethod(1, 2, 3); } for (int i = 0; i < 1000000; i++) { a = nonFinalMethod(1, 2, 3); } public final int finalMethod(final int a, final int b, final int c) { final float x = 1.23f, y = 0.05f; final float z = x * y; final int d = a + b + c; return d; } public int nonFinalMethod(int a, int b, int c) { float x = 1.23f, y = 0.05f; float z = x * y; int d = a + b + c; return d; }
Average times on a Nokia Asha 305 on Nokia Asha software platform:
finalMethod
: 650 ms
nonFinalMethod
: 940 ms 45% slower
In this case, the time difference comes from final
keyword before x
and y
. It is
logical because then z
value can be precalculated.
The final
keywords with parameters a
, b
, c
let us not precalculate d
or anything. And because we do not use z
, it being final
does not help us.
In finalMethod()
, the multiplication operation (fmul
) is optimised out (precalculated at build time) as shown in the
resulting bytecode below:
nonFinalMethod() |
finalMethod() |
0: ldc #62 // float 1.23f 2: fstore 4 4: ldc #79 // float 0.05f 6: fstore 5 8: fload 4 10: fload 5 12: fmul 13: fstore 6 15: iload_1 16: iload_2 17: iadd 18: iload_3 19: iadd 20: istore 7 22: iload 7 24: ireturn |
0: ldc #62 // float 1.23f 2: fstore 4 4: ldc #79 // float 0.05f 6: fstore 5 8: ldc #80 // float 0.0615f 10: fstore 6 12: iload_1 13: iload_2 14: iadd 15: iload_3 16: iadd 17: istore 7 19: iload 7 21: ireturn |
Note also that to give the compiler and ProGuard obfuscator
every possible chance to optimise at compile time, we have marked finalMethod()
itself as final
(final finalMethod()
). This may be considered a part of good
object design and a high performance coding habit to use wherever
possible, but it is not significant to the performance difference
seen in this test. In more general programming, we would likely have
marked the entire class as final
rather than just
this one method when possible as this has the effect in one word of
making all methods in the class final
.
Generally static
methods and variables should
be faster. Oddly, with some combinations of ARM and JVM, instance
accesses are slightly faster. For example, when testing with Nokia
Asha 305 on Nokia Asha software platform, the loop calling instance
method below was faster.
Below are the tests run on Nokia Asha 305 on Nokia Asha software platform:
for (int i = 0; i < 1000000; i++) { staticMethod(); } for (int i = 0; i < 1000000; i++) { nonStaticMethod(); } private static void staticMethod() { b++; // static variable } private void nonStaticMethod() { a++; // instance variable }
Average times spent in loops on Nokia Asha 305 on Nokia Asha software platform:
nonStaticMethod
: 570 ms
staticMethod
: 680 ms 20% slower
Bytecodes (no obfuscation applied):
If you are going to concatenate a large number
of small String
s, use StringBuffer.append()
instead of the String +=
operator. String
is much slower because every time you concatenate a string to another
with +=
operator, a new StringBuffer
is created under the hood. Depending on the number of concatenations,
a single explicit StringBuffer
can be many times
faster than multiple implicit StringBuffer
s created
by String
addition.
If you have to copy an array, do not make your own
copy method but use System.arraycopy()
. It takes
advantage of low level routines to do the work for you and is many
times faster than copying the array by yourself.
Even better
than System.arraycopy()
is to avoid copying in the
first place. If you can alter your algorithm or method parameters
to accept the data in its present form, or generate the data in a
form which the algorithm can accept without copying, this is much
faster still.
With bitshift operations you can achieve a small performance gain, but in cases involving constants as the second operator, ProGuard will do this for you. When tested with the following loops on Nokia Asha 305 on Nokia Asha software platform:
for (int i = 0; i < 10000000; i++) { if (i % 2 == 0) { a <<= 1; } else { a >>= 1; } } for (int i = 0; i < 10000000; i++) { if (i % 2 == 0) { a *= 2; } else { a /= 2; } }
The average times spent in the loops were (obfuscated):
Bitshifting: 3650 (3500)
* and / operators: 3820 (3500) 5% slower without obfuscation
So ProGuard automatically optimises the *2 and /2 operations to bit shifts. For readability of your code, if you don’t have really many of those operations, it’s not worth replacing them with bitshifts (especially when using ProGuard). But remember that ProGuard can not automatically convert to bit shifts in cases where it does not know the correct integer power of two multiple at build time. Thus you should use bit shifts explicitly if you are shifting by a variable amount, but in other cases you are better served by leaving your math calculations clear and easy to read.
It is interesting in the above test that only a 5% speed increase was realised. This reflects the fact that modern ARM processors with a 6 stage pipeline and a lot of silicon dedicated to multiplication and division operations are surprisingly good. But this also makes it fairly tricky to generalise the performance gain from simple tests like this into more complex real world code with several other branch and math operations going on in close proximity to the bit shift. For example, several bit shifts in a row may give different results from this test, but in no case will the bit shift be slower than multiplication or division.
Before starting to iterate through a loop, try to calculate as many things as possible in advance to avoid redundant calculation. Also try to avoid accessing slow external variable accessors multiple times if they will not change and a local copy will suffice.
Wrong way:
for (int i = 0; i < monsters.size(); i++) { if (((Monster)monsters.elementAt(i)).getTarget().getType() == TROLL) { ... } ((Monster)monsters.elementAt(i)).decreaseHealth(10); }
Faster and clearer:
final int size = monsters.size(); Monster monster; for (int i = 0; i < size; i++) { monster = ((Monster)monsters.elementAt(i)).getTarget(); if (monster.getType() == TROLL) { ... } monster.decreaseHealth(10); }
Do not continue iterating the loop to end if you can stop earlier. For example, if you want to traverse a list to see if any of the elements has some attribute, you can terminate it right after finding an element with that attribute.
boolean anyMonsterAlive = false; int size = monsters.size(); for (int i = 0; i < size; i++) { Monster monster = (Monster)monsters.elementAt(i); if (monster.isAlive()) { anyMonsterAlive = true; break; // terminate here } }
If there are a large number of elements, you can save a lot of time. Carefully examining and perhaps purposefully arranging patterns in the data you access in the loop (for example sorting and indexing) may further accelerate loop exits.
If you have a fixed number of objects to store,
use an array as it is always the fastest structure. If the number
of objects in the list changes often, both array and Vector
are possible depending on the pattern of change. If you use an array,
you have to create an array of maximum size of objects and keep track
of the current, actual size of used array elements (this allocates
memory that is unneeded most of time). Or then you can change the
size of array but it is very slow with a large number of elements.
If you use Vector
, adding and removing elements is
fast and simple, but accessing elements is slower.
When creating
a Vector
, if you know the exact or even rough number
of elements to be added, you can make the creation faster.
Vector v = new Vector(600);
Below are some performance test results of measuring array and Vector performance on Nokia Asha 305 on Nokia Asha software platform:
Array |
Vector |
|
Creating 50,000 elements* |
190 ms |
370 ms (280 ms if initial size is given as parameter) |
Accessing all elements |
9 ms |
90 ms |
Extending array is very slow. For example, extending an empty array one by one (with new Objects) to length of 1000 took 120 milliseconds.
We compared a number of math operation on a Nokia 311 to see the relative performance of various operations, argument and variable types.
Integer vs. Long
long m = 0; int j = 11; for (i = 0; i < 1000000; i++) { m = m + j; }
m is long |
m is int |
|
Basic addition |
31 ms |
21 ms |
j is final |
31 ms |
20 ms |
j is static, m = m + j |
35 ms |
20 ms |
j is static, m += j |
31 ms |
20 ms |
In some cases it appears the JVM or ARM11 processor optimises m = m + j to be the same speed as m += j, but this cannot be relied on as it depends on the variable type and also other activities in the processor pipeline.
If we turn on ProGuard, we get 0 ms for all tests. ProGuard optimises the entire loop and test away at compile time.
By changing the test to use the result of the calculation, we can prevent ProGuard from eliminating the loop. We can also verify by our tests that ProGuard changes m = m + j to the faster m += j form automatically. Thus you can use whichever style of calculation you find most readable and expect it to be very efficient in your obfuscated final release builds.
Now if we modify the loop to calculate on a changing variable “i” instead of “j”, it slows down slightly because we are not referring to an unchanging variable, but one which changes each time through the loop. This apparently increases load on the Arm11 6 stage pipeline, possibly removing the possibility of optimising the variable i into register space for looping. ProGuard is again shown to improve the calculation speed by folding into a more efficient math operation.
m is long |
m is int |
|
m = m + i |
37 ms |
22 ms |
m += i |
33 ms |
21 ms |
m = m + i, ProGuard on |
32 ms |
21 ms |
m += i, ProGuard on |
33 ms |
21 ms |
Turning off “compile with optimisation” on javac had no effect on these results with ProGuard on or off. These are presumably too simple to be optimised at the javac level, but this may improve results in more complex logic cases, and since it does not harm we may as well leave it turned on even if these optimisations are likely a redundant subset of what ProGuard does on the resulting bytecode.
Addition vs. Multiplication vs. Division
Division operations are simply slower at low level and for that reason the loop takes more time. If there is much calculation in a loop, try to avoid division operations.
Performance differences between addition, multiplication, and division operations were measured with the following loops (on Nokia Asha 305 on Nokia Asha software platform).
for (int i = 0; i < 500000; i++) { a = 1.23f; b = 1.45f; c = 0.004523f; c += a; a = b + c; } for (int i = 0; i < 500000; i++) { a = 1.23f; b = 1.45f; c = 0.004523f; c *= a; a = b * c; } for (int i = 0; i < 500000; i++) { a = 1.23f; b = 1.45f; c = 0.004523f; c /= a; a = b / c; }
Average times spent in the loops:
Multiplying: 330 ms
Addition: 360 ms 9% slower
Division: 560 ms 70% slower
Lookup tables are a way to decrease calculation time in a loop by pre-calculating and storing the values in an array. For example, we can pre-calculate the x-y chart below to speed up later drawing.
final int[] line = new int[230]; for (int i = 0; i < line.length; i++) { line[i] = 20 + i / 3; // Fill lookup table } .. for (int x = 9; x < 230; x += 10) { // Now we avoid recalculating y = 20 + x / 3 final int y = line[x]; g.drawLine(x, y, x + 2, y + 2); g.drawLine(x, y + 2, x + 2, y); }
For another example of a lookup table in the pureDownscale()
routine, see Case Example: Image
Resize.
This can be a bit confusing, and you are advised to seek a complete treatment of fixed point math. The short summary is that if you need fractions, but within only a limited range of accuracy, consider using bit shifted integers instead. This trick is commonly used in 3D graphics to increase performance, or for example you can use it in a game to perform physical calculations with an acceptable accuracy but much more quickly.
For example, if floating point with accuracy of 0.125 (1/8, 3 bits) is sufficient, you can replace floating point operations with fixed point integer math as follows:
// original calculation float a = 4.1; float b = 3.3; float c = b / a; // floating point divide is very slow
Can be replaced with:
// fixed point calculation int a_fp= (int) (4.1 * 8); // fixed point, 3 bit fractions int b_fp = (int) (3.3 * 8); // same: 3 bits right of decimal int c_fp = b / a; // in a loop, for speed, fast operation float c = c_fp / 8.0; // convert back to float, unshift 3 bits
For example, if you are doing physics engine calculations
in a game, fixed point calculations are often sufficiently accurate
and much faster than using float
or double
.
This is a rarely used technique, but worth understanding. You can for example perform 4 byte additions for the cost of one integer addition if you treat each of the 4 bytes as a separate number.
Parallel math requires you to use very precise ranging of your math operations to avoid overflow operation where one byte will overflow into a neighboring byte. This is done by masking the values to decrease accuracy such that no overflow can occur.
For a complete, real world example of fixed point math, see Case Example: Image Resize.
Switch vs. If
The switch
statement
in C is implemented as a direct jump which is extremely fast. In Java
on Nokia Series 40 phones, switches are implemented at the bytecode
level as a series of if
statements. Therefore in
many cases a switch
statement is less efficient than
a manually created series of if..else
statements
in which the first positive case is selected as the one which occurs
more frequently. If you prefer to use switch statements for code clarity,
then arrange them so that the most frequent cases appear first.