Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

Learn iOS Game Optimization. Ultimate
Guide

by Dmitriy Vovk

Want to achieve the same level of
technology speed? Welcome!
Image is used without any permissions 

What you might know
• Batch, Batch, Batch!
http
://ce.u-sys.org/Veranstaltungen/Interaktive%20Computergrap
papers/BatchBatchBatch.pdf

• Render from one thread only
• Avoid synchronizations:
1. glFlush/glFinish;
2. Querying GL states;
3. Accessing render targets;

What you might know
• Pixel perfect HSR (Hidden Surface
Removal),
• But still need to sort opaque
geometry!
• Avoid doing alpha test. Use alpha
blend instead

What you might not know
• HSR still requires vertices to be processed!
• …thus don’t forget to cull your geometry on
CPU!
• Prefer Stencil Test before Scissor.
– Stencil test is performed in hardware on
PowerVR GPUs, thus resulting in dramatically
increased performance.
– Stencil can be of any form in contrast to the
rectangular Scissor

• Why no alpha test?!
o Alpha testdiscard requires fragment shader to run, before
visibility for current fragment can be determined. This will
remove benefits of HSR
o Even more! If shader code contains discard, than any
geometry rendered with this shader will suffer from alpha
test drawbacks. Even if this key-word is under condition,
USSE does assumes, that this condition may be hit.
o Move discard into separate shader
o Draw opaque geometry, than alpha tested one and alpha
blended in the end

What you might know
• Bandwidth matters
1. Use constant color per object, instead of
per vertex
2. Simplify your models. Use smaller data
types.
3. Use indexed triangles or non-indexed
triangle strips
4. Use VBO instead of client arrays
5. Use VAO

– VAO implementation on at least
iOS 4.0 did harmed your
performance
– VBOs are allocated at 4KB page
size multiples. Be aware of that.
Large amount of small VBOs can
defragment and waste you
memory.

• Updating your VBO data each frame:
1. glBufferSubData, that updates big part of the
original data do harm performance. Try not to
update buffer, that is used now
2. glBufferData, that will completely overwrite original
data is OK. Old data will be orphaned by driver and
storage for new one will be allocated
3. glMapBuffer with triple buffered VBO is preferred
way to update your data
4. EXT_map_buffer_range (iOS 6 only), when you need to
update only a subset of a buffer object.

int bufferID = 0; //initialization
for (int i = 0; i < 3; ++i)// only allocate data for 3 vbo, do not upload it
{
glBindBuffer(vertexBuffer[i]);
glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW);
}
//...
glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer[bufferID]);
void* ptr = glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES);
//update data here
glUnmapBufferOES(GL_ARRAY_BUFFER);
++bufferID;
if (bufferID == 3) //cycling through 3 buffers
{
bufferID = 0;
}

• This scheme will give you the best
performance possible – no blocking CPU by
GPU (or vice versa), no redundant memcpy
operations, lower CPU load, but extra
memory is used (note, that you will need no
extra temporal buffer to store your data
before sending it to VBO).

update(1), draw(1), gpuworking(................)

• Float type is native to GPU
• …that means any other type will be
converted to float by USSE
• …resulting in few additional cycles
• Thus it’s your choice in tradeoff
between bandwidthstorage and
additional cycles

What you might know
• Use interleaved vertex data
– Align each vertex attribute by 4 bytes
boundaries

• Why you have to do this?!
– You don’t. Driver can do this instead of
you
– …resulting in slower performance.

What you might know
• Split your vertex data into two parts:
1. Static VBO - the one, that never will be
changed
2. Dynamic VBO – the one, that needs to
be updated frequently
• Split your vertex data into few VBOs,
when few meshes share the same set
of attributes

What you might know
• Bandwidth matters
1. Use lower precision formats i.e.
RGB565
2. Use PVRTC compressed textures
3. Use atlases
4. Use mipmaps. They improve texture
cache efficiency and quality.

• iOS OpenGL ES drivers from 4.0 version
prior to 6.0 has a bug, that will ALWAYS
reserve memory for mipmaps, regardless,
whether you requested to create them, or
not. And you don’t need mip maps for 2D
graphics.
• …but there are one workaround – make
your textures NPOT (non-power of two).

• NPOT textures works only with the
GL_CLAMP_TO_EDGE warp mode
• POT are preferable, they gives you the best
performance possible
• Use NPOT textures with dimensions multiple to
32 pixels for best performance
• Driver will pad data of your NPOT texture to
match the size of the closes POT values.

• Why do I have to use PVRTC? It looks
ugly!
1.PVRTC provides great compression,
resulting in smaller texture size,
improved cache, saved bandwidth and
decreased power consumption
2.PVRTC stores pixel data in GPU’s native
order i.e BGRA, instead of RGBA

• BGRA vs RGBA
1. RGBA:
• Requires pixel data to be shuffled by driver into
BGRA
• Has options for RGB422, RGB565, RGBA4444,
RGBA5551
2. BGRA:
• Stores data in GPU’s native order
• Has option only for BGRA8888 for upload and
BGRA888, BGRA5551, BGRA4444 for ReadPixels

• Prefer OES_texture_half_float instead of
OES_texture_float
• Texture reads read only 32 bits per texel, thus
RGBA float texture will result in 4 texture reads

What you might know
• Prefer multitexturing instead of
multiple passes
• Configure texture parameters before
feeding image data to driver

• Texture uploading to the GPU is a
mess!
• Usual way to do this:
1. Load texture to temporal buffer in RAM
2. Feed this buffer to glTexImage2D
3. Draw!
• Looks simple and fast, right?

• …NO!

void* buf = malloc(TEXTURE_SIZE); //4mb for RGBA8 1024x1024 texture

LoadTexture(textureName);

glBindTexture(GL_TEXTURE_2D, textureID);

glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, &buf);

// buf is copied into internal buffer, created by driver (that's obvious)

free(buf); // because buffer can be freed immediately after glTexImage2D

glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);

// driver will do some additional work to fully upload texture first time it is actually used!

• Textures are finally uploaded only when they are used
first time. So draw them off screen immediately after
glTexImage2D
• A lot of redundant work!

• Jedi way to upload textures:
void* ptr = mmap(NULL, TEXTURE_SIZE, PROT_READ, MAP_PRIVATE, fileHandle, 0); //file mapping

glBindTexture(GL_TEXTURE_2D, textureID);

glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, ptr);

// buf is copied into internal buffer, created by driver (that's obvious)

free(buf); // because buffer can be freed immediately after glTexImage2D

glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);

// driver will do some additional work to fully upload texture first time it is actually used!

munmap(ptr, TEXTURE_SIZE);

• File mapping does not copy your file data into RAM! It
does load file data page by page, when it’s accessed.
• Thus we eliminated one redundant copy, dramatically
increased texture upload time and decreased memory
fragmentation

• Always use glClear at the beginning
of the frame…
• … and EXT_discard_framebuffer at
the end.
• PVR GPU series have a fast on chip
depth buffer for each tile. If you
forget to cleardiscard depth buffer, it
will be uploaded from HW to SW

What you might know
• Be wise with precision hints
• Avoid branching
• Eliminate loops
• Do not use discard. Place discard
instruction as early, as possible to
avoid useless computations

• Code inside of dynamic branch (it’s
condition is evaluated against value
calculated in shader) will be
executed anyway and than it will be
orphaned if condition is false

• highp – represents 32 bit floating point value
• mediump – represents 16 bit floating point
value in range of [-65520, 65520]
• lowp – 10 bit fixed point values in range of [-2,
2] with step of 1/256
• Try to give the same precision to all you
operands, because conversion takes some
time

• highp values are calculated on a scalar
processor on USSE1 only:
highp vec4 v1, v2;

highp float s1, s2;

// Bad

v2 = (v1 * s1) * s2;

//scalar processor executes v1 * s1 – 4 operations, and than this result is multiplied
by s2 on //a scalar processor again – 4 additional operations

// Good

v2 = v1 * (s1 * s2);

//s1 * s2 – 1 operation on a scalar processor; result * v1 – 4 operations on a scalar
processor

What you might know
• Typical CPU found in iOS devices:
1. ARMv7 architecture
2. Cortex A8Cortex A9Custom Apple
cores
3. 600 – 1300 MHz
4. 1-2 cores
5. Thumb-2 instructions set

• ARMv7 has no hardware support for
integer division
• VFPv3 FPUVFPv4 on Apple A6 (rumored)
• NEON SIMD engine
• Unaligned access is done in software on
Cortex A8. That means a hundred times
slower
• Cortex A8 is in-order CPU. Cortex A9+ are
out of order

• Cortex A9 core has full VFPv3 FPU,
while Cortex A8 has a VFPLite. That
means, that float operations take 1
cycle on A9 and 10 cycles on A8!

• NEON – 16 registers, 128 bit wide each.
Supports operations on 8, 16, 32 and
64 bits integers and 32 bits float values
• NEON can be used for:
– Software geometry instancing;
– Skinning on ES 1.1;
– As a general vertex processor;
– Other, typical, applications for SIMD.

• USSE1 architecture is scalar, NEON is
vector by nature. Move your vertex
processing to CPU from GPU to
speedup calculations*
• ???????
• PROFIT!!!111

• *NOTE. That doesn’t apply to USSE2 hardware

• The weakest side of mobile GPUs is a fill
rate. Fill rate is quickly killed by
blending. 2D games are heavy on this.
PowerVR USSE engine doesn’t care what
to do – vertex or fragments processing.
Moving you vertex processing to CPU
(NEON) will leave some room space for
fragment processing. It will have more
effect on USSE1, scalar hardware.

• There are 3 ways to use NEON engine
in your code:
1. Intrinsics
2. 1.1 GLKMath
3. Handwritten NEON assembly
4. Autovectorization. Add –mllvm –vectorize
–mllvm –bb-vectorize-aligned-only to
Other C Flags in project settings and you
are ready to go.

Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

• Intrinsics:

• Assembly:

• Summary:
Running time, CPU usage, %
ms
Intrinsics 2764 19
Assembly 3664 20
FPU 6209 25-28
FPU 5028 22-24
• Intrinsics got me 25%
autovectorized speedup over
assembly. Let’s see the code!
• Note that speed of intrinsics code vary from
compiler to compiler.

__attribute__((always_inline)) void Matrix4ByVec4(const
float32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__ vec,
float32x4_t* __restrict__ result)
{
(*result) = vmulq_n_f32((*mat).val[0], (*vec)[0]);

(*result) = vmlaq_n_f32((*result), (*mat).val[1], (*vec)[1]);
}

__attribute__((always_inline)) void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2, float32x4x4_t* __restrict__ r)

{

#ifdef INTRINSICS

(*r).val[0] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[0], 0));




(*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[1], vgetq_lane_f32((*m2).val[0], 1));












}

__asm__ volatile

(

"vldmia %6, { q0-q3 } nt"

"vldmia %0, { q8-q11 }nt"

"vmul.f32 q12, q8, d0[0]nt"

"vmul.f32 q13, q8, d2[0]nt"

"vmul.f32 q14, q8, d4[0]nt"

"vmul.f32 q15, q8, d6[0]nt"

"vmla.f32 q12, q9, d0[1]nt"

"vmla.f32 q13, q9, d2[1]nt"

"vmla.f32 q14, q9, d4[1]nt"

"vmla.f32 q15, q9, d6[1]nt"

"vmla.f32 q12, q10, d1[0]nt"

"vmla.f32 q13, q10, d3[0]nt"

"vmla.f32 q14, q10, d5[0]nt"

"vmla.f32 q15, q10, d7[0]nt"

"vmla.f32 q12, q11, d1[1]nt"

"vmla.f32 q13, q11, d3[1]nt"

"vmla.f32 q14, q11, d5[1]nt"

"vmla.f32 q15, q11, d7[1]nt"

"vldmia %1, { q0-q3 } nt"

"vmul.f32 q8, q12, d0[0]nt"

"vmul.f32 q9, q12, d2[0]nt"

"vmul.f32 q10, q12, d4[0]nt"

"vmul.f32 q11, q12, d6[0]nt"

"vmla.f32 q8, q13, d0[1]nt"

"vmla.f32 q8, q14, d1[0]nt"

"vmla.f32 q8, q15, d1[1]nt"

"vmla.f32 q9, q13, d2[1]nt"

"vmla.f32 q9, q14, d3[0]nt"

"vmla.f32 q9, q15, d3[1]nt"

"vmla.f32 q10, q13, d4[1]nt"

"vmla.f32 q10, q14, d5[0]nt"

"vmla.f32 q10, q15, d5[1]nt"

"vmla.f32 q11, q13, d6[1]nt"

"vmla.f32 q11, q14, d7[0]nt"

"vmla.f32 q11, q15, d7[1]nt"

"vstmia %2, { q8 }nt"



"vstmia %5, { q11 }"

:

: "r" (proj), "r" (squareVertices), "r" (v1), "r" (v2), "r" (v3), "r" (v4), "r" (modelView)

: "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15"

);

• For detailed explanation on
intrinsicsassembly see:
http://infocenter.arm.com/help/index.jsp?
com.arm.doc.dui0491e/CIHJBEFE.html

Contact me

http://www.linkedin.com/in/dvovk/
http://nukecode.blogspot.com/

Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

Recommended

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide (20)

More from UA Mobile (12)

Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide