SlideShare a Scribd company logo
Intrinsics and other micro-optimizations
Egor Bogatov
Engineer at Microsoft
Agenda
Useful micro-optimizations
Pitfalls for external contributors
Intrinsics & SIMD with examples
.NET Core 3.0-x features
2
Prefer Spans API where possible
var str = "EGOR 3.14 1234 7/3/2018";
string name = str.Substring(0, 4);
float pi = float.Parse(str.Substring(5, 4));
int number = int.Parse(str.Substring(10, 4));
DateTime date = DateTime.Parse(str.Substring(15, 8));
var str = "EGOR 3.14 1234 7/3/2018".AsSpan();
var name = str.Slice(0, 4);
float pi = float.Parse(str.Slice(5, 4));
int number = int.Parse(str.Slice(10, 4));
DateTime date = DateTime.Parse(str.Slice(15, 8));
Allocated on heap: 168 bytes
Allocated on heap: 0 bytes
3
Allocating a temp array
char[] buffer =
new char[count];
4
Allocating a temp array
Span<char> span =
new char[count];
5
Allocating a temp array
Span<char> span =
count <= 512 ?
stackalloc char[512] :
new char[count];
6
Allocating a temp array
Span<char> span =
count <= 512 ?
stackalloc char[512] :
ArrayPool<char>.Shared.Rent(count);
7
Allocating a temp array
char[] pool = null;
Span<char> span =
count <= 512 ?
stackalloc char[512] :
(pool = ArrayPool<char>.Shared.Rent(count));
if (pool != null)
ArrayPool<char>.Shared.Return(pool);
8
Allocating a temp array - final pattern
char[] pool = null;
Span<char> span =
count <= 512 ?
stackalloc char[512] :
(pool = ArrayPool<char>.Shared.Rent(count));
if (pool != null)
ArrayPool<char>.Shared.Return(pool);
9
Allocating a temp array – without ArrayPool
Span<char> span = count <= 512 ?
stackalloc char[512] :
new char[count];
10
Optimizing .NET Core: pitfalls
public static int Count<TSource>(this IEnumerable<TSource> source)
{
if (source is ICollection<TSource> collectionoft)
return collectionoft.Count;
if (source is IIListProvider<TSource> listProv)
return listProv.GetCount(onlyIfCheap: false);
if (source is ICollection collection)
return collection.Count;
if (source is IReadOnlyCollection<TSource> rocollectionoft)
return rocollectionoft.Count;
int count = 0;
using (IEnumerator<TSource> e = source.GetEnumerator())
while (e.MoveNext())
count++;
return count;
}
12
~ 3 ns
~ 3 ns
~ 3 ns
~ 30 ns
~ 10-… ns
Casts are not cheap
var t0 = (List<string>)value;
var t1 = (ICollection<string>)value
var t2 = (IList)value
var t3 = (IEnumerable<string>)value
object value = new List<string> { };
// Covariant interfaces:
public interface IEnumerable<out T>
public interface IReadOnlyCollection<out T>
13
IEnumerable<object> a = new List<string> {..}
Cast to covariant interface – different runtimes 14
Method | Runtime | Mean | Scaled |
-------------------:|----------------:|------------:|---------:|
CastAndCount | .NET 4.7 | 78.1 ns | 6.7 |
CastAndCount | .NET Core 3 | 42.9 ns | 3.7 |
CastAndCount | CoreRT | 11.6 ns | 1.0 |
CastAndCount | Mono | 6.7 ns | 0.6 |
return ((IReadOnlyCollection<string>)_smallArray).Count;
.NET Core: bounds check
Bounds check
public static double SumSqrt(double[] array)
{
double result = 0;
for (int i = 0; i < array.Length; i++)
{
result += Math.Sqrt(array[i]);
}
return result;
}
16
Bounds check
public static double SumSqrt(double[] array)
{
double result = 0;
for (int i = 0; i < array.Length; i++)
{
if (i >= array.Length)
throw new ArgumentOutOfRangeException();
result += Math.Sqrt(array[i]);
}
return result;
}
17
Bounds check eliminated!
public static double SumSqrt(double[] array)
{
double result = 0;
for (int i = 0; i < array.Length; i++)
{
result += Math.Sqrt(array[i]);
}
return result;
}
18
Bounds check: tricks
public static void Test1(char[] array)
{
array[0] = 'F';
array[1] = 'a';
array[2] = 'l';
array[3] = 's';
array[4] = 'e';
array[5] = '.';
}
19
Bounds check: tricks
public static void Test1(char[] array)
{
array[5] = '.';
array[0] = 'F';
array[1] = 'a';
array[2] = 'l';
array[3] = 's';
array[4] = 'e';
}
20
Bounds check: tricks
public static void Test1(char[] array)
{
if (array.Length > 5)
{
array[0] = 'F';
array[1] = 'a';
array[2] = 'l';
array[3] = 's';
array[4] = 'e';
array[5] = '.';
}
}
21
Bounds check: tricks
public static void Test1(char[] array)
{
if ((uint)array.Length > 5)
{
array[0] = 'F';
array[1] = 'a';
array[2] = 'l';
array[3] = 's';
array[4] = 'e';
array[5] = '.';
}
}
22
Bounds check: tricks – CoreCLR sources:
// Boolean.cs
public bool TryFormat(Span<char> destination, out int charsWritten)
{
if (m_value)
{
if ((uint)destination.Length > 3)
{
destination[0] = 'T';
destination[1] = 'r';
destination[2] = 'u';
destination[3] = 'e';
charsWritten = 4;
return true;
}
}
23
.NET Core: Intrinsics & SIMD
• Recognize patterns
• Replace methods (usually marked with [Intrinsic])
• System.Runtime.Intrinsics
mov eax,dword ptr [rcx+8]
mov ecx,dword ptr [rcx+0Ch]
rol eax,cl
ret
private static uint Rotl(uint value, int shift)
{
return (value << shift) | (value >> (32 - shift));
}
[Intrinsic]
public static double Round(double a)
{
double flrTempVal = Floor(a + 0.5);
if ((a == (Floor(a) + 0.5)) && (FMod(flrTempVal, 2.0) != 0))
flrTempVal -= 1.0;
return copysign(flrTempVal, a);
}
25
cmp dword ptr [rcx+48h] …
jne M00_L00
vroundsd xmm0,xmm0,mmword ptr …
ret
Intrinsics
SIMD
Vector4 result =
new Vector4(1f, 2f, 3f, 4f) +
new Vector4(5f, 6f, 7f, 8f);
26
vmovups xmm0,xmmword ptr [rdx]
vmovups xmm1,xmmword ptr [rdx+16]
vaddps xmm0,xmm0,xmm1
X1 X2 X+ =
Y1 Y2 Y+ =
Z1 Z2 Z+ =
W1 W2 W+ =
X1 X2 X
Y1 Y2 Y
+
Z1 Z2 Z
=
W1 W2 W
SIMD
Meet System.Runtime.Intrinsics
var v1 = new Vector4(1, 2, 3, 4);
var v2 = new Vector4(5, 6, 7, 8);
var left = Sse.LoadVector128(&v1.X); // Vector128<float>
var right = Sse.LoadVector128(&v2.X);
var sum = Sse.Add(left, right);
Sse.Store(&result.X, sum);
var mulPi = Sse.Multiply(sum, Sse.SetAllVector128(3.14f));
var result = new Vector4(v1.X + v2.X, v1.Y + v2.Y, ...);
27
System.Runtime.Intrinsics
28
• System.Runtime.Intrinsics
Vector64<T>
Vector128<T>
Vector256<T>
• System.Runtime.Intrinsics.X86
Sse (Sse, Sse2…Sse42)
Avx, Avx2
Fma
…
• System.Runtime.Intrinsics.Arm.Arm
64
Simd
…
System.Runtime.Intrinsics
29
public class Sse2 : Sse
{
public static bool IsSupported => true;
/// <summary>
/// __m128i _mm_add_epi8 (__m128i a, __m128i b)
/// PADDB xmm, xmm/m128
/// </summary>
public static Vector128<byte> Add(Vector128<byte> left, Vector128<byte> right);
/// <summary>
/// __m128i _mm_add_epi8 (__m128i a, __m128i b)
/// PADDB xmm, xmm/m128
/// </summary>
public static Vector128<sbyte> Add(Vector128<sbyte> left, Vector128<sbyte> right);
S.R.I.: Documentation
/// <summary>
/// __m128d _mm_add_pd (__m128d a, __m128d b)
/// ADDPD xmm, xmm/m128
/// </summary>
public static Vector128<double> Add(
Vector128<double> left,
Vector128<double> right);
30
S.R.I.: Usage pattern
if (Arm.Simd.IsSupported)
DoWorkusingNeon();
else if (Avx2.IsSupported)
DoWorkUsingAvx2();
else if (Sse2.IsSupported)
DoWorkUsingSse2();
else
DoWorkSlowly();
31
JIT
if (Arm.Simd.IsSupported)
DoWorkusingNeon();
else if (x86.Avx2.IsSupported)
DoWorkUsingAvx2();
else if (x86.Sse2.IsSupported)
DoWorkUsingSse2();
else
DoWorkSlowly();
IsSorted(int[]) – simple implementation
bool IsSorted(int[] array)
{
if (array.Length < 2)
return true;
for (int i = 0; i < array.Length - 1; i++)
{
if (array[i] > array[i + 1])
return false;
}
return true;
}
32
IsSorted(int[]) – optimized with SSE41
bool IsSorted_Sse41(int[] array)
{
fixed (int* ptr = &array[0])
{
for (int i = 0; i < array.Length - 4; i += 4)
{
var curr = Sse2.LoadVector128(ptr + i);
var next = Sse2.LoadVector128(ptr + i + 1);
var mask = Sse2.CompareGreaterThan(curr, next);
if (!Sse41.TestAllZeros(mask, mask))
return false;
}
}
return true;
}
i0 i1 i2 i3
i0 i1 i2 i3
0 1 0 0
_mm_test_all_zeros
i4 i5
33
Method | Mean |
---------------- |---------:|
IsSorted | 35.07 us |
IsSorted_unsafe | 21.19 us |
IsSorted_Sse41 | 13.79 us |
Reverse<T>(T[] array), level: student
void Reverse<T>(T[] array)
{
for (int i = 0; i < array.Length / 2; i++)
{
T tmp = array[i];
array[i] = array[array.Length - i - 1];
array[array.Length - i - 1] = tmp;
}
}
“1 2 3 4 5 6” => “6 5 4 3 2 1”
34
Reverse<T>(T[] array), level: CoreCLR developer
void Reverse<T>(T[] array)
{
ref T p = ref Unsafe.As<byte, T>(ref array.GetRawSzArrayData());
int i = 0;
int j = array.Length - 1;
while (i < j)
{
T temp = Unsafe.Add(ref p, i);
Unsafe.Add(ref p, i) = Unsafe.Add(ref p, j);
Unsafe.Add(ref p, j) = temp;
i++;
j--;
}
}
No bounds/covariance checks
35
Reverse<T>(T[] array), level: SSE-maniac
int* leftPtr = ptr + i;
int* rightPtr = ptr + len - vectorSize - i;
var left = Sse2.LoadVector128(leftPtr);
var right = Sse2.LoadVector128(rightPtr);
var reversedLeft = Sse2.Shuffle(left, 0x1b); //0x1b =_MM_SHUFFLE(0,1,2,3)
var reversedRight = Sse2.Shuffle(right, 0x1b);
Sse2.Store(rightPtr, reversedLeft);
Sse2.Store(leftPtr, reversedRight);
36
LINQ vs SIMD
37
int max = arrayOfInts.Max();
bool equal = Enumerable.SequenceEqual(arrayOfFloats1, arrayOfFloats2);
Be careful with floats and intrinsics
38
Fma.MultiplyAdd(x, y, z); // x*y+z
Sse3.HorizontalAdd(x, x);
a (39.33427f) * b (245.2255f) + c (150.424f) =
fmadd: 9796.190
fmul,fadd: 9796.189
39
61453.ToString("X"): "0xF00D"
public static int CountHexDigits(ulong value)
{
int digits = 1;
if (value > 0xFFFFFFFF)
{
digits += 8;
value >>= 0x20;
}
if (value > 0xFFFF)
{
digits += 4;
value >>= 0x10;
}
if (value > 0xFF)
{
digits += 2;
value >>= 0x8;
}
if (value > 0xF)
digits++;
return digits;
}
return (67-(int)Lzcnt.LeadingZeroCount(value | 1)) >> 2;
0xF00D = 0000 0000 … 0000 0000 1111 0000 0000 1101
40
Lzcnt.LeadingZeroCount(0xFOOD): 42
public static unsafe Matrix4x4 operator *(Matrix4x4 value1, Matrix4x4 value2)
{
// OLD
m.M11 = value1.M11 * value2.M11 + value1.M12 * value2.M21 + value1.M13 * value2.M31 + value1.M14 * value2.M41;
m.M12 = value1.M11 * value2.M12 + value1.M12 * value2.M22 + value1.M13 * value2.M32 + value1.M14 * value2.M42;
m.M13 = value1.M11 * value2.M13 + value1.M12 * value2.M23 + value1.M13 * value2.M33 + value1.M14 * value2.M43;
m.M14 = value1.M11 * value2.M14 + value1.M12 * value2.M24 + value1.M13 * value2.M34 + value1.M14 * value2.M44;
// NEW
var row = Sse.LoadVector128(&value1.M11);
Sse.Store(&value1.M11,
Sse.Add(Sse.Add(Sse.Multiply(Sse.Shuffle(row, row, 0x00), Sse.LoadVector128(&value2.M11)),
Sse.Multiply(Sse.Shuffle(row, row, 0x55), Sse.LoadVector128(&value2.M21))),
Sse.Add(Sse.Multiply(Sse.Shuffle(row, row, 0xAA), Sse.LoadVector128(&value2.M31)),
Sse.Multiply(Sse.Shuffle(row, row, 0xFF), Sse.LoadVector128(&value2.M41)))));
41
42
43
Better Matrix4x4 layout:
public struct Matrix4x4
{
public float M11;
public float M12;
public float M13;
//... 16 float fields
}
public struct Matrix4x4
{
public Vector128<float> Row1;
public Vector128<float> Row2;
public Vector128<float> Row3;
public Vector128<float> Row4;
}
AVX problems
44
var v1 = Avx.LoadVector256(&m1.M11);
var v2 = Avx.LoadVector256(&m2.M11);
var v3 = Avx.Add(v1, v2);
SSE <-> AVX
Alignment
45
// Prologue: iterate until data is aligned
for (…)
// Main loop: 100% optimized SIMD operations
for (…) LoadAlignedVector256(i)
// Epilogue: do regular `for` for the rest
for (…)
.NET Core: future
Objects on stack (escape analysis)
public string DoSomething()
{
var builder = new StringBuilder();
builder.Append(…);
builder.Append(…);
return builder.ToString();
// builder never escapes the method
}
47
For Java folks: we have user-defined value-types ;-)
Objects on stack – merged!
48
Tiered JIT Compilation – enabled by default
49
• COMPlus_TieredCompilation=1
• COMPlus_TieredCompilation_Tier1CallCountThreshold=30
• Cold methods with hot loops problem
• [MethodImpl(MethodImplOptions.AggressiveOptimization)]
Loop unrolling (auto-vectorization)
for (uint i = 0; i < 256; ++i)
{
total += array[i];
}
for (uint i = 0; i < 64; ++i)
{
total += array[i + 0];
total += array[i + 1];
total += array[i + 2];
total += array[i + 3];
}
50
And don’t forget - C# has other backends!
51
• .NET 4.x CLR
• CoreRT
• Mono
• JIT
• AOT
• LLVM (AOT/JIT)
• Interpreter
• IL2CPP
• Burst
Micro-optimizations are for
• BCL and Runtime
• Because you expect it to be fast
• Game Dev – 16ms per frame
• Don’t be CPU-bound 
• High-load related libs and apps
• Image/Video processing, DL/ML frameworks
• Silly benchmarks (Go vs C#, Java vs C#)
52
Egor Bogatov
EgorBo
Thanks!
53

More Related Content

What's hot (20)

PPT
NFS2-3030
aescotom
 
PPT
Embedded system programming using Arduino microcontroller
Arun Kumar
 
PPTX
Real Time OS For Embedded Systems
Himanshu Ghetia
 
PDF
C programming session5
Keroles karam khalil
 
PPTX
AUTOSAR 403 CAN Stack
Rania Nabil
 
PDF
Embedded C - Lecture 2
Mohamed Abdallah
 
PPTX
scope rules.pptx
NayyabMirTahir
 
PPTX
Basics of arduino uno
Rahat Sood
 
PDF
Instalación y configuración de Firewall ENDIAN
Yimy Pérez Medina
 
PPTX
REAL TIME OPERATING SYSTEM
prakrutijsh
 
PPT
Networking
aescotom
 
PDF
OOPS Notes(C++).pdf
Sitamarhi Institute of Technology
 
PDF
Electronics Notice Board Notice board using Wi-Fi Report
Vaibhav Pandey
 
PPT
Function Block & Organization Block
Mahmoud Hussein
 
PDF
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
Linaro
 
PPTX
Wi-Fi Esp8266 nodemcu
creatjet3d labs
 
PPTX
Encrypt your volumes with barbican open stack 2018
Duncan Wannamaker
 
PPTX
Firewall Endian
Fouad Root
 
NFS2-3030
aescotom
 
Embedded system programming using Arduino microcontroller
Arun Kumar
 
Real Time OS For Embedded Systems
Himanshu Ghetia
 
C programming session5
Keroles karam khalil
 
AUTOSAR 403 CAN Stack
Rania Nabil
 
Embedded C - Lecture 2
Mohamed Abdallah
 
scope rules.pptx
NayyabMirTahir
 
Basics of arduino uno
Rahat Sood
 
Instalación y configuración de Firewall ENDIAN
Yimy Pérez Medina
 
REAL TIME OPERATING SYSTEM
prakrutijsh
 
Networking
aescotom
 
Electronics Notice Board Notice board using Wi-Fi Report
Vaibhav Pandey
 
Function Block & Organization Block
Mahmoud Hussein
 
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
Linaro
 
Wi-Fi Esp8266 nodemcu
creatjet3d labs
 
Encrypt your volumes with barbican open stack 2018
Duncan Wannamaker
 
Firewall Endian
Fouad Root
 

Similar to Egor Bogatov - .NET Core intrinsics and other micro-optimizations (20)

PPTX
How to add an optimization for C# to RyuJIT
Egor Bogatov
 
PPTX
C++11 - STL Additions
GlobalLogic Ukraine
 
PPT
Whats new in_csharp4
Abed Bukhari
 
PPT
SP-First-Lecture.ppt
FareedIhsas
 
PPTX
Story of static code analyzer development
Andrey Karpov
 
PPTX
A scrupulous code review - 15 bugs in C++ code
PVS-Studio LLC
 
PPTX
Getting started cpp full
Võ Hòa
 
PPTX
Ch07-3-sourceCode.pptxhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
VaibhavSrivastav52
 
PDF
Write Python for Speed
Yung-Yu Chen
 
PPT
Java 5 Features
sholavanalli
 
PPT
array2d.ppt
DeveshDewangan5
 
PPTX
Arrays 2d Arrays 2d Arrays 2d Arrrays 2d
LakshayBhardwaj39
 
PPT
Lec2&3_DataStructure
Ibrahim El-Torbany
 
PPT
Lec2&3 data structure
Saad Gabr
 
PPT
Lec2
Saad Gabr
 
PPT
Евгений Крутько, Многопоточные вычисления, современный подход.
Platonov Sergey
 
PDF
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
Teddy Hsiung
 
PPTX
C++11 - A Change in Style - v2.0
Yaser Zhian
 
PDF
talk at Virginia Bioinformatics Institute, December 5, 2013
ericupnorth
 
PDF
Adam Sitnik "State of the .NET Performance"
Yulia Tsisyk
 
How to add an optimization for C# to RyuJIT
Egor Bogatov
 
C++11 - STL Additions
GlobalLogic Ukraine
 
Whats new in_csharp4
Abed Bukhari
 
SP-First-Lecture.ppt
FareedIhsas
 
Story of static code analyzer development
Andrey Karpov
 
A scrupulous code review - 15 bugs in C++ code
PVS-Studio LLC
 
Getting started cpp full
Võ Hòa
 
Ch07-3-sourceCode.pptxhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
VaibhavSrivastav52
 
Write Python for Speed
Yung-Yu Chen
 
Java 5 Features
sholavanalli
 
array2d.ppt
DeveshDewangan5
 
Arrays 2d Arrays 2d Arrays 2d Arrrays 2d
LakshayBhardwaj39
 
Lec2&3_DataStructure
Ibrahim El-Torbany
 
Lec2&3 data structure
Saad Gabr
 
Lec2
Saad Gabr
 
Евгений Крутько, Многопоточные вычисления, современный подход.
Platonov Sergey
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
Teddy Hsiung
 
C++11 - A Change in Style - v2.0
Yaser Zhian
 
talk at Virginia Bioinformatics Institute, December 5, 2013
ericupnorth
 
Adam Sitnik "State of the .NET Performance"
Yulia Tsisyk
 
Ad

Recently uploaded (20)

PDF
Dealing with JSON in the relational world
Andres Almiray
 
PDF
Alur Perkembangan Software dan Jaringan Komputer
ssuser754303
 
PDF
Power BI vs Tableau vs Looker - Which BI Tool is Right for You?
MagnusMinds IT Solution LLP
 
PPTX
Quality on Autopilot: Scaling Testing in Uyuni
Oscar Barrios Torrero
 
PPTX
computer forensics encase emager app exp6 1.pptx
ssuser343e92
 
PPTX
For my supp to finally picking supp that work
necas19388
 
PDF
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
Lionel Briand
 
PPTX
IObit Driver Booster Pro 12.4-12.5 license keys 2025-2026
chaudhryakashoo065
 
PDF
How DeepSeek Beats ChatGPT: Cost Comparison and Key Differences
sumitpurohit810
 
PPTX
ManageIQ - Sprint 264 Review - Slide Deck
ManageIQ
 
PPTX
Automatic_Iperf_Log_Result_Excel_visual_v2.pptx
Chen-Chih Lee
 
PPTX
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
PDF
Rewards and Recognition (2).pdf
ethan Talor
 
PDF
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
 
PPTX
EO4EU Ocean Monitoring: Maritime Weather Routing Optimsation Use Case
EO4EU
 
PDF
WholeClear Split vCard Software for Split large vCard file
markwillsonmw004
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 41
utfefguu
 
PPTX
Android Notifications-A Guide to User-Facing Alerts in Android .pptx
Nabin Dhakal
 
PDF
IObit Uninstaller Pro 14.3.1.8 Crack for Windows Latest
utfefguu
 
PPTX
How Can Recruitment Management Software Improve Hiring Efficiency?
HireME
 
Dealing with JSON in the relational world
Andres Almiray
 
Alur Perkembangan Software dan Jaringan Komputer
ssuser754303
 
Power BI vs Tableau vs Looker - Which BI Tool is Right for You?
MagnusMinds IT Solution LLP
 
Quality on Autopilot: Scaling Testing in Uyuni
Oscar Barrios Torrero
 
computer forensics encase emager app exp6 1.pptx
ssuser343e92
 
For my supp to finally picking supp that work
necas19388
 
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
Lionel Briand
 
IObit Driver Booster Pro 12.4-12.5 license keys 2025-2026
chaudhryakashoo065
 
How DeepSeek Beats ChatGPT: Cost Comparison and Key Differences
sumitpurohit810
 
ManageIQ - Sprint 264 Review - Slide Deck
ManageIQ
 
Automatic_Iperf_Log_Result_Excel_visual_v2.pptx
Chen-Chih Lee
 
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
Rewards and Recognition (2).pdf
ethan Talor
 
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
 
EO4EU Ocean Monitoring: Maritime Weather Routing Optimsation Use Case
EO4EU
 
WholeClear Split vCard Software for Split large vCard file
markwillsonmw004
 
IDM Crack with Internet Download Manager 6.42 Build 41
utfefguu
 
Android Notifications-A Guide to User-Facing Alerts in Android .pptx
Nabin Dhakal
 
IObit Uninstaller Pro 14.3.1.8 Crack for Windows Latest
utfefguu
 
How Can Recruitment Management Software Improve Hiring Efficiency?
HireME
 
Ad

Egor Bogatov - .NET Core intrinsics and other micro-optimizations

  • 1. Intrinsics and other micro-optimizations Egor Bogatov Engineer at Microsoft
  • 2. Agenda Useful micro-optimizations Pitfalls for external contributors Intrinsics & SIMD with examples .NET Core 3.0-x features 2
  • 3. Prefer Spans API where possible var str = "EGOR 3.14 1234 7/3/2018"; string name = str.Substring(0, 4); float pi = float.Parse(str.Substring(5, 4)); int number = int.Parse(str.Substring(10, 4)); DateTime date = DateTime.Parse(str.Substring(15, 8)); var str = "EGOR 3.14 1234 7/3/2018".AsSpan(); var name = str.Slice(0, 4); float pi = float.Parse(str.Slice(5, 4)); int number = int.Parse(str.Slice(10, 4)); DateTime date = DateTime.Parse(str.Slice(15, 8)); Allocated on heap: 168 bytes Allocated on heap: 0 bytes 3
  • 4. Allocating a temp array char[] buffer = new char[count]; 4
  • 5. Allocating a temp array Span<char> span = new char[count]; 5
  • 6. Allocating a temp array Span<char> span = count <= 512 ? stackalloc char[512] : new char[count]; 6
  • 7. Allocating a temp array Span<char> span = count <= 512 ? stackalloc char[512] : ArrayPool<char>.Shared.Rent(count); 7
  • 8. Allocating a temp array char[] pool = null; Span<char> span = count <= 512 ? stackalloc char[512] : (pool = ArrayPool<char>.Shared.Rent(count)); if (pool != null) ArrayPool<char>.Shared.Return(pool); 8
  • 9. Allocating a temp array - final pattern char[] pool = null; Span<char> span = count <= 512 ? stackalloc char[512] : (pool = ArrayPool<char>.Shared.Rent(count)); if (pool != null) ArrayPool<char>.Shared.Return(pool); 9
  • 10. Allocating a temp array – without ArrayPool Span<char> span = count <= 512 ? stackalloc char[512] : new char[count]; 10
  • 12. public static int Count<TSource>(this IEnumerable<TSource> source) { if (source is ICollection<TSource> collectionoft) return collectionoft.Count; if (source is IIListProvider<TSource> listProv) return listProv.GetCount(onlyIfCheap: false); if (source is ICollection collection) return collection.Count; if (source is IReadOnlyCollection<TSource> rocollectionoft) return rocollectionoft.Count; int count = 0; using (IEnumerator<TSource> e = source.GetEnumerator()) while (e.MoveNext()) count++; return count; } 12 ~ 3 ns ~ 3 ns ~ 3 ns ~ 30 ns ~ 10-… ns
  • 13. Casts are not cheap var t0 = (List<string>)value; var t1 = (ICollection<string>)value var t2 = (IList)value var t3 = (IEnumerable<string>)value object value = new List<string> { }; // Covariant interfaces: public interface IEnumerable<out T> public interface IReadOnlyCollection<out T> 13 IEnumerable<object> a = new List<string> {..}
  • 14. Cast to covariant interface – different runtimes 14 Method | Runtime | Mean | Scaled | -------------------:|----------------:|------------:|---------:| CastAndCount | .NET 4.7 | 78.1 ns | 6.7 | CastAndCount | .NET Core 3 | 42.9 ns | 3.7 | CastAndCount | CoreRT | 11.6 ns | 1.0 | CastAndCount | Mono | 6.7 ns | 0.6 | return ((IReadOnlyCollection<string>)_smallArray).Count;
  • 16. Bounds check public static double SumSqrt(double[] array) { double result = 0; for (int i = 0; i < array.Length; i++) { result += Math.Sqrt(array[i]); } return result; } 16
  • 17. Bounds check public static double SumSqrt(double[] array) { double result = 0; for (int i = 0; i < array.Length; i++) { if (i >= array.Length) throw new ArgumentOutOfRangeException(); result += Math.Sqrt(array[i]); } return result; } 17
  • 18. Bounds check eliminated! public static double SumSqrt(double[] array) { double result = 0; for (int i = 0; i < array.Length; i++) { result += Math.Sqrt(array[i]); } return result; } 18
  • 19. Bounds check: tricks public static void Test1(char[] array) { array[0] = 'F'; array[1] = 'a'; array[2] = 'l'; array[3] = 's'; array[4] = 'e'; array[5] = '.'; } 19
  • 20. Bounds check: tricks public static void Test1(char[] array) { array[5] = '.'; array[0] = 'F'; array[1] = 'a'; array[2] = 'l'; array[3] = 's'; array[4] = 'e'; } 20
  • 21. Bounds check: tricks public static void Test1(char[] array) { if (array.Length > 5) { array[0] = 'F'; array[1] = 'a'; array[2] = 'l'; array[3] = 's'; array[4] = 'e'; array[5] = '.'; } } 21
  • 22. Bounds check: tricks public static void Test1(char[] array) { if ((uint)array.Length > 5) { array[0] = 'F'; array[1] = 'a'; array[2] = 'l'; array[3] = 's'; array[4] = 'e'; array[5] = '.'; } } 22
  • 23. Bounds check: tricks – CoreCLR sources: // Boolean.cs public bool TryFormat(Span<char> destination, out int charsWritten) { if (m_value) { if ((uint)destination.Length > 3) { destination[0] = 'T'; destination[1] = 'r'; destination[2] = 'u'; destination[3] = 'e'; charsWritten = 4; return true; } } 23
  • 25. • Recognize patterns • Replace methods (usually marked with [Intrinsic]) • System.Runtime.Intrinsics mov eax,dword ptr [rcx+8] mov ecx,dword ptr [rcx+0Ch] rol eax,cl ret private static uint Rotl(uint value, int shift) { return (value << shift) | (value >> (32 - shift)); } [Intrinsic] public static double Round(double a) { double flrTempVal = Floor(a + 0.5); if ((a == (Floor(a) + 0.5)) && (FMod(flrTempVal, 2.0) != 0)) flrTempVal -= 1.0; return copysign(flrTempVal, a); } 25 cmp dword ptr [rcx+48h] … jne M00_L00 vroundsd xmm0,xmm0,mmword ptr … ret Intrinsics
  • 26. SIMD Vector4 result = new Vector4(1f, 2f, 3f, 4f) + new Vector4(5f, 6f, 7f, 8f); 26 vmovups xmm0,xmmword ptr [rdx] vmovups xmm1,xmmword ptr [rdx+16] vaddps xmm0,xmm0,xmm1 X1 X2 X+ = Y1 Y2 Y+ = Z1 Z2 Z+ = W1 W2 W+ = X1 X2 X Y1 Y2 Y + Z1 Z2 Z = W1 W2 W SIMD
  • 27. Meet System.Runtime.Intrinsics var v1 = new Vector4(1, 2, 3, 4); var v2 = new Vector4(5, 6, 7, 8); var left = Sse.LoadVector128(&v1.X); // Vector128<float> var right = Sse.LoadVector128(&v2.X); var sum = Sse.Add(left, right); Sse.Store(&result.X, sum); var mulPi = Sse.Multiply(sum, Sse.SetAllVector128(3.14f)); var result = new Vector4(v1.X + v2.X, v1.Y + v2.Y, ...); 27
  • 28. System.Runtime.Intrinsics 28 • System.Runtime.Intrinsics Vector64<T> Vector128<T> Vector256<T> • System.Runtime.Intrinsics.X86 Sse (Sse, Sse2…Sse42) Avx, Avx2 Fma … • System.Runtime.Intrinsics.Arm.Arm 64 Simd …
  • 29. System.Runtime.Intrinsics 29 public class Sse2 : Sse { public static bool IsSupported => true; /// <summary> /// __m128i _mm_add_epi8 (__m128i a, __m128i b) /// PADDB xmm, xmm/m128 /// </summary> public static Vector128<byte> Add(Vector128<byte> left, Vector128<byte> right); /// <summary> /// __m128i _mm_add_epi8 (__m128i a, __m128i b) /// PADDB xmm, xmm/m128 /// </summary> public static Vector128<sbyte> Add(Vector128<sbyte> left, Vector128<sbyte> right);
  • 30. S.R.I.: Documentation /// <summary> /// __m128d _mm_add_pd (__m128d a, __m128d b) /// ADDPD xmm, xmm/m128 /// </summary> public static Vector128<double> Add( Vector128<double> left, Vector128<double> right); 30
  • 31. S.R.I.: Usage pattern if (Arm.Simd.IsSupported) DoWorkusingNeon(); else if (Avx2.IsSupported) DoWorkUsingAvx2(); else if (Sse2.IsSupported) DoWorkUsingSse2(); else DoWorkSlowly(); 31 JIT if (Arm.Simd.IsSupported) DoWorkusingNeon(); else if (x86.Avx2.IsSupported) DoWorkUsingAvx2(); else if (x86.Sse2.IsSupported) DoWorkUsingSse2(); else DoWorkSlowly();
  • 32. IsSorted(int[]) – simple implementation bool IsSorted(int[] array) { if (array.Length < 2) return true; for (int i = 0; i < array.Length - 1; i++) { if (array[i] > array[i + 1]) return false; } return true; } 32
  • 33. IsSorted(int[]) – optimized with SSE41 bool IsSorted_Sse41(int[] array) { fixed (int* ptr = &array[0]) { for (int i = 0; i < array.Length - 4; i += 4) { var curr = Sse2.LoadVector128(ptr + i); var next = Sse2.LoadVector128(ptr + i + 1); var mask = Sse2.CompareGreaterThan(curr, next); if (!Sse41.TestAllZeros(mask, mask)) return false; } } return true; } i0 i1 i2 i3 i0 i1 i2 i3 0 1 0 0 _mm_test_all_zeros i4 i5 33 Method | Mean | ---------------- |---------:| IsSorted | 35.07 us | IsSorted_unsafe | 21.19 us | IsSorted_Sse41 | 13.79 us |
  • 34. Reverse<T>(T[] array), level: student void Reverse<T>(T[] array) { for (int i = 0; i < array.Length / 2; i++) { T tmp = array[i]; array[i] = array[array.Length - i - 1]; array[array.Length - i - 1] = tmp; } } “1 2 3 4 5 6” => “6 5 4 3 2 1” 34
  • 35. Reverse<T>(T[] array), level: CoreCLR developer void Reverse<T>(T[] array) { ref T p = ref Unsafe.As<byte, T>(ref array.GetRawSzArrayData()); int i = 0; int j = array.Length - 1; while (i < j) { T temp = Unsafe.Add(ref p, i); Unsafe.Add(ref p, i) = Unsafe.Add(ref p, j); Unsafe.Add(ref p, j) = temp; i++; j--; } } No bounds/covariance checks 35
  • 36. Reverse<T>(T[] array), level: SSE-maniac int* leftPtr = ptr + i; int* rightPtr = ptr + len - vectorSize - i; var left = Sse2.LoadVector128(leftPtr); var right = Sse2.LoadVector128(rightPtr); var reversedLeft = Sse2.Shuffle(left, 0x1b); //0x1b =_MM_SHUFFLE(0,1,2,3) var reversedRight = Sse2.Shuffle(right, 0x1b); Sse2.Store(rightPtr, reversedLeft); Sse2.Store(leftPtr, reversedRight); 36
  • 37. LINQ vs SIMD 37 int max = arrayOfInts.Max(); bool equal = Enumerable.SequenceEqual(arrayOfFloats1, arrayOfFloats2);
  • 38. Be careful with floats and intrinsics 38 Fma.MultiplyAdd(x, y, z); // x*y+z Sse3.HorizontalAdd(x, x); a (39.33427f) * b (245.2255f) + c (150.424f) = fmadd: 9796.190 fmul,fadd: 9796.189
  • 39. 39
  • 40. 61453.ToString("X"): "0xF00D" public static int CountHexDigits(ulong value) { int digits = 1; if (value > 0xFFFFFFFF) { digits += 8; value >>= 0x20; } if (value > 0xFFFF) { digits += 4; value >>= 0x10; } if (value > 0xFF) { digits += 2; value >>= 0x8; } if (value > 0xF) digits++; return digits; } return (67-(int)Lzcnt.LeadingZeroCount(value | 1)) >> 2; 0xF00D = 0000 0000 … 0000 0000 1111 0000 0000 1101 40 Lzcnt.LeadingZeroCount(0xFOOD): 42
  • 41. public static unsafe Matrix4x4 operator *(Matrix4x4 value1, Matrix4x4 value2) { // OLD m.M11 = value1.M11 * value2.M11 + value1.M12 * value2.M21 + value1.M13 * value2.M31 + value1.M14 * value2.M41; m.M12 = value1.M11 * value2.M12 + value1.M12 * value2.M22 + value1.M13 * value2.M32 + value1.M14 * value2.M42; m.M13 = value1.M11 * value2.M13 + value1.M12 * value2.M23 + value1.M13 * value2.M33 + value1.M14 * value2.M43; m.M14 = value1.M11 * value2.M14 + value1.M12 * value2.M24 + value1.M13 * value2.M34 + value1.M14 * value2.M44; // NEW var row = Sse.LoadVector128(&value1.M11); Sse.Store(&value1.M11, Sse.Add(Sse.Add(Sse.Multiply(Sse.Shuffle(row, row, 0x00), Sse.LoadVector128(&value2.M11)), Sse.Multiply(Sse.Shuffle(row, row, 0x55), Sse.LoadVector128(&value2.M21))), Sse.Add(Sse.Multiply(Sse.Shuffle(row, row, 0xAA), Sse.LoadVector128(&value2.M31)), Sse.Multiply(Sse.Shuffle(row, row, 0xFF), Sse.LoadVector128(&value2.M41))))); 41
  • 42. 42
  • 43. 43 Better Matrix4x4 layout: public struct Matrix4x4 { public float M11; public float M12; public float M13; //... 16 float fields } public struct Matrix4x4 { public Vector128<float> Row1; public Vector128<float> Row2; public Vector128<float> Row3; public Vector128<float> Row4; }
  • 44. AVX problems 44 var v1 = Avx.LoadVector256(&m1.M11); var v2 = Avx.LoadVector256(&m2.M11); var v3 = Avx.Add(v1, v2); SSE <-> AVX
  • 45. Alignment 45 // Prologue: iterate until data is aligned for (…) // Main loop: 100% optimized SIMD operations for (…) LoadAlignedVector256(i) // Epilogue: do regular `for` for the rest for (…)
  • 47. Objects on stack (escape analysis) public string DoSomething() { var builder = new StringBuilder(); builder.Append(…); builder.Append(…); return builder.ToString(); // builder never escapes the method } 47 For Java folks: we have user-defined value-types ;-)
  • 48. Objects on stack – merged! 48
  • 49. Tiered JIT Compilation – enabled by default 49 • COMPlus_TieredCompilation=1 • COMPlus_TieredCompilation_Tier1CallCountThreshold=30 • Cold methods with hot loops problem • [MethodImpl(MethodImplOptions.AggressiveOptimization)]
  • 50. Loop unrolling (auto-vectorization) for (uint i = 0; i < 256; ++i) { total += array[i]; } for (uint i = 0; i < 64; ++i) { total += array[i + 0]; total += array[i + 1]; total += array[i + 2]; total += array[i + 3]; } 50
  • 51. And don’t forget - C# has other backends! 51 • .NET 4.x CLR • CoreRT • Mono • JIT • AOT • LLVM (AOT/JIT) • Interpreter • IL2CPP • Burst
  • 52. Micro-optimizations are for • BCL and Runtime • Because you expect it to be fast • Game Dev – 16ms per frame • Don’t be CPU-bound  • High-load related libs and apps • Image/Video processing, DL/ML frameworks • Silly benchmarks (Go vs C#, Java vs C#) 52