The document provides an introduction to parallel computing and parallel programming. It discusses Moore's Law and the need for parallelism to continue increasing performance. It outlines some common parallel architectures like SIMD and MIMD. It also describes different parallel programming models including message passing and shared memory, and different parallel algorithm patterns such as data parallel, task graph, and master-slave models. Finally, it briefly introduces MapReduce as a parallel programming paradigm.
The document discusses parallel computing concepts including concurrency vs parallelism, Amdahl's law, task dependency graphs, and common patterns for parallelizing algorithms such as task-level, divide-and-conquer, pipeline, and repository models. Key points are that parallelism requires multiple processors executing tasks simultaneously, while concurrency allows interleaving of tasks; Amdahl's law describes theoretical speedup limits based on sequential portions of code; and understanding hardware and dependencies informs choice of parallelization patterns.
The document provides an introduction to OpenMP, which is an application programming interface for explicit, portable, shared-memory parallel programming in C/C++ and Fortran. OpenMP consists of compiler directives, runtime calls, and environment variables that are supported by major compilers. It is designed for multi-processor and multi-core shared memory machines, where parallelism is accomplished through threads. Programmers have full control over parallelization through compiler directives that control how the program works, including forking threads, work sharing, synchronization, and data environment.
interfacing matlab with embedded systemsRaghav Shetty
This Book is all about Interfacing Embedded System with Matlab. This book guides the beginners for creating GUI , Modeling with SimuLink & Interfacing Arduino , Raspberry Pi , BeagleBone with Embedded System. This Book is NOT FOR SALE , Only knowledge base for Open Source Community
This document summarizes a presentation about using the Task Parallel Library (TPL) for data flow tasks in .NET. It discusses how TPL can be used to parallelize image processing pipelines by modeling the stages as data flow blocks. The key TPL data flow blocks for sources, targets, buffering, transformations, and joins are explained. Code examples are provided for building a skeletal image processing program using these TPL data flow capabilities.
A Domain-Specific Embedded Language for Programming Parallel Architectures.Jason Hearne-McGuiness
This document proposes a domain-specific embedded language (DSEL) for programming parallel architectures. The DSEL aims to enable parallelism while avoiding issues like deadlocks, race conditions, and complex APIs. It presents the grammar and properties of the proposed DSEL, including that it generates schedules that are deadlock-free and race-condition free. Examples demonstrating data flow and data parallelism using the DSEL are also provided.
This document discusses serial communication between an Arduino and MATLAB. It explains that serial communication involves transferring data one bit at a time. A buffer is used to store data as it is sent and received asynchronously. The document provides examples of setting up a serial port object in MATLAB, configuring communication settings like baud rate, and using functions like fwrite and fscanf to send data from and receive data into MATLAB. It also briefly discusses reading from and writing to the serial port in Arduino code.
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...Jason Hearne-McGuiness
The document summarizes a presentation on using data-parallelism in C++ to parallelize the SPEC2006 benchmark suite. It discusses the design of a data-flow library called Parallel Pixie Dust (PPD) and how it was used to parallelize some STL algorithms. Analysis of two SPEC2006 benchmarks found that only a small number of STL algorithm usages could be easily parallelized due to the functional decomposition of the code into small blocks and potential side effects. Larger functions and avoidance of side effects may enable more opportunities for parallelization.
Concurrent Programming OpenMP @ Distributed System DiscussionCherryBerry2
This powerpoint presentation discusses OpenMP, a programming interface that allows for parallel programming on shared memory architectures. It covers the basic architecture of OpenMP, its core elements like directives and runtime routines, advantages like portability, and disadvantages like potential synchronization bugs. Examples are provided of using OpenMP directives to parallelize a simple "Hello World" program across multiple threads. Fine-grained and coarse-grained parallelism are also defined.
OpenMP directives are used to parallelize sequential programs. The key directives discussed include:
1. Parallel and parallel for to execute loops or code blocks across multiple threads.
2. Sections and parallel sections to execute different code blocks simultaneously in parallel across threads.
3. Critical to ensure a code block is only executed by one thread at a time for mutual exclusion.
4. Single to restrict a code block to only be executed by one thread.
OpenMP makes it possible to easily convert sequential programs to leverage multiple threads and processors through directives like these.
The Challenges facing Libraries and Imperative Languages from Massively Paral...Jason Hearne-McGuiness
The document discusses challenges related to parallel processing and massive parallel architectures. It covers topics like pipeline processors, multiprocessors, processing in memory architectures like Cyclops and picoChip, and cellular architectures. It also discusses code generation issues that arise from massive parallelism and possible solutions using compilers or libraries.
This document discusses parallel programming in .NET and provides an overview of the Task Parallel Library (TPL) and Parallel LINQ (PLINQ). It notes that multicore processors have existed for years but many developers are still writing single-threaded programs. The TPL scales concurrency dynamically across cores and handles partitioning work. PLINQ can improve performance of some queries by parallelizing across segments. Tasks represent asynchronous operations more efficiently than threads. The document provides examples of implicit and explicit task creation and running tasks in parallel using Parallel.Invoke or Task.Run.
Introduction to data structures and AlgorithmDhaval Kaneria
This document provides an introduction to algorithms and data structures. It defines algorithms as step-by-step processes to solve problems and discusses their properties, including being unambiguous, composed of a finite number of steps, and terminating. The document outlines the development process for algorithms and discusses their time and space complexity, noting worst-case, average-case, and best-case scenarios. Examples of iterative and recursive algorithms for calculating factorials are provided to illustrate time and space complexity analyses.
An introduction to the OpenMP parallel programming model.
From the Scalable Computing Support Center at Duke University (http://wiki.duke.edu/display/scsc)
This document provides an overview of using Matlab to communicate with serial ports. It discusses creating a serial port object in Matlab, configuring communication settings like baud rate and data bits, and performing read/write operations both synchronously and asynchronously. It also covers using callbacks to execute functions when specific serial port events occur, like data being available to read.
This document discusses shared-memory parallel programming using OpenMP. It begins with an overview of OpenMP and the shared-memory programming model. It then covers key OpenMP constructs for parallelizing loops, including the parallel for pragma and clauses for declaring private variables. It also discusses managing shared data with critical sections and reductions. The document provides several techniques for improving performance, such as loop inversions, if clauses, and dynamic scheduling.
This document provides an overview of parallel programming with OpenMP. It discusses how OpenMP allows users to incrementally parallelize serial C/C++ and Fortran programs by adding compiler directives and library functions. OpenMP is based on the fork-join model where all programs start as a single thread and additional threads are created for parallel regions. Core OpenMP elements include parallel regions, work-sharing constructs like #pragma omp for to parallelize loops, and clauses to control data scoping. The document provides examples of using OpenMP for tasks like matrix-vector multiplication and numerical integration. It also covers scheduling, handling race conditions, and other runtime functions.
Parallel computing uses multiple processors simultaneously to solve computational problems faster. It allows solving larger problems or more problems in less time. Shared memory parallel programming with tools like OpenMP and pthreads is used for multicore processors that share memory. Distributed memory parallel programming with MPI is used for large clusters with separate processor memories. GPU programming with CUDA is also widely used to leverage graphics hardware for parallel tasks like SIMD. The key challenges in parallel programming are load balancing, communication overhead, and synchronization between processors.
C++ and OpenMP can be used together to create fast and maintainable parallel programs. However, there are some challenges to parallelizing C++ code using OpenMP due to inconsistencies between the C++ and OpenMP specifications. Objects used in OpenMP clauses like shared, private, and firstprivate require special handling of constructors, destructors, and assignment operators. Parallelizing C++ loops can also be problematic if the loop index is not an integer type or if the loop uses STL iterators. STL containers introduce additional issues for parallelization related to initialization and data distribution across processors.
The document describes a solution for implementing real-time data exchange between PLCs and a data consumer using XML/JSON formats. The key aspects of the solution are:
1. A string map and variable map data model in the PLC consolidates all strings and process variables to efficiently manage memory.
2. Telegram vessels are generated recursively based on an XML configuration and embed cross-references to strings and variables.
3. An auto-configuration tool loads the XML and formats it for the PLC to generate the telegram vessels.
4. Reliable TCP interfaces with rotating buffers are used to exchange telegrams containing embedded references that allow fast processing of string and variable values.
The document discusses different types of programmable logic devices including memory units, random access memory (RAM), read only memory (ROM), programmable logic arrays (PLA), programmable array logic (PAL), and complex programmable logic devices (CPLD). It describes the basic components, operations, and applications of each type of programmable logic device. Memory units can store and retrieve binary data and include RAM and ROM. RAM can be written to and read from while ROM can only be read from. PLDs like PLA and PAL provide configurable logic functions using AND and OR gates that can be programmed. CPLDs contain multiple configurable logic blocks and a programmable interconnect
This document provides an overview of writing OpenMP programs on multi-core machines. It discusses:
1) Why OpenMP is useful for parallel programming and its main components like compiler directives and library routines.
2) Elements of OpenMP like parallel regions, work sharing constructs, data scoping, and synchronization methods.
3) Achieving scalable speedup through techniques like breaking data dependencies, avoiding synchronization overheads, and improving data locality with cache and page placement.
OpenMP is a framework for parallel programming that utilizes shared memory multiprocessing. It allows users to split their programs into threads that can run simultaneously across multiple processors or processor cores. OpenMP uses compiler directives, runtime libraries, and environment variables to implement parallel regions, shared memory, and thread synchronization. It is commonly used with C/C++ and Fortran to parallelize loops and speed up computationally intensive programs. A real experiment showed a nested for loop running 3.4x faster when parallelized with OpenMP compared to running sequentially.
This document discusses utilizing multicore processors with OpenMP. It provides an overview of OpenMP, including that it is an industry standard for parallel programming in C/C++ that supports parallelizing loops and tasks. Examples are given of using OpenMP to parallelize particle system position calculation and collision detection across multiple threads. Performance tests on dual-core and triple-core systems show speedups of 2-5x from using OpenMP. Some limitations of OpenMP are also outlined.
This document discusses parallel programming concepts in .NET 4.0 including task parallelism, data parallelism, and coordination data structures. It provides an overview of tasks and task parallelism in .NET 4.0, explaining how to create and start tasks using lambda expressions and delegates. The document also demonstrates continuing tasks by chaining additional actions to run after a task completes.
This document provides an overview of the Erlang programming language and its concurrency model. It discusses key Erlang design principles like fault tolerance, functional programming, and distribution. It then describes Erlang's lightweight processes and message passing model for concurrent programming without shared memory. Processes are organized in supervisor trees for resilience. Examples demonstrate basic Erlang syntax, generic servers, and supervisors. Overall the document introduces the core concepts of the Erlang language and actor-based concurrency model.
what every web and app developer should know about multithreadingIlya Haykinson
Multithreading allows executing multiple tasks simultaneously by splitting a program into multiple threads. It is useful for improving performance on web applications and when disk/network I/O is involved. However, threads introduce complexity as code and data may be accessed concurrently. Synchronization techniques like mutexes are needed to coordinate thread access and prevent race conditions from causing unpredictable behavior. Developers must carefully manage shared resources and avoid deadlocks when using multithreading.
Simon Peyton Jones: Managing parallelismSkills Matter
If you want to program a parallel computer, it obviously makes sense to start with a computational paradigm in which parallelism is the default (ie functional programming), rather than one in which computation is based on sequential flow of control (the imperative paradigm). And yet, and yet ... functional programmers have been singing this tune since the 1980s, but do not yet rule the world. In this talk I’ll say why I think parallelism is too complex a beast to be slain at one blow, and how we are going to be driven, willy-nilly, towards a world in which side effects are much more tightly controlled than now. I’ll sketch a whole range of ways of writing parallel program in a functional paradigm (implicit parallelism, transactional memory, data parallelism, DSLs for GPUs, distributed processes, etc, etc), illustrating with examples from the rapidly moving Haskell community, and identifying some of the challenges we need to tackle.
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...Jason Hearne-McGuiness
The document summarizes a presentation on using data-parallelism in C++ to parallelize the SPEC2006 benchmark suite. It discusses the design of a data-flow library called Parallel Pixie Dust (PPD) and how it was used to parallelize some STL algorithms. Analysis of two SPEC2006 benchmarks found that only a small number of STL algorithm usages could be easily parallelized due to the functional decomposition of the code into small blocks and potential side effects. Larger functions and avoidance of side effects may enable more opportunities for parallelization.
Concurrent Programming OpenMP @ Distributed System DiscussionCherryBerry2
This powerpoint presentation discusses OpenMP, a programming interface that allows for parallel programming on shared memory architectures. It covers the basic architecture of OpenMP, its core elements like directives and runtime routines, advantages like portability, and disadvantages like potential synchronization bugs. Examples are provided of using OpenMP directives to parallelize a simple "Hello World" program across multiple threads. Fine-grained and coarse-grained parallelism are also defined.
OpenMP directives are used to parallelize sequential programs. The key directives discussed include:
1. Parallel and parallel for to execute loops or code blocks across multiple threads.
2. Sections and parallel sections to execute different code blocks simultaneously in parallel across threads.
3. Critical to ensure a code block is only executed by one thread at a time for mutual exclusion.
4. Single to restrict a code block to only be executed by one thread.
OpenMP makes it possible to easily convert sequential programs to leverage multiple threads and processors through directives like these.
The Challenges facing Libraries and Imperative Languages from Massively Paral...Jason Hearne-McGuiness
The document discusses challenges related to parallel processing and massive parallel architectures. It covers topics like pipeline processors, multiprocessors, processing in memory architectures like Cyclops and picoChip, and cellular architectures. It also discusses code generation issues that arise from massive parallelism and possible solutions using compilers or libraries.
This document discusses parallel programming in .NET and provides an overview of the Task Parallel Library (TPL) and Parallel LINQ (PLINQ). It notes that multicore processors have existed for years but many developers are still writing single-threaded programs. The TPL scales concurrency dynamically across cores and handles partitioning work. PLINQ can improve performance of some queries by parallelizing across segments. Tasks represent asynchronous operations more efficiently than threads. The document provides examples of implicit and explicit task creation and running tasks in parallel using Parallel.Invoke or Task.Run.
Introduction to data structures and AlgorithmDhaval Kaneria
This document provides an introduction to algorithms and data structures. It defines algorithms as step-by-step processes to solve problems and discusses their properties, including being unambiguous, composed of a finite number of steps, and terminating. The document outlines the development process for algorithms and discusses their time and space complexity, noting worst-case, average-case, and best-case scenarios. Examples of iterative and recursive algorithms for calculating factorials are provided to illustrate time and space complexity analyses.
An introduction to the OpenMP parallel programming model.
From the Scalable Computing Support Center at Duke University (http://wiki.duke.edu/display/scsc)
This document provides an overview of using Matlab to communicate with serial ports. It discusses creating a serial port object in Matlab, configuring communication settings like baud rate and data bits, and performing read/write operations both synchronously and asynchronously. It also covers using callbacks to execute functions when specific serial port events occur, like data being available to read.
This document discusses shared-memory parallel programming using OpenMP. It begins with an overview of OpenMP and the shared-memory programming model. It then covers key OpenMP constructs for parallelizing loops, including the parallel for pragma and clauses for declaring private variables. It also discusses managing shared data with critical sections and reductions. The document provides several techniques for improving performance, such as loop inversions, if clauses, and dynamic scheduling.
This document provides an overview of parallel programming with OpenMP. It discusses how OpenMP allows users to incrementally parallelize serial C/C++ and Fortran programs by adding compiler directives and library functions. OpenMP is based on the fork-join model where all programs start as a single thread and additional threads are created for parallel regions. Core OpenMP elements include parallel regions, work-sharing constructs like #pragma omp for to parallelize loops, and clauses to control data scoping. The document provides examples of using OpenMP for tasks like matrix-vector multiplication and numerical integration. It also covers scheduling, handling race conditions, and other runtime functions.
Parallel computing uses multiple processors simultaneously to solve computational problems faster. It allows solving larger problems or more problems in less time. Shared memory parallel programming with tools like OpenMP and pthreads is used for multicore processors that share memory. Distributed memory parallel programming with MPI is used for large clusters with separate processor memories. GPU programming with CUDA is also widely used to leverage graphics hardware for parallel tasks like SIMD. The key challenges in parallel programming are load balancing, communication overhead, and synchronization between processors.
C++ and OpenMP can be used together to create fast and maintainable parallel programs. However, there are some challenges to parallelizing C++ code using OpenMP due to inconsistencies between the C++ and OpenMP specifications. Objects used in OpenMP clauses like shared, private, and firstprivate require special handling of constructors, destructors, and assignment operators. Parallelizing C++ loops can also be problematic if the loop index is not an integer type or if the loop uses STL iterators. STL containers introduce additional issues for parallelization related to initialization and data distribution across processors.
The document describes a solution for implementing real-time data exchange between PLCs and a data consumer using XML/JSON formats. The key aspects of the solution are:
1. A string map and variable map data model in the PLC consolidates all strings and process variables to efficiently manage memory.
2. Telegram vessels are generated recursively based on an XML configuration and embed cross-references to strings and variables.
3. An auto-configuration tool loads the XML and formats it for the PLC to generate the telegram vessels.
4. Reliable TCP interfaces with rotating buffers are used to exchange telegrams containing embedded references that allow fast processing of string and variable values.
The document discusses different types of programmable logic devices including memory units, random access memory (RAM), read only memory (ROM), programmable logic arrays (PLA), programmable array logic (PAL), and complex programmable logic devices (CPLD). It describes the basic components, operations, and applications of each type of programmable logic device. Memory units can store and retrieve binary data and include RAM and ROM. RAM can be written to and read from while ROM can only be read from. PLDs like PLA and PAL provide configurable logic functions using AND and OR gates that can be programmed. CPLDs contain multiple configurable logic blocks and a programmable interconnect
This document provides an overview of writing OpenMP programs on multi-core machines. It discusses:
1) Why OpenMP is useful for parallel programming and its main components like compiler directives and library routines.
2) Elements of OpenMP like parallel regions, work sharing constructs, data scoping, and synchronization methods.
3) Achieving scalable speedup through techniques like breaking data dependencies, avoiding synchronization overheads, and improving data locality with cache and page placement.
OpenMP is a framework for parallel programming that utilizes shared memory multiprocessing. It allows users to split their programs into threads that can run simultaneously across multiple processors or processor cores. OpenMP uses compiler directives, runtime libraries, and environment variables to implement parallel regions, shared memory, and thread synchronization. It is commonly used with C/C++ and Fortran to parallelize loops and speed up computationally intensive programs. A real experiment showed a nested for loop running 3.4x faster when parallelized with OpenMP compared to running sequentially.
This document discusses utilizing multicore processors with OpenMP. It provides an overview of OpenMP, including that it is an industry standard for parallel programming in C/C++ that supports parallelizing loops and tasks. Examples are given of using OpenMP to parallelize particle system position calculation and collision detection across multiple threads. Performance tests on dual-core and triple-core systems show speedups of 2-5x from using OpenMP. Some limitations of OpenMP are also outlined.
This document discusses parallel programming concepts in .NET 4.0 including task parallelism, data parallelism, and coordination data structures. It provides an overview of tasks and task parallelism in .NET 4.0, explaining how to create and start tasks using lambda expressions and delegates. The document also demonstrates continuing tasks by chaining additional actions to run after a task completes.
This document provides an overview of the Erlang programming language and its concurrency model. It discusses key Erlang design principles like fault tolerance, functional programming, and distribution. It then describes Erlang's lightweight processes and message passing model for concurrent programming without shared memory. Processes are organized in supervisor trees for resilience. Examples demonstrate basic Erlang syntax, generic servers, and supervisors. Overall the document introduces the core concepts of the Erlang language and actor-based concurrency model.
what every web and app developer should know about multithreadingIlya Haykinson
Multithreading allows executing multiple tasks simultaneously by splitting a program into multiple threads. It is useful for improving performance on web applications and when disk/network I/O is involved. However, threads introduce complexity as code and data may be accessed concurrently. Synchronization techniques like mutexes are needed to coordinate thread access and prevent race conditions from causing unpredictable behavior. Developers must carefully manage shared resources and avoid deadlocks when using multithreading.
Simon Peyton Jones: Managing parallelismSkills Matter
If you want to program a parallel computer, it obviously makes sense to start with a computational paradigm in which parallelism is the default (ie functional programming), rather than one in which computation is based on sequential flow of control (the imperative paradigm). And yet, and yet ... functional programmers have been singing this tune since the 1980s, but do not yet rule the world. In this talk I’ll say why I think parallelism is too complex a beast to be slain at one blow, and how we are going to be driven, willy-nilly, towards a world in which side effects are much more tightly controlled than now. I’ll sketch a whole range of ways of writing parallel program in a functional paradigm (implicit parallelism, transactional memory, data parallelism, DSLs for GPUs, distributed processes, etc, etc), illustrating with examples from the rapidly moving Haskell community, and identifying some of the challenges we need to tackle.
The document discusses parallel programming approaches for multicore processors, advocating for using Haskell and embracing diverse approaches like task parallelism with explicit threads, semi-implicit parallelism by evaluating pure functions in parallel, and data parallelism. It argues that functional programming is well-suited for parallel programming due to its avoidance of side effects and mutable state, but that different problems require different solutions and no single approach is a silver bullet.
The document provides an overview of concurrency constructs and models. It discusses threads and locks, and some of the problems with locks like manually locking/unlocking and lock ordering issues. It then covers theoretical models like actors, CSP, and dataflow. Implementation details and problems with different models are discussed. Finally, it highlights some open problems and areas for further work.
The document provides an overview of asynchronous processing and how it relates to scalability and performance. It discusses key topics like sync vs async, scheduling, latency measurement, concurrent vs lock-free vs wait-free data structures, I/O models like IO, AIO, NIO, zero-copy, and sorting algorithms. It emphasizes picking the right tools for the job and properly benchmarking and measuring performance.
Need for Async: Hot pursuit for scalable applicationsKonrad Malawski
This document discusses asynchronous processing and how it relates to scalability and performance. It begins with an introduction on why asynchronous processing is important for highly parallel systems. It then covers topics like asynchronous I/O, scheduling, latency measurement, concurrent data structures, and techniques for distributed systems like backup requests and combined requests. The overall message is that asynchronous programming allows more efficient use of resources through approaches like non-blocking I/O, and that understanding these principles is key to building scalable applications.
This document summarizes a presentation on Inferno, a system for scalable deep learning on Apache Spark. Inferno allows deep learning models built with Blaze, La Trobe University's deep learning system, to be trained faster using a Spark cluster. It coordinates distributed training of Blaze models across worker nodes, with optimized communication of weights and hyperparameters. Evaluation shows Inferno can train ResNet models on ImageNet up to 4-5 times faster than a single GPU. The presentation provides an overview of deep learning and Spark, demonstrates how Blaze allows easy model building, and explains Inferno's architecture for distributed deep learning training on Spark.
The document discusses multiprocessor and multicore systems. It defines multiprocessors as systems with two or more CPUs sharing full access to common RAM. It describes different hardware architectures for multiprocessors like bus-based, UMA, and NUMA systems. It discusses cache coherence protocols and issues like false sharing. It also covers scheduling and synchronization challenges in multiprocessor systems like load balancing, task assignment, and avoiding priority inversions.
At a time when Herbt Sutter announced to everyone that the free lunch is over (The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software), concurrency has become our everyday life.A big change is coming to Java, the Loom project and with it such new terms as "virtual thread", "continuations" and "structured concurrency". If you've been wondering what they will change in our daily work or
whether it's worth rewriting your Tomcat-based application to super-efficient reactive Netty,or whether to wait for Project Loom? This presentation is for you.
I will talk about the Loom project and the new possibilities related to virtual wattles and "structured concurrency". I will tell you how it works and what can be achieved and the impact on performance
These days fast code needs to operate in harmony with its environment. At the deepest level this means working well with hardware: RAM, disks and SSDs. A unifying theme is treating memory access patterns in a uniform and predictable way that is sympathetic to the underlying hardware. For example writing to and reading from RAM and Hard Disks can be significantly sped up by operating sequentially on the device, rather than randomly accessing the data. In this talk we’ll cover why access patterns are important, what kind of speed gain you can get and how you can write simple high level code which works well with these kind of patterns.
Performance and Predictability - Richard WarburtonJAXLondon2014
This document discusses various low-level performance optimizations related to branch prediction, memory access, storage, and conclusions. It explains that branches can cause stalls, caches help mitigate slow memory access, and sequential access patterns outperform random access. The key themes are optimizing for predictability over randomness and prioritizing principles over specific tools.
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...PROIDEA
Speaker: Konrad Malawski
Language: English
It's the year 2015, so unless you've been living under a rock for the last decade, you probably have heard about servers and platforms needing to go asynchronous in order to scale. But really, how deep did you dive into the reasons as why this need arrises? This talk aims to explain the various reasons and techniques that can be (and often are) used in developing high performance web applications - from the kernel depths, to the high level abstractions that all contribute to such designs.
We'll start with the lowest level of them all - the network transports we all use and how they impact latency in our systems.
Then we will move on to operating systems' socket selector implementation details and the now legendary C10K problem, to see how implementations were forced to change in order to survive the ever-rising number of concurrent connections. Next we'll dive into processor and thread utilisation effects and how parallel programming - using either message-passing or stream processing style libraries fits into the grand picture of pursuing the most stable and lowest latency characteristics we could dream of.
Visit our website: http://atmosphere-conference.com/
Lock-free algorithms use atomic operations instead of locks to allow for thread-safe concurrent access to shared data structures. This avoids issues like deadlocks but makes implementation more complex. Common atomic operations include compare-and-swap (CAS) and load-linked/store-conditional (LL/SC). A key challenge is avoiding the ABA problem where a value is changed back to its original. Solutions involve adding version numbers or using more advanced operations like CAS2. Overall, lock-free approaches can improve performance over locking but require careful design to ensure correctness and avoid data corruption between threads.
The document summarizes lessons learned from running software on a 2000-core cluster. Key issues included:
- RabbitMQ started refusing connections when there were too many, requiring establishing a RabbitMQ cluster.
- Third-party libraries like Enyim for memcached access failed under high load and were replaced with custom code.
- Tasks were split too finely, overwhelming RabbitMQ; larger-grained tasks improved performance.
- A global event logging system ("Greg") contained bugs that slowed debugging, showing the importance of tools. Fixing issues like unbounded buffers and clock synchronization bugs improved its usefulness.
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
The document discusses multi-core and many-core processors and parallel programming models. It provides an overview of hardware trends including increasing numbers of cores in CPUs and GPUs. It also covers parallel programming approaches like shared memory, message passing, data parallelism and task parallelism. Specific APIs discussed include Win32 threads, OpenMP, and Intel TBB.
The document discusses data-oriented design principles for game engine development in C++. It emphasizes understanding how data is represented and used to solve problems, rather than focusing on writing code. It provides examples of how restructuring code to better utilize data locality and cache lines can significantly improve performance by reducing cache misses. Booleans packed into structures are identified as having extremely low information density, wasting cache space.
Medical Image Processing Strategies for multi-core CPUsDaniel Blezek
This document discusses various strategies for parallelizing medical image processing tasks across multi-core CPUs. It begins with a poll asking about readers' computer hardware and parallel programming experience. It then covers degrees of parallelism from serial to large-scale parallelism. The document presents pragmatic approaches using C/C++ with "bolted on" parallel concepts. It provides a brief introduction to SIMD and focuses on SMP concepts using threads, concurrency, and parallel programming models like OpenMP, TBB, and ITK. It discusses example problems like thresholding images and common errors. It concludes with next steps in parallel computing.
Java Core | Modern Java Concurrency | Martijn Verburg & Ben EvansJAX London
The document provides an overview of modern Java concurrency. It discusses how concurrency has become important for performance as CPUs evolved to include multiple cores. It summarizes the java.util.concurrent utilities and common concurrency constructs like locks, queues, thread pools. It advocates using higher-level concurrency abstractions and more immutable and thread-safe collections to make concurrent programming easier.
The document discusses a scalable non-blocking coding style for concurrent programming. It proposes using large arrays for parallel read and update access without locks or volatile variables. Finite state machines are replicated per array word, where successful compare-and-set operations transition states. Failed CAS causes retry but ensures global progress. Array resizing is handled by "marking" words to prevent late updates and copying in parallel using the state machine. The style aims to provide high performance concurrent data structures that are as fast as non-thread-safe implementations but with correctness guarantees. Example applications discussed include a lock-free bit vector and resizable hash table.
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
Web & Graphics Designing Training at Erginous Technologies in Rajpura offers practical, hands-on learning for students, graduates, and professionals aiming for a creative career. The 6-week and 6-month industrial training programs blend creativity with technical skills to prepare you for real-world opportunities in design.
The course covers Graphic Designing tools like Photoshop, Illustrator, and CorelDRAW, along with logo, banner, and branding design. In Web Designing, you’ll learn HTML5, CSS3, JavaScript basics, responsive design, Bootstrap, Figma, and Adobe XD.
Erginous emphasizes 100% practical training, live projects, portfolio building, expert guidance, certification, and placement support. Graduates can explore roles like Web Designer, Graphic Designer, UI/UX Designer, or Freelancer.
For more info, visit erginous.co.in , message us on Instagram at erginoustechnologies, or call directly at +91-89684-38190 . Start your journey toward a creative and successful design career today!
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveScyllaDB
Want to learn practical tips for designing systems that can scale efficiently without compromising speed?
Join us for a workshop where we’ll address these challenges head-on and explore how to architect low-latency systems using Rust. During this free interactive workshop oriented for developers, engineers, and architects, we’ll cover how Rust’s unique language features and the Tokio async runtime enable high-performance application development.
As you explore key principles of designing low-latency systems with Rust, you will learn how to:
- Create and compile a real-world app with Rust
- Connect the application to ScyllaDB (NoSQL data store)
- Negotiate tradeoffs related to data modeling and querying
- Manage and monitor the database for consistently low latencies
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Markus Eisele
We keep hearing that “integration” is old news, with modern architectures and platforms promising frictionless connectivity. So, is enterprise integration really dead? Not exactly! In this session, we’ll talk about how AI-infused applications and tool-calling agents are redefining the concept of integration, especially when combined with the power of Apache Camel.
We will discuss the the role of enterprise integration in an era where Large Language Models (LLMs) and agent-driven automation can interpret business needs, handle routing, and invoke Camel endpoints with minimal developer intervention. You will see how these AI-enabled systems help weave business data, applications, and services together giving us flexibility and freeing us from hardcoding boilerplate of integration flows.
You’ll walk away with:
An updated perspective on the future of “integration” in a world driven by AI, LLMs, and intelligent agents.
Real-world examples of how tool-calling functionality can transform Camel routes into dynamic, adaptive workflows.
Code examples how to merge AI capabilities with Apache Camel to deliver flexible, event-driven architectures at scale.
Roadmap strategies for integrating LLM-powered agents into your enterprise, orchestrating services that previously demanded complex, rigid solutions.
Join us to see why rumours of integration’s relevancy have been greatly exaggerated—and see first hand how Camel, powered by AI, is quietly reinventing how we connect the enterprise.
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptxMSP360
Data loss can be devastating — especially when you discover it while trying to recover. All too often, it happens due to mistakes in your backup strategy. Whether you work for an MSP or within an organization, your company is susceptible to common backup mistakes that leave data vulnerable, productivity in question, and compliance at risk.
Join 4-time Microsoft MVP Nick Cavalancia as he breaks down the top five backup mistakes businesses and MSPs make—and, more importantly, explains how to prevent them.
Autonomous Resource Optimization: How AI is Solving the Overprovisioning Problem
In this session, Suresh Mathew will explore how autonomous AI is revolutionizing cloud resource management for DevOps, SRE, and Platform Engineering teams.
Traditional cloud infrastructure typically suffers from significant overprovisioning—a "better safe than sorry" approach that leads to wasted resources and inflated costs. This presentation will demonstrate how AI-powered autonomous systems are eliminating this problem through continuous, real-time optimization.
Key topics include:
Why manual and rule-based optimization approaches fall short in dynamic cloud environments
How machine learning predicts workload patterns to right-size resources before they're needed
Real-world implementation strategies that don't compromise reliability or performance
Featured case study: Learn how Palo Alto Networks implemented autonomous resource optimization to save $3.5M in cloud costs while maintaining strict performance SLAs across their global security infrastructure.
Bio:
Suresh Mathew is the CEO and Founder of Sedai, an autonomous cloud management platform. Previously, as Sr. MTS Architect at PayPal, he built an AI/ML platform that autonomously resolved performance and availability issues—executing over 2 million remediations annually and becoming the only system trusted to operate independently during peak holiday traffic.
TrsLabs - Leverage the Power of UPI PaymentsTrs Labs
Revolutionize your Fintech growth with UPI Payments
"Riding the UPI strategy" refers to leveraging the Unified Payments Interface (UPI) to drive digital payments in India and beyond. This involves understanding UPI's features, benefits, and potential, and developing strategies to maximize its usage and impact. Essentially, it's about strategically utilizing UPI to promote digital payments, financial inclusion, and economic growth.
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Raffi Khatchadourian
Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code—supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, imperative DL frameworks encouraging eager execution have emerged but at the expense of run-time performance. Though hybrid approaches aim for the “best of both worlds,” using them effectively requires subtle considerations to make code amenable to safe, accurate, and efficient graph execution—avoiding performance bottlenecks and semantically inequivalent results. We discuss the engineering aspects of a refactoring tool that automatically determines when it is safe and potentially advantageous to migrate imperative DL code to graph execution and vice-versa.
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Raffi Khatchadourian
Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code—supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, imperative DL frameworks encouraging eager execution have emerged but at the expense of run-time performance. Though hybrid approaches aim for the “best of both worlds,” using them effectively requires subtle considerations to make code amenable to safe, accurate, and efficient graph execution—avoiding performance bottlenecks and semantically inequivalent results. We discuss the engineering aspects of a refactoring tool that automatically determines when it is safe and potentially advantageous to migrate imperative DL code to graph execution and vice-versa.
fennec fox optimization algorithm for optimal solutionshallal2
Imagine you have a group of fennec foxes searching for the best spot to find food (the optimal solution to a problem). Each fox represents a possible solution and carries a unique "strategy" (set of parameters) to find food. These strategies are organized in a table (matrix X), where each row is a fox, and each column is a parameter they adjust, like digging depth or speed.
UiPath Agentic Automation: Community Developer OpportunitiesDianaGray10
Please join our UiPath Agentic: Community Developer session where we will review some of the opportunities that will be available this year for developers wanting to learn more about Agentic Automation.
Transcript: Canadian book publishing: Insights from the latest salary survey ...BookNet Canada
Join us for a presentation in partnership with the Association of Canadian Publishers (ACP) as they share results from the recently conducted Canadian Book Publishing Industry Salary Survey. This comprehensive survey provides key insights into average salaries across departments, roles, and demographic metrics. Members of ACP’s Diversity and Inclusion Committee will join us to unpack what the findings mean in the context of justice, equity, diversity, and inclusion in the industry.
Results of the 2024 Canadian Book Publishing Industry Salary Survey: https://publishers.ca/wp-content/uploads/2025/04/ACP_Salary_Survey_FINAL-2.pdf
Link to presentation slides and transcript: https://bnctechforum.ca/sessions/canadian-book-publishing-insights-from-the-latest-salary-survey/
Presented by BookNet Canada and the Association of Canadian Publishers on May 1, 2025 with support from the Department of Canadian Heritage.
Slides for the session delivered at Devoxx UK 2025 - Londo.
Discover how to seamlessly integrate AI LLM models into your website using cutting-edge techniques like new client-side APIs and cloud services. Learn how to execute AI models in the front-end without incurring cloud fees by leveraging Chrome's Gemini Nano model using the window.ai inference API, or utilizing WebNN, WebGPU, and WebAssembly for open-source models.
This session dives into API integration, token management, secure prompting, and practical demos to get you started with AI on the web.
Unlock the power of AI on the web while having fun along the way!
4. Parallel Programming: Why? Moore’s Law Limits to sequential CPUs – parallel processing is how we avoid those limits. Programs must be parallel to get Moore level speedups. Applies to programming in general.
6. “ Waaaah!” “ Parallel programming is hard.” “ My code already runs incredibly fast – it doesn’t need to go any faster.” “ It’s impossible to parallelise this algorithm.” “ Only the rendering pipeline needs to be parallel.” “ that’s only for super computers.”
22. Locking Used to serialise access to code. Like a key to a coffee shop toilet one key, one toilet, queue for access. Lock()/Unlock() … Code… Lock(); // protected region Unlock(); ...more code…
31. Deadlock “ When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone. ” — Kansas Legislature Deadlock can occur when 2 or more processes require resource(s) from another.
32. Deadlock Thread 1 Thread 2 Generally can be considered to be a logic error Can be painfully subtle and rare. Lock A Lock B Lock B Lock A Unl0ck A Unlock B
34. Read/write tearing More that one thread writing to the same memory at the same time. The more data, the more likely Solve with synchronisation primitives. “ AAAAAAAA” “ BBBBBBBB” “ AAAABBBB”
35. The Problems Race conditions Deadlocks Read/write tearing Priority Inversion
36. Priority Inversion Consider threads with different priorities Low priority thread holds a shared resource High priority thread tries to acquire that resource High priority thread is blocked by the low Medium priority threads will execute at the expense of the low and the high threads.
37. The Problems Race conditions Deadlocks Read/write tearing Priority Inversion The ABA Problem
38. The ABA problem Thread 1 reads ‘A’ from memory. Thread 2 modifies memory value ‘A’ to ‘B’ and back to ‘A’ again. Thread 1 resumes and assumes nothing has changed (using CAS) Often associated with dynamically allocated memory
49. SpinLock Loop until a value is set. No OS overhead with thread management Doesn’t sleep thread Handy if you will never wait for long. Very bad if you need to wait for a long time Can embed sleep() or Yield() But these can be perilous
50. Mutex Mutual Exclusion A simple lock/unlock primitive Otherwise known as a CriticalSection Used to serialise access to code. Often overused. More than just a spinlock can release thread Be aware of overhead
51. Barrier Will block until ‘n’ threads signal it Useful for ensuring that all threads have finished a particular task.
63. RWLock Allows many readers But exclusive writing Writing blocks writers and readers. Writing waits until all readers have finished.
64. Semaphore Generalisation of mutex Allows ‘c’ threads access to critical code at once. Basically an atomic integer Wait() will block if value == 0; then dec & cont. Signal() increments value (allows a waiting thread to unblock) Conceptually, Mutexes stop other threads from running code Semaphores tell other threads to run code
65. Parallel Patterns Why patterns? A set of templates to aid design A common language Aids education Provides a familiar base to start implementation
66. So, how do we start? Analyse your problem Identify tasks that can execute concurrently Identify data local to each task Identify task order/schedule Analyse dependencies between tasks. Consider the HW you are running on
68. Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow From “Patterns for Parallel Programming”
69. Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive From “Patterns for Parallel Programming”
70. Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive Task Parallelism Divide and Conquer Geometric Decomposition Recursive Data Pipeline Event-Based Coordination From “Patterns for Parallel Programming”
71. Task Parallelism Task dominant, linear Functionally driven problem Many tasks that may depend on each other Try to minimise dependencies Key elements: Tasks Dependencies Schedule
72. Divide and Conquer Task Dominant, recursive Problem solved by splitting it into smaller sub-problems and solving them independently. Generally its easy to take a sequential Divide and Conquer implementation and parallelise it.
73. Geometric Decomposition Data dominant, linear Decompose the data into chunks Solve for chunks independently. Beware of edge dependencies.
74. Recursive Data Pattern Data dominant, recursive Operations on trees, lists, graphs Dependencies can often prohibit parallelism Often requires tricky recasting of problem ie operate on all tree elements in parallel More work, but distributed across more cores
75. Pipeline Pattern Data flow dominant, linear Sets of data flowing through a sequence of stages Each stage is independent Easy to understand - simple, dedicated code
76. Event-Based Coordination Data flow, recursive Groups of semi-independent tasks interacting in an irregular fashion. Tasks sending events to other tasks which send tasks… Can be highly complex Tricky to load balance
77. Supporting Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
78. Program Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
79. SPMD Single Program, Multiple Data Single source code image running on multiple threads Very common Easy to maintain Easy to understand
80. Master/Worker Dominant force is the need to dynamically load balance Tasks are highly variable ie duration/cost Program structure doesn’t map onto loops Cores vary in performance. “ Bag of Tasks” Master sets up tasks and waits for completion Workers grab task from queue, execute and then grab the next one.
81. Loop Parallelism Dominated by computationally expensive loops Split iterations of the loop out to threads Be careful of memory use and process granularity
82. Fork/Join The number of concurrent tasks varies over the life of the execution. Complex or recursive relations between tasks Either Direct task/core mapping Thread pool
83. Supporting Data Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
84. Shared Data Required when At least one data structure is accessed by multiple tasks At least one task modifies the shared data The tasks potentially need to use the modified value. Solutions Serialise execution – mutual exclusion Noninterfering sets of operations RWlocks
85. Distributed Array How can we distribute an array across many threads? Used in Geometric Decomposition Break array into thread specific parts Maximise locality per thread Be wary of cache line overlap Keep data distribution coarse
86. Shared Queue Extremely valuable construct Fundamental part of Master/Worker (“Bag of Tasks”) Must be consistent and work with many competing threads. Must be as efficient as possible Preferably lock free
87. Lock free programming Locks Simple, easy to use and implement But serialise code execution Lock Free Tricky to implement and debug
88. Lock Free linked list Lock free linked list (ordered) Easily generalised to other container classes Stacks Queues Relatively simple to understand
94. Add ‘b’ and ‘c’ concurretly head a d tail b c Find where to insert
95. Add ‘b’ and ‘c’ concurretly head a d tail b c newNode->Next = prev->Next;
96. Add ‘b’ and ‘c’ concurrently head a d tail b c prev->Next = newNode;
97. Add ‘b’ and ‘c’ concurrently head a d tail b c
98. Extending to multiple threads What could go wrong? Add another node between a & c A or c could be deleted A concurrent read could reach a dangling pointer. Any number of multiples of the above If anything can go wrong, it will. So, how do we make it thread safe? Lets examine some solutions
99. Coarse Grained Locking Lock the list for each add or remove Also lock for reads (find, iterators) Will effectively serialise the list Only one thread at a time can access the list. Correctness at the expense of performance .
100. A concrete example 10 producers Add 500 random numbers in a tight loop 10 consumers Remove the 500 numbers in a tight loop Each in its own thread 21 threads Running on PS3 using SNTuner to profile
105. Coarse Grained locking Wide green bars are active locks Little blips are adds or removes Execution took 416ms (profiling significantly impacts performance)
106. Fine Grained Locking Add and Remove only affects neighbours Give each Node a lock (So, creating a node creates a mutex) Lock only neighbours when adding or removing. When iterating along the list you must lock/unlock as you go.
113. Fine Grained Locking Blocking is much longer – due to overhead in creating a mutex Very slow > 1200ms Better solution would have been to have a pool of mutexes that could be used
114. Optimistic Locking Search without locking Lock nodes once found, then validate them Valid if you can navigate to it from head. If invalid, search from head again.
133. Optimistic Caveat We can’t delete nodes immediately Another thread could be reading it Can’t rely on memory not being changed. Use deferred garbage collection Delete in a ‘safe’ part of a frame. Or use invasive lists (supply own nodes) Find() requires validation (Locks).
140. Optimistic Synchronisation ~540ms Most time was spent validating Plus there was overhead in creating a mutex per node for the lock. Again, a pool of mutexes would benefit.
141. Lazy Synchronization Attempt to speed up Optimistic Validation Store a deleted flag in each node Find() is then lock free Just check the deleted flag on success.
158. Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next; | prev->next=b;
159. Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next; | prev->next=b;
160. Delete ‘a’ and add ‘b’ concurrently head a c tail b head->next=a->next; | prev->next=b;
161. Delete ‘a’ and add ‘b’ concurrently head a c tail b Effectively deletes ‘a’ and ‘b’.
162. Introducing the AtomicMarkedPtr<> Wrapper on uint32 Encapsulates an atomic pointer and a flag Allows testing of a flag and updating of a pointer atomically. Use LSB for the flag AtomicMarkedPtr<Node> next; next->CompareAndSet(eValue, nValue,eFlag, nFlag);
163. AtomicMarkedPtr<> We can now use CAS to set a pointer and check a flag in a single atomic action. ie. check deleted status and change pointer at same time. class Node { public: Node(); AtomicMarkedPtr<Node> m_Next; T m_Data; int32 m_Key; };
165. Step 1: Find ‘d’ head a c tail f k m pred curr succ d if(!InternalFind(‘d’)) continue;
166. Step 2: Mark ‘d’ head a c tail f k m pred curr succ d if(!curr->next->CAS(succ,succ,false,true)) continue;
167. Step 3: Skip ‘d’ head a c tail f k m pred curr succ d pred->next->CAS(curr,succ,false,false);
168. LockFree: InternalFind() Finds pred and curr Skips marked nodes. Consider the list at Step 2 in previous example and, lets introduce a second thread calling InternalFind();
175. Real world considerations Cost of locking Context switching Memory coherency/latency Size/granularity of tasks
176. Advice Build a set of lock free containers Design around data flow Minimise locking You can have more than ‘n’ threads on an ‘n’ core machine Profile, profile, profile.
177. References Patterns for Parallel Programming – T. Mattson et.al. The Art of Multiprocessor Programming – M Herlihy and Nir Shavit http://www.top500.org/ Flow Based Programming - http://www.jpaulmorrison.com/fbp/index.shtml http://www.valvesoftware.com/publications/2007/GDC2007_SourceMulticore.pdf http://www.netrino.com/node/202 http://blogs.intel.com/research/2007/08/what_makes_parallel_programmin.php The Little book of Semaphores - http://www.greenteapress.com/semaphores/ My Blog: 7DOF - http://seven-degrees-of-freedom.blogspot.com/