SlideShare a Scribd company logo
@prestonism

Ruby Supercomputing                                                                                gmail: conmotto
                                                                                             http://prestonlee.com
                                                                                       http://github.com/preston
Using The GPU For Massive Performance Speedup                                http://www.slideshare.net/preston.lee/
Last Updated: March 17th, 2011.
Preston Lee, MBA, Translational Genomics Research Institute And Arizona State University
git@github.com:preston/ruby-gpu-examples.git

   Grab the code now if you want, but to run all the examples
     you’ll need the NVIDIA development driver and toolkit,
     Ruby 1.9, the “barracuda” gem, and JRuby (without the
   barracuda gem) on a multi-core OS X Snow Leopard system.
    This takes time to set up, so maybe just chillax for now?
Let’s find the area of each ring.
http://openwalls.com/image/7358/the_texture_of_the_tree_rings_1440x900.jpg
Math. Yay!

✤   The inner-most ring is ring #1.

✤   Total area of ring #5 is π times            #1
    the square of the radius. (πrr)       #4

✤   Area of only ring #5 is πrr                #5
    minus area of ring #4.

✤   (Math::PI * (radius ** 2)) -
    (Math::PI * ((radius - 1) ** 2))
✤   Let’s find the area of every ring...
1st working attempt.                                              #1
(Single-threaded Ruby.)




  def ring_area(radius)
  
    (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  end

  (1..RINGS).each do |i|
     puts ring_area(i)
  end
2nd working attempt.                                                    #1
(Multi-threaded Ruby.)



  def ring_area(radius)
  
     (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  end

  # Multi-thread it!
  (0..(NUM_THREADS - 1)).each do |i|
  
     rings_per_thread = RINGS / NUM_THREADS
  
     offset = i * rings_per_thread
  
     threads << Thread.new(rings_per_thread, offset) do |num, offset|
         
 last = offset + num - 1
         
 (offset..(last)).each do |radius|
         
 
      ring_area(radius)
         
 end
  
     end
  end
  threads.each do |t| t.join end
3rd/4th working attempt.                                                                                                                                                                       #1
(Single/Multi-threaded C. Ohhh crap...)


  /*                                                                                                             
                  printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec);
    A baseline CPU-based benchmark program CPU/GPU performance comparision.                                      
    Approximates the cross-sectional area of every tree ring in a tree trunk in serial and in parallel           
                  printf("nDone!nn");
    by taking the total area at a given radius and subtracting the area of the closest inner ring.               
                  return EXIT_SUCCESS;
    Copyright © 2011 Preston Lee. All rights reserved.                                                           }
    http://prestonlee.com
  */                                                                                                             /* Approximate the cross-sectional area between each pair of consecutive tree rings
                                                                                                                   
               in serial */
  #include <stdio.h>                                                                                             void calculate_ring_areas_in_serial(int rings) {
  #include <stdlib.h>                                                                                            
                 
            calculate_ring_areas_in_serial_with_offset(rings, 0);
  #include <math.h>                                                                                              }
  #include <pthread.h>
  #include <sys/time.h>                                                                                          void calculate_ring_areas_in_serial_with_offset(int rings, int thread) {
                                                                                                                 
                 int i;
  #include "tree_rings.h"                                                                                        
                 int offset = rings * thread;
                                                                                                                 
                 int max = rings + offset;
  #define DEFAULT_RINGS 1000000                                                                                   
                 float a = 0;
  #define NUM_THREADS 8                                                                                           
                 for(i = offset; i < max; i++) {
  #define DEBUG 0                                                                                                 
                 
              a = (M_PI * pow(i, 2)) - (M_PI * pow(i - 1, 2));
                                                                                                                 
                 }
  int acc = 0;                                                                                                   }

  int main(int argc, const char * argv[]) {                                                                      /* Approximate the cross-sectional area between each pair of consecutive tree rings
  
                  int rings = DEFAULT_RINGS;                                                                    
               in parallel on NUM_THREADS threads */
                                                                                                                 void calculate_ring_areas_in_parallel(int rings) {
  
                 if(argc > 1) {                                                                               
                 pthread_t threads[NUM_THREADS];
  
                 
              rings = atoi(argv[1]);                                                        
                 int rc;
  
                 }                                                                                            
                 int t;
  
                                                                                                              
                 int rings_per_thread = rings / NUM_THREADS;
  
                 printf("nA baseline CPU-based benchmark program for CPU/GPU performance comparision.n");   
                 ring_thread_data data[NUM_THREADS];
  
                 printf("Copyright © 2011 Preston Lee. All rights reserved.nn");                            
  
                 printf("tUsage: %s [NUM TREE RINGS]nn", argv[0]);                                         
                 for(t = 0; t < NUM_THREADS; t++){
  
                                                                                                              
                 
              data[t].rings = rings_per_thread;
  
                 printf("Number of tree rings: %i. Yay!n", rings);                                           
                 
              data[t].number = t;
  
                 
                                                                                            
                    rc = pthread_create(&threads[t], NULL, (void *) ring_job, (void *) &data[t]);
  
                 struct timeval start, stop, diff;
                                                           
                    if (rc){
  
                                                                                                              
                      printf("ERROR; return code from pthread_create() is %dn", rc);
  
                 printf("nRunning serial calculation using CPU...ttt");                                   
                      exit(-1);
  
                 gettimeofday(&start, NULL);                                                                  
                   }
  
                 calculate_ring_areas_in_serial(rings);                                                       
                 }
  
                 gettimeofday(&stop, NULL);                                                                   
  
                 timeval_subtract(&diff, &stop, &start);                                                      
                 for(t = 0; t < NUM_THREADS; t++){
  
                 printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec);                        
                 
              pthread_join(threads[t], NULL);
  
                                                                                                              
                 }
  
                 printf("Running parallel calculation using %i CPU threads...t", NUM_THREADS);               }
  
                 gettimeofday(&start, NULL);
  
                 calculate_ring_areas_in_parallel(rings);                                                     void ring_job(ring_thread_data * data) {
  
                 gettimeofday(&stop, NULL);                                                                   
                 calculate_ring_areas_in_serial_with_offset(data->rings, data->number);
  
                 timeval_subtract(&diff, &stop, &start);                                                      }
Speed: Your Primary Options

1. Pure Ruby in a big loop. (Single
   threaded.)                         #1

2. Pure Ruby, multi-threading
   smaller loops. (Limited to using
   a single core on 1.9 due to the
   GIL, but not on jruby etc.)

3. C extension, single thread.

4. C extension, pthreads.

5. “Divide and conquer.”
Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1
Ruby 1.9
Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1
CPU Concurrency

✤   Ideally, asynchronous tasks across multiple physical and/or logical
    cores.

✤   POSIX threading is the standard.

✤   Producer/Consumer pattern typically used to account for differences
    in machine execution time.

✤   Concurrency generally implemented with Time-Division
    Multiplexing. An CPU with 4 cores can run 100 threads, but the OS
    will time-share execution time, more-or-less fairly.

✤   MIMD: Multiple Instruction Multiple Data
Common CPU Issues


✤   Sometime we need insane numbers of threads per host.

✤   Lock performance.

✤   Potential for excessive virtual memory swapping.

✤   Many tests are non-deterministic.

✤   Concurrency modeling is difficult to get correct.
Can’t we just execute every
                       instruction
concurrently, but with different data?
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
No.
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
No.
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))




         Multiple Instruction, Multiple Data. (MIMD)
Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1
Yes!
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))




          Single Instruction, Multiple Data. (SIMD)
GPU Brief History

✤   Graphics Processing Units were initially developed for 3D scene rendering, such as computing
    luminosity values of each pixel on the screen concurrently.

✤   Consider a 1024x768 display. That’s 786,432 pixels updating 60 times per second. 786,432 pixles
    @ 60Hz => 47,185,920 => potential pixel pushes per second!

✤   Running 768,432 threads makes OS unhappy. :(

✤   Sometimes it’d be great to have all calculations finish simultaneously. If you’re updating every
    pixel in a display, each can be computed concurrently. Exactly concurrently.

✤   Generally fast floating point. (Critical for scientific computing.)

✤   SIMD: Single Instruction Multiple Data
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
GPU Pros


✤   SIMD architecture can run thousands of threads concurrently.

✤   Many synchronization issues are designed out.

✤   More FLOPS (floating point operations per second) than host CPU.

✤   Synchronously execute the same instruction for different data points.
NVidia Tesla C1060
CL_DEVICE_NAME: 

    
     Tesla C1060
CL_DEVICE_VENDOR: 
   
     
      NVIDIA Corporation
CL_DRIVER_VERSION: 
  
     
      260.81
CL_DEVICE_TYPE:
 
    
     
      CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS:
 
           30
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
 3
CL_DEVICE_MAX_WORK_ITEM_SIZES:
 
         512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE:
           
      512
CL_DEVICE_MAX_CLOCK_FREQUENCY:
           
      1296 MHz
CL_DEVICE_ADDRESS_BITS:
    
      
      32
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
 
          1014 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:
 
      4058 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
 no
CL_DEVICE_LOCAL_MEM_TYPE:
 
       local
CL_DEVICE_LOCAL_MEM_SIZE:
 
       16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:

      CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES:

      CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:
 
        1
CL_DEVICE_MAX_READ_IMAGE_ARGS:
 
         128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:

         8
CL_DEVICE_SINGLE_FP_CONFIG:

      CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF
CL_FP_FMA
CL_DEVICE_2D_MAX_WIDTH
     
      
      4096
CL_DEVICE_2D_MAX_HEIGHT
 
         
      32768
CL_DEVICE_3D_MAX_WIDTH
     
      
      2048
CL_DEVICE_3D_MAX_HEIGHT
 
         
      2048
CL_DEVICE_3D_MAX_DEPTH
     
      
      2048
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
 CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1
OpenCL
(Image from Khronos Group.)

Mac OS X Snow Leopard Ships With Support For OpenCL 1.0, But You Still Need The Development Driver, SDK Etc.
OpenCL Terms

✤   Kernel: Your custom function(s) that will run on the Device.
    (Unrelated to the OS kernel.)

✤   Device: something that computes, such as a GPU chip. (You probably
    have 1, but then you might have 0... or 2+.) Could also be your CPU!

✤   Platform: Container for all your devices. When running code on your
    local machine you’ll only have one “platform” instance. (A cluster of
    GPU-heavy systems connected over a network would yield multiple
    available platforms, but OpenCL work in this area is not yet
    complete.)

✤   Device-specific terms: thread group, streaming multi-processor, etc.
Code!

✤   Ruby 1.9: single threaded.
✤   Ruby 1.9: multi-threaded on 1.9.
✤   Ruby 1.9: barracuda (native OpenCL/NVIDIA CUDA bindings).
✤   JRuby 1.6: single threaded.
✤   JRuby 1.6: multi-threaded.
✤   C: single threaded.
✤   C: native threads.
Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1
GPU Not-So-Pros

✤   Copying data in system RAM to/from GPU shared memory is not free. (In my
    own testing, often the slowest link in the chain.)

✤   SIMD can feel limiting. Benefits start to break down when you can’t design
    conditions out of your algorithm. (E.g. “if(thread_number > 42) { x += y; }” will
    cause some threads to idle.)

✤   64-bit CPU does not imply 64-bit GPU! (All GPUs I’ve used are 32-bit or less.)

✤   Allocation limitations. Having 4GB of GPU shared memory does necessarily
    mean your can allocate one giant block.

✤   Kernels are essentially written in C. May be difficult if you’re new to pointers or
    memory management. (You can bind to higher-level languages, though.)
Q&A and bonus content.
Names/Keywords To Know


✤   NVidia (CUDA)

✤   ATI (Owned by AMD) (Stream SDK)

✤   Kronus (Drives OpenCL specification.)

✤   Apple (ships OpenCL-capable drivers with all newer Macs running
    Snow Leopard.)
NVidia GeForce GT 330M
CL_DEVICE_NAME: 

    
     GeForce GT 330M
CL_DEVICE_VENDOR: 
   
     
     NVIDIA
CL_DRIVER_VERSION: 
  
     
     CLH 1.0
CL_DEVICE_TYPE:
 
    
     
     CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS:
 
          6
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
 3
CL_DEVICE_MAX_WORK_ITEM_SIZES:
 
        0/0/0
CL_DEVICE_MAX_WORK_GROUP_SIZE:
          
     512
CL_DEVICE_MAX_CLOCK_FREQUENCY:
          
     1100 MHz
CL_DEVICE_ADDRESS_BITS:
    
     
      32
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
 
         128 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:
 
     512 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
 no
CL_DEVICE_LOCAL_MEM_TYPE:
 
      local
CL_DEVICE_LOCAL_MEM_SIZE:
 
      16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:

     CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:
 
       1
CL_DEVICE_MAX_READ_IMAGE_ARGS:
 
        128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:

        8
CL_DEVICE_SINGLE_FP_CONFIG:

     CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
CL_DEVICE_2D_MAX_WIDTH
     
     
      0
CL_DEVICE_2D_MAX_HEIGHT
 
        
      0
CL_DEVICE_3D_MAX_WIDTH
     
     
      0
CL_DEVICE_3D_MAX_HEIGHT
 
        
      0
CL_DEVICE_3D_MAX_DEPTH
     
     
      0
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
 CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0
Intel i7 CPU M 640, 2.80GHz
CL_DEVICE_NAME: 

    
     Intel(R) Core(TM) i7 CPU    M 640 @ 2.80GHz
CL_DEVICE_VENDOR: 
   
     
       Intel
CL_DRIVER_VERSION: 
  
     
       1.0
CL_DEVICE_TYPE:
 
    
     
       CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS:
 
            4
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
 3
CL_DEVICE_MAX_WORK_ITEM_SIZES:
 
          0/0/0
CL_DEVICE_MAX_WORK_GROUP_SIZE:
            
      1
CL_DEVICE_MAX_CLOCK_FREQUENCY:
            
      2800 MHz
CL_DEVICE_ADDRESS_BITS:
    
       
      64
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
 
           1536 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:
 
       6144 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
 no
CL_DEVICE_LOCAL_MEM_TYPE:
 
        global
CL_DEVICE_LOCAL_MEM_SIZE:
 
        16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:

       CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:
 
         1
CL_DEVICE_MAX_READ_IMAGE_ARGS:
 
          128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:

          8
CL_DEVICE_SINGLE_FP_CONFIG:

       CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
CL_DEVICE_2D_MAX_WIDTH
     
       
      0
CL_DEVICE_2D_MAX_HEIGHT
 
          
      0
CL_DEVICE_3D_MAX_WIDTH
     
       
      0
CL_DEVICE_3D_MAX_HEIGHT
 
          
      0
CL_DEVICE_3D_MAX_DEPTH
     
       
      0
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
 CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
Great GPU Use
Cases
✤   3D rendering. (Obviously!)

✤   Simulation. (Folding@home actual has a
    GPU client.)

✤   Physics.

✤   Probability.

✤   Network reliability.

✤   General floating-point speed-ups.

✤   Augmentation of existing apps by offloading
    work to GPU. Combining CPU/GPU
    paradigms is a perfectly valid approach.
A Few Simple Kernels
Several simple C kernels w/Java, OpenCL and JOCL in Eclipse Helios

Date
Multi-
Dimensional
Threading
Threads can have an X, Y, and Z
coordinate within their thread
group, which you use to determine
who you are and can thus derive
your specific local parameters and
minimize control flow changes.
Links


✤   OpenCL Programming Guide for OSX. (Really good!)
    http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html



✤   NVidia CUDA:
    http://www.nvidia.com/object/cuda_home_new.html


✤   ATI Stream:
    http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/STREAM-TECHNOLOGY/Pages/stream-technology.aspx


✤   OpenCL:
    http://www.khronos.org/opencl/
Ad

More Related Content

What's hot (19)

Learn How to Master Solr1 4
Learn How to Master Solr1 4Learn How to Master Solr1 4
Learn How to Master Solr1 4
Lucidworks (Archived)
 
DDS-20m
DDS-20mDDS-20m
DDS-20m
PDX Web & Design
 
December 7, Projects
December 7, ProjectsDecember 7, Projects
December 7, Projects
University of Colorado at Boulder
 
Python basic
Python basic Python basic
Python basic
sewoo lee
 
NAS EP Algorithm
NAS EP Algorithm NAS EP Algorithm
NAS EP Algorithm
Jongsu "Liam" Kim
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
Jun Young Park
 
User defined functions
User defined functionsUser defined functions
User defined functions
shubham_jangid
 
Network vs. Code Metrics to Predict Defects: A Replication Study
Network vs. Code Metrics  to Predict Defects: A Replication StudyNetwork vs. Code Metrics  to Predict Defects: A Replication Study
Network vs. Code Metrics to Predict Defects: A Replication Study
Kim Herzig
 
Part 3-functions
Part 3-functionsPart 3-functions
Part 3-functions
ankita44
 
Lecture1 classes3
Lecture1 classes3Lecture1 classes3
Lecture1 classes3
Noor Faezah Mohd Yatim
 
Refactoring in AS3
Refactoring in AS3Refactoring in AS3
Refactoring in AS3
Eddie Kao
 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
Edge AI and Vision Alliance
 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick Reference
LiquidHub
 
Clean code ch15
Clean code ch15Clean code ch15
Clean code ch15
HyeonSeok Choi
 
[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design
종빈 오
 
cluster(python)
cluster(python)cluster(python)
cluster(python)
Noriyuki Kojima
 
3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery
Vu Tran Lam
 
Invited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshopInvited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshop
Paolo Missier
 
Oech03
Oech03Oech03
Oech03
fangjiafu
 
Python basic
Python basic Python basic
Python basic
sewoo lee
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
Jun Young Park
 
User defined functions
User defined functionsUser defined functions
User defined functions
shubham_jangid
 
Network vs. Code Metrics to Predict Defects: A Replication Study
Network vs. Code Metrics  to Predict Defects: A Replication StudyNetwork vs. Code Metrics  to Predict Defects: A Replication Study
Network vs. Code Metrics to Predict Defects: A Replication Study
Kim Herzig
 
Part 3-functions
Part 3-functionsPart 3-functions
Part 3-functions
ankita44
 
Refactoring in AS3
Refactoring in AS3Refactoring in AS3
Refactoring in AS3
Eddie Kao
 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
Edge AI and Vision Alliance
 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick Reference
LiquidHub
 
[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design
종빈 오
 
3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery
Vu Tran Lam
 
Invited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshopInvited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshop
Paolo Missier
 

Similar to Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1 (20)

Computer networkppt4577
Computer networkppt4577Computer networkppt4577
Computer networkppt4577
Thiyagarajan_Billa
 
Rubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for RubyRubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for Ruby
Prasun Anand
 
40d5984d819aaa72e55aa10376b73bde_MIT6_087IAP10_lec12.pdf
40d5984d819aaa72e55aa10376b73bde_MIT6_087IAP10_lec12.pdf40d5984d819aaa72e55aa10376b73bde_MIT6_087IAP10_lec12.pdf
40d5984d819aaa72e55aa10376b73bde_MIT6_087IAP10_lec12.pdf
SagarYadav642223
 
Permute
PermutePermute
Permute
Russell Childs
 
Unit 3
Unit 3 Unit 3
Unit 3
GOWSIKRAJAP
 
Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning tree
oneous
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
Yung-Yu Chen
 
Using an Array include ltstdiohgt include ltmpih.pdf
Using an Array include ltstdiohgt include ltmpih.pdfUsing an Array include ltstdiohgt include ltmpih.pdf
Using an Array include ltstdiohgt include ltmpih.pdf
giriraj65
 
Task based Programming with OmpSs and its Application
Task based Programming with OmpSs and its ApplicationTask based Programming with OmpSs and its Application
Task based Programming with OmpSs and its Application
Facultad de Informática UCM
 
Permute
PermutePermute
Permute
Russell Childs
 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptx
KarthikVijay59
 
I have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdfI have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdf
shreeaadithyaacellso
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysore
ambikavenkatesh2
 
C - aptitude3
C - aptitude3C - aptitude3
C - aptitude3
Srikanth
 
C aptitude questions
C aptitude questionsC aptitude questions
C aptitude questions
Srikanth
 
Intro to super. advance algorithm..pptx
Intro to super.   advance algorithm..pptxIntro to super.   advance algorithm..pptx
Intro to super. advance algorithm..pptx
ManishBaranwal10
 
Chp4(ref dynamic)
Chp4(ref dynamic)Chp4(ref dynamic)
Chp4(ref dynamic)
Mohd Effandi
 
UNIT I LINEAR DATA STRUCTURES – LIST .pptx
UNIT I LINEAR DATA STRUCTURES – LIST .pptxUNIT I LINEAR DATA STRUCTURES – LIST .pptx
UNIT I LINEAR DATA STRUCTURES – LIST .pptx
VISWANATHAN R V
 
Recursion to iteration automation.
Recursion to iteration automation.Recursion to iteration automation.
Recursion to iteration automation.
Russell Childs
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
K Hari Shankar
 
Rubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for RubyRubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for Ruby
Prasun Anand
 
40d5984d819aaa72e55aa10376b73bde_MIT6_087IAP10_lec12.pdf
40d5984d819aaa72e55aa10376b73bde_MIT6_087IAP10_lec12.pdf40d5984d819aaa72e55aa10376b73bde_MIT6_087IAP10_lec12.pdf
40d5984d819aaa72e55aa10376b73bde_MIT6_087IAP10_lec12.pdf
SagarYadav642223
 
Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning tree
oneous
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
Yung-Yu Chen
 
Using an Array include ltstdiohgt include ltmpih.pdf
Using an Array include ltstdiohgt include ltmpih.pdfUsing an Array include ltstdiohgt include ltmpih.pdf
Using an Array include ltstdiohgt include ltmpih.pdf
giriraj65
 
Task based Programming with OmpSs and its Application
Task based Programming with OmpSs and its ApplicationTask based Programming with OmpSs and its Application
Task based Programming with OmpSs and its Application
Facultad de Informática UCM
 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptx
KarthikVijay59
 
I have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdfI have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdf
shreeaadithyaacellso
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysore
ambikavenkatesh2
 
C - aptitude3
C - aptitude3C - aptitude3
C - aptitude3
Srikanth
 
C aptitude questions
C aptitude questionsC aptitude questions
C aptitude questions
Srikanth
 
Intro to super. advance algorithm..pptx
Intro to super.   advance algorithm..pptxIntro to super.   advance algorithm..pptx
Intro to super. advance algorithm..pptx
ManishBaranwal10
 
UNIT I LINEAR DATA STRUCTURES – LIST .pptx
UNIT I LINEAR DATA STRUCTURES – LIST .pptxUNIT I LINEAR DATA STRUCTURES – LIST .pptx
UNIT I LINEAR DATA STRUCTURES – LIST .pptx
VISWANATHAN R V
 
Recursion to iteration automation.
Recursion to iteration automation.Recursion to iteration automation.
Recursion to iteration automation.
Russell Childs
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
K Hari Shankar
 
Ad

Recently uploaded (20)

Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
TrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token ListingTrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token Listing
Trs Labs
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Connect and Protect: Networks and Network Security
Connect and Protect: Networks and Network SecurityConnect and Protect: Networks and Network Security
Connect and Protect: Networks and Network Security
VICTOR MAESTRE RAMIREZ
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
MINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PRMINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PR
MIND CTI
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
TrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token ListingTrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token Listing
Trs Labs
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Connect and Protect: Networks and Network Security
Connect and Protect: Networks and Network SecurityConnect and Protect: Networks and Network Security
Connect and Protect: Networks and Network Security
VICTOR MAESTRE RAMIREZ
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
MINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PRMINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PR
MIND CTI
 
Ad

Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1

  • 1. @prestonism Ruby Supercomputing gmail: conmotto http://prestonlee.com http://github.com/preston Using The GPU For Massive Performance Speedup http://www.slideshare.net/preston.lee/ Last Updated: March 17th, 2011. Preston Lee, MBA, Translational Genomics Research Institute And Arizona State University
  • 2. [email protected]:preston/ruby-gpu-examples.git Grab the code now if you want, but to run all the examples you’ll need the NVIDIA development driver and toolkit, Ruby 1.9, the “barracuda” gem, and JRuby (without the barracuda gem) on a multi-core OS X Snow Leopard system. This takes time to set up, so maybe just chillax for now?
  • 3. Let’s find the area of each ring. http://openwalls.com/image/7358/the_texture_of_the_tree_rings_1440x900.jpg
  • 4. Math. Yay! ✤ The inner-most ring is ring #1. ✤ Total area of ring #5 is π times #1 the square of the radius. (πrr) #4 ✤ Area of only ring #5 is πrr #5 minus area of ring #4. ✤ (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) ✤ Let’s find the area of every ring...
  • 5. 1st working attempt. #1 (Single-threaded Ruby.) def ring_area(radius) (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) end (1..RINGS).each do |i| puts ring_area(i) end
  • 6. 2nd working attempt. #1 (Multi-threaded Ruby.) def ring_area(radius) (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) end # Multi-thread it! (0..(NUM_THREADS - 1)).each do |i| rings_per_thread = RINGS / NUM_THREADS offset = i * rings_per_thread threads << Thread.new(rings_per_thread, offset) do |num, offset| last = offset + num - 1 (offset..(last)).each do |radius| ring_area(radius) end end end threads.each do |t| t.join end
  • 7. 3rd/4th working attempt. #1 (Single/Multi-threaded C. Ohhh crap...) /* printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec); A baseline CPU-based benchmark program CPU/GPU performance comparision. Approximates the cross-sectional area of every tree ring in a tree trunk in serial and in parallel printf("nDone!nn"); by taking the total area at a given radius and subtracting the area of the closest inner ring. return EXIT_SUCCESS; Copyright © 2011 Preston Lee. All rights reserved. } http://prestonlee.com */ /* Approximate the cross-sectional area between each pair of consecutive tree rings in serial */ #include <stdio.h> void calculate_ring_areas_in_serial(int rings) { #include <stdlib.h> calculate_ring_areas_in_serial_with_offset(rings, 0); #include <math.h> } #include <pthread.h> #include <sys/time.h> void calculate_ring_areas_in_serial_with_offset(int rings, int thread) { int i; #include "tree_rings.h" int offset = rings * thread; int max = rings + offset; #define DEFAULT_RINGS 1000000 float a = 0; #define NUM_THREADS 8 for(i = offset; i < max; i++) { #define DEBUG 0 a = (M_PI * pow(i, 2)) - (M_PI * pow(i - 1, 2)); } int acc = 0; } int main(int argc, const char * argv[]) { /* Approximate the cross-sectional area between each pair of consecutive tree rings int rings = DEFAULT_RINGS; in parallel on NUM_THREADS threads */ void calculate_ring_areas_in_parallel(int rings) { if(argc > 1) { pthread_t threads[NUM_THREADS]; rings = atoi(argv[1]); int rc; } int t; int rings_per_thread = rings / NUM_THREADS; printf("nA baseline CPU-based benchmark program for CPU/GPU performance comparision.n"); ring_thread_data data[NUM_THREADS]; printf("Copyright © 2011 Preston Lee. All rights reserved.nn"); printf("tUsage: %s [NUM TREE RINGS]nn", argv[0]); for(t = 0; t < NUM_THREADS; t++){ data[t].rings = rings_per_thread; printf("Number of tree rings: %i. Yay!n", rings); data[t].number = t; rc = pthread_create(&threads[t], NULL, (void *) ring_job, (void *) &data[t]); struct timeval start, stop, diff; if (rc){ printf("ERROR; return code from pthread_create() is %dn", rc); printf("nRunning serial calculation using CPU...ttt"); exit(-1); gettimeofday(&start, NULL); } calculate_ring_areas_in_serial(rings); } gettimeofday(&stop, NULL); timeval_subtract(&diff, &stop, &start); for(t = 0; t < NUM_THREADS; t++){ printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec); pthread_join(threads[t], NULL); } printf("Running parallel calculation using %i CPU threads...t", NUM_THREADS); } gettimeofday(&start, NULL); calculate_ring_areas_in_parallel(rings); void ring_job(ring_thread_data * data) { gettimeofday(&stop, NULL); calculate_ring_areas_in_serial_with_offset(data->rings, data->number); timeval_subtract(&diff, &stop, &start); }
  • 8. Speed: Your Primary Options 1. Pure Ruby in a big loop. (Single threaded.) #1 2. Pure Ruby, multi-threading smaller loops. (Limited to using a single core on 1.9 due to the GIL, but not on jruby etc.) 3. C extension, single thread. 4. C extension, pthreads. 5. “Divide and conquer.”
  • 12. CPU Concurrency ✤ Ideally, asynchronous tasks across multiple physical and/or logical cores. ✤ POSIX threading is the standard. ✤ Producer/Consumer pattern typically used to account for differences in machine execution time. ✤ Concurrency generally implemented with Time-Division Multiplexing. An CPU with 4 cores can run 100 threads, but the OS will time-share execution time, more-or-less fairly. ✤ MIMD: Multiple Instruction Multiple Data
  • 13. Common CPU Issues ✤ Sometime we need insane numbers of threads per host. ✤ Lock performance. ✤ Potential for excessive virtual memory swapping. ✤ Many tests are non-deterministic. ✤ Concurrency modeling is difficult to get correct.
  • 14. Can’t we just execute every instruction concurrently, but with different data? (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  • 15. No. (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  • 16. No. (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) Multiple Instruction, Multiple Data. (MIMD)
  • 18. Yes! (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) Single Instruction, Multiple Data. (SIMD)
  • 19. GPU Brief History ✤ Graphics Processing Units were initially developed for 3D scene rendering, such as computing luminosity values of each pixel on the screen concurrently. ✤ Consider a 1024x768 display. That’s 786,432 pixels updating 60 times per second. 786,432 pixles @ 60Hz => 47,185,920 => potential pixel pushes per second! ✤ Running 768,432 threads makes OS unhappy. :( ✤ Sometimes it’d be great to have all calculations finish simultaneously. If you’re updating every pixel in a display, each can be computed concurrently. Exactly concurrently. ✤ Generally fast floating point. (Critical for scientific computing.) ✤ SIMD: Single Instruction Multiple Data
  • 20. Hardware Examples Images from NVIDIA and ATI.
  • 21. Hardware Examples Images from NVIDIA and ATI.
  • 22. Hardware Examples Images from NVIDIA and ATI.
  • 23. Hardware Examples Images from NVIDIA and ATI.
  • 24. Hardware Examples Images from NVIDIA and ATI.
  • 25. Hardware Examples Images from NVIDIA and ATI.
  • 26. Hardware Examples Images from NVIDIA and ATI.
  • 27. Hardware Examples Images from NVIDIA and ATI.
  • 28. Hardware Examples Images from NVIDIA and ATI.
  • 29. GPU Pros ✤ SIMD architecture can run thousands of threads concurrently. ✤ Many synchronization issues are designed out. ✤ More FLOPS (floating point operations per second) than host CPU. ✤ Synchronously execute the same instruction for different data points.
  • 30. NVidia Tesla C1060 CL_DEVICE_NAME: Tesla C1060 CL_DEVICE_VENDOR: NVIDIA Corporation CL_DRIVER_VERSION: 260.81 CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU CL_DEVICE_MAX_COMPUTE_UNITS: 30 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64 CL_DEVICE_MAX_WORK_GROUP_SIZE: 512 CL_DEVICE_MAX_CLOCK_FREQUENCY: 1296 MHz CL_DEVICE_ADDRESS_BITS: 32 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1014 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 4058 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA CL_DEVICE_2D_MAX_WIDTH 4096 CL_DEVICE_2D_MAX_HEIGHT 32768 CL_DEVICE_3D_MAX_WIDTH 2048 CL_DEVICE_3D_MAX_HEIGHT 2048 CL_DEVICE_3D_MAX_DEPTH 2048 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1
  • 31. OpenCL (Image from Khronos Group.) Mac OS X Snow Leopard Ships With Support For OpenCL 1.0, But You Still Need The Development Driver, SDK Etc.
  • 32. OpenCL Terms ✤ Kernel: Your custom function(s) that will run on the Device. (Unrelated to the OS kernel.) ✤ Device: something that computes, such as a GPU chip. (You probably have 1, but then you might have 0... or 2+.) Could also be your CPU! ✤ Platform: Container for all your devices. When running code on your local machine you’ll only have one “platform” instance. (A cluster of GPU-heavy systems connected over a network would yield multiple available platforms, but OpenCL work in this area is not yet complete.) ✤ Device-specific terms: thread group, streaming multi-processor, etc.
  • 33. Code! ✤ Ruby 1.9: single threaded. ✤ Ruby 1.9: multi-threaded on 1.9. ✤ Ruby 1.9: barracuda (native OpenCL/NVIDIA CUDA bindings). ✤ JRuby 1.6: single threaded. ✤ JRuby 1.6: multi-threaded. ✤ C: single threaded. ✤ C: native threads.
  • 35. GPU Not-So-Pros ✤ Copying data in system RAM to/from GPU shared memory is not free. (In my own testing, often the slowest link in the chain.) ✤ SIMD can feel limiting. Benefits start to break down when you can’t design conditions out of your algorithm. (E.g. “if(thread_number > 42) { x += y; }” will cause some threads to idle.) ✤ 64-bit CPU does not imply 64-bit GPU! (All GPUs I’ve used are 32-bit or less.) ✤ Allocation limitations. Having 4GB of GPU shared memory does necessarily mean your can allocate one giant block. ✤ Kernels are essentially written in C. May be difficult if you’re new to pointers or memory management. (You can bind to higher-level languages, though.)
  • 36. Q&A and bonus content.
  • 37. Names/Keywords To Know ✤ NVidia (CUDA) ✤ ATI (Owned by AMD) (Stream SDK) ✤ Kronus (Drives OpenCL specification.) ✤ Apple (ships OpenCL-capable drivers with all newer Macs running Snow Leopard.)
  • 38. NVidia GeForce GT 330M CL_DEVICE_NAME: GeForce GT 330M CL_DEVICE_VENDOR: NVIDIA CL_DRIVER_VERSION: CLH 1.0 CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU CL_DEVICE_MAX_COMPUTE_UNITS: 6 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 0/0/0 CL_DEVICE_MAX_WORK_GROUP_SIZE: 512 CL_DEVICE_MAX_CLOCK_FREQUENCY: 1100 MHz CL_DEVICE_ADDRESS_BITS: 32 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 128 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 512 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_DEVICE_2D_MAX_WIDTH 0 CL_DEVICE_2D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_WIDTH 0 CL_DEVICE_3D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_DEPTH 0 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0
  • 39. Intel i7 CPU M 640, 2.80GHz CL_DEVICE_NAME: Intel(R) Core(TM) i7 CPU M 640 @ 2.80GHz CL_DEVICE_VENDOR: Intel CL_DRIVER_VERSION: 1.0 CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU CL_DEVICE_MAX_COMPUTE_UNITS: 4 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 0/0/0 CL_DEVICE_MAX_WORK_GROUP_SIZE: 1 CL_DEVICE_MAX_CLOCK_FREQUENCY: 2800 MHz CL_DEVICE_ADDRESS_BITS: 64 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1536 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 6144 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: global CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_DEVICE_2D_MAX_WIDTH 0 CL_DEVICE_2D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_WIDTH 0 CL_DEVICE_3D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_DEPTH 0 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
  • 40. Great GPU Use Cases ✤ 3D rendering. (Obviously!) ✤ Simulation. (Folding@home actual has a GPU client.) ✤ Physics. ✤ Probability. ✤ Network reliability. ✤ General floating-point speed-ups. ✤ Augmentation of existing apps by offloading work to GPU. Combining CPU/GPU paradigms is a perfectly valid approach.
  • 41. A Few Simple Kernels Several simple C kernels w/Java, OpenCL and JOCL in Eclipse Helios Date
  • 42. Multi- Dimensional Threading Threads can have an X, Y, and Z coordinate within their thread group, which you use to determine who you are and can thus derive your specific local parameters and minimize control flow changes.
  • 43. Links ✤ OpenCL Programming Guide for OSX. (Really good!) http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html ✤ NVidia CUDA: http://www.nvidia.com/object/cuda_home_new.html ✤ ATI Stream: http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/STREAM-TECHNOLOGY/Pages/stream-technology.aspx ✤ OpenCL: http://www.khronos.org/opencl/

Editor's Notes