- Apache Spark is an open-source cluster computing framework for large-scale data processing. It was originally developed at the University of California, Berkeley in 2009 and is used for distributed tasks like data mining, streaming and machine learning.
- Spark utilizes in-memory computing to optimize performance. It keeps data in memory across tasks to allow for faster analytics compared to disk-based computing. Spark also supports caching data in memory to optimize repeated computations.
- Proper configuration of Spark's memory options is important to avoid out of memory errors. Options like storage fraction, execution fraction, on-heap memory size and off-heap memory size control how Spark allocates and uses memory across executors.