数仓开发面试题-Spark_数仓为什么小表最多时设置10m-CSDN博客

1. Join 实现有几种呢，源码有研究过吗？底层是怎么实现的

Join类型有七种：①内连接、②笛卡尔积连接、③左连接、④右连接、⑤全连接、⑥左半连接(in)、⑦左反连接(not in)

在 ANSI SQL标准中，共有5种Join方式：内连接 (Inner)、全外连接(FullOuter)、左外连接(LeftOuter)、右外连接( RightOuter)和交叉连接( Cross)。

Join实现类型有五种：①广播 Hash、②Shuffle Hash、③Shuffle Sort Merge、④笛卡尔积、⑤广播嵌套循环

在 Spark 的物理计划（physical plan）阶段，Spark 的 JoinSelection 类会根据 Join hints 策略、Join 表的大小、 Join 是等值 Join（equi-join）还是不等值（non-equi-joins）以及参与 Join 的 key 是否可以排序等条件来选择最终的 Join 策略（join strategies），最后 Spark 会利用选择好的 Join 策略执行最终的计算。当前 Spark（Apache Spark 3.0）一共支持五种 Join 策略：

Join分类

Broadcast hash join (BHJ)
Shuffle hash join（SHJ）
Shuffle sort merge join (SMJ)
Shuffle-and-replicate nested loop join，又称笛卡尔积（Cartesian product join)
Broadcast nested loop join (BNLJ)

1.Broadcast Hash Join (BHJ)

BHJ 又称 map-side-only join，即map端join，适合小表join大表，将小表广播到所有Executor中进行本地join，避免Shuffle。

使用这个 Join 策略必须满足以下条件：

小表的数据必须很小，可以通过 spark.sql.autoBroadcastJoinThreshold 参数来配置，默认是 10MB，如果你的内存比较大，可以将这个阈值适当加大；如果将 spark.sql.autoBroadcastJoinThreshold 参数设置为 -1，可以关闭 BHJ；
只能用于等值 Join，不要求参与 Join 的 keys 可排序；
除了 full outer joins ，支持所有的 Join 类型。

2.Shuffle hash join（SHJ）

当表的数据比spark.sql.autoBroadcastJoinThreshold大，不适合使用广播，这个时候就可以考虑使用 Shuffle hash join。

Shuffle hash join 同样是在大表和小表进行 Join 的时候选择的一种策略，它的计算思想是：把大表和小表按照相同的分区算法和分区数进行分区（根据参与 Join 的 keys 进行分区），这样就保证了 hash 值一样的数据都分发到同一个分区中，然后在同一个 Executor 中两张表 hash 值一样的分区就可以在本地进行 hash Join 了。在进行 Join 之前，还会对小表 hash 完的分区构建 hash map。Shuffle hash join 利用了分治思想，把大问题拆解成小问题去解决。

要启用 Shuffle Hash Join 必须满足以下几个条件：

仅支持等值 Join，不要求参与 Join 的 Keys 可排序；
支持所有类型的join，除了full outer join
spark.sql.join.preferSortMergeJoin 参数必须设置为 false，参数是从 Spark 2.0.0 版本引入的，默认值为 true，也就是默认情况下选择 Sort Merge Join；
小表的大小（plan.stats.sizeInBytes）必须小于 spark.sql.autoBroadcastJoinThreshold * spark.sql.shuffle.partitions；而且小表大小（stats.sizeInBytes）的三倍必须小于等于大表的大小（stats.sizeInBytes），也就是 a.stats.sizeInBytes * 3 < = b.stats.sizeInBytes

3.Shuffle sort merge join (SMJ)

前面两种 Join 策略对表的大小都有条件的，如果参与 Join 的表都很大，这时候就得考虑用 Shuffle Sort Merge Join 了。

Shuffle Sort Merge Join 的实现思想：也是对两张表参与 Join 的 Keys 使用相同的分区算法和分区数进行分区，目的就是保证相同的 Keys 都落到相同的分区里面。分区完之后再对每个分区按照参与 Join 的 Keys 进行排序，最后 Reduce 端获取两张表相同分区的数据进行 Merge Join，也就是 Keys 相同说明 Join 上了。

Shuffle Sort Merge Join 并不是一定就使用的，也需要满足以下条件：

仅支持等值 Join，并且要求参与 Join 的 Keys 可排序；
支持所有的Join类型

4.Cartesian product join

两张参与 Join 的表没指定 where 条件（ON 条件）那么会产生 Cartesian product join，这个 Join 得到的结果其实就是两张行数的乘积。

Broadcast Nested Join将一个输入数据集广播到每个executor上，然后在各个executor上，另一个数据集的分区会和第一个数据集使用嵌套循环的方式进行Join输出结果。

Broadcast Nested Join需要广播数据集和嵌套循环，计算效率极低，对内存的需求也极大，因为不论数据集大小，都会有一个数据集被广播到所有executor上。

适用条件：

必须是 inner Join；
支持等值和不等值 Join。

5.Broadcast nested loop join (BNLJ)

就是不加任何特效，最最普通的join实现方式，先遍历一边，然后每一行判断时遍历另一边，非常非常慢。

适用条件：

支持等值和非等值join；
支持所有的Join类型，一些优化：① right outer join时广播左表；② left outer join、left semi join、left anti join、existence join时广播右表；③ inner join时广播其中的任一一张表。其他情况下会多次扫描数据，会很慢。

选择逻辑

如果是等值Join，会先看join hints，顺序如下：

brocast hint：会判断是否是支持的join类型，两边都hint会选择较小的一张广播
sort merge hint：需要join key可排序
shuffle hash hint：会判断是否是支持的join类型，两边都hint会选择较小的一张build hash
shuffle replicate NL hint：即Cartesian product join，inner join时有效

如果没有hint或者hint不适合，会按如下规则顺序执行：

判断是否适合Broadcast hash join，判断是否是支持的join类型，两边都hint会选择较小的一张广播；
判断是否适合Shuffle hash join，小表必须比大表小很多，同时spark.sql.join.preferSortMergeJoin设置的是false，然后根据小表构建本地hash map；
判断是否适合Shuffle sort merge join，要求Join key必须可排序；
判断是否适合Cartesian product join，要求Join 类型必须是inner join；
Broadcast nested loop join作为最终方案，有可能OOM但别无选择。

3.0.0源码

  
  /**
   * Select the proper physical plan for join based on join strategy hints, the availability of
   * equi-join keys and the sizes of joining relations. Below are the existing join strategies,
   * their characteristics and their limitations.
   *
   * - Broadcast hash join (BHJ):
   *     Only supported for equi-joins, while the join keys do not need to be sortable.
   *     Supported for all join types except full outer joins.
   *     BHJ usually performs faster than the other join algorithms when the broadcast side is
   *     small. However, broadcasting tables is a network-intensive operation and it could cause
   *     OOM or perform badly in some cases, especially when the build/broadcast side is big.
   *
   * - Shuffle hash join:
   *     Only supported for equi-joins, while the join keys do not need to be sortable.
   *     Supported for all join types except full outer joins.
   *
   * - Shuffle sort merge join (SMJ):
   *     Only supported for equi-joins and the join keys have to be sortable.
   *     Supported for all join types.
   *
   * - Broadcast nested loop join (BNLJ):
   *     Supports both equi-joins and non-equi-joins.
   *     Supports all the join types, but the implementation is optimized for:
   *       1) broadcasting the left side in a right outer join;
   *       2) broadcasting the right side in a left outer, left semi, left anti or existence join;
   *       3) broadcasting either side in an inner-like join.
   *     For other cases, we need to scan the data multiple times, which can be rather slow.
   *
   * - Shuffle-and-replicate nested loop join (a.k.a. cartesian product join):
   *     Supports both equi-joins and non-equi-joins.
   *     Supports only inner like joins.
   */
  object JoinSelection extends Strategy with PredicateHelper {

    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {

      // If it is an equi-join, we first look at the join hints w.r.t. the following order:
      //   1. broadcast hint: pick broadcast hash join if the join type is supported. If both sides
      //      have the broadcast hints, choose the smaller side (based on stats) to broadcast.
      //   2. sort merge hint: pick sort merge join if join keys are sortable.
      //   3. shuffle hash hint: We pick shuffle hash join if the join type is supported. If both
      //      sides have the shuffle hash hints, choose the smaller side (based on stats) as the
      //      build side.
      //   4. shuffle replicate NL hint: pick cartesian product if join type is inner like.
      //
      // If there is no hint or the hints are not applicable, we follow these rules one by one:
      //   1. Pick broadcast hash join if one side is small enough to broadcast, and the join type
      //      is supported. If both sides are small, choose the smaller side (based on stats)
      //      to broadcast.
      //   2. Pick shuffle hash join if one side is small enough to build local hash map, and is
      //      much smaller than the other side, and `spark.sql.join.preferSortMergeJoin` is false.
      //   3. Pick sort merge join if the join keys are sortable.
      //   4. Pick cartesian product if join type is inner like.
      //   5. Pick broadcast nested loop join as the final solution. It may OOM but we don't have
      //      other choice.
      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right, hint) =>
        def createBroadcastHashJoin(buildLeft: Boolean, buildRight: Boolean) = {
          val wantToBuildLeft = canBuildLeft(joinType) && buildLeft
          val wantToBuildRight = canBuildRight(joinType) && buildRight
          getBuildSide(wantToBuildLeft, wantToBuildRight, left, right).map { buildSide =>
            Seq(joins.BroadcastHashJoinExec(
              leftKeys,
              rightKeys,
              joinType,
              buildSide,
              condition,
              planLater(left),
              planLater(right)))
          }
        }

        def createShuffleHashJoin(buildLeft: Boolean, buildRight: Boolean) = {
          val wantToBuildLeft = canBuildLeft(joinType) && buildLeft
          val wantToBuildRight = canBuildRight(joinType) && buildRight
          getBuildSide(wantToBuildLeft, wantToBuildRight, left, right).map { buildSide =>
            Seq(joins.ShuffledHashJoinExec(
              leftKeys,
              rightKeys,
              joinType,
              buildSide,
              condition,
              planLater(left),
              planLater(right)))
          }
        }

        def createSortMergeJoin() = {
          if (RowOrdering.isOrderable(leftKeys)) {
            Some(Seq(joins.SortMergeJoinExec(
              leftKeys, rightKeys, joinType, condition, planLater(left), planLater(right))))
          } else {
            None
          }
        }

        def createCartesianProduct() = {
          if (joinType.isInstanceOf[InnerLike]) {
            Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), condition)))
          } else {
            None
          }
        }

        def createJoinWithoutHint() = {
          createBroadcastHashJoin(
            canBroadcast(left) && !hint.leftHint.exists(_.strategy.contains(NO_BROADCAST_HASH)),
            canBroadcast(right) && !hint.rightHint.exists(_.strategy.contains(NO_BROADCAST_HASH)))
            .orElse {
              if (!conf.preferSortMergeJoin) {
                createShuffleHashJoin(
                  canBuildLocalHashMap(left) && muchSmaller(left, right),
                  canBuildLocalHashMap(right) && muchSmaller(right, left))
              } else {
                None
              }
            }
            .orElse(createSortMergeJoin())
            .orElse(createCartesianProduct())
            .getOrElse {
              // This join could be very slow or OOM
              val buildSide = getSmallerSide(left, right)
              Seq(joins.BroadcastNestedLoopJoinExec(
                planLater(left), planLater(right), buildSide, joinType, condition))
            }
        }

        createBroadcastHashJoin(hintToBroadcastLeft(hint), hintToBroadcastRight(hint))
          .orElse { if (hintToSortMergeJoin(hint)) createSortMergeJoin() else None }
          .orElse(createShuffleHashJoin(hintToShuffleHashLeft(hint), hintToShuffleHashRight(hint)))
          .orElse { if (hintToShuffleReplicateNL(hint)) createCartesianProduct() else None }
          .getOrElse(createJoinWithoutHint())

      // If it is not an equi-join, we first look at the join hints w.r.t. the following order:
      //   1. broadcast hint: pick broadcast nested loop join. If both sides have the broadcast
      //      hints, choose the smaller side (based on stats) to broadcast for inner and full joins,
      //      choose the left side for right join, and choose right side for left join.
      //   2. shuffle replicate NL hint: pick cartesian product if join type is inner like.
      //
      // If there is no hint or the hints are not applicable, we follow these rules one by one:
      //   1. Pick broadcast nested loop join if one side is small enough to broadcast. If only left
      //      side is broadcast-able and it's left join, or only right side is broadcast-able and
      //      it's right join, we skip this rule. If both sides are small, broadcasts the smaller
      //      side for inner and full joins, broadcasts the left side for right join, and broadcasts
      //      right side for left join.
      //   2. Pick cartesian product if join type is inner like.
      //   3. Pick broadcast nested loop join as the final solution. It may OOM but we don't have
      //      other choice. It broadcasts the smaller side for inner and full joins, broadcasts the
      //      left side for right join, and broadcasts right side for left join.
      case logical.Join(left, right, joinType, condition, hint) =>
        val desiredBuildSide = if (joinType.isInstanceOf[InnerLike] || joinType == FullOuter) {
          getSmallerSide(left, right)
        } else {
          // For perf reasons, `BroadcastNestedLoopJoinExec` prefers to broadcast left side if
          // it's a right join, and broadcast right side if it's a left join.
          // TODO: revisit it. If left side is much smaller than the right side, it may be better
          // to broadcast the left side even if it's a left join.
          if (canBuildLeft(joinType)) BuildLeft else BuildRight
        }

        def createBroadcastNLJoin(buildLeft: Boolean, buildRight: Boolean) =