This document proposes a two-stage sampling selection strategy (T3S) for large-scale data deduplication using Apache Spark. T3S reduces the labeling effort for training data by first selecting balanced subsets of candidate pairs, then removing redundant pairs to produce a smaller, more informative training set. It detects fuzzy region boundaries using this training set to classify candidate pairs. The approach is implemented in a distributed manner using Apache Spark and shows better performance than an existing method by reducing the training set size.