/** * Returns a new [[Dataset]] by sampling a fraction of rows, using a user-supplied seed. * * 通过使用用户提供的种子,通过抽样的方式返回一个新的[[Dataset]]。 * * @param withReplacement Sample with replacement or not. * 如果withReplacement=true的话表示有放回的抽样,采用泊松抽样算法实现. * 如果withReplacement=false的话表示无放回的抽样,采用伯努利抽样算法实现. * @param fraction Fraction of rows to generate. * 每一行数据被取样的概率.服从二项分布.当withReplacement=true的时候fraction>=0,当withReplacement=false的时候 0 < fraction < 1. * @param seed Seed for sampling. * 取样种子(与随机数生成有关) * @note This is NOT guaranteed to provide exactly the fraction of the count * of the given [[Dataset]]. * 不能保证准确的按照给定的分数取样。(一般结果会在概率值*总数左右) * @group typedrel * @since 1.6.0 */ defsample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T] = { require(fraction >= 0, s"Fraction must be nonnegative, but got ${fraction}")
/** * Returns a new [[Dataset]] by sampling a fraction of rows, using a random seed. * * 通过程序随机的种子,抽样返回新的DataSet * * @param withReplacement Sample with replacement or not. * 取样结果是否放回 * @param fraction Fraction of rows to generate. * 每行数据被取样的概率 * @note This is NOT guaranteed to provide exactly the fraction of the total count * of the given [[Dataset]]. * 不能保证准确的按照给定的分数取样。(一般结果会在概率值*总数左右) * @group typedrel * @since 1.6.0 */ defsample(withReplacement: Boolean, fraction: Double): Dataset[T] = { sample(withReplacement, fraction, Utils.random.nextLong) }
df.createOrReplaceTempView("test_sample"); // 生成临时表 df.sqlContext() // 添加随机数列,并根据其进行排序 .sql("select * ,rand() as random from test_sample order by random") .limit(2) // 根据参数的fraction计算需要获取的取样结果 .drop("random") // 删除掉添加的随机列 .show();