apache spark - how to filter () a pairRDD according to two conditions -
how can filter pair rdd if have 2 conditions filter , 1 test key , other 1 test value (wanna portion of code) bcz used portion , didnt work saddly
javapairrdd filtering = pairrdd1.filter((x,y) -> (x._1.equals(y._1))&&(x._2.equals(y._2)))));
you can't use regular filter this, because checks 1 item @ time. have compare multiple items each other, , check 1 keep. here's example keeps items repeated:
val items = list(1, 2, 5, 6, 6, 7, 8, 10, 12, 13, 15, 16, 16, 19, 20) val rdd = sc.parallelize(items) // create rdd possible combinations of pairs val mapped = rdd.map { case (x) => (x, 1)} val reduced = mapped.reducebykey{ case (x, y) => x + y } val filtered = reduced.filter { case (item, count) => count > 1 } // print out results: filtered.collect().foreach { case (item, count) => println(s"keeping $item because occurred $count times.")}
it's not performant code this, should give idea approach.
Comments
Post a Comment