scala flatMap or filter which one is cheaper

  • A+

I want to use keyed stream, I was wondering which approach is better in high throughput lets's say I have the following

trait A{   val id: Int    def isFoo: Boolean } case class Foo(id: Int) extends A{   override def isFoo = true } case class Bar(id: Int) extends A{   override def isFoo = false } val as = List[A](Foo, Bar) val fs: List[Foo] = as.flatMap{   case Foo => Some(Foo)   case _ => None } 

I can have the following stream

val src: DataStream[A] = env.fromElements(as:_*) 

I have these options :

  1. src.filter(_.isFoo).keyBy( processing...)
  2. src.filter(_.isInstanceOf[Foo]).keyBy( processing...)//here we can remove the isFoo method
  3. src.flatMap{ case f:Foo => Some(f) case _ => None }.keyBy( processing...) The only reason that I have option 3 will allow me to remove the id field from Bar and from the trait itself because it will create a stream of Foo which is more straight forward. However filtering by field seems to be much lighter option than mapping specially in high throughput. What do you think ?


Short answer: it almost certainly doesn't matter, and it's a waste of your time to worry about which of these specific ways of writing a filtering operation is fastest. The ...more processing... part will almost certainly be the bottleneck in your application, and you should write the version of the filter that you find clearest and easiest to maintain.

Slightly longer answer: even if this were somehow the strange case where one of these options was markedly better than the others in a way that actually affected the performance of your application, you're far better off writing your own benchmarking for your own particular case than asking strangers on Stack Overflow to speculate about your use case.

If you really, really want a stranger on Stack Overflow to speculate about your use case, the one thing I would say is that flatMap-ing into Option is probably not the best choice here, since it results in a lot of unnecessary allocations. Here's a super-quick benchmark for some alternative implementations (using Vector instead of Flink, so take the results with a grain of salt):

import org.openjdk.jmh.annotations._  @State(Scope.Thread) class FilterBenchmark {   val values: Vector[A] = (0 to 100).map {     case i if i % 2 == 0 => Foo(i)     case i => Bar(i)   }.toVector    @Benchmark def withIsFoo: Vector[A] = values.filter(_.isFoo)    @Benchmark def withIsInstanceOf: Vector[A] = values.filter(_.isInstanceOf[Foo])    @Benchmark def withFlatMap: Vector[A] = values.flatMap {     case f @ Foo(_) => Some(f)     case _ => None   }    @Benchmark def withFlatMapTypeMatch: Vector[A] = values.flatMap {     case f: Foo => Some(f)     case _ => None   }    @Benchmark def withCollect: Vector[A] = values.collect {     case f @ Foo(_) => f   }    @Benchmark def withCollectTypeMatch: Vector[A] = values.collect {     case f: Foo => f   } } 

And some results (on 2.12.8):

Benchmark                              Mode  Cnt        Score      Error  Units FilterBenchmark.withCollect           thrpt   10  1359035.689 ± 2749.815  ops/s FilterBenchmark.withCollectTypeMatch  thrpt   10  1361227.743 ± 2337.850  ops/s FilterBenchmark.withFlatMap           thrpt   10   113074.826 ±  288.107  ops/s FilterBenchmark.withFlatMapTypeMatch  thrpt   10   113188.419 ±  262.826  ops/s FilterBenchmark.withIsFoo             thrpt   10  1254404.326 ± 3997.759  ops/s FilterBenchmark.withIsInstanceOf      thrpt   10  1257725.670 ± 6115.773  ops/s 

The moral of the story (to my eye) is that flatMap is off by an order of magnitude and clearly bad, but there's not enough of a difference between the other choices to make performance relevant to the decision.


:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: