MongoDB Java API slow reading peformance

  • A+
Category:Languages

We are reading from a local MongoDB all documents from a collections and performance is not very brillant.

We need to dump all the data, don't be concerned why, just trust it's really needed and there is no workaround possible.

We've 4mio documents that look like :

{ "_id":"4d094f58c96767d7a0099d49", "exchange":"NASDAQ", "stock_symbol":"AACC", "date":"2008-03-07", "open":8.4, "high":8.75, "low":8.08, "close":8.55, "volume":275800, "adj close":8.55 }

And we're using this for now trivial code to read:

    MongoClient mongoClient = MongoClients.create();     MongoDatabase database = mongoClient.getDatabase("localhost");     MongoCollection<Document> collection = database.getCollection("test");      MutableInt count = new MutableInt();     long start = System.currentTimeMillis();     collection.find().forEach((Block<Document>) document -> count.increment() /* actually something more complicated */ );     long start = System.currentTimeMillis(); 

We're reading the whole collection at 16 seconds ( 250k row / sec ), that is really not impressive at all with small documents. Bear in mind we want to load 800mio rows. No aggregate, map reduce or similar are possible.

Is this as fast as MongoDB gets or are there other ways to load documents faster (other techniques, moving Linux, more RAM, settings.. ) ?

 


You didn't specify your use case so it's very hard to tell you how to tune your query. (I.e: Who would want to load 800mil row at a time just for count?).

Given your schema, I think your data is almost read-only and your task is related to data aggregation.

Your current work is just read the data, (most likely your driver will read in batch), then stop, then perform some calculation (hell yeah, an int wrapper is used to increase the processing time more), then repeat. That's not a good approach. The DB does not magically fast if you do not access it the correct way.

If the computation is not too complex, I suggest you to use the aggregation framework instead of loading all into your RAM.

Something just you should consider to improve your aggregation:

  1. Divide your dataset to smaller set. (Eg: Partition by date, partition by exchange...). Add index to support that partition and operate aggregation on partition then combine the result (Typical divide-n-conquer approach)
  2. Project only needed fields
  3. Filter out unnecessary document (if possible)
  4. Allow diskusage if you can't perform your aggregation on memory (if you hit the 100MB limit per pipiline).
  5. Use builtin pipeline to speedup your computation (eg: $count for your example)

If your computation is too complex that you cannot express with aggregation framework, then use mapReduce. It operates on the mongod process and data does not need to transfer over network to your memory.

Updated

So look like you want to do an OLAP processing, and you stuck at ETL step.

You do not need to and have to avoid load the whole OLTP data to OLAP every time. Only need to load new changes to your data warehouse. Then first data loading/dumping takes more time is normal and acceptable.

For first time loading, you should consider following points:

  1. Divide-N-Conquer, again, breaks your data to smaller dataset (with predicate like date / exchange / stock label...)
  2. Do parallel computation, then combine your result (You have to partition your dataset properly)
  3. Do computation on batch instead of processing in forEach: Load the data partition then compute instead of compute one by one.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: