I have two processes in Python that need to communicate data. These two processes are launched by
multiprocessing.Process. That data is usually of the sort of dictionaries, lists, and so (just like the data that JSON would allow). For this last reason, I have though about either using JSON strings and loading and dumping the data in each side, or doing the same with pickle.
Processes are able to send each other strings or bytes.
W̶h̶i̶c̶h̶ ̶w̶o̶u̶l̶d̶ ̶b̶e̶ ̶m̶o̶r̶e̶ ̶r̶e̶c̶o̶m̶m̶e̶n̶d̶a̶b̶l̶e̶?̶ I have different concerns when it comes to choosing either one. I would like to know what main advantages each option has over the other (in terms of speed, flexibility, security, etc).
Above all, I'm concerned because of the cumbersome security and data structure drawbacks that pickle has. Is this true?
If anyone has any alternative that is better, I would also like to know about it.
To me, it seems that in practice you could use either one. (S̶i̶n̶c̶e̶ ̶p̶r̶o̶c̶e̶s̶s̶e̶s̶ ̶a̶r̶e̶ ̶n̶o̶t̶ ̶a̶b̶l̶e̶ ̶t̶o̶ ̶s̶h̶a̶r̶e̶ ̶m̶e̶m̶o̶r̶y̶ ̶i̶n̶ ̶P̶y̶t̶h̶o̶n̶, as you mentioned you have to communicate with plain data. Therefore, it has to be interpreted and loaded into memory in each communication). Edit: though that is true as for the OS, Python provides a
multiprocessing.Queue that lets processes communicate through Queues if you call them through
multiprocessing.Process -- there are other ways as well to pass data: you may use a Pipe rather than a Socket, but in the end the data has to have a format, and I think that is a bit more what you are asking here. (Edit 2: see last point in Summing up section)
You can do a speed test, if that matters to you, with the following code:
import json, pickle import timeit d = # define here the data you want to test def json_op(): json.loads(json.dumps(d)) def pickle_op(): pickle.loads(pickle.dumps(d)) print(timeit.timeit(json_op, number=100000), timeit.timeit(pickle_op, number=100000))
You can change the number of repetitions. I don't know the complexity and how that increases, but it looks to me like Pickle is faster (in an example in my computer it was twice as fast).
2. Flexibility (object types)
Also, Pickle lets you pass some types of data that JSON won't let you: you can define your own classes, functions, etc., and that can be stored into Pickle. Essentially, any object in Python (and that is almost like saying, anything in Python) can be stored into a Pickle string.
1. Flexibility (accross applications)
If you want to later use that data with another application that is written in another programming language, or pass that data across internet, you better use JSON, since Pickle is logically only available for Python scripts.
As @Lost very well mentioned, JSON is securer than Pickle in a Python environment since it can only contain certain types of data:
In JSON, values must be one of the following data types:
- a string
- a number
- an object (JSON object)
- an array
- a boolean
Since Python is based on the duck typing philosophy, your data is subject to injection if it comes from foreign or untrusted sources. However, this is not a factor to consider if you will simply be passing data between two processes; only if you rely on user input or other different settings.
3. Speed (when saving into a file)
AFAIK JSON data size is usually smaller since its simple syntax and reduced data types do not require a very long serialization; on the other hand Pickle dumpings occupy more space and when reading or writing to a file (if you at all need to). The relevance of this point depends also on how you communicate your data; if you read an write from/to a file each time you communicate it may be worth considering.
You can check data length easily by:
import json, pickle d = # define here the data you want to test json_str = json.dumps(d) pickle_str = pickle.dumps(d) print(len(pickle_str), len(json_str))
Each of them has advantages and disadvantages; you have to evaluate which is better in your circumstances and stick to it. Knowing beforehand each of their disadvantages can give you a wise way to paliate them somehow. In general, I would not encourage Pickle in a production environment, but it depends on the circumstances and it can be useful and secure if well implemented.
There are many other data serialization alternatives. I have not tried them, so I can't really recommend them or discourage them. However, you may want to check them out. Here are some:
Google Protocol Buffers (cross-platform support: Java, Python, Objective-C, C++, Go, Ruby, C#)
YAML, Marshal, BSON, XML, etc. You may really want to check this out. It's an in-depth speed-wise comparison of data serialization approaches for Python, to solve bottlenecks when it comes to IO.
(Pass raw Python objects if you are using Queues: if you use
multiprocessing.Queue, you can add elements to the Queue that are Python lists or dictionaries, even numpy arrays or pandas dataframes, with no need in serializing and de-serializing data) Edit: (thanks to Martijn Pieters) this functionality uses Pickle underneath, so it does not make a serious difference, apart from the usability that the
Queueobject may provide.
If you pass your data across to different processes, don't load it and dump it each time; keep it in bytes/string format and do load/dump on the very endpoints. Each time you load a serialized object you copy it to memory, unless you apply a cache decorator to the function (Is there a decorator to simply cache function return values?) but this is basically the same as to keeping the string because it will still be copying process-wise.