 A+
I have written a function in python to calculate Delta function in Gauss broadening, which involves 4level loops. However, the efficiency is very low, about 10 times slower than using Fortran in a similar way.
def Delta_Gaussf(Nw, N_bd, N_kp, hw, eigv): Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=float) for w1 in range(Nw): for k1 in range(N_kp): for i1 in range(N_bd): for j1 in range(N_bd): if ( j1 >= i1 ): Delta_Gauss[w1][k1][i1][j1] = np.exp(pow((eigv[k1][j1]eigv[k1][i1]hw[w1])/width,2)) return Delta_Gauss
I have removed some constants to make it looks simpler.
Could any one help me to optimize this script to increase efficiency?
Simply compile it
To get the best performance I recommend Numba (easy usage, good performance). Alternatively Cython may be a good idea, but with a bit more changes to your code.
You actually got everything right and implemented a easy to understand (for a human and most important for a compiler) solution.
There are basically to ways to gain performance

Vectorize the code as @scnerd showed. This is usually a bit slower and more complex than simply compile a quite simple code, that only uses some for loops. Don't vectorize your code and than use a compiler. From a simple looping aproach this is usually some work to do and leads to a slower and more complex result. The advantage of this process is that you only need numpy, which is a standard dependency in nearly every Python project that deals with some numerical calculations.

Compile the code. If you have already a solution with a few loops and no other, or only a few non numpy functions involved this is often the simplest and fastest solution.
A solution using Numba
You do not have to change much, I changed the pow function to np.power and some slight changes to the way arrays accessed in numpy (this isn't really necessary).
import numba as nb import numpy as np #performancedebug info import llvmlite.binding as llvm llvm.set_option('', 'debugonly=loopvectorize') @nb.njit(fastmath=True) def Delta_Gaussf_nb(Nw, N_bd, N_kp, hw, width,eigv): Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=float) for w1 in range(Nw): for k1 in range(N_kp): for i1 in range(N_bd): for j1 in range(N_bd): if ( j1 >= i1 ): Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]eigv[k1,i1]hw[w1])/width,2)) return Delta_Gauss
Due to the 'if' the SIMDvectorization fails. In the next step we can remove it (maybe a call outside the njited function to np.triu(Delta_Gauss)
will be necessary). I also parallelized the function.
@nb.njit(fastmath=True,parallel=True) def Delta_Gaussf_1(Nw, N_bd, N_kp, hw, width,eigv): Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=np.float64) for w1 in nb.prange(Nw): for k1 in range(N_kp): for i1 in range(N_bd): for j1 in range(N_bd): Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]eigv[k1,i1]hw[w1])/width,2)) return Delta_Gauss
Performance
Nw = 20 N_bd = 20 N_kp = 20 width=20 hw = np.linspace(0., 1.0, Nw) eigv = np.zeros((N_kp, N_bd),dtype=np.float) Your version: 0.5s first_compiled version: 1.37ms parallel version: 0.55ms
These easy optimizations lead to about 1000x speedup.