I am currently using the threading function in python and got the following:
In [1]:
import threading
threading.activeCount()
Out[1]:
4
Now on my terminal, I use lscpu and learned there are 2 threads per core and I have access to 4 cores:
kitty@FelineFortress:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Stepping: 3
CPU MHz: 800.000
BogoMIPS: 5786.45
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
Hence, I should have a lot more than 4 threads to access. Is there a python function I can use to increase the number of cores I am using (with example) to get more than 4 threads? Or even something to type on the terminal when launching ipython notebook like below:
ipython notebook n_cores=3
You can use multiprocessing to allow Python to use multiple cores. Just one, big caveat: all the data you pass between Python sessions has to be picklable or passed via inheritance, and a new Python instance is spawned on Windows, while on Unix systems it can be forked over. This has notabled performance implications on a Windows system.
A basic example using multiprocessing is as follows from "Python Module of the Week":
import multiprocessing
def worker():
"""worker function"""
print 'Worker'
return
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=worker)
jobs.append(p)
p.start()
When executed, it outputs:
Worker
Worker
Worker
Worker
Worker
Multiprocessing allows you to do independent calculations on different cores, allowing CPU-bound tasks with little overhead to execute much more rapidly than a traditional process.
You should also realize that threading in Python does not improve performance. It exists for convenience (such as maintaining the responsiveness of a GUI during long calculations). The reason for this is these are not native threads due to Python's Global Interpreter Lock, or GIL.
This is still very much applicable, and will be for the foreseeable future. The Cpython implementation uses the following definition for reference counting:
typedef struct _object {
_PyObject_HEAD_EXTRA
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
} PyObject;
Notably, this is not thread-safe, so a global-interpreter lock must be implemented to allow only one thread of execution with Python objects to avoid data races leading to memory issues.
There are numerous tools to try to side-step the global interpreter lock, in addition to multiprocessing (which requires a complete copy of the interpreter on Windows, rather than a fork, making it very slow and unamenable to improving performance).
Your simplest solution is Cython. Simply cdef a function, without any internal objects, and release the GIL with the with nogil
keyword.
A simple example taken from the documentation, which shows you how to release, and temporarily re-enable the GIL:
from cython.parallel import prange
cdef int func(Py_ssize_t n):
cdef Py_ssize_t i
for i in prange(n, nogil=True):
if i == 8:
with gil:
raise Exception()
elif i == 4:
break
elif i == 2:
return i
CPython has a GI, while Jython and IronPython do not. Be careful, as numerous C-libraries for high-performance computing may not work with IronPython or Jython (SciPy flirted with IronPython support, but dropped it long ago, and it will not work on a modern Python version).
MPI, or Message Passing Interface, is a high-performance interface for languages like C and C++. It allows efficient parallel computations, and MPI4Py creates bindings for MPI for Python. For efficiency, you should only use MPI4Py with NumPy arrays.
An example from their documentation is:
from mpi4py import MPI
import numpy
def matvec(comm, A, x):
m = A.shape[0] # local rows
p = comm.Get_size()
xg = numpy.zeros(m*p, dtype='d')
comm.Allgather([x, MPI.DOUBLE],
[xg, MPI.DOUBLE])
y = numpy.dot(A, xg)
return y