I know perspective division is done by dividing x,y, and z by w, to get normalized device coordinates. But I am not able to understand the purpose of doing that. Also, does it have anything to do with clipping?
Some details that complement the general answers:
The idea is to project a point (x,y,z) on screen to have (xs,ys,d). The next figure shows this for the y coordinate.
We know from school that
tan(alpha) = ys / d = y / z
This means that the projection is computed as
ys = d*y/z = y /w
w = z / d
This is enough to apply a projection. However in OpenGL, you want (xs,ys,zs) to be normalized device coordinates in [-1,1] and yes this has something to do with clipping.
The extrema values for (xs,ys,zs) represent the unit cube and everything outside it will be clipped. So a projection matrix usually takes into consideration the clipping limits (Frustum) to make a single transformation that, with the perspective division, simultaneously apply a projection and transform the projected coordinates along with the z to normalized device coordinates.