What flow would the program go through?
Very roughly, the processing stages would be:
Step 1 is usually done using the classic Viola&Jones face detection algorithm. It's quite fast and reliable.
The faces found in step 1 may have different brightness, contrast and different sizes. To simplify processing, they are all scaled to the same size and exposure differences are compensated (e.g. using histogram equalization) in step 2.
There are many approaches to step 3. Early face detectors tried to find specific positions (center of the eyes, end of the nose, end of the lips, etc.) and use geometric distances and angles between those as features for recognition. From what I've read, these approaches were very fast, but not that reliable.
A more recent approach, "Eigenfaces", is based on the fact that pictures of faces can be approximated as a linear combination of base images (found through PCA from a large set of training images). The linear factors in this approximation can be used as features. This approach can also be applied to parts of the face (eyes, nose, mouth) individually. It works best if there the pose between all images is the same. If some faces look to the left, others look upwards, it won't work as well. Active appearance models try to counter that effect by training a full 3d model instead of flat 2d pictures.
Step 4 is relatively straightforward: You have a set of numbers for each face, and for the face images acquired during training, and you want to find the training face that's "most similar" to the current test face. That's what machine learning algorithms do. I think the most common algorithm is the support vector machine (SVM). Other choices are e.g. artificial neural networks or k-nearest neighbors. If the features are good, the choice of the ML algorithm won't matter that much.
Literature on the subject: