Eyewitnesses’ Visual Recollection in Suspect Identification by using Facial Appearance Model

Facial recognition has been an active field of imaging science. With the recent progresses in computer vision development, it is extensively applied in various areas, especially in law enforcement and security. Human face is a viable biometric that could be effectively used in both identification and verification. Thus far, regardless of a facial model and relevant metrics employed, its main shortcoming is that it requires a facial image, against which comparison is made. Therefore, closed circuit televisions and a facial database are always needed in an operational system. For the last few decades, unfortunately, we have experienced an emergence of asymmetric warfare, where acts of terrorism are often committed in secluded area with no camera installed and possibly by persons whose photos have never been kept in any official database prior to the event. During subsequent investigations, the authorities thus had to rely on traumatized and frustrated witnesses, whose testimonial accounts regarding suspect’s appearance are dubious and often misleading. To address this issue, this paper presents an application of a statistical appearance model of human face in assisting suspect identification based on witness’s visual recollection. An online prototype system was implemented to demonstrate its core functionalities. Both visual and numerical assessments reported herein evidentially indicated potential benefits of the system for the intended purpose.

With the emergence of the 4 th generation or the Asymmetric Warfare, threats due to terrorisms as well as self-radicalization have ever been escalated during the past decade (1,2). These have consequently led to single handed terrorist acts in disguise, which typically occurred in a secluded part of urban areas, where there is no close circuit television (CCTV) or other advanced visual technologies. Investigation, as a result, will inevitably have to rely on witness accounts, if any. Unfortunately, right after such traumatic event, witnesses are severely disturbed both physically and emotionally and as such their accounts regarding the events or suspects are often vague and confusing (3,4). 1 School of Computer Engineering Suranaree University of Technology, Nakhon Ratchasima, Thailand, 30000 2 Faculty of Science and Industrial Technology Prince of Songkla University, Surat Thani campus, Surat Thani, Thailand, 84000 * Corresponding author: phorkaew@sut.ac.th * ORCID ID: 0000-0003-0879-7125 A skilled sketch artist, who is typically recruited to draft the suspect face will thus have to work on witnesses' distorted verbal explanation of suspects' characteristics. Extensive psychological research in the past 30 years had agreed that in many occasions, their testimonies can be contradicting to each other, rendering the sketches even more unreliable (5,6). Following crime investigation based only on this information could very well lead to false arrest or even worst unjust accusation. To address parts of this matter, several studies suggested adopting facial recognition to accelerate automatic search of a required face (7). The recent advances in the field have recently elevates the problems and successfully to an extent resolves a Single Sample per Person (SSPP) issue where only one photo per person can be collected (8,9,10). It should be noted that, those algorithms cannot operate in the absence of a digital face to compare with. Moreover, in cases where the unsubs are newly surfaced and whose pictures have never been recorded by any law enforcement entity, there will not be any facial image in the database. Since typical facial recognition schemes normally fail when querying an unknown image and would return instead a most similar one (7,11). This could also possibly lead to wrongly informed intel, unlawful arrests or civilian lawsuits. Unlike other existing works and instead of attempting to automatically search for a face in the database from a given sample, this study proposes an intuitive means of converting and combining eyewitnesses' visual recollection of the suspect's appearance into model parameters, which in turn will be used to synthesize a 3D face from statistical facial model built from a training set. Model generalization is the primary characteristics of the scheme, enabling such instantiation of an unseen sample (12,13). With the proposed scheme, more than one witnesses' testimonies can also be fused by using trivial linear combination. The reconstructed model can subsequently be compared with those in the database where the most resembling N-samples will be retrieved for confirmation purposes. Military and law enforcement personnel can hence rely on this visual information and use it as one of the intelligences for subsequent investigation and serial crime prevention. The remaining part of this paper is organized as follows: Material and method section discusses the related techniques found in literature and explains the methodology in more detail. Then, to elucidate the value of the proposed scheme, the results section demonstrates and discusses both visual and numerical assessments of the resultant facial model. Finally, concluding statements and prospects on future works and applications are provided.

Material and Method:
Conventional computerized facial recognition was dated back to early 90's (7). It relied on a statistical method, called Principal Component Analysis (PCA). More specifically, PCA analyzed co-variance of pixels was bounded within a common rectangular frame. This covariance matrix was then decomposed and spanned by orthogonal bases, called Eigen vectors and associated values. The resultant bases were used to describe a facial instance by a few dominant parameters. This technique, hence called Eigen face, and its variants were found applied in many economical biometric applications, due to its effectiveness and reasonable computing resource consumption. However, their main drawback is usage of static rectangular domain of coinciding pixels.
This representation disregarded correspondence of anatomical landmarks throughout the dataset. The Eigen face assumption that pixels at the same locations belong to an identical facial structure is not necessarily true, i.e., outline and appearance of a face and its relative peripheral distances varied per ethnics and across individuals. It is thus only suitable for lowresolution applications, where this adverse effect was not much pronounced (14). Instead of comparing pixels within a rectangular grid as in Eigen face approach, this paper adopted a deformable frame technique, called Active Appearance Model (AAM) (15). It expressed the variation found in a training set based on deformable facial shape and texture enclosed by its convex hull. PCA was similarly applied to span such variation but an instance was reconstructed by two-stage projection, i.e., statistical shape and texture frames. Thanks to the compactness, generalization ability and specificity of AAM (13), this paper applied such technique and proposed a computer-assisted suspect identification system based on eyewitnesses' visual recollection. Detailed account on its design and implementation was described in the subsequent section. The proposed scheme consists of three essential components, i.e., building a 3D facial instance, AAM reconstruction and online application development, as outlined in Fig. 1.  (19). The application was then deployed to an on-premise Virtual Machine (VM) server with 64-bits CPU, running Microsoft ® Windows Server 2016. Each component is described in more details in the following subsections.

Building a 3D Facial Instance
Without loss of generalization, participants of Thai ethnic group were recruited in this study. Each subject had photographs of their face taken with parallel stereo cameras, with known baseline distance (B) and focal length (f). Configuration of these cameras was illustrated in Fig. 2. Each 3D object P located at distance Z away from the baseline was projected onto the left (O L ) and right (O R ) cameras at coordinates p L and p R , respectively. If this correspondence could be established for all pixels in both images, the dense map of their depth from the imaging plane can be trivially computed using respective disparity (20), as expressed in Eq. 1.
Determining accurate set of correspondence in actual scene was however a challenging task and had remained an active field of recent investigation (21,22,23). Instead, this study used an efficient face detection module in Dlib (18,24) to determine important landmarks. With this module, the landmarks consisted of 68 control points, outlining eyes, eyebrows, a nose, a mouth and a jawline of a given face, as depicted in Fig. 3a and 3b. Since all points were strictly ordered, the correspondence of a given pair of face images can then be accurately identified. Respective dense depth map (Fig. 3c) was evaluated by first triangulating these landmarks by using Delaunay algorithm (25,26), whose properties are undistorted, and stable mesh configuration being preserved. Later, each interior depth was linearly interpolated based on three nodal values with Barycentric coordinates method. Sixtyeight control points, each of which was expressed by two Cartesian axes, i.e., (x, y)  R 2 , resulted in a model of 68×2 = 136 degrees of freedom (DoF). To capture all the plausible variations described by this model (7,12), at least a total of 136 participants (i.e., facial instances) were thus required. This process was carried out for each subject involved in this study, producing a total of 136 threedimensional faces. frame. To this end, a rigid body transformation matrix (rotation, scaling and translation) aligned a given face to this frame, by using the Procrustes analysis (27). Firstly, a centroid of each shape was moved to origin. Rotation and scaling were then computed by using an iterative least squares method. AAM has been well accepted mostly for its compact representation with high generalization ability and specificity. In this study, an appearance model of this normalized faces was derived by first computing the averaged control points and respective covariance matrix. A Karhunen-Loeve (KL) expansion was then applied to express their point distribution model (PDM), as given in Eq. (2). = ̅ + (2) where ̅ is the averaged control points (mean face), P s is an orthogonal mode of variations, whose columns are Eigen vectors sorted in descending order with respect to their Eigen values, and b s is shape parameters describing a face in normalized coordinate. Generally, each b s element was limited to within 2-3 standard deviations (), so that it could capture 97-99% facial variations. A rigid body matrix was subsequently applied to the normalized face to produce that of actual size, orientation and position. The second step involved warping each face and its texture image onto the mean one, again by using Barycentric coordinate over the mean face triangulation (mesh). Interior pixels with identical Barycentric coordinates were then indexed and reordered into a vector and a KL expansion was similarly applied to express the graylevel distribution model (GDM), as given in Eq. (3).
= ̅ + (3) where ̅ is the averaged texture patch (mean texture), P g is an orthogonal mode of variations, and b g is texture parameters describing the underlying texture within a mean face mesh. To consolidate these models in a unified one, resultant shape and texture parameters were first concatenated, following Eq. (4).
where is a diagonal matrix weighting the shape parameter so that it matches that of the texture one. It was computed by measuring the root mean squares (RMS) of changes in b s per unit changes in b g . A KL was then applied to the appearance parameter b in Eq. (4) to produce the final appearance model, i.e., b = Qc, where Q and c are Eigen vector matrix and parameters of appearance, respectively. Given this model, a facial instance can be reconstructed (15), as per Eq. (5, 6, and 7). = ̅ + = ̅ + (6) = [ ] (7) It is evident in Eq. (5) that the appearance parameters c can be used to synthesize a unified instance of both shape and texture, and hence a plausible facial instance. Depending on predefined total coverage, the first few parameters can capture the principal modes of facial variations found in the training set. The next subsection discusses an intuitive way of utilizing these parameters in the proposed scheme.

Online Web Application Development
Interaction between the created facial appearance model and witness's visual recollection was deduced by means of an online application interface. To this end, a 3-tier web application was developed by ASP.NET framework following the workflow illustrated in Fig. 4 and explained as follows.
The data tier consisted of two databases, i.e., 3D facial images and model databases, stored in YAML and structured-binary formats, respectively. Due to simplicity of such design, neither SQL data source nor engine were required. The application tier, consisted of AAM class, its interfaces, and auxiliaries were written based on EmguCV and related API libraries. Its primary function was to synthesize a face instance from given model parameters via web service, whereby communications were made using JSON schema. Other tasks also included facial matching and fusion based on model parameters. Finally, the client tier was written mainly based on ASP.NET and JavaScript. It employed 2D image JavaScript and Three.js for visualization. Its core interfaces were creating a synthetic facial instance from a given set of model parameters, matching a synthetic face with those stored in the face database, with closest parameter values, and fusing a set of testimonial parameters from different witnesses to produce a face with general agreement (a.k.a. consensus exemplars). With this framework, an eyewitness can try adjusting the parameter vector (c) and simultaneously learn the relationship between their adjustment and morphological changes captured by the model. Meanwhile, visual recollection served as a comparison metric, measuring the difference between what the witness can recall and that computationally synthesized by AAM. The process would be repeated until strongly enough confidence was subconsciously formed, and the synthetic face resemble to what they indeed saw.

Results and Discussion:
This section reports preliminary experiments and evaluation, numerical and visual assessments of each component in the proposed system, as well as demonstration of the prototype application. Relevant discussions on the findings are also provided. Figure 5 illustrates detected facial control points overlaid on two examples. Since the degrees of freedom were 68 (points) × 2 (x and y coordinates) = 136, the 300 faces originally trained in Dlib was sufficient to capture plausible morphology also found in our dataset. Reconstruction of three 3D facial models are also shown. It should be noted here that though possible, visualizing 3D face at this stage did not require dense interpolation of depth values, because the process was automatically executed within WebGL pipeline through texture over 3D polygon rendering (19).
After control points were detected from all training face images, they were then aligned by means of Procrustes analysis. The mean face was then calculated by averaging this control points over all training samples. Constrained Delaunay algorithm (25,26) was then employed to triangulate the mean control points, as shown in Fig 6. This triangular mesh formed a common frame of reference for computing the Barycentric coordinates of a given interior pixel, when building the AAM.  An appearance model was built following Eq. (5), and Fig. 7 illustrates the first three principal modes of appearance variations within ±2σ found in the dataset. It is notable that gender transition was intrinsically established in the most dominant mode, while appearances of eyes, jawline, and cheek were captured by the 2 nd and 3 rd modes, respectively. A weighted combination of these modes could reconstruct a plausible but may be untrained face, without eyewitness having to provide any verbal description of individual facial component.

Figure 7. Intrinsic appearance variations captured by the first three principal model parameters (modes)
Without imposing much burden to the eyewitness, the total variation should be describable by as fewer model parameters as possible. In addition to the above visual assessment, the compactness of both facial shape and appearance models are plotted in Fig. 8 and 9, respectively. It is evident that 67% of total variation could be described by only 3 and 6 shape and appearance parameters, respectively. In order to cover 97% of the total variations, however, 18 and 126 shape and appearance parameters would be required. In practice, it was found that 16 appearance parameters, expanding about 75% of total variation, were a reasonable balance between dataset coverage and user experience, and was hence used in the design of the prototype system. It is worth noted here that variations at higher degrees may further improve reconstruction accuracy in typical automated facial recognition, but they hardly contribute to incrementally stimulating eyewitness recollection as intended. 196 The next experiment was carried out to emulate an actual setting, whereby an eyewitness interactively adjusts the model parameters to match the appearance with the face they had witnessed. Sixteen parameters were randomly generated within 2 and a 3D face was accordingly reconstructed.
Two best matched samples (w.r.t a Maharanis distance) were also queried from the database. This experiment was repeated three times and their results were shown in Fig. 10.
Corresponding parametric values in each sample are listed in Table 1.  Figure 10. Randomly synthetic models (left), best match (middle) and 2 nd best match (right) faces Figure 11 shows a screenshot of the developed software. It illustrates four typical steps involving suspect identification process. Suppose, for instance, that a criminal act happened in an area with no CCTV installed and the unsub had fled the scene. The crime was subsequently reported to the authorities and the investigation was conducted around the area and escape route. An eyewitness observing at the scene was presented with an application showing a neutral face (mean appearance) and asked to adjust 16 parameters by using the slide bars (1). Since the frontend was run on AJAX, on updating the parametric controls, the values were submitted to the server API and the new facial model was created and immediately rendered on the WebGL display (2). It is important that during this process, the witness was not guided in anyway regarding the implication of each parameter and expected to subconsciously learn from their interactive experience. Once they were confident that the face appeared similar to what they recalled, they may then press the button (3). The parameters were submitted to the server and compared with those in the facial database. The two closest matched faces were immediately returned for confirmation. Once completed, the resultant appearance parameters were then stored and indexed with a case unique ID so that they can later be fused with those of other witnesses' accounts (4).

Conclusion:
Despite the rapid development in facial recognition research, especially that based on SSPP, they are not suitable in scenario where there is no image, against which the model is compared. This paper therefore employed a compact and specific but generalizable model, namely AAM, and developed a computer assisted suspect identification scheme that presented an eyewitness with a synthetic face, while they interactively adjust model parameters to match its respective appearance with visual recollection. The prototype system exhibits a first step toward collaborative scheme for querying, profiling and distributing images of criminal offenders.