Tamper Detection in Text Document

Although text document images authentication is difficult due to the binary nature and clear separation between the background and foreground but it is getting higher demand for many applications. Most previous researches in this field depend on insertion watermark in the document, the drawback in these techniques lie in the fact that changing pixel values in a binary document could introduce irregularities that are very visually noticeable. In this paper, a new method is proposed for object-based text document authentication, in which I propose a different approach where a text document is signed by shifting individual words slightly left or right from their original positions to make the center of gravity for each line fall in with the middle point of intended line. Any modification, addition or deletion in a letter, word, or line in the document will be detected.


Introduction
Binary images are commonly used for archiving text document and logo images.It is often necessary to develop appropriate methods to verify their fidelity and integrity for security protection in various applications.So far, there were only a few studies on data hiding in binary images and very few researches on authentication of binary images.Wu, Tang, and Liu [1] embedded bits in image blocks by pattern matching; the method can be used both for data hiding and image authentication.The method proposed by Pan, Chen, and Tseng [2] changed pixel values in image blocks to hide secret data by mapping block contents into the secret data.And the method proposed by Tseng and Pan [3] modified the method by Pan, Chen, and Tseng [2] to control the image quality.In Koch and Zhao [4], a bit 1 or 0 was embedded in an image block by enforcing the ratio of the number of black pixels in the block to that of white ones to be larger or smaller than the value 1, respectively.A text document contains some objects such as characters, words, lines and paragraphs.A watermark embedded and extracted in the document for authentication purposes by shifting these objects are known as object-based text document authentication [5].Authentication method for text document images was proposed in which the maximum data hiding capacity of the host document was utilized for watermark embedding.For the embedding process, maximum data hiding capacity of the document was calculated before actually the watermark is embedded.Then the watermark was generated from the owner secret key according to the calculated capacity and embedded within the whole document to ensure the authenticity and integrity of the document [6].For the sake of convenience, we propose a new contentbased scheme.The authentication process doesn't need the original document.We study the characteristics of the center of gravity (centroid) for binary images.Experimental results show that centroid can be used to represent the binary image and to decide the authenticity of the image effectively.The paper is organized to describe the definition of centroid, steps for calculate centroid, cropping the body of binary text from image document by * Department of Computer Science, College of Science for Women, University of Baghdad applying trimming process , image authentication and alteration localization through line authentication algorithm.

The proposed method 1 Center of Gravity
The center of gravity (CG) is the center of an object's weight distribution, where the force of gravity can be considered to act.It is the point in any object about which it is in perfect balance no matter how it is turned or rotated around that point.For a finite set of point masses, CG may be defined as the average of positions weighted by mass.That is, the {Sum of (weight X position) / (Sum of weights)}.To find the CG of a two dimensional object consist of elements, 1. Calculate the weights of the basic elements.2. Choose a starting point.This is called the datum.This point is arbitrarily placed at one end of the object 3. Measure the distances from the datum to the center of each element.4. Multiply each distance by the respective weight.This gives the moment for each element.5. Add the weights of all the elements.6. Divide the total moment by the total weight.This is the distance from the datum to the center of gravity.
We can apply the characteristics of CG on binary text image.The digital text image is a two-dimensional binary image g(x,y) whose pixels' values (weights) are ones or zeros corresponding to light and dark points on the original image.Where each element is, for example, assigned the value 1 if the i th cell contain a portion of the character, and is assigned the value 0 otherwise A text document image is represented by the following function : and y=0, 1, 2,……, L D and L represent the width and length of the document in pixels.Use the formulas: Xcg = ∑xw / ∑w to find the CG along the x-axis ……………….. (1) Ycg = ∑yw / ∑w to find the CG along the y-axis ……………….. ( 2) The point at which they intersect is the Center of Gravity (centroid).

Image trimming
Text document image is an area that contains binary image.This area may consist of additional lines and columns that have no data (spaces), so these empty lines are eliminated by tracing from outside margins towards inside and stop at a first occurrence of on-pixel at each side of the four edges, see fig ( 1).

Image authentication
After producing the digital image, a trimming procedure is applied, according to this process the empty areas surrounding the text image has to be removed, then we can apply equations ( 1) & (2) by using bottom margin and left margin as a datum respectively to get the center of gravity for whole image in both direction which consist of two numbers represent coordinates of centroid.These two numbers inserted in the header of file.
To check the authenticity of the image the previous steps repeated by the recipient of the file and compare the calculated centroid with the centroid that has impeded in the header of file.If they are not identical, this mean the document is forged where each deletion or modification of word or letter in the document will change the centroid coordinates.The centroid is computing finally after applying Line Authentication Algorithm which discussed in section 2.4

Localization of modification
To localize the alteration may have occurred in the document we apply equation (1) on each line of document after finding the boundaries of its individual words; this can be accomplished with one of sophisticated edge detection techniques.The proposed method requires leaving double space between words of document for two reasons; first, to utilize these spaces in defining the threshold for word segmentation and boundary detection, second, to give a tolerance to decrease (not less than predefined threshold) or increase these spaces as will explained later.A threshold value has to be established to perform the separation.Segmentation of line's words is shown in fig.(2).We find the center of gravity along the x-axis for this line by calculate the weight and center of gravity of each word, then find their distances from the datum (left margin).Then we compute the mid length of intended line.If the distance from reference line to the mid length not identical with the Center of Gravity for the line, we shift the words forward or backward by changing the length of spaces between the words of the intended line to make the Center of Gravity along xcoordinate fall in with the mid point of line length.The procedures shown in Line Authentication Algorithm.
To check the authenticity of document's lines, the previous steps repeated by the recipient of the file and compare the calculated Center of Gravity along the xaxis with the mid length for each line.If they are not identical, this indicates that this line has altered.If one (or more) line has deleted or added then the centroid of whole document will change.Program body:

0 3 EXECUTE segmentation of lines for binary text image 4 WHILE not end of lines DO 5 EXECUTE segmentation of words for intended line 6 WHILE not end of spaces DO
1 SET THR2 SET i = 0, j = 0, NS =

not end of words DO 12 compute W(i) 13 compute CGW(i) 14 compute X(i) 15 i = i + 1 16 ENDWHILE 17 compute MD 18 compute CGL 19 Compute ∑ S(j) *for j =0 to NS-1* 20 Difference = CGL -MD 21 IF Difference > 0 AND Difference > (∑S(j) -NS * THR) THEN 22 shift last word to next line : GOTO step 2 23 ELSE IF Difference < 0 AND ABS(Difference) > (D -2 * MD) THEN 24 shift last word to next line : GOTO step
Digital Text document is easily reproduced and modified without any trace of manipulations.The big important for text documents in our live lead me to find a novel method for verify the genuineness of text documents.A physical property of materials; the Center of Gravity, was exploited and applied on text documents.Very encouragement results has fulfilled when suggested algorithm achieved.