After doing text extraction by using an OCR library, we try to find and highlight useful information. In the following diagram, the useful information is being highlighted with a green border. As it can be seen in the diagram the date highlighted exists more than once, but it is not aligned with the other information that were are interested into and hence producing a false positive match.
The above problem has been simplified as shown in the next diagram. In this problem, we need to find the row which contains the most different number of colors. Should a cell color exist in the document but not on the final row to be identified, the first occurrence of that cell color would be considered as the correct value. Should the final row have two or more cells of the same color, only the first occurrence of that cell color on this row would be considered as the correct value. All other cells would be ignored.
It's important to note that these diagrams simplify the original problem shown initially. In reality each cell would define a rectangle with its corner coordinates (top, bottom, left, right) having different values. Each rectangle would also have different dimensions, though in general each rectangle on the same row would almost have the same height. To determine whether two cells are on the same row or not, we make use of the top-right corner of the two cells being compared. To get the corner coordinate to fit its target, a margin of error of 1% would be added to suffice the height differentiation deficiency.
Problem Resolution
The approach taken to address this problem is via brute force.
The first step taken is that of grouping by a color all the cell coordinates. For the same example shown above, the coordinates are shown in the format (x, y).
The next step taken was that of comparing the y-coordinate of a cell color with all other y-coordinate of all other cells colors. When they have the same y-coordinate, we keep track that they are on the same row and move to the next color.
Iteration 1: Yellow cell (2, 1) compare with Brown cell (5, 1) - same y-coordinate and hence matched
Iteration 2: Yellow cell (2, 1) compare with Violet cells - no matches
Iteration 3: Yellow cell (2, 1) compare with Grey cells - no matches
Iteration 4: Yellow cell (2, 1) compare with Green cells - no matches
Iteration 5: Yellow cell (2, 1) compare with Orange cells - no matches
Iteration 6: Yellow cell (2, 1) compare with Blue cells - no matches
Iteration 7: Yellow cell (2, 4) compare with Yellow cells - no matches
Iteration etc
foreach (sourceCellColor in cellColors)
{
foreach (sourceCellCoordinate in sourceCellColor.Coordinates)
{
foreach (targetCellColor in cellColors)
{
foreach (targetCellCoordinate in targetCellColor.Coordinates)
{
if (sourceCellCoordinate = targetCellCoordinate)
{
sourceCellColor has a match on targetCellColor with targetCellCoordinate;
break loop for targetCellColor.Coordinates;
}
}
}
}
}
Optimization 1
Reducing the number of iterations.
- On the first loop there is no need to make use of the last color as it would have already been processed and compared via one of the inner loops.
- On the inner loop there is no need to compare to currently color being processed by outside loop. Also, already processed colors from the outside loop would have already been processed. If I already compared Yellow to Brown, there is no need to compare again Brown to Yellow.
foreach (sourceCellColor in cellColors - lastColor)
{
foreach (sourceCellCoordinate in sourceCellColor.Coordinates)
{
foreach (targetCellColor in (cellColors - already or currently being processed sourceCellColor))
{
foreach (targetCellCoordinate in targetCellColor.Coordinates)
{
etc
}
}
}
}
Optimization 2
Ranking colors by minimum number of occurrences.
Optimization 3
During the processing of a colors, we might be in a position that we have already found the best fit and hence there is no need for additional brute force processing. To determine this, we need to know the number of remaining colors to be matched. Should within the current iteration we have a total match as the renaming number of colors to be matched, then we can terminate processing.
Examples:
Iteration 1: Current color = Violet; Number of colors to be matched = 7
Iteration 2: Current color = Grey; Number of colors to be matched = 6
Iteration 3: Current color = Orange; Number of colors to be matched = 5; Using the sample data above, at this stage we would have found the best fit and processing can stop
Should the colors Grey and Violet have been inverted, we would have been able to find the best fit in the 2nd iteration.
Other notes
The logic for comparing cells isn't included here. This involves more processing than just comparing two values. The additional processing required is to adjust the cell coordinates due to image rotation because of imperfect document scanning. This rotation processing consists of additional mathematical transformation that incur a performance charge.