Thursday, August 8, 2024

Thursday, August 3, 2023

SQL Server Temporal Tables and Usage in .NET Core

Recently I was working on a solution where there was a requirement to implement the audit trail feature. An audit trail is a security-relevant chronological record, set of records, and/or source/destination of records that provide documentary evidence of the sequence of activities that have affected at any time a specific operation, procedure, event, or device.

After going through the available options and comparing each one’s pros and cons, we have settled upon a SQL Server feature called Temporal Tables.

What are Temporal Tables?

Temporal tables (also known as system-versioned temporal tables) are a database feature that brings built-in support for providing information about data stored in the table at any point in time, rather than only the data that is correct at the current moment in time.  This database feature was made available from SQL Server 2016 onwards.

A system-versioned temporal table is a type of user table designed to keep a full history of data changes, allowing easy point-in-time analysis. This type of temporal table is referred to as a system-versioned temporal table because the period of validity for each row is managed by the system (that is, the database engine).

Every temporal table has two explicitly defined columns, each with a datetime2 data type. These columns are referred to as period columns. These period columns are used exclusively by the system to record the period of validity for each row, whenever a row is modified. The main table that stores current data is referred to as the current table, or simply as the temporal table.

In addition to these period columns, a temporal table also contains a reference to another table with a mirrored schema, called the history table. The system uses the history table to automatically store the previous version of the row each time a row in the temporal table gets updated or deleted. During temporal table creation, users can specify an existing history table (which must be schema compliant) or let the system create a default history table.

I you are planning to make use of Temporal Tables in .NET Core, this feature was made accessible from EF Core 6.0. The feature supports:

  • The creation of temporal tables using EF Core migrations
  • Transformation of existing tables into temporal tables, again using migrations
  • Querying historical data
  • Restoring data from some point in the past

For more details on how to make use of Temporal Tables via a sample application in .NET Core, you can start with Microsoft’s announcement of this feature.

Friday, September 3, 2021

GZipStream issue

Recently I was required to include some functionality which consists of compression and decompression of data.  To accomplish this task I made use of the GZipStream object.

Initially, I was using the following code for compression:

using var gs = new GZipStream(mso, CompressionMode.Compress);{
msi.CopyTo(gs);

After trying using this code, no compressed data was being produced.  After changing the code to the below, compressed data was being produced correctly.


using var gs = new GZipStream(mso, CompressionMode.Compress);{
{
        msi.CopyTo(gs);
}

The functionality for the above shared code should be identical, but for some reason it behaves differently.

Following the above, I did some additional tests using a different Compression Mode as Decompress.  Use both code samples shown above, during decompression issue wasn't reproduced.

Tuesday, May 18, 2021

Cosmos DB Bulk Delete

Recently we were required to do bulk delete of data that exists on Cosmos DB. This was required as a one time process for the purpose of cleaning orphaned data that has been caused due to legacy functionality from another system.

To provide an overview of the current structure for recording data onto Cosmos DB is setup as shown below. In our instance the partition_key is the car_color

Unfortunately, Cosmos DB doesn't provide functionality out of the box to accommodate this requirement. To accomplish this task, one can make use of the following stored procedure.  This needs to be added to the container where the data needs to be deleted.

function bulkDelete(query) {
    var collection = getContext().getCollection();
    var collectionLink = collection.getSelfLink();
    var response = getContext().getResponse();
    var responseBody = {
        deleted: 0,
        continuation: true
    };
    // Validate input.
    if (!query) throw new Error("The query is undefined or null.");
    tryQueryAndDelete();
    // Recursively runs the query w/ support for continuation tokens.
    // Calls tryDelete(documents) as soon as the query returns documents.
    function tryQueryAndDelete(continuation) {
        var requestOptions = {continuation: continuation};
        var isAccepted = collection.queryDocuments(collectionLink, query, requestOptions, function (err, retrievedDocs, responseOptions) {
            if (err) throw err;
            if (retrievedDocs.length > 0) {
                // Begin deleting documents as soon as documents are returned form the query results.
                // tryDelete() resumes querying after deleting; no need to page through continuation tokens.
                //  - this is to prioritize writes over reads given timeout constraints.
                tryDelete(retrievedDocs);
            } else if (responseOptions.continuation) {
                // Else if the query came back empty, but with a continuation token; repeat the query w/ the token.
                tryQueryAndDelete(responseOptions.continuation);
            } else {
                // Else if there are no more documents and no continuation token - we are finished deleting documents.
                responseBody.continuation = false;
                response.setBody(responseBody);
            }
        });
        // If we hit execution bounds - return continuation: true.
        if (!isAccepted) {
            response.setBody(responseBody);
        }
    }
    // Recursively deletes documents passed in as an array argument.
    // Attempts to query for more on empty array.
    function tryDelete(documents) {
        if (documents.length > 0) {
            // Delete the first document in the array.
            var isAccepted = collection.deleteDocument(documents[0]._self, {}, function (err, responseOptions) {
                if (err) throw err;
                responseBody.deleted++;
                documents.shift();
                // Delete the next document in the array.
                tryDelete(documents);
            });
            // If we hit execution bounds - return continuation: true.
            if (!isAccepted) {
                response.setBody(responseBody);
            }
        } else {
            // If the document array is empty, query for more documents.
            tryQueryAndDelete();
        }
    }
}

When executing this stored procedure apart from the query parameter, one needs to also include the partition_key.  Hence the data that would be deleted would only be within a specific partition and not on the whole container.  Should the user require to delete data from different partitions, the stored procedure needs to be executed multiple times by supplying each time the related partition_key.

Thursday, May 13, 2021

Cognitive Services: OCR vs Analyze Layout vs Analyze Invoice vs Analyze Forms

Recently I was required to do some analysis on some of the APIs offered within the Azure Cognitive Services suite.  Initially the APIs that were going to be analyzed were the following listed below.  However one of them was later dropped due to the reasons that will be provided later.

  • OCR
  • Analyze Layout (v2.0)
  • Analyze Layout (v2.1 preview)
  • Analyze Invoice (v2.1 preview)
  • Analyze Forms (v2.0)
  • Analyze Forms (v2.1 preview)

When performing this analyses, a set of 8 different documents was used.

OCR vs Analyze Layout (v2.0)
  • The OCR engine used within Analyze Layout is different from the one offered by the OCR API.  This conclusion was made as on one particular instance some specific text was OCRed differently
  • When using scanned documents, Analyze Layout seemed to pick-up noise which adds no value
  • On a couple of documents the coordinates for data extracted between the two APIs were completely different
  • On both APIs, text which spans on multiples lines gets extracted as different lines
  • OCR doesn't produce any structure to define table data
  • The JSON result produced by the Analyze Layout includes a new section pageResults.  This is being used to define table structured data
  • On some occasions, Analyze Layout did extract table structured data but which do not map to any table structure
  • On some occasions, Analyze Layout did extract only parts of a table
  • When used a document with 2 pages, a table structure was identified on page 2 but not on page 1

OCR vs Analyze Layout (v2.1)
  • The OCR engine used within Analyze Layout is different from the one offered by the OCR API.  This conclusion was made as on one particular instance some specific text was OCRed differently.  In fact the value produced was also different from the value produced via v2.0
  • On both APIs, text which spans on multiples lines gets extracted as different lines
  • OCR doesn't produce any structure to define table data
  • The JSON result produced by the Analyze Layout includes a new section pageResults.  This is being used to define table structured data
  • On different number of occasions, Analyze Layout was capable of extracting table structure related data
  • When used a document with multiple pages, table structure data was extracted on both pages
  • On one specific document, a particular symbol was identified as a selection mark (looks to be the identification of a checkbox)
  • On some instances within the table structure data, related text on multiple lines was being amalgamated
  • There seems to be the introduction of appearance metadata related to the data being extracted.  However, this looked to be static as the same style was observed across all the data

Analyze Invoice (v2.1)
  • The JSON result produced excludes the lines section that is included in the OCR and Analyze Layout APIs
  • The JSON result produced includes a new section pageResults.  This is being used to define table structured data.  Compared to the Analyze Layout APIs, this is excluding references to lines information
  • The  JSON result produced includes a new section documentResults to help classifying content within the related document
  • The classification data consists of a key value pair for single matches, but is capable of classifying table structure data

Analyze Forms (v2.1)
To analyze this API, the use of the FOTT tool was made use.  This simplifies the generation of JSON messages to be used for the respective APIs.  To understand how this tool work, the steps within this video were followed: Steps to use FOTT tool.

It is being assumed / concluded that Analyze Invoice is making use of Analyze Forms, but with already pre-trained data.  Similarly there are some already other existing APIs to cater for Business Cards, etc.

Analyze Forms (v2.0)
This API wasn't analyzed as the FOTT tool doesn't support this version.  Hence, the generation of JSON messages manually would have been a complex task.  Also, there would have been a chance that the results would be similar or worse to the results produced within v2.1
 

Thursday, April 22, 2021

Dependency injection for background task

Recently I faced a requirement where from a method I needed to call an independent background task.  Below is a snippet

public Task Handler()
{
    ... do something
    ... BackgroundTask()
    ... do something
}

The challenge that was encountered is that the background task was required to make use of objects  which were instantiated within the constructor of the class via dependency injection.

public class MyService
{
    private readonly IMediator mediator;
    public MyService(IMediator mediator)
    {
        this.mediator = mediator;
    }
}

The background task was required to make use of the mediator object.  However upon running this code, I started experiencing Cannot access a disposed object.

To overcome this problem what was required is to created a scoped object within the background job.  Hence we made use of the IServiceScopeFactory which would help in instantiating an object for the required service within the scope of the background task.

public class MyService
{
    private readonly IMediator mediator;
    private readonly IServiceScopeFactory scopeFactory;
    public MyService(IMediator mediator, IServiceScopeFactory scopeFactory)
    {
        this.mediator = mediator;
        this.scopeFactory = scopeFactory;
    }
    public void BackgroundTask()
    {
        using var scope = scopeFactory.CreateScope();
        var scopedMediator = scope.ServiceProvider.GetRequiredService<IMediator>();
        ... etc
    }
}

Friday, April 9, 2021

Query retrieval optimization tips for ElasticSearch

When retrieving data from ElasticSearch, one approach to do this is via the following code.

var request = new SearchRequest<Document>(indexName);
var response = await elasticClient.SearchAsync<Document>(request);

Should the document structure be something similar to the following, upon running the above search request the response would include all data persisted and map it to all the provided properties.

public class Document 
{
    public int Id { get; set; }
    public string Name { get; set; }
    public List<Tag> Tags { get; set; }
}

Optimizing retrieval by excluding unwanted properties

There might be instances where we would need only specific data within this structure.  On can just remove the undesired properties from the Document class.  Though this would present the user with only the set of wanted properties, performance wise it wouldn't have any benefits.  This is because ElasticSearch would still return all the persisted data.  The exclusion of the data would be just happening during the process when mapping the result to our structure.

To improve the performance by avoiding bulk and useless network traffic from the ElasticSearch server, we can optimize the search request to exclude unwanted properties. This can be achieved via the following code.

var request = new SearchRequest<Document>(documentsIndexName)
{
      Source = new SourceFilter { Excludes = "tags" }
};

Using the above search request, we would be instructing ElasticSearch to exclude the tags properties from the response.  Hence, the result sent by ElasticSearch would consist of the properties Id and Name.


Optimizing retrieval by including wanted properties

There might be instances that instead of excluding data, it would easier to define the set of wanted properties.  This can be achieved via the following code.

var request = new SearchRequest<Document>(documentsIndexName)
{
      Source = new SourceFilter { Includes = "Name" }
};

Using the above search request, we would be instructing ElasticSearch to include only the Name property.  All other properties would be set as null.

Thursday, April 8, 2021

Improve lists debugging in Visual Studio

How many times we needed to debug code related to lists?  Visual Studio watches and inspect end up showing us this kind of information (ie data types).  We then end up having to expand each element to analyze its contents.


It would help that instead of showing the data type the user gets presented with some meaningful information.  To override this information a developer can make use of DebuggerDisplay attribute and include any custom information that he would like to see.


During debugging the developer will be shown this information.  Should the user view all the properties within the element, he can then expand as per normal procedure.


We can further improve the above by removing the quotation marks.  This can be achieved by including the following.








Monday, March 29, 2021

Brute force for best fit cell alignment

Problem Definition
After doing text extraction by using an OCR library, we try to find and highlight useful information.  In the following diagram, the useful information is being highlighted with a green border.  As it can be seen in the diagram the date highlighted exists more than once, but it is not aligned with the other information that were are interested into and hence producing a false positive match.

The above problem has been simplified as shown in the next diagram.  In this problem, we need to find the row which contains the most different number of colors.  Should a cell color exist in the document but not on the final row to be identified, the first occurrence of that cell color would be considered as the correct value.  Should the final row have two or more cells of the same color, only the first occurrence of that cell color on this row would be considered as the correct value. All other cells would be ignored.

The expected result would be the following:

It's important to note that these diagrams simplify the original problem shown initially.  In reality each cell would define a rectangle with its corner coordinates (top, bottom, left, right) having different values.  Each rectangle would also have different dimensions, though in general each rectangle on the same row would almost have the same height.  To determine whether two cells are on the same row or not, we make use of the top-right corner of the two cells being compared. To get the corner coordinate to fit its target, a margin of error of 1% would be added to suffice the height differentiation deficiency.

Problem Resolution
The approach taken to address this problem is via brute force.

The first step taken is that of grouping by a color all the cell coordinates.  For the same example shown above, the coordinates are shown in the format (x, y).

The next step taken was that of comparing the y-coordinate of a cell color with all other y-coordinate of all other cells colors.  When they have the same y-coordinate, we keep track that they are on the same row and move to the next color.

Iteration 1: Yellow cell (2,  1) compare with Brown cell (5, 1) - same y-coordinate and hence matched
Iteration 2: Yellow cell (2, 1)  compare with Violet cells - no matches
Iteration 3: Yellow cell (2, 1)  compare with Grey cells - no matches
Iteration 4: Yellow cell (2, 1)  compare with Green cells - no matches
Iteration 5: Yellow cell (2, 1)  compare with Orange cells - no matches
Iteration 6: Yellow cell (2, 1)  compare with Blue cells - no matches
Iteration 7: Yellow cell (2, 4)  compare with Yellow cells - no matches
Iteration etc

foreach (sourceCellColor in cellColors)
{
    foreach (sourceCellCoordinate in sourceCellColor.Coordinates)
    {
        foreach (targetCellColor in cellColors)
        {
            foreach (targetCellCoordinate in targetCellColor.Coordinates)
            {
                if (sourceCellCoordinate = targetCellCoordinate)
                {
                    sourceCellColor has a match on targetCellColor with targetCellCoordinate;
                    break loop for targetCellColor.Coordinates;
                }
            }
        }
    }
}

Optimization 1
Reducing the number of iterations.  
  • On the first loop there is no need to make use of the last color as it would have already been processed and compared via one of the inner loops.
  • On the inner loop there is no need to compare to currently color being processed by outside loop.  Also, already processed colors from the outside loop would have already been processed. If I already compared Yellow to Brown, there is no need to compare again Brown to Yellow.
foreach (sourceCellColor in cellColors - lastColor)
{
    foreach (sourceCellCoordinate in sourceCellColor.Coordinates)
    {
        foreach (targetCellColor in (cellColors - already or currently being processed sourceCellColor))
        {
            foreach (targetCellCoordinate in targetCellColor.Coordinates)
            {
                etc
            }
        }
    }
}

Optimization 2
Ranking colors by minimum number of occurrences.

This would be beneficial especially if one of the initial cells being compared is a best fit, as we reduce the number of iterations.  Also, eliminating processed colors would continue reduce the number of iterations.

Optimization 3
During the processing of a colors, we might be in a position that we have already found the best fit and hence there is no need for additional brute force processing.  To determine this, we need to know the number of remaining colors to be matched.  Should within the current iteration we have a total match as the renaming number of colors to be matched, then we can terminate processing.

Examples:
Iteration 1: Current color = Violet; Number of colors to be matched = 7
Iteration 2: Current color = Grey; Number of colors to be matched = 6
Iteration 3: Current color = Orange; Number of colors to be matched = 5; Using the sample data above, at this stage we would have found the best fit and processing can stop

Should the colors Grey and Violet have been inverted, we would have been able to find the best fit in the 2nd iteration.

Other notes
The logic for comparing cells isn't included here.  This involves more processing than just comparing two values.  The additional processing required is to adjust the cell coordinates due to image rotation because of imperfect document scanning.  This rotation processing consists of additional mathematical transformation that incur a performance charge.

Sunday, April 18, 2010

Parallelism on a thread-safe data structure

Introduction
In this article we have generated some smalls tests on Parallelism using Dot Net Framework 4.0 with some further extensions from the previous article.  In this article we will be adding items on a list.  The list is to have an unknown initial size and without indexing.  In the previous article, data was added to a two-dimensional array with known initial size and allows for indexing.

The construct used for parallelism to produce the below tests, is the same as that used in the previous article.  The new construct that will be used will allow for different threads to add data to a shared list.  Thus this construct must ensure for race conditions, in order to avoid having two items written simultaneously in the same location and thus overwriting the first write.


Construct Details
ConcurrentBag Class


Bags are useful for storing objects when ordering doesn't matter, and unlike sets, bags support duplicates. ConcurrentBag is a thread-safe bag implementation, optimized for scenarios where the same thread will be both producing and consuming data stored in the bag.


Methodology 
The test performed was the factorial of the current iteration number and taking timing benchmarks  while adding it to the list.  The tests where performed using:




  • standard sequential method, and
  • parallelism.

The method used to compute the factorial in both scenarios is:

public int Factorial(int number)
{
    if (number == 0)
    {
        return 1;
    }
    else
    {
        return number * Factorial(number - 1);
    }
}



The machine used has the following specifications: 

Intel(R) Core(TM)2 CPU 6400 @ 2.16, 2.14 GHz, 1.00GB of RAM




Results
These are the results achieved (time shown is in seconds):



                       10       100     1000     10000 12000
Sequential       0      0       0.01        1.38           1.99
Parallel 0.03   0.02    0.03        0.79           1.31



Conclusions
As per the results produced, it can be concluded that the best performance was achieved using parallelism.  Unfortunately, I could not perform more tests with larger iterations, since a StackOverflowException was being given.

Code
Below is the code for all the mentioned above methods using C#:

public List GenerateNumbersListSequentially(int size)
{
    List numbers = new List();

    for (int x=0; x
    {
        numbers.Add(Factorial(x));
    }

    return numbers;
}


public List GenerateNumbersListParallel(int size)
{
    ConcurrentBag numbers = new ConcurrentBag();

    Parallel.For(0, size, x =>
    {
        numbers.Add(Factorial(x));
    });

    return numbers.GetEnumerator() as List;
}