Tuesday, May 18, 2021

Cosmos DB Bulk Delete

Recently we were required to do bulk delete of data that exists on Cosmos DB. This was required as a one time process for the purpose of cleaning orphaned data that has been caused due to legacy functionality from another system.

To provide an overview of the current structure for recording data onto Cosmos DB is setup as shown below. In our instance the partition_key is the car_color

Unfortunately, Cosmos DB doesn't provide functionality out of the box to accommodate this requirement. To accomplish this task, one can make use of the following stored procedure.  This needs to be added to the container where the data needs to be deleted.

function bulkDelete(query) {
    var collection = getContext().getCollection();
    var collectionLink = collection.getSelfLink();
    var response = getContext().getResponse();
    var responseBody = {
        deleted: 0,
        continuation: true
    };
    // Validate input.
    if (!query) throw new Error("The query is undefined or null.");
    tryQueryAndDelete();
    // Recursively runs the query w/ support for continuation tokens.
    // Calls tryDelete(documents) as soon as the query returns documents.
    function tryQueryAndDelete(continuation) {
        var requestOptions = {continuation: continuation};
        var isAccepted = collection.queryDocuments(collectionLink, query, requestOptions, function (err, retrievedDocs, responseOptions) {
            if (err) throw err;
            if (retrievedDocs.length > 0) {
                // Begin deleting documents as soon as documents are returned form the query results.
                // tryDelete() resumes querying after deleting; no need to page through continuation tokens.
                //  - this is to prioritize writes over reads given timeout constraints.
                tryDelete(retrievedDocs);
            } else if (responseOptions.continuation) {
                // Else if the query came back empty, but with a continuation token; repeat the query w/ the token.
                tryQueryAndDelete(responseOptions.continuation);
            } else {
                // Else if there are no more documents and no continuation token - we are finished deleting documents.
                responseBody.continuation = false;
                response.setBody(responseBody);
            }
        });
        // If we hit execution bounds - return continuation: true.
        if (!isAccepted) {
            response.setBody(responseBody);
        }
    }
    // Recursively deletes documents passed in as an array argument.
    // Attempts to query for more on empty array.
    function tryDelete(documents) {
        if (documents.length > 0) {
            // Delete the first document in the array.
            var isAccepted = collection.deleteDocument(documents[0]._self, {}, function (err, responseOptions) {
                if (err) throw err;
                responseBody.deleted++;
                documents.shift();
                // Delete the next document in the array.
                tryDelete(documents);
            });
            // If we hit execution bounds - return continuation: true.
            if (!isAccepted) {
                response.setBody(responseBody);
            }
        } else {
            // If the document array is empty, query for more documents.
            tryQueryAndDelete();
        }
    }
}

When executing this stored procedure apart from the query parameter, one needs to also include the partition_key.  Hence the data that would be deleted would only be within a specific partition and not on the whole container.  Should the user require to delete data from different partitions, the stored procedure needs to be executed multiple times by supplying each time the related partition_key.

Thursday, May 13, 2021

Cognitive Services: OCR vs Analyze Layout vs Analyze Invoice vs Analyze Forms

Recently I was required to do some analysis on some of the APIs offered within the Azure Cognitive Services suite.  Initially the APIs that were going to be analyzed were the following listed below.  However one of them was later dropped due to the reasons that will be provided later.

  • OCR
  • Analyze Layout (v2.0)
  • Analyze Layout (v2.1 preview)
  • Analyze Invoice (v2.1 preview)
  • Analyze Forms (v2.0)
  • Analyze Forms (v2.1 preview)

When performing this analyses, a set of 8 different documents was used.

OCR vs Analyze Layout (v2.0)
  • The OCR engine used within Analyze Layout is different from the one offered by the OCR API.  This conclusion was made as on one particular instance some specific text was OCRed differently
  • When using scanned documents, Analyze Layout seemed to pick-up noise which adds no value
  • On a couple of documents the coordinates for data extracted between the two APIs were completely different
  • On both APIs, text which spans on multiples lines gets extracted as different lines
  • OCR doesn't produce any structure to define table data
  • The JSON result produced by the Analyze Layout includes a new section pageResults.  This is being used to define table structured data
  • On some occasions, Analyze Layout did extract table structured data but which do not map to any table structure
  • On some occasions, Analyze Layout did extract only parts of a table
  • When used a document with 2 pages, a table structure was identified on page 2 but not on page 1

OCR vs Analyze Layout (v2.1)
  • The OCR engine used within Analyze Layout is different from the one offered by the OCR API.  This conclusion was made as on one particular instance some specific text was OCRed differently.  In fact the value produced was also different from the value produced via v2.0
  • On both APIs, text which spans on multiples lines gets extracted as different lines
  • OCR doesn't produce any structure to define table data
  • The JSON result produced by the Analyze Layout includes a new section pageResults.  This is being used to define table structured data
  • On different number of occasions, Analyze Layout was capable of extracting table structure related data
  • When used a document with multiple pages, table structure data was extracted on both pages
  • On one specific document, a particular symbol was identified as a selection mark (looks to be the identification of a checkbox)
  • On some instances within the table structure data, related text on multiple lines was being amalgamated
  • There seems to be the introduction of appearance metadata related to the data being extracted.  However, this looked to be static as the same style was observed across all the data

Analyze Invoice (v2.1)
  • The JSON result produced excludes the lines section that is included in the OCR and Analyze Layout APIs
  • The JSON result produced includes a new section pageResults.  This is being used to define table structured data.  Compared to the Analyze Layout APIs, this is excluding references to lines information
  • The  JSON result produced includes a new section documentResults to help classifying content within the related document
  • The classification data consists of a key value pair for single matches, but is capable of classifying table structure data

Analyze Forms (v2.1)
To analyze this API, the use of the FOTT tool was made use.  This simplifies the generation of JSON messages to be used for the respective APIs.  To understand how this tool work, the steps within this video were followed: Steps to use FOTT tool.

It is being assumed / concluded that Analyze Invoice is making use of Analyze Forms, but with already pre-trained data.  Similarly there are some already other existing APIs to cater for Business Cards, etc.

Analyze Forms (v2.0)
This API wasn't analyzed as the FOTT tool doesn't support this version.  Hence, the generation of JSON messages manually would have been a complex task.  Also, there would have been a chance that the results would be similar or worse to the results produced within v2.1