OCR Machine Learning Model Implementation

13 min readApr 3, 2022

NPL’s Data scientists at constantly turn to cloud providers for obtain embedded machine learning functionalities through SaaS services: it’s initially have a trial version but ultimately turn into high costs and not fully personalizable customizable. In this article. I rescue an NPL project structure for Machine Learning based on the implementation of an OCR ML model that can be integrated into translation systems with parallel processing, Bert Transformers Models, IaaS as a Docker, Python functions or, where appropriate, computer vision CNN for the identification and segmentation of text in image and PDF. It takes aspects such as the prediction being in real time, the model artifacts implementing different versions being instantiated by a Rest API access endpoint in the inference model, the files processed by the model being more than 50 PDF pages with processing times greater than 10 seconds, where peaks of service requests will be presented and we will design an architecture that fulfills the predictive function of an effective OCR neural network.

ML OCR’s Model Inference

The approach to be taken for the development of the model will be real-time (or interactive) inference: making predictions by request and/or at any time, triggering an immediate response to each request. Taking into account that the frequencies to generate the predictions will be by request; With the greatest request for predictions at the beginning of the day and at the end of the day, the following processes will be applied to solve the load:

Infrastructure: The prediction activation process will be use auto scaling cluster per time segment. Where the challenges of consumption of GPU and memory consumption by the OCR process will be faced; this will allow us to deliver the results on time.
Segmentation: Predictions no soportadas will be generated by batches of 1,024 Mega, specifically a 4096-byte linear buffer, where the files will be preprocessed in a stage with their respective predictions and then be integrated under the context of the entire file.
Model latency:
• Is model latency expected?
• How much compute capacity is needed to run the model?
• Are there operational implications and costs to maintain the model?
For our process we will use the real-time inference architecture; where the model can be triggered at any time and an immediate response is expected. This pattern allows you to analyze streaming data, interactive data from the application, Bert Transformes for the automatic translation system or character identification in real time; This allows you to take advantage of the machine learning model to solve the cold start problem, where in the following decision tree it shows the step of the decision tree:

Real-time inference challenges: Latency and throughput requirements make the real-time inference architecture more complex for the model. A system may need to respond in 100 milliseconds or less, in which time it must retrieve the data, perform the inference, validate and store the model results, execute any necessary business logic, and return the results to the system or application.

Compute options for real-time inference: The best way to implement real-time inference is by deploying containers to a Kubernetes or Docker cluster, and exposing it as a REST API web service; in this way, the model runs in its own isolated environment and can be managed like any other web service. Docker capabilities can then be used for management, monitoring, auto scaling, and translation, the model can be deployed locally, in the cloud, or at the on-premise edge.

Multi-region deployment and high availability: Regional deployment and high availability architectures should be considered in real-time inference scenarios, as latency and throughput of the model will be difficult issues to resolve. To reduce latency in multi-region deployments, it is recommended to locate the model that is as close as possible to the point of consumption. The model and supporting infrastructure must adhere to the company’s high availability and disaster recovery principles and strategy. Having many models for these scenarios to capture regional data or store-level relationships could lead to greater accuracy than a single model. This approach assumes that enough data is available for this level of granularity.

(high latency exception) Note: In cases where the containers do not resolve the requests we will use Batch Inference: In case of OCR processes of complex images, complex texts to identify or a large size of the uploaded files to be processed by the neural network, we will use Deep Learning with batch processes; running multiple models simultaneously to achieve a highly scalable inference solution that can handle large volumes of data. To achieve hierarchical model inference, many models can be divided into categories. Each category can have its own inference storage, such as a data lake preferably in Hadoop. When implementing this pattern, it is necessary to balance horizontal and vertical scaling of the models, as this would affect cost and performance. Running too many model instances might increase performance, but cost might suffer. Having very few instances with high-spec nodes might be more cost-effective, but there might be issues with scaling.

(Optimal low latency) Real-time inference with containers: We will implement real-time inference as the best option for several models in low latency and with requests on demand, it will be done with a REST connection point of external services require a standard interface to interact with the model, normally through a REST interface with a JSON payload, With this pattern in the detection service we will identify a list of services and their metadata, each service is a stateless microservice that can handle multiple requests simultaneously and is limited to the physical virtual machine resource. The service can deploy multiple models if multiple groups are selected. For this, homogeneous groupings translation services, image segmentation identification, image filters, autonomous driving action identification selection, such as categorization, etc. are recommended. The mapping between the service request and the selected model for a given service must be built into the inference logic, typically through the scoring script. If the size of the models is relatively small (a few megabytes), it is recommended to load them into memory for performance reasons. Otherwise each model can be loaded dynamically per request.

OCR MODEL DEVELOPMENT PROCEDURE

STACK OF TECHNOLOGY TO USE

During this decade; I have worked and tested IaaS, Paas, SaaS services with different cloud service providers, where from my own experience AWS is definitely the best option. For this specific text extraction process AWS Amazon TextExtract is the best solution where I find the largest number of languages for processing. We must bear in mind that as a neural network engineer we can develop the same service in Python under the same AWS infrastructure, obtaining very similar results (either by transfer learning) or native development.

ARCHITECTURE DIAGRAM

OCR Technical ML problem, How approach to ?

The goal of this Machine Learning OCR model is to translate and process image to text in large-scale PDF or image files with maximum performance; for that it is necessary to define metrics for the success of the model. In this case, the end user expects 100% accuracy, but the reality may vary between 90% accuracy and 95% or 99% Accuracy accuracy. In case of a prediction accuracy of 70% in the OCR model for a normal user it is not optimal, therefore it is necessary to take measures in this regard for the correct prediction by defining a clear goal. Establish achievable goals and identify when precision +1 or -1 is applicable since OCR identification is not the same for complex Arabic languages, etc.

Phases Solution

Input image preprocessing
Detection of the main structure of the text
Character segmentation (paragraph and word localization, character extraction and segmentation)
Neural network design and training
Text classification and recognition (Binarization and CNN data scaling)
Post text processing
Output to Microsoft Word

OCR data

The OCR model works in real time, therefore it requires the use of data storage systems without latency and where the data is stored in cache memory; this will allow the GPU processing of the model to be more effective, it means that the data is in the Cache, for example in AWS ElastiCache for online storage, including the training processes in real time since in the market internet trends change quite a lot quickly and therefore the model needs to be retrained in real time.

PLS Partial least Square.

Optimal Mathematical and Statistical Model.

Are there any inconsistencies in the prediction between the test data and the training data? To improve it, filters must be applied in the image environment with OpenCV. Use Contour recognition adjusting the parameters with profiling of results implementing a mathematical and statistical PLS model. We will use RedShift for DWH, AWS S3 for Data Lake.

Types of Text in the controlled environment OCR, the case of controlled environment are clear texts such as web page, documents, dense text, text structure and its location.

Text in uncontrolled environments, complex unregulated typography, varied location, variable sizes, text in graffiti, traffic signs, images with a lot of noise, gradient or with a lot of light. For the PDF processing we will use “TESSERACT” it will be a dataset to use in the PDF treatment process. For PNG processing we will use the “SVHN Street View House Number” dataset where we can show these variances for the correct training of the model. “OpenAlpr” Dataset for license plate recognition, “CAPTCHA” combinations of letters with reading difficulty. For complex PNG Image processing we will use Deep Learning with Artificial Vision tools such as Mnist and/or Coco-Text for image recognition with noise or poor lighting. For example, in real-time training for a click prediction problem, you show the user the ad and they don’t click, is that an example of failure? Maybe the user typically clicks after 10 minutes. But you’ve already created the data and trained your model on it. There are many factors to consider when preparing data for your models. You need to ask questions and think through the process from start to finish to be successful at this stage.

Evaluation

The important thing here is the separation of the data in training and testing for OCR recognition. Frequently doing a set of training and testing data, by sampling, we forget an implicit assumption, the data is rarely independently and identically distributed. In layman’s terms, our assumption that each data point is independent of one another and comes from the same distribution is flawed at best, if not outright wrong. You can think of splitting the data using the time variable instead of randomly sampling the data. A novel technique for automated OCR based on merging and feature selection of multiple properties is proposed. The features are merged using the serial formulation and the output is passed to the partial least based (PLS) selection method. The selection is made based on the entropy fitness function and the final features are classified using an ensemble classifier. For the time series model, a baseline is to use the last day’s prediction, that is, predict the previous day’s number of translations or histories. For natural language processing classification models, evaluation metrics such as precision can be used to find the best hyperparameter settings, reducing unnecessary bias.

Characteristics: The correct understanding and application of several methods for the selection of characteristics will allow us to obtain the maximum performance of the OCR model. Therefore, it is appropriate to implement it and evaluate the results obtained in order to finally obtain the most appropriate characteristics to build the model.

Modeling: With the acquisition and cleaning of data from the previously named datasets, the creation of characteristics and others, we will be able to make the model interpretable.

Experimentation: Go into an endless loop to further improve our model.

· Test our model in production environments, learn more about what could go wrong, and continue to improve our model with continuous integration.

· For real-time OCR we must understand the shortcomings of the model with real-time feedback trying to minimize the time for the model’s first online experiment.

Detection of problems that may arise when implementing the Text-OCR model

Programming language management

Deploying a model in Python or R to one in a production language like C++ or Java is tricky, and often results in reduced performance, considering the speed, accuracy, of the original model. R can present problems when new versions of the software appear. Also, it is slow and will not move through large data efficiently. R is a great language for prototyping as it allows for easy interactions and troubleshooting, but it needs to be translated into Python or C++ or Java for production. For this case of OCR we develop Python for embeded box into Docker container. Container technologies like Docker can solve the incompatibility and portability challenges introduced by the multitude of tools. However, automatic dependency checking, error checking, testing, and build tools will not be able to resolve issues across the language barrier. Reproducibility is also another challenge. In fact, many versions of a model can be built using different programming languages, libraries, or different versions of the same library. Manual tracking of these dependencies is difficult. To solve these challenges, a machine learning lifecycle tool is needed that can automatically track and record these dependencies during the training phase as configuration in code form, and later bundle them together with the trained model to be deployed. It is recommended to use a tool or platform that can instantly translate code from one language to another or that allows models to be implemented behind an API so that they can be integrated anywhere.

Computing power

In case neural networks are used, these are often very deep, which means that training and using them for inference requires a lot of computational power. Normally, we want our algorithms to work fast, for many users, and that can be a hindrance. Additionally, many of today’s machine learning production processes rely on graphics processing units, or GPUs. However, this equipment is expensive, which easily adds another layer of complexity to the task of scaling machine learning systems. To support the peaks of service requests by time slot, this will be the optimal Cloud design.

Portability

Another interesting challenge of implementing the model is the lack of portability. Lacking the ability to easily migrate a software component to another host environment and run it there, organizations can become locked into a particular platform. This can create barriers to creating models and deploying them.

Scalability

Scalability is a real problem for many Artificial Intelligence projects. In fact, you need to make sure your models are able to scale and meet performance increases and application demand in production. At the beginning of a project, we typically rely on relatively static data at a manageable scale. As the model moves into production, it is typically exposed to increased data volumes and data transport modes. The team will need various tools to monitor and resolve performance and scalability challenges that will arise over time. Scalability issues can be resolved by taking a consistent, microservices-based approach to production analytics. Similarly, teams should have options to scale compute and memory footprints to support more complex workloads.

Testing and validation issues

Models continually evolve due to changes in data, new models, among other causes. As a consequence, every time such a change occurs, we must revalidate the performance of the model. Apart from validating models in offline tests, it is very important to evaluate the performance of models in production. We usually plan this in the implementation strategy and monitoring sections. Machine learning models need to be updated more frequently than regular software applications.

MLOps Machine Learning NPL Automation

This could be a good solution to produce faster models. In addition, the platform can support the development and comparison of multiple models, so that the business can choose the model that best suits its needs for predictive accuracy, latency, and compute resources. Up to 90% of all enterprise machine learning models can be developed automatically. Machine learning experts can be hired to work with business people to develop the small percentage of models that are currently beyond the reach of automation. Many models experience performance degradation over time. As such, the deployed models must be monitored. Each deployed model must log all inputs, outputs, and exceptions. A model deployment platform should provide log storage and display of model performance. Keeping a close eye on model performance is key to effectively managing the lifecycle of a machine learning model. Machine Learning is still in its early stages. In fact, both software and hardware components are constantly evolving to meet today’s machine learning demands. Deploying Machine Learning is and will continue to be difficult, and that is just a reality that organizations are going to have to face. Fortunately, some new architectures and products are helping to change this landscape. Additionally, as more companies are scaling operations, they are also implementing tools that make it easier to implement the model.

The model artifact

Has several versions and all of them can be called through the inference endpoint, having the facility to improve in real time by varying the selection of algorithms for text understanding and comprehension work with Amazon Comprend and text extraction. with TextExtract. Within the Pipeline to locate the selection of algorithms we can use AutoKeras that uses NAS to search for the best extraction algorithm, allowing us to develop and improve models in production and execution, adapting to the needs of precision in prediction, calculations and latency, avoiding degradations in the model performance with usage over time, pull processes, and exceptions to S3 storage over the lifecycle of the ML model.