TCS Ocr
TCS Ocr
TCS Ocr
Content :
1. The Origins and Prevalence of Texture Bias in Convolutional Neural Networks
3. Segment Anything
- Internal Components
- Working
- Results on Flyers along with OCR
4. Results on Synth Text images using Different model and Natural Augmentations Techniques
- SAM + OCR
- Text Spotter
- OCR
- Proof of Texture Bias over shape bias
- Natural Augmentations (as suggested in 1 to overcome Texture bias problem)
- Cut out
- Gaussian Noise and Blur
- Sobel filtering
- Color Discoloration
The Origins and Prevalence of Texture Bias in Convolutional Neural Networks
● ImageNet-trained CNNs tend to classify images by texture rather than by shape, in contrast to human perception. where models
tend to classify based on superficial textural features rather than shape information.
● Texture bias refers to the ability of an OCR model to better recognize text in new domains where the visual patterns of
the character may differ from those in the training data.
● Shape bias refers to the ability of an OCR model to recognize text in new domains where the shapes of the characters
may differ from those in the training data.
● Texture Bias Limitations:
- Struggles with handwritten text or low-resolution scans that have different visual patterns and textures.
- Performance may degrade when faced with text that deviates significantly from the patterns seen in the training data.
- Less effective in scenarios where text appearance is altered, distorted, or obscured.
The EAST model is a deep learning-based text detection model that can detect text in natural scene images. The SynthText dataset is a
synthetically generated dataset that contains word instances placed in natural scene images while taking into account the scene layouts.
It is used to train EAST models for text detection.
The pipeline directly predicts words or text lines of arbitrary orientations and quadrilateral shapes in full images, eliminating
unnecessary intermediate steps with a single neural network. The simplicity of their pipeline allows concentrating efforts on designing
loss functions and neural network architecture.
Experiments on standard datasets including ICDAR 2015, COCO-Text and MSRA-TD500 demonstrate that the proposed algorithm
significantly outperforms state-of-the-art methods in terms of both accuracy and efficiency.
On the ICDAR 2015 dataset, the proposed algorithm achieves an F-score of 0.7820 at 13.2fps at 720p resolution.
To address the severe domain distribution mismatch, they propose a synthetic-to-real domain adaptation method for scene text
detection, which transfers knowledge from synthetic data (source domain) to real data (target domain).
In this paper, a text self-training (TST) method and adversarial text instance alignment (ATA) for domain adaptive scene text
detection are introduced. ATA helps the network learn domain-invariant features by training a domain classifier in an adversarial
manner. TST diminishes the adverse effects of false positives (FPs) and false negatives (FNs) from inaccurate pseudo-labels. Two
components have positive effects on improving the performance of scene text detectors when adapting from synthetic-to-real scenes.
They evaluate the proposed method by transferring from SynthText, VISD to ICDAR 2015, ICDAR 2013. The results demonstrate the
effectiveness of the proposed method with up to 10% improvement, which has important exploration significance for domain adaptive
scene text detection.
Recent studies have explored incorporating textual semantics to address these challenges, either through explicit language models or
implicit extraction from visual cues. We propose a method that combines both explicit and implicit textual semantics, leveraging the
strengths of both approaches to enhance STR performance.
The MVLT model was pre-trained in first stage using masked autoencoders and a multi-modal Transformer decoder. The decoder
combined visual cues with language semantics to incorporate linguistic information. The model was fine-tuned in a second stage
using unmasked scene text images and an iterative correction method to optimize the pretrained knowledge of both the encoder and
decoder.
GitHub : onealwj/MVLT: PyTorch implementation of BMVC2022 paper Masked Vision-Language Transformers for Scene Text
Recognition (github.com)
In the first stage of the training strategy, the researchers pretrained the MVLT model by adopting the concept of masked autoencoders
(MAE) , MVLT model aims to recognize scene text while simultaneously reconstructing the masked patches. To incorporate
linguistic information, a multi-modal Transformer decoder was introduced, which combines visual cues with language semantics. The
input to the decoder consists of encoded patches, mask tokens for visual information, and character embeddings derived from the
ground-truth text label of the corresponding image.
Second Stage : Fine-tuning stage
In the second stage, the pretrained model from the first stage was further refined through a process called fine-tuning. This stage
aimed to optimize the model's performance specifically for the task of scene text recognition.
During the fine-tuning stage, the unmasked scene text images were provided as input to the encoder, which extracted relevant features
from the images. The decoder, in turn, generated the predicted text based on the extracted features. Unlike previous methods that only
fine-tuned the encoder, in this approach, both the encoder and the decoder were fine-tuned to leverage the pretrained knowledge
effectively. Additionally, the researchers introduced an iterative correction method.
Model
SAM’s architecture comprises three components that work together to
return a valid segmentation mask:
Image encoder
At the highest level, an image encoder (a masked auto-encoder, MAE, pre-trained Vision Transformer, ViT)
generates one-time image embeddings and can be applied prior to prompting the model.
Prompt encoder
The prompt encoder encodes background points, masks, bounding boxes, or texts into an embedding vector in real
time. The research considers two sets of prompts: sparse (points, boxes, text) and dense (masks).
Points and boxes are represented by positional encodings and added with learned embeddings for each prompt type.
Free-form text prompts are represented with an off-the-shelf text encoder from CLIP. Dense prompts, like masks,
are embedded with convolutions and summed element-wise with the image embedding.
Mask decoder
A lightweight mask decoder predicts the segmentation masks based on the embeddings from both the image and
prompt encoders. It maps the image embedding, prompt embeddings, and an output token to a mask. All of the
embeddings are updated by the decoder block, which uses prompt self-attention and cross-attention in two directions
(from prompt to image embedding and back).
The masks are annotated and used to update the model weights. This layout enhances the dataset and allows the
model to learn and improve over time, making it efficient and flexible.
SAM RESULTS
SAM + EASY OCR RESULTS
Results on Synth Text images using Different model and Natural
Augmentations Techniques
Results on Synth text using SAM + OCR :
Results on Synth text Textspotter:
Results on Synth text using OCR :
Texture Biased over Shape Biased Results :
Textspotter
OCR
Data Augmentation Techniques applied using OCR as discussed in ‘The Origins
and Prevalence of Texture Bias in Convolutional Neural Networks’
Results using Color Discoloration on Synth text :
Results using Cut out on Synth text :
Results using Gaussian Blur on Synth text :
Results using Gaussian Noise on Synth text :
Results using Sobel Filtering on Synth text :