Digitizing KYC information from images and performing fraud checks

Sai Manohar Boidapu
3 min readOct 23, 2019


The requirement is to extract information from images of KYC documents (PAN card, AADHAR card, Driving Licence, Voter Id and Bank Statements etc.) submitted by the customer using publicly available OCR APIs and performing basic fraud checks like name/DOC is same in all documents. As of today, verification of KYC data across different documents is a manual process which involves cost and has high probability of error.


With the latest advancements in technology, this problem can be solved by leveraging AI and automation reducing or completely eliminating the human intervention.


Design explaining high level flow of the project
High level design

API call:

We have used Azure ML api to extract the text from the images. The api would return the response in JSON which includes the bounding boxes and the text contained in them.

Data extraction:

We have extracted the required information like name, dob fields from the semi structured JSON to form the structured data.

Data validation across KYC images:

Once we get the required fields data from the different proof images passed, we perform following validations to verify the fields.

a) Name validation:

We would first check whether a direct string matching passes. If not, we would generate all the combinations of the words in string and apply levenshtein distance on top cosine similarity wrapper. If score is 100, it means that the name field strings matched on re-order else we would leave the decision to user by display the percentage of match.

For example, the combinations we would try for the name ‘Sai Manohar Boidapu’ could be ‘Boidapu Sai Manohar’, ‘Manohar Sai Boidapu’, ‘B S Manohar’)

Note: We also can perform Spell Check using Norvig’s Algorithm to replace in case any of the strings are not properly identified during OCR process.

b) Date of birth validation:

For date, the main challenge we identified is some of AADHAR cards only contain the year of birth where as some other contain properly formatted date. We have written custom logic to identify whether the field contains date of birth or year of birth. The same data is being validated across documents by matching the strings.

Challenges faced:

In some cases, the text wont be properly identified by Azure ML, we would need to perform standard OCR pre-processing techniques in such scenarios.

Steps for better performance:

1. Maintain appropriate Height and Width Pixels.

2. De-noise the images by converting to Grey Scale (Sailent Edge map).

3. In case of blur images, improve blurness by following Gaussian Smoothing techniques.

Kreate hackathon background:

Kreate was a 24-hour Hackathon organised by Kalaari Capital and Skillenza. We provided solution to this problem, clubbed with RDBMS , developed awesome UI, detailed dashboard, spent sleepless night 😜 and finally became winners for the theme provided by MoneyOnClick.

Team name: Crons

Team members:

Manikant Kella, Data scientist at United Health Group

Sai Manohar Boidapu, Data Engineer at Affine Analytics

Renuka Poluru, Web and UI expert at Zeta

Subhash Hardaha, Full stack developer at Info Gain India Pvt. Ltd.

GIT link to the project: https://github.com/Manikant92/kreate_hackathon/

Featured in YourStory ( https://yourstory.com/2019/10/hackathon-developers-kalaari-capital-kstart-startups)

I hope this will be helpful to the required. Thank you 😃