AI Based Document Digitisation
for InsurTech
Short story
The biggest driving factor for a digital transformation of documents is the ability to access them anywhere. OCR, ICR, and machine learning technologies have proved to be an effective and efficient way of converting paper documents into fully digital assets faster and easier than ever before, with a higher degree of accuracy.
But still there are many organisations that still rely on manual data entry. Most of them don’t invest in setting up an automated data extraction pipeline and in turn they pay more on manual data entry processes which require minimal skill.
Indonesia
Client
Automotive InsurTech Company
CLIENT
Client is one of an Insurtech companies in Indonesia that has provided innovative automotive insurance solutions to insurance companies across APAC region. The main point of focus for them was, to digitise the daily documents they were handling. These are 'Paper documents' (scanned and stored as electronic images), 'Office documents' (stored as PDF, Word, and other formats) and 'Print stream’ documents (stored as printer-specific formats).
The two main technical visions that they had were, bringing the data trapped on those documents to life - allowing analytics and AI tools to surface new business insights and allowing the organisation to easily comply with the data protection and privacy regulations.
CHALLENGE
THE TOP 5 CHALLENGES:
​
-
The primary challenge faced by auto-insurance companies is to manage insurance policies available as PDFs (searchable/scanned). An insurance company sells hundreds-thousands of insurance policies and managing that data is a time-consuming process.
-
Despite years of automation and new technology initiatives, insurers must still deal with huge volumes of physical documents – paper, faxes, checks, cards, and envelopes. A typical mid-sized insurer creates and receives millions of paper documents each year. Such an amount of paper affects the company's daily workflow as well. This leads to and frequently causes delays in sales and service.
-
Another major challenge was to structure the digitised data in the same structure and sequence as that of the document.
-
Another challenge was to increase the image quality before passing it for input for digitisation because a low quality image will gibberish results.
-
One other major challenge was to collect data for training a table detection model which detects specific kinds of table structures.
SOLUTION
-
We have tested and trained our table detection model, applied table data extraction technique along with OCR (optical character recognition) and PDF text-extraction techniques are used to digitize the documents.
-
Insurer provides the input as a searchable PDF file , scanned(image-based) PDF file or an actual image.
-
Scanned pdf and image based input are first converted into a searchable pdf and then the rest of the processing starts.
-
The table detection model detects the table-regions in a common image format of the input file.
-
The input file is also passed for OCR and PDF text extraction logic to extract meaningful texts from the form.
-
Both the text output data are passed for post processing which gives the final text data result which is then used to restructure and update the table detection results.
-
Once the table detection results are updated, it’s passed for table extraction logic which updates the final table results with the extracted table data in row-column format.
-
Once the extraction is done, all the extracted data is merged and a final output is generated which gives both textline as well as tabular results.
-
The final data output is saved in an xml format.
This is a generic solution, that works on PDF or image input. Searchable PDFs (where text can be searched and copied), scanned PDFs (all PDF pages are image) and direct scanned images (of any image format).
Image credit: Google Search
-
Vehicle Insurance Policy
-
Certificate of Motor Insurance
-
Vehicle's Registration Certificate
-
Insurance Claim Form
-
Claim Settlement Letter
-
Accident Report
-
Acceptance Report
-
Letter of Subrogation
-
Repair Bills and Cash Receipts
-
Tax receipt
RESULT
We tested the table detection model on multiple inputs:
-
On searchable PDFs & scanned PDFs, the application gave a significantly good accuracy of 85%, and for image input it gave an accuracy of 70%.
-
The application is able to detect both bordered and border-less tables in all three type of inputs.
-
The OCR and PDF text extraction techniques also gave very good results for searchable PDFs and for scanned PDFs and moderate results on image input and the table data extraction from the detected table region logic simplified the need to extract and structure the text into a row-column pair.