- Samkit Jain
A lending company needs the answer to just one question when giving loans, “Will the borrower be able to repay the loan?”.
Coming to a binary (yes or no) answer to that question involves a lot of work (both manual and automated). Lending companies go through various financial documents of the borrower and perform a thorough analysis to come to a conclusion. With the advancements in artificial intelligence and machine learning, some companies have been able to reduce the time to approve a loan to a day (but don’t share the percentage of bad loans probably because OCR systems are still not as good as they want it to be). A bank account statement can be tens or hundreds of pages long with thousands of transactions. An individual’s account statement can contain just a few hundred transactions while a corporate’s can be in thousands. To understand the past spending behaviour of a borrower and predict the future loan repaying ability, one of the financial document that every lending company asks for is a bank account statement.
Bank statements shared come in all types imaginable. You’ll find PDFs (both bank generated and scanned copies), CSVs, images, in rare cases even HTMLs. In this post, we’ll talk about the most common type, image. Even in images, you’ll see a lot of variety,
OCR to the rescue!
Optical Character Recognition or OCR is a technology that recognizes text within an image. Humans have the ability to easily understand the text in an image, however complex (after all, we are the masters of the sacred texts!).
Over the past few years OCR solutions have really gotten much better. They are able to recognise handwritten texts with a good amount of accuracy. Giants like Google and Microsoft have also invested in the field and have come up with their own text recognition products.
It’s a known fact that OCR works well when the characters are printed, image quality is high and lighting is ideal. Bank statement images shared by the borrower have one thing going for them, they contain printed text. But this is not enough. Even with ideal conditions, it won’t be enough.
Majority of errors in OCR systems are because of incorrect classification. It usually misclassifies in cases where the features of a letter and number are same. Some of the ambiguous cases are,
Images shared by borrowers are usually not in the ideal condition. They are in bad lighting, blurry, low res, have pen/pencil markings, pages are folded, etc. All these factors act as a catalyst and lead to more and more incorrectly classified characters.
Some examples where OCR didn’t work for us
Current state of OCR systems
When Inkredo was into P2P lending, we too accepted images of bank account statements and manually typed every entry in an Excel sheet. Expectedly, this was a time taking process and we tried multiple OCR solutions — in-house, open-sourced and paid — but, none of them gave us the desired result. Some of them worked really well with bad quality photos but all of them struggled with ambiguous characters.
To conclude, OCR is not reliable for text detection in financial documents where reading a comma as a dot (or vice-versa) can make a significant difference. PDF (containing text and not scanned images) should be the preferred type because it’s not as easy to manipulate as a CSV and is easier to extract text from PDF-encapsulated files rather than images. Plus, you won’t have to buy expensive OCR solutions.
Do you think OCR is reliable when it comes to credit risk assessment? Share your thoughts in the comments.
In case you are interested in this problem, we're hiring machine learning engineers.