How we achieved accuracy of over 90% after reading 800+ transactions
- Samkit Jain
At Inkredo, we perform flow-based credit assessment to determine the monthly repaying capacity of a customer. Our customers are small & underbanked retailers who are running a bootstrapped and consistently profitable business, yet they remain excluded from formal credit. Formal institutions have shied to lend to lower-middle income group because the cost-benefit analysis of lending and collections do not offset the cost of originating and recovery. There is no cost-effective measure to monitor income/solvency & ensure timely repayment.
The assessment involves calculating useful analysis from the bank statement of our customer. This task requires copy pasting every transaction from the PDF of bank statement (containing tens of pages with hundreds of rows) to an Excel file, cleaning the copied data, and then using Excel wizardry to perform some statistical operations. Imagine using Ctrl+C and Ctrl+V almost a thousand times every other day. As you might have guessed, this involves a lot of human interaction and typically takes us a day to complete a single bank statement.
With our growing user base, a solution was required to reduce the effort and time required. A smart solution to generate insights within seconds with minimal human interaction.
Bank statement in PDF
Before starting with the development, we tackled the problem manually. For each bank statement (from various banks), read all the transactions, highlighted keywords and assigned appropriate labels and categories to each. Then with the generated mapping created a set of keywords for each category with priorities assigned.
A bank statement containing transactions from over six months of a person running a business is usually more than 20 pages long with around 1,000 transactions. Columns are generally of date, particular, balance, deposit, withdrawal, etc. For a specific bank, the result is pretty consistent and easy to play with, but every bank has its format for bank statements. Count of columns, positioning of columns, separators, text format and abbreviations vary.
The columns we require are
These columns are found in every bank statement. Example,
Date | Narration | Chq./Ref.No. | Value Dt | Withdrawal Amt. | Deposit Amt. | Closing Balance
Srl | Txn Date | Value Date | Description | Cheque No | CR/DR | CCY | Trxn Amount | Balance
Tran Date | Value Date | Particulars | Location | Chq.No | Withdrawals | Deposits | Balance (INR)
The naming convention might be different, but the purpose of every column remains the same. Created a dictionary called BANK_DETAILS that contains the position of the required column. Example,
Reading the bank statement
Reading tables from PDF documents is not an easy task. Even copying data from tables doesn’t work properly most of the time. Thankfully, there’s an open-source library available called tabula that can extract tables from a PDF with almost accurate results. We used its Python wrapper tabula-py for the data extraction.
Making the extracted data consistent
Every page with transactions table of the ICICI bank statement consists of this header row. This row is useless for the system as we are only targeting transactions.
Aim: Remove header rows from the list of transactions.
Solution: From reading multiple transactions from numerous bank statements we realised that the closing balance column is always the last. So, a header can be considered as rows (why plural? see next task) starting from the first row till the row where closing balance is not null. Then, go through all the rows and if the row is a part of headers, remove it. In the end, we have rows without any header.
In the image, we can see that particular can be in multiple lines but belong to the same row. Tabula cannot differentiate whether multiple lines in a row belong to the same row. It will treat them as multiple rows, and as a result, we get the following output:
# First line read from HDFC statement
['22/06/17', 'IMPS-7-RAHUL-HDFC-XXXXXXXX', 'XXXX7', '22/06/17', nan, '1,000.00', '14,904.08']
# Second line read from HDFC statement
[nan, '8-XXXX', nan, nan, nan, nan, nan]
Aim: Convert the same particular from multiple rows into one.
Solution: The first line of every entry contains particular, date, balance, transaction amount and cheque number. Only the particulars can be multiline. So, a multiline particular can be between two date entries.
Credit, debit and default
As we saw in the columns of various bank statements, the differentiation between credit, debit and default is based on whether the entry is in deposit column or withdrawal column or in some cases whether mentioned as CR/DR.
Aim: Classify every transaction as credit, debit or default.
Solution: Classify all deposits as credit and withdrawals as debit. An event of default is defined when a withdrawal leads to negative closing balance and then immediately followed by a deposit of the same amount.
To perform analysis on the bank transactions, we need to categorise every bank transaction. Categorizing enables us to perform category specific operations and answer questions such as “how much does he spend on operations?” or “what are the different channels of earning?”. A category can be ATM, Shopping, IMPS, NEFT, etc.
Aim: Categorise every transaction.
Solution: For every transaction, tokenise the particular and based on the occurrence and position of keywords assign a category.
Now that we have read, cleaned and categorised transactions from the bank statement, it’s time to generate some insights. After all, what’s data without information?
Cashflow analysis helps in
This analysis gives an overall view of the total number and amount of credits, debits and defaults in the bank statement. Also contains a categorical breakdown of cash and non-cash transactions.
This analysis is a month-wise breakdown of the overall analysis of the bank statement. Helps in calculating the growth of the business.
This analysis shows the total number and amount of defaults in the bank statement along with the details of every default.
To understand the spending behaviour of the user we need to know the most common transactions. To answer questions like, “Are there multiple NEFT transactions to/from the same person/company?”, “Is he an IRCTC agent?” etc., We used Ratcliff-Obershelp algorithm to club similar transactions with more than 85% similarity. For better results, removed numbers, special characters from the strings.
Note: Code snippets mentioned above are pseudocodes to demonstrate the idea and may not contain all the edge cases.