Musings on taking technology to Bharat
How we achieved accuracy of over 90% after reading 800+ transactions
- Samkit Jain
At Inkredo, we perform flow-based credit assessment to determine the monthly repaying capacity of a customer. Our customers are small & underbanked retailers who are running a bootstrapped and consistently profitable business, yet they remain excluded from formal credit. Formal institutions have shied to lend to lower-middle income group because the cost-benefit analysis of lending and collections do not offset the cost of originating and recovery. There is no cost-effective measure to monitor income/solvency & ensure timely repayment.
The assessment involves calculating useful analysis from the bank statement of our customer. This task requires copy pasting every transaction from the PDF of bank statement (containing tens of pages with hundreds of rows) to an Excel file, cleaning the copied data, and then using Excel wizardry to perform some statistical operations. Imagine using Ctrl+C and Ctrl+V almost a thousand times every other day. As you might have guessed, this involves a lot of human interaction and typically takes us a day to complete a single bank statement.
With our growing user base, a solution was required to reduce the effort and time required. A smart solution to generate insights within seconds with minimal human interaction.
Bank statement in PDF
Before starting with the development, we tackled the problem manually. For each bank statement (from various banks), read all the transactions, highlighted keywords and assigned appropriate labels and categories to each. Then with the generated mapping created a set of keywords for each category with priorities assigned.
A bank statement containing transactions from over six months of a person running a business is usually more than 20 pages long with around 1,000 transactions. Columns are generally of date, particular, balance, deposit, withdrawal, etc. For a specific bank, the result is pretty consistent and easy to play with, but every bank has its format for bank statements. Count of columns, positioning of columns, separators, text format and abbreviations vary.
The columns we require are
These columns are found in every bank statement. Example,
Date | Narration | Chq./Ref.No. | Value Dt | Withdrawal Amt. | Deposit Amt. | Closing Balance
Srl | Txn Date | Value Date | Description | Cheque No | CR/DR | CCY | Trxn Amount | Balance
Tran Date | Value Date | Particulars | Location | Chq.No | Withdrawals | Deposits | Balance (INR)
The naming convention might be different, but the purpose of every column remains the same. Created a dictionary called BANK_DETAILS that contains the position of the required column. Example,
Reading the bank statement
Reading tables from PDF documents is not an easy task. Even copying data from tables doesn’t work properly most of the time. Thankfully, there’s an open-source library available called tabula that can extract tables from a PDF with almost accurate results. We used its Python wrapper tabula-py for the data extraction.
Making the extracted data consistent
Every page with transactions table of the ICICI bank statement consists of this header row. This row is useless for the system as we are only targeting transactions.
Aim: Remove header rows from the list of transactions.
Solution: From reading multiple transactions from numerous bank statements we realised that the closing balance column is always the last. So, a header can be considered as rows (why plural? see next task) starting from the first row till the row where closing balance is not null. Then, go through all the rows and if the row is a part of headers, remove it. In the end, we have rows without any header.
In the image, we can see that particular can be in multiple lines but belong to the same row. Tabula cannot differentiate whether multiple lines in a row belong to the same row. It will treat them as multiple rows, and as a result, we get the following output:
# First line read from HDFC statement
['22/06/17', 'IMPS-7-RAHUL-HDFC-XXXXXXXX', 'XXXX7', '22/06/17', nan, '1,000.00', '14,904.08']
# Second line read from HDFC statement
[nan, '8-XXXX', nan, nan, nan, nan, nan]
Aim: Convert the same particular from multiple rows into one.
Solution: The first line of every entry contains particular, date, balance, transaction amount and cheque number. Only the particulars can be multiline. So, a multiline particular can be between two date entries.
Credit, debit and default
As we saw in the columns of various bank statements, the differentiation between credit, debit and default is based on whether the entry is in deposit column or withdrawal column or in some cases whether mentioned as CR/DR.
Aim: Classify every transaction as credit, debit or default.
Solution: Classify all deposits as credit and withdrawals as debit. An event of default is defined when a withdrawal leads to negative closing balance and then immediately followed by a deposit of the same amount.
To perform analysis on the bank transactions, we need to categorise every bank transaction. Categorizing enables us to perform category specific operations and answer questions such as “how much does he spend on operations?” or “what are the different channels of earning?”. A category can be ATM, Shopping, IMPS, NEFT, etc.
Aim: Categorise every transaction.
Solution: For every transaction, tokenise the particular and based on the occurrence and position of keywords assign a category.
Now that we have read, cleaned and categorised transactions from the bank statement, it’s time to generate some insights. After all, what’s data without information?
Cashflow analysis helps in
This analysis gives an overall view of the total number and amount of credits, debits and defaults in the bank statement. Also contains a categorical breakdown of cash and non-cash transactions.
This analysis is a month-wise breakdown of the overall analysis of the bank statement. Helps in calculating the growth of the business.
This analysis shows the total number and amount of defaults in the bank statement along with the details of every default.
To understand the spending behaviour of the user we need to know the most common transactions. To answer questions like, “Are there multiple NEFT transactions to/from the same person/company?”, “Is he an IRCTC agent?” etc., We used Ratcliff-Obershelp algorithm to club similar transactions with more than 85% similarity. For better results, removed numbers, special characters from the strings.
Note: Code snippets mentioned above are pseudocodes to demonstrate the idea and may not contain all the edge cases.
India is home to largest underbanked population in the world - 25% of the total underbanked people on the planet are in India. It's a statement enough that existing financial tools in the market aren't designed for those in the lower income group despite all the push by the system in last few decades. Existing financial institutions have always focused on serving those who can give the banks enough liquidity in maintaining an average monthly balance, credit score comes after it.
With the increasing penetration of the internet, mobile is almost about to become the first universal device. Thanks to the telecom industry that has shown the world to deliver services at high volume and low cost, overcoming the geographical constraints on the way. Why can't banking services do the same? It's time that we put the power in the hands of individuals instead of the system. Hence, we decided to build a mobile-first financial tool to empower them.
The predominant USSD interface is clumsy, text heavy, hierarchical, and a barrier to uptake. Smartphones open a whole new range of interface options that can leverage touch-screens, images, graphics, and sound. A well- designed interface can affect millions of customers in their day-to-day interactions with finance. Many market signs point to rising smartphone usage in the next 5-10 years. Smartphone interfaces could be a key to unlock value for low-literate consumers overcoming communication barriers imposed by early-stage feature phone-based models.
While working with the informal lenders in the low-income geographies where micro-entrepreneurs operate, we identified key principles of design that drives their engagement when it comes to accessing a financial tool
It is not meant to be comprehensive, but is intended as a starting set of principles that will improve smartphone interfaces for basic mobile financial tools in low-income geographies.
By Joy Lal Chattaraj
When I accepted the Internship offer from Inkredo, little did I know that I was going to design and deploy the entire backend for their product. It was the first time I was working with Django. I was supposed to deploy a scalable web server on the cloud for the first time. And it was the first time I was about to do some image processing to recognize texts in images using computer vision. There have been a lot of learning experiences since then. Let’s begin with Django.
I still remember my first day at Inkredo. I was new to Django and as I was fumbling with it like a 5-year-old gifted with a toy that is more suitable to a ten-year-old. However, a challenge is one thing that I have dwelled upon. I had worked on MVC frameworks before, but this was different. We used to create controllers for each model then. I was still in doubt, struggling to manage the current migrations of the app. Now and then a migration attempt would fail on the test server.I found out the actual reason for this was the foreign key dependencies between different apps. It was essential to run them in the order they were generated. But, when you run the migrations of a specific app that is dependent on a different one, it’s highly likely that the database may end up in an inconsistent state. Then Tanmay, the founder, shared this blog,
I saw the same problem there; they too had these issues. I quickly moved all the interrelated models to a single app, and the migrations didn’t trouble me anymore. Here’s what I did.
As the code base grew with each passing day, it became difficult to manage lengthy pieces of code for an intern. This was when I was introduced to the concept of modules, thanks to my colleague Droan who taught me some cool ways to manage a codebase that was growing every day. I broke down the code into several parts, each of which would serve their purpose and import those functions when needed. It was the first time I was managing a large code base, my apps in the past have been quite small because rails used to handle most of the tasks. This time I was designing APIs from scratch with a little help from Django. It’s was too big a responsibility to manage almost all the aspects of the application and get to know every single detail on how things work in the real world because it was all new to me. I had never felt as confident as a developer before.
Before working at Inkredo, I had only read about challenges of working in a startup. Here I had a new problem to solve each day; some were common as I could easily find answers on the internet while the not so common ones took a bit longer to solve. It was the first time I was experiencing what it means to learn on the job. Thanks to stack-overflow, the problems could have taken longer to solve if not for it. I will be sharing some of my important lessons in this article, and hope it may help some of you out there.
One of the problems I faced was uploading files directly to AWS S3 from an API request. The problem was that the file was an “InMemoryUploadedFile” whereas python’s “boto” isn’t well documented to upload such a file object. So, here’s a code snippet that does the task, i.e., get’s a file object from a request and uploads its content to a specific S3 bucket.
Now came the time to deploy my code. I had deployed a few web servers before, but those were for a small audience. Scaling was something I hadn’t done before. Most of these servers were instances of Linux on the cloud with a database installed on it and my application being served using a web-server application.
But this time I had to configure storage buckets for static files, load balancers, a separate database instance and an auto-scaling environment to scale the number of web servers according to the need. Later, I went ahead and deployed a few features like monitoring and alarms for my server instances and a task queue to manage some async tasks. I also had a thought for deploying a caching server, but since our application isn’t read heavy and there are changes to the database in almost every call, deploying one didn’t make sense.
It was a no-brainer to use AWS because I was familiar with it. One of their services, the elastic beanstalk made the deployment process easy, once you configure everything properly, it becomes really easy to deploy the next iterations of your updated application. I used to spend an hour or two every day on the platform. For first few weeks, I played a lot with the infrastructure. Every day I tried to configure certain things to automate the deployment process.
It is necessary to set up billing and budget alerts before you start deploying things. I learned this the hard way when I accidentally deployed a MxLarge RDS instance costing $5/hr, and it ran for the next 48 hours until I say the huge bar on the billings page, but thanks to the awesome support from Amazon, they understood the situation and waived off the bill for that month. ($250/ month is a huge cost for a startup in its early days. I know another startup that bills for $13000/month for a user base of almost 25m). Now we spend less than $10 a month, thanks to the AWS free tier.
The elastic beanstalk console although is quite limited in functionality, but the command line version of it, gives you full control to every resource you are running. You can ssh into an ec2 instance anytime and fix something. Below are some resources and some code snippets I would like to share that helped me a lot during the process of deployment.
Here’s how to add customizations to the apache config files on each ec2 instance.
Advice on deploying an update to your application on beanstalk: never directly ship new code to the production environment (however tested it may be, even when the update method is set to roll out one instance at a time). It is always possible that the code would break due to a certain dependency.
We spin a parallel environment, deploy your new code there and check if everything works as per expectation.
Once you feel everything is right, use the swap url feature on the elastic beanstalk console to move the users to the new environment. Ensure all the traffic has migrated to the new environment before shutting down the older one and always create alarms to keep you updated on any issues.
The motive of the app is to automate the entire loan application and processing. So, this involved a lot of character recognition stuff as the target group possesses its financial documents in hard copies. Enabling them to autofill their information would make the task easier as well as avoid human errors while typing. So, I first designed an OCR using google’s OCR library tesseract. The classifier produced good results when it came to reading standardized documents such as a PAN card. But, as the complexity of the document grew, such as reading a cheque leaf, getting a good accuracy score was difficult. To avoid the complexities of training a custom classifier and deploying it on the cloud (which would require a significant amount of computation) we decided to use Microsoft Azure’s Vision API. It provided us the coordinates of all the texts and all we had to do was look for texts similar to a PAN number or Account number and IFSC from a chequebook. And then I wrote a few regex expressions that made it easy to find strings that were a close match to the results we needed.
Later we extended this to read bank statements; this is where even Azure failed to read everything in the image. We had tried google vision’s API earlier, but the output wasn’t satisfactory. So, decided to work on making the image more readable. I came across a lot of image filters whose main motive was to convert the image to only black and white, no other colors. I tried out a lot of them, some of them were the mean, median and gaussian thresholding. The one which worked best for us was a custom designed filter using Otsu’s Thresholding principle.
Finally, a bit of cropping & rotating the image helped us significantly improve the accuracy. But as we uploaded more documents to this, it failed a lot of times because of the image’s orientations and some text that that showed up other than the actual statement needed (the cropping and rotation part malfunctioned here). Someday I wish to build a custom-designed solution for this with much better accuracy, till then I will be improving my machine learning skills.
For the need of credit, our user’s trusted us in sharing their transactional messages that gave us a closer look into their financial health. I did set up a mechanism for capturing and storing their data securely in our database, but it requires a good amount of data to start training a custom text classifier that would categorize those messages and figure out the amount spent. We are working on building a sentiment analyzer for the same. If you know anyone working on it or interested to work on it, do write to us.
It's been great 8 weeks of learning at Inkredo. The best part is that the team trusted me; It allowed me to play around with the tech and come up with my own solutions. I was involved in every decision making, every new idea is brainstormed before executing. I joined as a backend intern but I ended up doing a lot more.
In a nutshell, these are the products I contributed to:
It felt like I have applied all I had learned over the past years, right from Information Security to Web Development, a bit of DevOps and even Data Analytics. If there is one I still want to improve is writing tests for my code as the functionality coverage of my tests were not enough to ensure something isn’t broken. I hope I will be building some challenging stuff for Inkredo again in future.
Real-time alternative data for holisitic assessment of financial health of businesses that aren't tax compliant
Small businesses have always faced constraint of capital when it comes to scaling up their business. It is no secret that lenders have been facing challenges in underwriting loans for small businesses for decades because there’s hardly any reliable data to estimate the true health of their business. Furthermore, the benchmark set by banks to estimate borrower’s income in the moderate-income group is too high.
Though the owners transact in low value, high volume business, they have limited access to lumpsum money and usually resort to private lenders to scale their business. Just as informal lenders depend on intuition to judge the intent and ability to repay because they have access to the borrower’s circle of influence, so were banks stuck for a large enough time with old age ways of underwriting a borrower before the credit scoring became the norm. Even the traditional ways are becoming a passe.
The traditional approach followed by formal lenders to gauge the ability to pay typically relies on review of tax statements or income tax returns. This approach of estimating a prospective small borrower’s income suffers from major draw backs:
How do you underwrite loans for such businesses?
An alternate approach to underwrite loans in this segment is to follow a high touch point process of doing on ground estimation of borrower’s income. This is the approach followed by NBFCs and MFIs is not only time consuming but also highly prone to errors. It also leads high cost of underwriting that is eventually passed on to the borrower in the form of higher rates of lending.
With rapid pace of evolution in technology, lenders too need to find out more efficient ways of underwriting beyond using tax returns and income statements. There are various emerging supply chain aggregators that can be potentially leveraged by lenders as data source for loan underwriting.
Exemplifying the above statement, fin-tech players in real-time micro-payments and remittance services is a large industry where large market places have developed and the web has become a trading platform for such small businesses. These fin-tech companies are prepaid instruments (aka e-wallets) that convert real cash into digital money which facilitates small businesses for transacting with large enterprises such as telecom, railway, DTH and banks (especially for remittance). Therefore, they are a trove of business transaction data. This data is the most rewarding tool for lenders and can be used to assess flow of moderate-income borrowers based on which loans can be underwritten.
These benefits of this data counters the drawback of ITR based under-writing.
Another source which is better evaluation than tax assessment is bank statement analysis that is also available in real-time. Bank statements capture detailed description of credit and debit transaction with much more granularity. We extract business turnover, loan repayment, utility bill payments, point of sales transaction, etc. All these are signals much more valuable than information which is based obtained from historical tax statements. These data points are available in digital format and is fairly easy to extract and analyse if the right technology and data model is build around it.
Almost two months ago, an Android developer who was working in a "Baagh Global" funded company, applied at Inkredo and got rejected without being tested for technical skills.
We believe a startup is good as its people, idea is secondary. We want to know you as a person who you are, what rocks your boat and what fires you up? We all are individuals and we respect each other's identity, however we are not looking for ideological clones.
We want to know you and why do you want to invest your time in Inkredo.
Why are we asking every potential hire to write?
Everyone will play with data in any organisation because that's how decisions are made in any modern day company. Data is more than a number, there is a story hidden behind else it is meaningless. We are looking for people who can communicate effectively and make things easy to understand because we'll be bringing chaotic things to order while entropy of data will always increase with time.
It was a hard decision for us to make because programming is one of the fields where a non-English speaking can still be a great programmer. That said, in last few years of developing and meeting communities of programmers has taught me that programmers who can communicate ideas clearly are far, far more effective than who can really communicate via compilers. Language is crucial for documenting code, writing specifications and technical design documents that other people can understand.
"Brilliant programmers who have trouble explaining their ideas just can't make much of a contribution." - Joel Spolsky, Creator of MS Excel, Fog Creek, Trello and Stack Overflow
India is a land of non-native English speakers and hence we try to be considerate with speakers who are nonetheless excellent communicators.
Please avoid cliches - don't bore us with superstar and rockstar stories. If you consider yourself as an eager learner, a kind and supportive human being who can develop trust with strangers but ambitious minds then you have a home here.
Diversity is important in building or creating great things. There are few better ways to expand your horizons and grow as a person than putting yourself into company of diverse minds with different experiences, backgrounds and identities. So don't shy away and show us that you belong here. We'll do great things together.