A practical approach to text analysis using Natural Language Processing.
Shubham (~jainshubham) |
Contify provides an advanced Market Intelligence platform, that consistently delivers the intelligence that you're looking for on the Internet. We have an interesting range of products like a newsfeed, custom dashboards, daily newsletters, and an upcoming social media monitoring platform. All of this largely relies on how well we are able to crawl the web and sift through the information deluge. Our edge, when compared to the competition, lies in how we have designed our algorithms to work along with our analysts to deliver intelligence, quickly.
To give you some perspective, for one particular customer, a total of about 35,000 documents were sourced in past one week. Using classical supervised learning technique we were able to identify about 14,000 relevant documents. Out of the remaining, about 10,000 documents were identified as unique, using unsupervised clustering pipeline called deduplication. With just these two steps, we were able to reduce the number of documents for further processing by about 60-70%. The analysts further rejected about half of them and thus roughly 500 or 700 documents went up for consumption. These documents were further categorized into named entities like companies, persons, and locations using entity tagging (out of the scope of discussion here) and semantic tagging like topics, industries, sentiment using supervised text classification. So each individual could set up their own set of filters to receive only what is relevant to them via daily newsletters or using our platform or as interesting infographics using our dashboards platform.
The starting point for all text processing pipelines is to transform each document into a feature vector. For deduplication and relevance detection, a simple word frequency vector returned best results while for semantic tagging TFIDF vectorizer (giving higher weightage to unique words in the documents) worked the best. Once we have a feature vector we can perform classification/clustering as needed. For relevance detection, we performed a logistic regression classification on labeled dataset of relevant and irrelevant documents. For deduplication, a cosine multiplication was performed for every pair of document vectors to measure how similar they were. For semantic tagging one vs all logistic regression classifier was trained on documents labeled with multiple tags. We will discuss how the objectives were to improve analyst's performance and work with the computational given resources rather than trying to achieve marquee benchmarks.
Python, NLP, Basic Machine Learning techniques
Shubham works as a Data Scientist/Full stack developer at Contify.