A practical approach to text analysis using Natural Language Processing.

Shubham (~jainshubham)




Contify provides an advanced Market Intelligence platform, that consistently delivers the intelligence that you're looking for on the Internet. We have an interesting range of products like a newsfeed, custom dashboards, daily newsletters, and an upcoming social media monitoring platform. All of this largely relies on how well we are able to crawl the web and sift through the information deluge. Our edge, when compared to the competition, lies in how we have designed our algorithms to work along with our analysts to deliver intelligence, quickly.
To give you some perspective, for one particular customer, a total of about 35,000 documents were sourced in past one week. Using classical supervised learning technique we were able to identify about 14,000 relevant documents. Out of the remaining, about 10,000 documents were identified as unique, using unsupervised clustering pipeline called deduplication. With just these two steps, we were able to reduce the number of documents for further processing by about 60-70%. The analysts further rejected about half of them and thus roughly 500 or 700 documents went up for consumption. These documents were further categorized into named entities like companies, persons, and locations using entity tagging (out of the scope of discussion here) and semantic tagging like topics, industries, sentiment using supervised text classification. So each individual could set up their own set of filters to receive only what is relevant to them via daily newsletters or using our platform or as interesting infographics using our dashboards platform.
The starting point for all text processing pipelines is to transform each document into a feature vector. For deduplication and relevance detection, a simple word frequency vector returned best results while for semantic tagging TFIDF vectorizer (giving higher weightage to unique words in the documents) worked the best. Once we have a feature vector we can perform classification/clustering as needed. For relevance detection, we performed a logistic regression classification on labeled dataset of relevant and irrelevant documents. For deduplication, a cosine multiplication was performed for every pair of document vectors to measure how similar they were. For semantic tagging one vs all logistic regression classifier was trained on documents labeled with multiple tags. We will discuss how the objectives were to improve analyst's performance and work with the computational given resources rather than trying to achieve marquee benchmarks.


Python, NLP, Basic Machine Learning techniques

Content URLs:

Resources; 1. NLTK (http://www.nltk.org/) 2. Scikit Learn (http://scikit-learn.org/stable/). Company URL https://www.contify.com/

Speaker Info:

Shubham works as a Data Scientist/Full stack developer at Contify.

Section: Core Python
Type: Talks
Target Audience: Beginner
Last Updated:

Would be great if you could add some slides or reference to the content you would be talking about.

Amit Singh Sethi (~dusual)

Can we have some links to your slides or a general structure, some thing that can be used to be put on the projector so audience can follow along.

Please upload the slides/structure so they can be reviewed before 12th feb.

Have you given any talks(including this one) before? Any experience of public speaking? It's not a requirement for doing the talk but would definitely help us gauge the experience level. We suggest going through the presentation least once in front of a small audience to get some experience if you have not already.

Akshay Arora (~akshayaurora)

Login to add a new comment.