Automatic Data Validation and Cleaning with PySemantic

Jaidev Deshpande (~jaidevd)


7

Votes

Description:

Data is dirty. Any dataset that isn't properly curated and stored can suffer from many problems like having mixed data types, not being properly encoded or escaped, uneven number of fields, and so on. None of these problems are unsolvable. In fact, most of us are pretty good at cleaning data. Normally, when we know little or nothing about a given dataset, we proceed in a very predictable manner. We first try to read the data naively and see if errors are raised by the parser. If they are, we try to fix our function calls. When those are fixed, we try to run some sanity checks on the data, and end up filtering the dataset, sometimes quite heavily.

The problem with this process is that it is iterative, and worse, it is reactive. Everybody in the team has to do it if they are to use the dataset. Sure, one can simply clean it up and dump it in a new file with just a few lines of code. But we shouldn't have to run that script every time we encouter a new dataset. We would be much more comforable if data is cleaned as it is read. It is much more efficient if data cleaning is a part of data ingestion. Secondly, and more importantly, cleaning data via ad-hoc Python scripts is non trivial. Readable as Python scripts might be, it's not always easy for everyone in the team to change the cleaning process. Moreover, there are no Python libraries that offer an abstraction at the level of cleaning and validating data.

Therefore, if one has to go through the process of data validation and cleaning in a customizable, modular way, one has to make sure that:

the specifications for all datasets are in one place, not in different scripts. datasets are grouped under a suitable name, that pertains to particular projects. strict validation and cleaning rules must be applied to all aspects of a dataset the process of validation and cleaning has to be indentically reproducible by everyone who works on the data PySemantic is a Python module that automates all of this, and more. The purpose of this talk is to introduce this module and talk about the best practices of cleaning and validating data.

Prerequisites:

Knowledge prerequisites:

  • Basic Python data structures
  • Pandas parsers
  • NumPy ndarrays and their data types
  • Basic tabular data analysis

Software Prerequisites - See https://github.com/motherbox/pysemantic#dependencies

Content URLs:

Here's a video that explains PySemantic in some detail (Note that it was meant for an audience of non-programmers): https://www.youtube.com/watch?v=6z-18zP4hOA

Speaker Info:

I'm a data scientist at Cube26 Software (http://cube26.com) where I build large scale machine learning applications. Previously I've worked at iDataLabs & Enthought, Inc, where I was one of the developers of the Canopy data analysis platform. I've been a research assistant in the fields of machine learning and signal processing at the Tata Institute of Fundamental Research and the University of Pune. I love developing GUI apps and signal processing tools in my free time.

Section: Data Visualization and Analytics
Type: Talks
Target Audience: Intermediate
Last Updated:

I am already bowled over. Looking forward to the talk.

Amit Singh Sethi (~dusual)

Can we have some links to your slides or a general structure, some thing that can be used to be put on the projector so audience can follow along.

Please upload the slides/structure so they can be reviewed before 12th feb.

Have you given any talks(including this one) before? Any experience of public speaking? It's not a requirement for doing the talk but would definitely help us gauge the experience level. We suggest going through the presentation least once in front of a small audience to get some experience if you have not already.

Akshay Arora (~akshayaurora)

Akshay, Please check the video link. The structure of the slideshow will be very similar to the one in the video. However, note that the video was made for non-programmer audiences, therefore I'm planning to update it by making it more Pythonic and software focused. I will also be speaking in detail about the software development practices I followed when developing this.

Jaidev Deshpande (~jaidevd)

Login to add a new comment.