Two business analysts. Two web developers. Two cyber security analysts. One data analyst. And one IT architect / technologist.  We were all in an EMC class on Data Science and Big Data Analytics. 

It is challenging but it was most enlightening for me. It is challenging because of the depth of statistics, programming, business domain that one has to know. But for the technologist, it sheds ‘true’ light into what data science deals with. It helped me understand what data scientists really need to know to churn out those ‘predictive analytics’ that we, as technologists, so casually talk about. Although this is probably level 101 in data science world, my appreciation for this domain has increased many folds.

Here’s Mike’s Take on why the Digital Technologist must scale this data science learning curve.

From 30,000 feet. It’s all about the data science life cycle. It’s about conceptualizing the business problem into a statistical one. It’s about knowing which data and variables to include in the analysis, or not. It’s about knowing which statistical model would give you the outcome you need. It’s about testing your predictive model and operationalizing it.

The Business Challenge.  We were tasked to develop a predictive analysis model, for a housing loan application system. The model is supposed to generate a rating for online loan applicants based on information like gender, income, ethnicity, loan amount, geographic location etc. Normally, financial institutions will just perform a ‘credit-check’ and revert to the applicant. This typically takes days. But this online predictive service provides an immediate, rating to the applicant.

1. Data Preparation. We were given the actual Home Mortgage Disclosure Act (HMDA) data from the US. There were close to 20-odd variables – from loan purpose, to application actions in the past, to loan type, geographic location, etc. The key task of the data scientist here is to clean the data – ensuring it is in the right format, decide if noisy variables should be considered in the model, determine if there are multiple models. This is where the data scientist needs to have some knowledge about housing loans to make sense of the data.

Data discovery 2

2. Selecting the Data Set. The density plot shows that the data consists of two models with statistically separate behaviors that must be analyzed separately for accurate prediction. I am understating this. Really. One of our classmate spent a lot of time manipulating his model at later stages, only to discover that a noisy variable was included in the analysis.

Discovering the data models

3. Building the Model. We then had to define the outcome we intend to achieve, and select the right statistical model to achieve that outcome. During the theory part of the class, we were introduced to popular clustering, association and classification techniques, and their associated outcomes. K-means, linear regression, logistic regression, Naïve Bayesian… don’t worry, the class will bring you up to speed. Two models were eventually used – Logistic Regression and Decision Tree. In an unrelated lab practice, we had the chance to build market basket analysis of a grocer using association rules. Simply too cool.

4. The Model. The final outcome was a series of regression coefficients that represent a prediction function. With the associated input from an online applicants, the function will calculate the likely probability of a successful application. This came from the Logistic Regression model. This is the output of this analytics and data science exercise.

Coeff values from Log Regression to predict

The Decision Tree produced a predictive decision tree, that served the same outcome – generate a prediction outcome based on applicant’s input.

decision tree to predict

5. Operationalizing with Continuous Development. While we didn’t cover this, but the output of the predictive model (coefficients and decision logic), will then be used as the algorithm behind the “online loan prediction system”. And as more data is being collected, the data scientist would continue to assess the existing model, and refine it and make changes to it. Can you then imagine the need for a DevOps + PaaS capability to keep up with the changes on the coding side? Can you imagine the need for data science skills in every business and organization?

The above is just a snippet of the whole experience. And this class merely exposes the tip of the data science and analytics world. The learning here was the clarity in which a business problem and the desired outcome was translated into a statistical model with actionable outcome. So, whether you are an  end user or a vendor, this is a fantastic place to start to begin to understand data science.

Check out the class : https://education.emc.com/guest/campaign/data_science.aspx