SC1015: Data Science Mini Project - Unconventionality & Success

School of Computer Science and Engineering
Nanyang Technological University
Lab: SC14
Group : 6

Members:

Bernice Koh Jun Yan (@bernicekjy)
Nepal Aaradh (@ardnep)
Veeraraghavan Srivathsan Nithyasri (@Veeraraghavan-S-Nithyasri)

Description:

This repository contains all the Jupyter Notebooks, datasets, images, video presentations, and the source materials/references we have used and created as part of the Mini Project for SC1015: Introduction to Data Science and AI.

This README briefly highlights what we have accomplished in this project. If you want a more detailed explanation of things, please refer to the the Jupyter Notebooks in this repository. They contain more in-depth descriptions and smaller details which are not mentioned here in the README. For convenience, we have divided the notebooks into 5 parts which broadly relate to the 5 main sections of this project.

1. Problem Formulation

Our Dataset: Stack Overflow Developer Survey 2020 on Kaggle
Our Question: Does Being Unconventional Determine Success?

Success: Determined using Salary and Job Satisfaction
Unconventional Individuals: Outliers/anomalies found after clustering individuals based on the technologies they use such as Web frameworks, Programming languages, Operating systems, etc.

Rationale: We believe that this dataset as well as the question we pose is very relevant to the SCSE community at NTU. Being students of SCSE, once we graduate, we might become developers ourselves. By learning what kinds of developers tend to be more successful, we might be able to understand what it takes to be successful in the software development world.

2. Data Preparation and Cleaning

In this section of the project, we prepped and cleaned the dataset to help us analyze our data better and also to help us use our data for the purposes of machine learning in the later sections.

We performed the following:

Preliminary Feature Selection: 8 relevant variables out of 61 were selected.
Dropping NaNs: All the NaN values in these 8 variables were dropped.
Splitting Dataset in Two: The 8 variables were then split in 2 DataFrames. One with 6 variables relating to conventionality and the other with 2 relating to success.
Encoding Categorical Variables: The categorical variables in both the DataFrames were encoded appropriately.

3. Exploratory Data Analysis

Then, we explored each of our two DataFrames further using Exploratory Data Analysis to answer questions like are there any patterns we are noticing? What do our success variables look like? What about the conventionality variables? Are there any underlying relationships between them? Can we make any inferences for our question at this stage?

To achieve this we did the following:

Explored ConvertedComp: This variable is the annual compensation in USD (a.k.a Salary). Median of around $54k was seen. A lot of outliers with high salaries were present.
Explored JobSat: This variable is the job satisfaction (0-4 scale). Most frequent ratings were 2 and 4. The mean rating was at 2.3.
Explored Relationships Between JobSat and ConvertedComp: Weak correlation was seen between JobSat and ConvertedComp.
Explored Variables Related to Conventionality: Studied which options in the 6 variables were more frequently selected by respondents.

For further findings and explanations, please refer to the Jupyter Notebook on EDA.

4. Dimensionality Reduction

Our DataFrame with 6 variables after encoding was converted to a DataFrame with 94 which is a very high dimensional data.

This meant a few problems (curse of dimensionality):

It would probably not result in nicely formulated clusters.
High dimensional data is difficult to work with because of space and time increases when running algorithms on them.
High dimensional data is difficult to visualize.

So, Multiple Correspondence Analysis (MCA) was used to reduce these dimensions. The reason we chose MCA was that the general convention with dimensionality reduction is Principal Component Analysis (PCA), however it does not work well with categorical data which is what we have. MCA works well with multiple columns of categorical data.

Using MCA, the dimensions were reduced from 94 columns to just 42!

5. Clustering

With these 42 columns, we then performed clustering. We chose the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBCAN).

The reasons for this are:

This is a density based clustering algorithm which means it does not need the specification of the number of clusters. Essentially, this algorithm will not force any point into a cluster. Instead, points which do not really belong to any cluster are labeled "noise". This is clearly useful for us since we are doing anomaly detection (outlier/noise detection).
Because this is density based, the shape of our data does not matter which is useful since we are working with high dimensional data and it is not possible for us to visualize and understand what kind of shapes our data might have.
With a non-hierarchical DBSCAN, there are certain hyperparameters that are difficult to tune. HDBSCAN removes the need to tune some of these parameters.
Because HDBSCAN is a hierarchical clustering algorithm even if we have a high dimensional data, we can use dendrograms to somewhat visualize the clusters and make inferences.

More details on HDBSCAN and its parameters are presented in the Jupyter Notebook on Clustering.

In this section, we performed the following:

Clustering with Random Parameters
Hyperparameter Tuning with GridSearchCV using DBCV Score
Readjusting Parameters (GridSearchCV does not work well in this case)
Clustering with New Parameters

Our final clustering resulted in a total of 3 clusters and 6206 outliers (out of 19362 total points).

6. Data Driven Insights and Conclusion

Here, we re-combined our variables related to success and the clustered variables related to conventionality to see if there are any differences between outliers and non-outliers. We performed a comparative Exploratory Data Analysis on the outliers vs. non-outliers to see if we can infer anything from the similarities and differences.

In this section, we also looked at the characteristics of the individuals in our 3 clusters using the variables related to conventionality. The findings have been presented in the Jupyter Notebook on Data Driven Insights.

Most notably, however, we found that there were no difference in the distribution of the Salary or the Job Satisfaction among Outliers and Non-outliers (Conventional individuals and non-conventional individuals). So, we concluded that unconventionality might NOT be an indicator of success.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
dataset		dataset
images		images
video		video
.DS_Store		.DS_Store
.gitignore		.gitignore
Part_1_Data_Prep_Cleaning.ipynb		Part_1_Data_Prep_Cleaning.ipynb
Part_2_EDA.ipynb		Part_2_EDA.ipynb
Part_3_Dimension_Reduction.ipynb		Part_3_Dimension_Reduction.ipynb
Part_4_Clustering.ipynb		Part_4_Clustering.ipynb
Part_5_Data_Driven_Insights.ipynb		Part_5_Data_Driven_Insights.ipynb
README.md		README.md
conv_vars_clustered.pickle		conv_vars_clustered.pickle
conv_vars_data.pickle		conv_vars_data.pickle
conv_vars_less_dims.pickle		conv_vars_less_dims.pickle
success_vars_data.pickle		success_vars_data.pickle

ardnep/ntu-sc1015-mini-project

Folders and files

Latest commit

History

Repository files navigation

SC1015: Data Science Mini Project - Unconventionality & Success

Description:

Table of Contents:

1. Problem Formulation

2. Data Preparation and Cleaning

3. Exploratory Data Analysis

4. Dimensionality Reduction

5. Clustering

6. Data Driven Insights and Conclusion

7. References

About

Topics

Resources

Stars

Watchers

Forks

Languages