Wednesday, February 22, 2017

Neural Network To Detect Any Pattern In The US H1B Visas Program

It is officially declared that the process of awarding H1B visas by the government of US is based on a lottery system i.e a random process.

Therefore, as a random project, I decided to cross verify the claim. Is it truly random or is there a pattern underlying that is not apparent?

The US government releases the data of all applications for H1B and the status of whether they were certified or not. You can find the data at United States Department of Labour.

I built a neural network in order to model this problem, where the features considered were:
  • VISA_CLASS
  • EMPLOYER_NAME
  • SOC_NAME
  • NAIC_CODE
  • PREVAILING_WAGE
  • PW_UNIT_OF_PAY
  • H1-B_DEPENDANT
  • WILLFUL_VIOLATOR
  • WORKSITE_STATE
There were around 647000 data points for 2016 alone, with the target classes being
  • CERTIFIED
  • CERTIFIED_WITHDRAWN
  • WITHDRAWN
  • DENIED
Initially, I was getting around 94% accuracy in the first epoch itself. I realised that I failed to understand the data and to build a baseline model first. You can read about the importance of building a baseline model in my blog post here, Importance of Baseline Models.

The data suggested that 89% of the 647000 data points were falling under the 'DENIED' category. The baseline model itself was giving me 89%. Hence, it was no wonder that the model was showing such high accuracies.

In order to handle such a skewed dataset, I decided to create my own dataset with the following distributions:
  • CERTIFIED - 40%
  • CERTIFIED_WITHDRAWN - 20%
  • WITHDRAWN - 20%
  • DENIED - 20%
From this distribution, we can clearly see that the baseline accuracy should be 40%. Due to the limitations on my personal computer, I was able to take only 50000 data points for training. The performance of the model was 61%. A bump in 21% accuracy is nothing to be ignored. At the same time, it does not conclusively prove that the process of awarding H1B visas has any underlying pattern.

I have some more modifications in mind, that I believe would improve the model and can point to some conclusion. I will update this post once I am done with that.

Also, using at least 80% of the data set for training might throw some new light on the pattern. Also, considering more features might help better the model. This would require significant computing power. This would enable us to build a more complicated and deeper neural network that might be able to capture the underlying pattern, if any.

But for now,it seems that there is nothing that can be said about the randomness of H1B visas.

No comments:

Post a Comment