It is officially declared that the process of awarding H1B visas by the government of US is based on a lottery system i.e a random process.
Therefore, as a random project, I decided to cross verify the claim. Is it truly random or is there a pattern underlying that is not apparent?
The US government releases the data of all applications for H1B and the status of whether they were certified or not. You can find the data at United States Department of Labour.
I built a neural network in order to model this problem, where the features considered were:
Initially, I was getting around 94% accuracy in the first epoch itself. I realised that I failed to understand the data and to build a baseline model first. You can read about the importance of building a baseline model in my blog post here, Importance of Baseline Models.Therefore, as a random project, I decided to cross verify the claim. Is it truly random or is there a pattern underlying that is not apparent?
The US government releases the data of all applications for H1B and the status of whether they were certified or not. You can find the data at United States Department of Labour.
I built a neural network in order to model this problem, where the features considered were:
- VISA_CLASS
- EMPLOYER_NAME
- SOC_NAME
- NAIC_CODE
- PREVAILING_WAGE
- PW_UNIT_OF_PAY
- H1-B_DEPENDANT
- WILLFUL_VIOLATOR
- WORKSITE_STATE
- CERTIFIED
- CERTIFIED_WITHDRAWN
- WITHDRAWN
- DENIED
The data suggested that 89% of the 647000 data points were falling under the 'DENIED' category. The baseline model itself was giving me 89%. Hence, it was no wonder that the model was showing such high accuracies.
In order to handle such a skewed dataset, I decided to create my own dataset with the following distributions:
- CERTIFIED - 40%
- CERTIFIED_WITHDRAWN - 20%
- WITHDRAWN - 20%
- DENIED - 20%
I have some more modifications in mind, that I believe would improve the model and can point to some conclusion. I will update this post once I am done with that.
Also, using at least 80% of the data set for training might throw some new light on the pattern. Also, considering more features might help better the model. This would require significant computing power. This would enable us to build a more complicated and deeper neural network that might be able to capture the underlying pattern, if any.
But for now,it seems that there is nothing that can be said about the randomness of H1B visas.
No comments:
Post a Comment