Projects

Data Scientist, Personalized Medicine: Redefining Cancer treatment

Predicted genetic mutations based on clinical evidence and used NLP techniques like TF-IDF and Word2Vec and categorical variables were encoded using OneHotEncoder and Response Coding
K-Nearest Neighbors (KNN), Logistic Regression, random forest, SVM, and Naive Bayes models were developed
Tuned the model hyperparameters using K-Fold Cross Validation & smoothing to attain 98.9% accuracy
Awarded 2nd position at Electrofocus, an annual state-level technical symposium at Anna University

Here is the - Github Repo

Data Scientist, Apparel Recommendation System

Developed and deployed RESTful API to recommend apparel based on text semantics in the search engine
Synthesized a model of Semantic Analysis on Neural Networks and NLP techniques like TF-IDF, Word2Vec, and AVGW2V
Measured the recommended products with 90.4% accuracy and compared with product images using euclidean distance

Here is the - Github Repo

Data Scientist, App Rating Prediction

Developed a Google Apps rating predictor using machine learning algorithms with the best performing error rate of 0.13
Evaluated the performance using Root Mean Squared Error, R-squared error, Residual Standard Error and Mean Absolute Error

Here is the - Github Repo

Data Scientist, Question Pair Similarity Problem

Implemented a real-time duplicate questions predictor on the Quora dataset using Python and identified best features
Performed feature extraction using NLP and Fuzzy techniques and developed Logistic regression, SVM, and xgboost models
Improved from 76.3% to 89.6% accuracy after hyperparameter tuning and optimizing the models
Secured 2nd runner-up position for the best machine learning model at Anna University hackathon with 200+ participants

Here is the - Github Repo

Data Scientist, Customter Churn Analysis using Apache Spark & pysparkML

Developed ML pipeline for predicting churn of a customer and performed ETL on an IBM telecom dataset
Logistic regression and random forest models were built and cross-validated to tune the model with the best parameters
The model had with an accuracy of 81.24% with a precision of 0.65 and a recall of 0.5057

Here is the - Github Repo