Projects

Data Scientist, Personalized Medicine: Redefining Cancer treatment
April 2022 - June 2022 | California, USA
- Predicted genetic mutations based on clinical evidence and used NLP techniques like TF-IDF and Word2Vec and categorical variables were encoded using OneHotEncoder and Response Coding
- K-Nearest Neighbors (KNN), Logistic Regression, random forest, SVM, and Naive Bayes models were developed
- Tuned the model hyperparameters using K-Fold Cross Validation & smoothing to attain 98.9% accuracy
- Awarded 2nd position at Electrofocus, an annual state-level technical symposium at Anna University
Here is the - Github Repo

Data Scientist, Apparel Recommendation System
October 2021 - December 2021 | California, USA
- Developed and deployed RESTful API to recommend apparel based on text semantics in the search engine
- Synthesized a model of Semantic Analysis on Neural Networks and NLP techniques like TF-IDF, Word2Vec, and AVGW2V
- Measured the recommended products with 90.4% accuracy and compared with product images using euclidean distance
Here is the - Github Repo

Data Scientist, App Rating Prediction
January 2021 - March 2021 | California, USA
- Developed a Google Apps rating predictor using machine learning algorithms with the best performing error rate of 0.13
- Evaluated the performance using Root Mean Squared Error, R-squared error, Residual Standard Error and Mean Absolute Error
Here is the - Github Repo

Data Scientist, Question Pair Similarity Problem
September 2019 - December 2019 | Chennai, India
- Implemented a real-time duplicate questions predictor on the Quora dataset using Python and identified best features
- Performed feature extraction using NLP and Fuzzy techniques and developed Logistic regression, SVM, and xgboost models
- Improved from 76.3% to 89.6% accuracy after hyperparameter tuning and optimizing the models
- Secured 2nd runner-up position for the best machine learning model at Anna University hackathon with 200+ participants
Here is the - Github Repo

Data Scientist, Customter Churn Analysis using Apache Spark & pysparkML
May 2019 - July 2019 | Chennai, India
- Developed ML pipeline for predicting churn of a customer and performed ETL on an IBM telecom dataset
- Logistic regression and random forest models were built and cross-validated to tune the model with the best parameters
- The model had with an accuracy of 81.24% with a precision of 0.65 and a recall of 0.5057
Here is the - Github Repo
