Projects


Data Scientist, Personalized Medicine: Redefining Cancer treatment

April 2022 - June 2022 | California, USA


  • Predicted genetic mutations based on clinical evidence and used NLP techniques like TF-IDF and Word2Vec and categorical variables were encoded using OneHotEncoder and Response Coding
  • K-Nearest Neighbors (KNN), Logistic Regression, random forest, SVM, and Naive Bayes models were developed
  • Tuned the model hyperparameters using K-Fold Cross Validation & smoothing to attain 98.9% accuracy
  • Awarded 2nd position at Electrofocus, an annual state-level technical symposium at Anna University

Here is the - Github Repo


Data Scientist, Apparel Recommendation System

October 2021 - December 2021 | California, USA


  • Developed and deployed RESTful API to recommend apparel based on text semantics in the search engine
  • Synthesized a model of Semantic Analysis on Neural Networks and NLP techniques like TF-IDF, Word2Vec, and AVGW2V
  • Measured the recommended products with 90.4% accuracy and compared with product images using euclidean distance

Here is the - Github Repo


Data Scientist, App Rating Prediction

January 2021 - March 2021 | California, USA


  • Developed a Google Apps rating predictor using machine learning algorithms with the best performing error rate of 0.13
  • Evaluated the performance using Root Mean Squared Error, R-squared error, Residual Standard Error and Mean Absolute Error

Here is the - Github Repo


Data Scientist, Question Pair Similarity Problem

September 2019 - December 2019 | Chennai, India


  • Implemented a real-time duplicate questions predictor on the Quora dataset using Python and identified best features
  • Performed feature extraction using NLP and Fuzzy techniques and developed Logistic regression, SVM, and xgboost models
  • Improved from 76.3% to 89.6% accuracy after hyperparameter tuning and optimizing the models
  • Secured 2nd runner-up position for the best machine learning model at Anna University hackathon with 200+ participants

Here is the - Github Repo


Data Scientist, Customter Churn Analysis using Apache Spark & pysparkML

May 2019 - July 2019 | Chennai, India


  • Developed ML pipeline for predicting churn of a customer and performed ETL on an IBM telecom dataset
  • Logistic regression and random forest models were built and cross-validated to tune the model with the best parameters
  • The model had with an accuracy of 81.24% with a precision of 0.65 and a recall of 0.5057

Here is the - Github Repo