About Me
Expertise
Data Engineering Data Analysis Data Warehousing ETL Big Data Cloud Platforms Data Modeling Database Design Data Pipeline Architecture Version Control Containerization
Project Categories
Data Engineering, Data Science, Data Analytics, Software Engineering, Other
Bio
Rohan is a data professional with 7 years of experience, two master’s degrees, and a proven track record of working with distributed global teams. His expertise spans the entire data lifecycle, from data engineering and analytics to visualization and machine learning, allowing him to quickly adapt to new technologies and methodologies. Rohan’s international exposure, diverse skill set, and eagerness to learn make him well-equipped to tackle complex data challenges and drive impactful solutions in fast-paced environments.
Experience
Python - 6 years Predictive Modeling - 6 years SQL - 5 years BigQuery - 3 years Data Modelling - 3 years Keras - 3 years Django - 2 years Amazon Web Services (AWS) - 2 years PySpark - 2 years DBT - 2 years Docker - 1 years
Education
- Master of Science in Business Consulting, Furtwangen University, Germany (Master Thesis)
- Master of Science in Big Data Analytics & AI, Novosibirsk State University, Russia (Master Thesis)
- Bachelor of Technology in Engineering Physics, Delhi Technological University, India
Certifications
- AWS Certified Data Engineer Associate (Credly)
- Databricks Certified Data Engineer Associate (Credentials Databricks)
- Dell EMC Data Science Assiociate (Certmetrics NCB172QTLEFQQG55)
Skills
- Libraries/APIs - Pandas, NumPy, Scikit-learn, Apache Spark, Django
- Tools - DBT, Docker, Apache Airflow
- Languages - Python, SQL, R
- Paradigms - ETL/ELT, Data Modeling, Data Warehousing, Big Data Processing
- Platforms - MySQL, PostgreSQL, Amazon Web Services (AWS), Google Cloud Platform (GCP)
- Storage - Amazon Redshift, Google BigQuery, Snowflake
- Other - Data Pipeline Design, Data Quality Management, Data Analytics, Machine Learning
Preferred Environment
Visual Studio Code (VS Code), Amazon Web Services (AWS), Google Colab, MacOS
Data Engineering Projects
Engineered a data pipeline using medallion architecture for Zillow data, processing 1GB of daily scraped CSV files from AWS S3. Implemented data cleaning and standardization for 180 columns, storing results in Parquet format with Hive partitioning. Deployed as an AWS Lambda function with scheduled execution via CloudWatch, optimized for 3GB RAM allocation.
Technologies: AWS S3, AWS Lambda, AWS CloudWatch, AWS CLI, Python, Pandas, PyArrow, CSV, Parquet, Hive partitioning
Developed a sophisticated tool to extract and analyze abbreviations from Wiktionary dumps. The project involved parsing complex data structures, implementing custom algorithms for abbreviation detection, and handling various edge cases to ensure comprehensive coverage of linguistic variations.
Technologies: Python, XML processing libraries, regular expressions, data structures, version control (Git), shell scripting
Developed a sophisticated machine learning model using PySpark to predict oil prices based on comprehensive oil field data. This project leveraged big data technologies to process and analyze large-scale datasets, enhancing decision-making capabilities in the energy sector. The model incorporated various predictive techniques, including Facebook Prophet, and was deployed on AWS infrastructure for scalable performance and real-time insights.
Technologies: PySpark, AWS EC2 (r5.2xlarge instances), Jupyter Notebook, Facebook Prophet, Scikit-learn, Apache Superset, AWS S3, SSH, Python
Developed a robust Python script to automate the process of migrating data from Snowflake to PostgreSQL. This tool ensures efficient and error-free data transfer between disparate database systems. It handles schema creation, table synchronization, and large-scale data movement with built-in error handling. The script also generates timestamped schemas for version control and easy rollback if needed.
Technologies: Python, SQLAlchemy, Pandas, Snowflake, PostgreSQL, SQL
Developed a Python-based tool to automate the extraction of Google Search Console (GSC) data and seamlessly integrate it with Google BigQuery (GBQ). This project streamlines the process of collecting and storing website performance metrics, enabling efficient data analysis and reporting. The tool supports customizable date ranges and multiple GSC properties, making it a versatile solution for SEO professionals and web analysts.
Technologies: Python, Google Search Console API, Google Cloud Platform, BigQuery, OAuth 2.0
Engineered a sophisticated Python-based system utilizing OpenAI to automate the extraction and parsing of healthcare provider information from diverse state-specific files. The system adeptly handles multiple file formats including PDF, Excel, and CSV, processing data from over 10 different states. It manages complex variations in data structure, ensuring clean and accurate output for internal compliance and monitoring use.
Technologies: Python, OpenAI API, CSV handling libraries, Time management and logging tools
Developed a Python script to extract and structure data from Forbes Russia's list of top 200 private companies. The project involved web scraping techniques to gather detailed information including company rankings, names, financial data, industry sectors, management details, and geographical information. The extracted data was then organized into a comprehensive dataset, enabling in-depth analysis of Russia’s leading private businesses.
Technologies: Python, BeautifulSoup, Pandas, NumPy, Requests, tqdm
This project developed a web scraping tool to collect seasonal anime data from MyAnimeList. The script uses the MyAnimeList API to fetch information about anime releases for specified years and seasons. It then processes and organizes this data into a structured format using Pandas, saving the results as a CSV file for further analysis or use in other applications.
Technologies: Python, Pandas, Requests, MyAnimeList API
Data Science Projects
Developed a machine learning model to accurately predict the system cancel rate for new jobs using pre-defined features. This model significantly improves customer lifetime value (LTV) predictions, enabling more efficient paid marketing campaigns and optimizing cleaner acquisition strategies. The project directly impacts business growth and operational efficiency.
Technologies: Python, Machine Learning, Keras, Optuna, Data Analysis, Feature Engineering, Data Visualization
Conducted comprehensive data analysis on global YouTube statistics to uncover success factors for top channels. Employed machine learning techniques, including Random Forest Classifier, to identify key performance indicators. Visualized insights through geospatial analysis using GeoPandas. Explored correlations between channel metrics, subscriber growth, and earnings. Analyzed popular content categories and regional influencers’ global impact. Developed actionable insights for content creators to optimize channel performance.
Technologies: Python, Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, GeoPandas, Google Colab
PDF Q&A Bestie is an innovative app that enables users to upload PDF files and ask questions about their content. Powered by the Llama2 AI model, it provides accurate answers extracted directly from the uploaded documents. This project leverages Replicate API to offer a computationally efficient solution with a free quota, making advanced AI capabilities accessible to users.
Technologies: Llama2, Replicate API, Langchain, Pinecone, Streamlit, Python, Hugging Face Embeddings, PyPDF Loader
Conducted data analysis and business consulting for client’s portfolio companies. Analyzed UK housing prices to provide actionable insights and propose advanced analytics strategies. Performed descriptive, exploratory, and predictive analyses to enhance business operations and decision-making. Delivered comprehensive presentations to company executives, focusing on data-driven solutions and value extraction from available datasets.
Technologies: R, Data Visualization, Statistical Analysis, Geospatial Analysis, Large Dataset, Kaggle Notebook
Python-based tool for semantic keyword clustering using natural language processing techniques. This tool processes large volumes of keywords, performs text cleaning, and groups semantically similar terms into clusters. It utilizes sentence transformers for generating embeddings and implements community detection algorithms for creating meaningful keyword groupings, enhancing SEO and content strategy efforts.
Technologies: Python, Pandas, Sentence Transformers, NLTK, Scikit-learn
Developed an algorithm to optimize Facebook’s Bid Multiplier feature for targeted advertising. The project involved analyzing ~20,000 customer data points to create an efficient bid configuration based on demographic factors like age, gender, and device platform. The goal was to maximize 12-month customer Lifetime Value (LTV) while minimizing ad spend, resulting in improved Return on Ad Spend (ROAS).
Technologies: Python, NumPy, Pandas, SciPy, Machine Learning, Data Analysis, Data Visualization
Developed a machine learning model to predict customer churn using XGBoost, an advanced gradient boosting algorithm. Achieved high accuracy and AUC scores through feature engineering and hyperparameter tuning. Implemented cross-validation for model optimization. This project enhanced the company’s ability to proactively retain customers and improve business outcomes.
Technologies: Python, XGBoost, Pandas, NumPy, Scikit-learn, Matplotlib
Developed an advanced product recommendation system using machine learning techniques. The project involved data preprocessing, feature engineering, and model optimization to predict customer purchases. Implemented cross-validation for robust evaluation and utilized Optuna for hyperparameter tuning. The system aims to enhance sales predictions and improve customer targeting for a more efficient sales strategy.
Technologies: Python, pandas, scikit-learn, LightGBM, Optuna, NumPy, Tmux
Developed an advanced machine learning model to predict recruitment fill times for job postings. The project involved data preprocessing, feature engineering, and implementing multiple XGBoost models for different pay structures. The final model significantly outperformed the baseline, enabling data-driven hiring strategies and significantly enhancing workforce planning efficiency.
Technologies: Python, XGBoost, Keras-TensorFlow, Feature Engineering, RMSE, Time Series Analysis
Developed a chatbot to provide instant responses to prospective applicants’ queries about an educational program. The chatbot automates FAQ handling, enhances user satisfaction, and streamlines communication across multiple platforms, including the institution’s website and social media channels. It features intent recognition, multi-language support, and seamless integration with existing systems. The solution significantly reduces response times and improves information accessibility for potential students.
Technologies: Dialogflow, Python, HTTPS Protocol, Social Media APIs
Developed an AI-powered system to predict “Angels” for new accounts using machine learning and Google Search API. The system processes account data, extracts information from the internet, and utilizes a trained model to make predictions. Achieved a global accuracy of 62.38% on unseen data, with the ability to increase accuracy to 98.99% by implementing a confidence threshold.
Technologies: Python, scikit-learn, Google Search API, NLTK, Pandas, NumPy, Pickle
Developed a Python-based tool to calculate and analyze inter-annotator agreement using Fleiss’ Kappa and Krippendorff’s Alpha metrics. The project involved implementing statistical algorithms, handling complex data structures, and interpreting results to assess reliability in annotation tasks. Explored limitations of agreement measures and their implications for data quality assessment in machine learning application of hematology.
Technologies: Python, NumPy, statsmodels, Krippendorff
Implemented and evaluated the performance of the Berkeley and Stanford parsers on the Wall Street Journal corpus. Developed custom Python scripts to calculate parsing accuracy metrics like precision, recall, and F1 score. Conducted comparative analysis of parser outputs using the EVALB scoring program and visualized results.
Technologies: Python, NLTK, Pandas, Matplotlib, Berkeley Parser, Stanford Parser, EVALB Evaluation
Data Analytics Projects
Developed comprehensive Chartio dashboards integrating various business functions. Created interactive visualizations for conversion funnels and budget reports, enhancing data interpretation. Implemented alert systems for proactive monitoring and embedding capabilities for seamless integration. The project streamlined analytics processes, enabling data-driven decisions across departments and improving overall business intelligence. These dashboards provided a centralized platform for real-time data analysis and reporting.
Technologies: Chartio, Data Visualization, Chart Embedding, Alerts
Engineered a centralized reverse ETL customer data store, seamlessly integrating Hubspot, Klaviyo, and website tracking sources. Implemented robust data synchronization and conflict resolution mechanisms to ensure data integrity across platforms. Leveraged dbt for efficient data transformation and SQL for complex queries, creating a comprehensive, unified view of customer data, and visualization based on it in Metabase. This solution streamlined customer analytics and enabled more targeted marketing strategies.
Technologies: DBT, Metabase, SQL, BigQuery, Airbyte, RudderStack, Klaviyo, Hubspot
Engineered a cost-effective, real-time data pipeline, seamlessly integrating customer data platforms and analytics tools. The solution harnessed the power of Segment, RudderStack, and BigQuery to optimize data flow, significantly reduce latency, and enhance data accessibility. This robust infrastructure enabled real-time insights, facilitating data-driven decision-making and improving overall business intelligence capabilities.
Technologies: Segment, RudderStack, BigQuery, Amplitude, Google Analytics 4
Implemented advanced user analytics for an e-commerce platform using Amplitude. Developed custom dashboards for user journey analysis and product performance metrics. Created segment analysis to compare brand performance, focusing on flowers and drinks categories. Utilized data-driven insights to optimize user experience, increase conversions, and inform strategic decisions across various product lines.
Technologies: Amplitude Analytics, SQL, Data Visualization, E-commerce, Event Tracking, User Segmentation, Custom Metrics
Designed and implemented comprehensive Cluvio-based dashboards for a major e-commerce company, providing real-time insights into key performance indicators (KPIs). The project focused on channel performance, revenue tracking, customer behavior analysis, and item performance metrics. These dashboards enabled data-driven decision-making by visualizing MTD, QTD, and YTD comparisons, revenue trends, customer lifetime value, and top-performing items.
Technologies: Cluvio, SQL, Data Analysis, KPIs
Developed a comprehensive financial Qliksense dashboard integrating accounts receivable aging and time analysis features. This tool provides real-time insights into AR transaction summaries, aging distributions, and designer performance metrics. The dashboard enables efficient tracking of outstanding payments, workload distribution, and project hours, facilitating improved financial management and resource allocation for the business.
Technologies: Qliksense, Data Visualization, DBMS, Financial Data, Interactive Design, Business Intelligence
Developed a scoring dashboard using Apache Superset to analyze and visualize performance metrics across multiple dimensions. Implemented time-based reporting, weighted scoring, and user comparison features. Integrated with MySQL database and explored materialized views for efficient data handling. Investigated API integrations and webhook implementations to enable real-time dashboard updates and notifications.
Technologies: Apache Superset, MySQL, Flask, Jinja, Docker, Git, RESTful APIs, Webhooks
Developed a comprehensive sales tracking board to monitor and analyze sales associate performance. The dashboard features KPI tracking, scoreboard metrics, and visual representations of sales data. It allows for time-based filtering and provides a clear ranking system for sales associates. This tool enhances performance visibility and facilitates data-driven decision making in the sales department.
Technologies: Looker Studio, SQL, Google Sheets
Developed and optimized data visualization solutions using Qliksense and Qlikview. Created interactive dashboards for financial reporting, inventory planning, and social media analytics. Improved dashboard performance through backend script optimization. Provided custom solutions for diverse industries including healthcare, finance, and marketing.
Technologies: Qliksense, Qlikview, Qlik Web Connectors, Idevio Maps, SQL
Conducted comprehensive financial data analysis on a multi-year dataset (2013-2017) to assess investment performance and risk factors. Utilized advanced statistical methods including regression analysis, beta calculations, and MacBeth risk premium estimation. Evaluated fund flows, returns, and risk metrics across multiple product references to provide insights for informed investment decisions.
Technologies: Excel, Statistical Analysis, Time Series, Financial Modeling
Software Engineering Projects
TrendsOnUp is a freelancer-focused web application that provides real-time email notifications for Upwork job postings. It allows users to receive alerts for opportunities matching their skills and preferences, helping them stay ahead in their freelance careers. This free service empowers freelancers worldwide by enabling them to quickly respond to relevant job openings in the Upwork marketplace.
Technologies: HTML, JavaScript, CSS, Email integration, Cloud Infrastructure
Job Application Bestie is a web-based platform designed to streamline the job application process. It offers tools for crafting personalized cover letters, managing job applications, and leveraging AI assistance. The platform provides a user-friendly interface for job seekers to create professional applications, track their progress, and increase their chances of securing interviews.
Technologies: Django, HTML/CSS, JavaScript, OpenAI API, PDF Generation, MySQL
Developed a Python-based tool to visualize Link Grammar dictionaries, focusing on 5-word clusters. Transformed grammar data into ontology format using rdflib, then generated interactive graph visualizations with ontospy. This innovative approach enables exploration of linguistic structures through ontology-based representations, offering new insights into language patterns and relationships.
Technologies: Python, rdflib, Ontospy, RDF/OWL, Visualization
Django-based tool for analyzing and classifying publishers from Google Sheets data. The system automates the process of identifying known publishers, marking them in spreadsheets, and calculating publication statistics. It features data processing, API integration, and text analysis capabilities, streamlining the workflow for publisher recognition and analysis.
Technologies: Python, Django, Google Sheets API, Pandas, TextHero
Other Projects
Led several red team projects to evaluate and stress-test Large Language Models (LLMs) for Hindi language processing. Managed a team of annotators tasked with breaking the model's performance, specifically targeting products like Google Gemini. This project aimed to identify vulnerabilities, biases, and limitations in Hindi language understanding and generation within cutting-edge AI systems.
Technologies: Large Language Models (LLMs), Hindi Language Datasets, AI Testing Frameworks, Data Annotation Platforms