Professional Prowess

TECHNICAL SKILLS

PROGRAMMING LANGUAGES

Details

DATABASE TOOLS & LANGUAGES”

Details

MATHEMATICAL SOLVERS

Details

TOOLS & IDE

Details

WORK PROJECTS

DATA SCIENCE

1) Developed an XGBoost model on Apache Spark using PySpark to predict the audience viewing a network at any time.

Details

The historical television viewing data is used to train this model. The data is based on the viewing behavior of people watching television which is provided by Nielsen.

Decision Tree algorithm works best for such predictions scenarios based on the viewing duration, program names, genre of the programs, daypart of the program, type of program and numerous other features. XGBoost has been industrially proven to be a superior prediction algorithm because of its bagging and boosting capabilities.

The accuracy of the predictions improved the program and break schedule with better delivery of promos and advertisements, saving 12% of the advertisement inventory.

2) Developed a Lasso and Elastic-Net Regularized Generalized Linear Models using GLMNET in R to forecast the audience viewing a network based on the demography, location and frequency of viewing.

Details

This model was a very research oriented model developed to find the relation between Impressions and Reach. Till date people are still working on a lot of research to find the perfect way to determine this relation.

The solution provided by the Lasso and Ridge regression suited our use case very well. Altering the lasso and ridge coefficients provided us with an optimized cost function reducing the model complexity and also helping in feature selection.

The major factors that assisted in determining the reach from a given impression was the network, demography group, designated market area, frequency of viewing by audience(1+,3+,5+,7+).

This model helped us to create a promo scheduling model based on reach and not just impressions.

3) Developed a natural language processing model to extract and structure the media rights terms from the contracts.

Details

Contracts among the Production House, Distributors and Networks can vary from 15-500 pages. It is filled with numerous conditions and clauses and the language is always lawyered-up.

Usually, a business analyst goes through the contracts page by page and creates a summary manually to highlight the key terms. This process was streamlined using Natural Language Processing.

The contracts were ingested in pdf format. Pytesseract was used as an OCR tool to convert the pdf file into text file in order to make it readable by code. Similarity Index Matrix along with a list of special keywords was used to extract the terms form the contract and generate an excel sheet summary.

BIG DATA & DATA ENGINEERING

1) Designed a Machine Learning Pipeline for efficient and distributed big data ingestion and feature creation in DataBricks which is used for ensembling models.

Details

The objective was to streamline the pipeline flow and make each part modular.

This pipeline design enables us to find the best performing model and then use it for future predictions.

This design also helps the team to develop different parts of the model simultaneously. With the help of Apache Spark and designing the clusters based on the data, the runtime of the model decreases significantly.

2) Merged(Levenshtein Distance, Text Distance, Schedule Matching) various data sources(Nielsen, Gracenote, FYI, IMDB) from Amazon S3, Excel Sheets from clients, APIs providing JSON files to create a clean structured Data Lake in Amazon S3 which is used as an input for machine learning models as well as to create client reports.

Details

The objective of this project was to unify and leverage unique information provided by different datasets.

As all the datasets are about the content that was telecasted in different platforms, schedule matching was done based on time with some tolerance to map the movies and series.

To extract more information about series and movies IMDb was mapped to the metadata using string matching techniques like Levenshtein distance and overlap distance with the program and movie names.

3) Interpreted the Nielsen data stored in IBM DB2 and prepared the base tables for the scheduling models using PL/SQL.

Details

The scheduling model used complex calculated data as inputs for the model and was also based on different platforms. The available data was comparatively simple and very basic.

Therefore had to develop procedures in PL/SQL to generate the data required for the model in an efficient way so that the calculations don’t consume much computation as well as time.

OPERATIONS RESEARCH

1) Created and developed an optimization model using FICO Xpress solver to improve the existing break logs for the networks by shuffling the breaks to deliver the impression on deals with advertisers.

Details

The objective of this model is to fulfill the deals between the advertisers and the networks. The advertisement deals are made in terms of CPM, i.e. Cost per million Impressions.

The advertisement spots have many constraints like Brand Separation, Advertisement Separation, Piggy-Backing, A-Z positions.

Keeping in mind all the constraints for each advertisement and with the objective to increase the impression delivery, considering the deal end date and pending impressions to deliver from the deal, the model suggests moves in the existing advertisement schedule for a better impressions delivery for each deal.

2) Designed and developed a scheduling model in Python to get the optimal schedule of promos using CPLEX solver.

Details

The purpose of this model was to create Promo Plans for awards, shows and movies across all platforms(Local Cable, National Cable, Broadcast) to the target audiences of different demographies having flexible objectives.

The objectives were:
1. Maximize number of Audience for a target Budget
2. Minimize Budget for a target number of Audience.

From the available data I designed the scheduling problem which you can see in the below images.

Decision Variables and Objective Functions

It also included calculations to skew the number of spots(promotional videos) towards the date of release.

ACADEMIC PROJECTS

OPERATIONS RESEARCH & ALGORITHMS

Enhancement of the University at Buffalo Bus (STAMPEDE) Schedule (2018)

Details

This project was a part of my academic curriculum as an Individual Problem under the guidance of Dr Professor Jamie Kang. She introduced me to a real world problem solving scenario and how we should approach it starting with the problem statement.

This project was done on 2 phases:
1) Literature Review: Studying and Analyzing research papers such that we can find the best way to solve our problem statement with the data available.
2) Implementation: Using the chosen algorithm picked from the research papers that are best suited for our data hence generating the results and conclusions.

The UB transportation data was provided by the PassioTechnologies for Fall 2016 and Spring 2017. The various analysis was done based on bus stops, route, number of passengers at any point in time. The first objective was to determine the destination of each passenger based on origin.

The algorithm is based on two primary assumptions:
1) A high percentage of passengers return to the destination station of their previous trip to begin their next trip.
2) A high percentage of passengers end their last trip of the day at the station where they began their next trip.

After the ingestion of data python was used to implement the algorithm. Some of the results you can see below.

Number of people travelled from Origin to Destination

Based on the number of people traveling from origin to destination based on time, the schedule was optimized.

Traffic Network Flow Analysis of Buffalo (2017)

Details

This project is based on a research paper on Bangkok Traffic Data.
The algorithm from the paper was used to develop a python code to calculate the Maximum Flow, Minimum Cut and hence predict the possible traffic jams.
For demonstration purposes, we created nodes, traffic routes for Buffalo City Traffic and displayed the Minimum Cut and Maximum Flow.
We achieved an accuracy of 90% through the algorithm.

The GitHub link to the code is here.

Traveling Salesman routing problem for UPS using nearest neighbor, MILP & Simulated Annealing (2017)

Details

The objective of this project was to understand how different algorithm works for the same problem and which one is the best in terms of efficiency and accuracy. We found that the Mixed Integer Linear Programming algorithm is the best.

To see the code and full report click on the GitHub link here.

HEALTHCARE DATA ANALYSIS & STATISTICS

Healthcare Transition for Patients Discharged from Hospitals to reduce readmissions. (2017)

Details

Hospitalization has accounted for 1⁄3 of the annual health spending ($2 trillion) in the U.S. 1 out of 9 hospitalizations were categorized as readmission (hospital admission within 30 days following an original admission). Proper transition care of patients by utilizing information technology could prevent and reduce costly readmissions.

The model with 8,000 entries as the train data and 3,734 as test data was ran. The results of the model show an accuracy of 75.05 %.

Analysis of Hospital Revenue of Medicare patients and comparison of average charge for popular medical services in the state of New York and Ohio. (2017)

Details

Medicare scores give a basic idea of how much a particular hospital is spending under Medicare. Analysis of the states of New York and Ohio is done on the basis of scores, to rate their performance.

Study shows that Ohio is performing better than New York in this aspect, though the number of hospitals is less comparatively.

To download the full report click here.

PROJECT MANAGEMENT

Project Management Case Study: Optimized gas sampler tips for a packed base. (2017)

Details

Gas samplers are inserted in adsorbent beds to collect gas samples from different locations within the bed.
To reduce the resistance faced by the sampler, its tip needs to be optimized. (conical tips, stepped conical tips, etc.)
The objective of this project would be to investigate the effect of different tip designs on the amount of force required.

To download the full report click here.

MANUFACTURING OPTIMIZATION

Determine an optimal production plan for manufacturing an actuator assembly cell at Sharp Tooling Company (2017)

Details

Sharp Tooling Company is a specialty machine shop in Buffalo, NY that offers dependable service and personalized service for each of their customers.

The production of the actuator is sluggish, inefficient, and expensive.

The total production time and cost needs to be optimized to meet production and sales demand, and the need for quicker and more.

Sharp also faces heavy competition, as the team is assisting them against the current most popular rowing machine on the market as well as various other advanced simulators and rowing tanks.

To download the full report click here.

DESIGN OF EXPERIMENT

Analysis of the variation in cold crack resistance of a medical device to make the device stiff but not rigid (2016)

Details

A specialty medical device was designed and it had trouble with the product cracking at extreme environmental temperatures. The proprietary device must be stiff (but not rigid) under normal conditions of use, but must also remain stiff and not crack in cold environmental conditions.
The medical device is measured by means of a mechanically applied pressure to a prepared specimen after both the specimen and the pressure-device have been conditioned in a carefully controlled cold cabinet.

A non-randomised block with fractional factorial design (2^15-11) was used.

Randomization was done.

It was observed that from the 15 factors 6 are non-significant. Recommendations were given to eliminate those factors and design the experiment again to get more information regarding the interaction between the factors.

DATA ANALYSIS OF LIVE DATA

Data Analysis on World Bank API live data to create a user-driven model to display economic parameters like GDP, GNI of each country. (2016)

Details

This objective of this project was to get a hands-on experience on live data available from various data sources. All these data sources use their own API to share data. Another focus was to retrieve the data, clean and analyze it and then visualize it in the best way.

We used the World Bank API as the data source. When a user runs the code, he/she is asked to list the countries he/she is interested to know the economic conditions. The output also plots the income category the country lies in.