We pull a variety of data to train two separate models. One collaborative filter utilizing user-product purchase data (ALS on Apache Spark) and one content-based system using product metadata. Depending on the age of the product, we route between the two different models. Next, we calculate cosine similarities between the input and the available recommendations. Finally, we rank products based on similarity and other factors such as cross-category exposure.
Technology Stack
Brandless has a relatively small (and highly utilized) engineering team, so we needed a system that could be built and maintained by the data team. We had a fairly straightforward list of requirements:
- Easy sandbox to test a variety of models and complex pre- and post-processing.
- Model-management system to keep track of different model versions and iterations.
- Easy, engineering-minimal deployment capabilities.
- Simple A/B testing frameworks.
- Open-source software that we could contribute back to in the future (if possible).
After exploring a variety of different systems and tools, we landed on the following toolkit:
- Databricks Unified Analytics Platform: To develop, iterate, and test custom-built models.
- MLflow: To log models and metadata, compare performance, and deploy to production.
- Amazon Sagemaker: To host production models and run A/B tests on different models.
The solution feeds raw data from Amazon Redshift to Databricks Unified Analytics Platform, which trains recommendation system models and develop custom pre and post-processing logic. We use the Databricks notebook functionality to collaborate in real time on model development and logic. We also performs a bit of offline testing within the Databricks platform.
Next we push the data to our MLflow tracking server, which acts as a source of truth for our models. MLflow will store the model hyperparameters, metadata as well as actual model artifacts. Once we have two different models sitting on the tracking server, we use the MLflow deploy commands to push these models into Amazon Sagemaker.
Our environment is stored in a Docker container, and the custom-packaged model pipeline is sent into Amazon Sagemaker’s inference platform. This final step also enables us to run multiple models in parallel, as shown below.
Once we have pushed a model to production with Amazon SageMaker, we can use the A/B testing functionality to iterate through different models and understand which version performs best. We will push two different model variants to a single endpoint and use the UpdateEndpointWeightsAndCapacities functionality to set specific weights to each model variant. Users on Brandless.com will now be assigned model variants in a random occurrence. We record which variant each user receives as well as the actions taken after the recommendation system displays for the customer. Finally, we calculate model performance for each variant.
Discussion
Now that Brandless has been using the Databricks-MLflow-Amazon SageMaker combination, the deployment process has evolved and become more efficient over time.
It originated as a process that required manual checks as we trained the model, pushed to MLflow, and deployed to Amazon SageMaker. We now have a one-click process that automatically checks for errors at each step. We run this process roughly once a week.
Since our initial deployment, we have tested and iterated through approximately 10 different models in production. These versions have different model hyperparameters, utilize increasingly complex post-processing, or combine multiple models in a hierarchy.
We measure online performance by calculating the percentage of Brandless.com visitors that interact with our recommended product carousels. We have seen performance increases on all but one of these model versions, with an estimated 15% improvement overall in comparison to our original model.
The team has also used the Databricks-MLflow-Amazon SageMaker combination to move faster with development for other ML models. These use cases range from customer service improvements to logistics optimization, and all of the models follow the same process.
Challenges and Learnings
We ran into a couple of challenges along the way! Below, we outline a few of these and what we learned:
- Leave time for DevOps and bug fixes: Whether we were setting up proper AWS permissions or debugging an issue while alpha testing MLflow, we needed to leave more time for general debugging the first time the system was set up. Each of the fixes and changes that we made apply to the system as a whole, meaning new models generally take less time from start to finish.
- Optimize for latency: When we first deployed our model, we saw latency above 500ms due to complex Spark operations. This was too slow for our use cases on Brandless.com. To handle this, we built a system that pre-computed model outputs, ultimately bringing latency down below 100ms. We now consider and plan for latency at the beginning of model development.
- Dependency management: Due to the variety of environments that our process uses (multiple systems executing code, different Spark clusters, etc..) we occasionally run into problems managing our library and dataset dependencies. To solve this problem, we wrap all of our custom code into a Python egg package and upload it into all systems. This creates an extra step and can cause confusion across egg versions, so we hope to implement a fully-integrated container system in the future.