Deploying machine learning to automate online ad filtering
Never satisfied with the status quo, we’re experimenting with machine learning projects to further boost our world-class ad-filtering technology. The team behind Project Moonshot wants to automate ad filtering so that we can effectively eliminate even the stickiest of disruptive online ads.
The internet needs online advertising, which helps fund quality content, however, advertising should not come at the expense of the user experience. At eyeo, we are dedicated to finding innovative ways to foster an open internet and provide tools for our tech partners to enhance their own products, giving them a competitive advantage that keeps users and attracts new ones.
We are using machine learning for the first time to automate online ad detection to ensure our technology remains the best-in-class, and that we can continue offering reliable, accurate and efficient ad filtering for our partners and their users.
The following is part of a series detailing our journey to automate online ad filtering. (You can check outPart 1 of Project Moonshoton the eyeo blog to find out how we came up with the idea.) Let’s dive into the methods we use to collect data and deploy machine learning.
Machine learning (ML) is a branch of artificial intelligence that automates data analysis by applying raw data to algorithms to identify patterns that can then make predictions to solve practical problems. In order to implement ML in our ad filtering and before collecting any data, we first defined the problem: aggressive and intrusive online ads are sometimes circumventing our efforts to filter them out. Now we could define the predictions (where these annoying ads would appear) and the data we would need to make those predictions.
Data, data, data
Machine learning is a hungry beast that needs tons of data. But the data needs to be clean, otherwise the machine is not going to identify real patterns. Just like human thought processes, we can’t make sound decisions with wrong information.
With machine learning, the more data, the better. So we used a Chromium-based application to crawl thousands of websites to collect as much data as possible to then train the machine learning models. We were interested in website structure (DOM/HTML) and Cascading Style Sheets (CSS), basically the entire website without the images and videos. HTML+CSS describes content, content layout and styling so it covers a lot of ground for the machine learning algorithms to predict the type of content (i.e. advertisements) it’s looking at. Once the data was collected, we stored it on our servers for later analysis.
Detecting sneaky ads
Now it was time to make the predictions of where the ads would appear on a website. We experimented with the best approaches for automating ad detection by training various machine learning models with the data we collected. Two models are working successfully to varying degrees: natural language processing (NLP) and graph-based.
NLP finds the textual elements in the websites before figuring out which texts belong to ads and which do not. We trained an NLP-based model with different example texts like organic content versus ad content. Currently our prototype is yielding good results.
For the graph-based model, we used an algorithm to transform a website’s structure into a readable graph and ran that against the graph-based model to predict where ads would appear based on the layout, positioning and styling. Our hypothesis was that this meta data alone would be able to make proper predictions. Turns out we were right. This method has been very successful. We’re considering taking this a step further and developing a hybrid model of NLP and graph-based.
As the available tools and technology advance, we are continuing to experiment with new ways to optimize our models and innovate with existing technology and techniques.
One giant leap
We took something tried and tested (the model architecture and machine learning methods) and deployed them in a brand new way. In short, the way we collect the data and train for our specific use case has never been done commercially before. Going where no other company has ever gone, allows us to offer ad-filtering technology that no one else can, to be truly unique and revolutionary on the market. These are the first small steps for (automating) ad filtering, but one giant leap for the online world.
We are excited to announce that Machine Learning Engineering Lead, Humera Minhas, and AI Product Manager, Parinitha Hirehal will be hosting an in-depth session about the journey to automate online ad filtering at the2022 We Are Developers World CongressJune 14-15 in Berlin. Come back for Part 3 of the Project Moonshot series to find out about the challenges we faced and how we overcame them (or in some cases simply had to change the rocket’s trajectory).