Wed 09 March 2022

Domain Spoofing on Gannett Sites

Category: adtech

Tags: domain-spoofing ads.txt

This research was done in collaboration with Krzysztof Franaszek. Read his post here.

Domain spoofing — where ad inventory is misrepresented as being from a different site — is often talked about as a solved problem by adtech insiders. Despite this, USA Today and hundreds of local newspapers owned by Gannett were sending spoofed bid requests to multiple ad exchanges for over 9 months.

The various companies involved didn’t appear to notice this behaviour — or didn’t act to resolve it if they did — until late last week when it suddenly stopped.

Tweet reads: "You ask what prevents an intermediary account holder from spoofing a site? This is table stakes on the web for anti-fraud vendors. Web spoofing effectively ended with the introduction of ads.txt. In 2018, almost all spoofing moved to in-app due to a lack of ads.txt (at the time)." — A tweet from a senior engineer at a major DSP.

The primary weapon against domain spoofing is the ads.txt standard from IAB Tech Lab. This allows a site to authorise specific ad accounts to sell its inventory, preventing other accounts from spoofing it. But ads.txt has its limitations, and one is that it only protects against domain spoofing by unauthorized accounts. What if the spoofing is coming from inside the house?

Gannett owns a large network of newspaper sites that share a set of DIRECT ad accounts — everything from USA Today to the Wisconsin Rapids Tribune. As they’re the “publisher” for all these sites, technically they’re allowed to do this — the accounts aren’t considered mislabelled.

However, the shared accounts are authorized to sell inventory for all the sites, so they can theoretically also spoof inventory for any of the sites without failing ads.txt validations. This kind of “authorized spoofing” is exactly what appeared to be happening until Friday the 4th of March 2022.

I’m not going to speculate as to whether this spoofing was deliberate on Gannett’s part, or if they derived any benefit from it. Instead, I want to focus on how the spoofing worked, and the implications of it.

Finding the Spoofing

Gannett sites use header bidding to run ad auctions on different exchanges when you load a page. Scripts running in your browser make the requests that trigger those auctions, so it’s possible to inspect the requests your browser sends to see how Gannett is representing their ad inventory. In September 2021 I noticed that the domains and page URLs included in some of those requests didn’t seem to match the actual page being loaded.

Let’s look at the Detroit Free Press, a local newspaper owned by Gannett. I loaded this local news article about mistreated dogs on the 17th of February 2022.

Screenshot of a Detroit Free Press article. Title reads: "More than 160 dogs found on blighted property in northern Michigan". — The actual article on the Detroit Free Press.

Screenshot of a USA Today article. Title reads: "Purdy throws for 3 TDs, No. 14 Iowa State routs UNLV 48-3". — The spoofed article on USA Today.

Header bidding requests to multiple ad exchanges, such as Pubmatic and IndexExchange, reported the page as a USA Today college football article. That looks a whole lot like domain spoofing.

Screenshot of Chrome developer tools on a freep.com page, showing a request to hbopenbid.pubmatic.com. The "page" field in the request payload is highlighted --- its value is the URL of an article on usatoday.com. — A header bidding request to Pubmatic.

Screenshot of Chrome developer tools on a freep.com page, showing a request to htlb.casalemedia.com. A JSON object in the request payload is highlighted --- it contains a "page" field where value is the URL of an article on usatoday.com. — A header bidding request to IndexExchange.

A search of the page’s source code shows where the spoofed URL is coming from — a minified inline script. With the help of a javascript formatter we can more easily see the offending code.

Screenshot of the source code for the Detroit Free Press article. A usatoday.com URL is highlighted. — Source code of the actual article.

Screenshot of a formatted version of a section of the article source code. An object called "pbjs" is being constructed. A "setConfig" function is called, passing in a large data object that includes the usatoday.com article URL. — Part of the inline script, formatted.

pbjs stands for Prebid.js, a JavaScript library that’s commonly used to implement header bidding. The a.setConfig({...}) method call is used to configure Prebid, and the ortb2 configuration field provides data about the inventory being sold in OpenRTB 2.5 format.

In OpenRTB, the site field should contain data about the page the ad will be shown on. So why was Gannett providing data about a totally different page?

Further, it wasn’t only the domain and page being spoofed. Data about the article section and subsection, keywords, and brand safety was also being provided based on the spoofed page, not the actual page. This spoofed data was all showing up in requests to affected exchanges.

Screenshot of documentation for the site field in OpenRTB 2.5. Description of the domain field: "Domain of the site". Description of the page field: "URL of the page where the impression will be shown". — Definition of fields in the `site` object from the OpenRTB 2.5 spec.

The mapping between actual pages and spoofed pages in the Prebid config seemed to be cached. Loading a given page multiple times within a short period, or from different browsers or computers, usually yielded the same spoofed page. The mappings did eventually change after some period of time — presumably after the page expired from the cache.

It is important to note that the spoofed data didn’t show up in requests to all exchanges. While some exchanges were only getting spoofed page data, others got actual page data, or a mixture of the two.

That’s because each exchange has its own Prebid adapter that constructs requests from available data. It’s up to these adapters to decide how to use the ortb2 field from the Prebid config, and some didn’t end up incorporating the spoofed data.

Similarly, exchanges that were sent spoofed data may not have passed it on to advertisers. To test this, Krzysztof Franaszek of Adalytics examined ads served in response to header bidding requests that included spoofed data. He found evidence suggesting advertisers were bidding on spoofed ad inventory via multiple exchanges.

Scoping the Problem

Based on archived versions of the USA Today homepage on The Internet Archive’s Wayback Machine, the ortb2 field was first added to the Prebid config sometime between 11:15am and 5:22pm EDT on the 25th of May 2021.

That first observed ortb2 field contained data for the article “Arizona has seen the worst of its winter weather — for now” from Canton Repository — i.e. the data was spoofed right from the start.

The USA Today homepage started consistently including correct data in the ortb2 field sometime between 9:16am and 2:37pm EST on the 4th of March 2022.

In the months since I noticed the spoofing I’ve run dozens of scans checking thousands of articles on Gannett sites to see how widespread the issue was. These scans excluded paywalled articles and a small number of special features that have different ad setups, but all public “regular” articles — ones where the path starts with /story/ — were candidates.

The last scan completed before the spoofing stopped was run on the 3rd of March 2022, and checked articles listed on the homepages of Gannett sites. Of 6,983 articles checked across 275 sites, the domain in the Prebid config was different to the actual domain 99.46% of the time. The page URL in the Prebid config was different to the actual page URL 99.99% of the time. Only a single page wasn’t spoofed.

These results are consistent with the many other scans I ran. The rare cases where a page wasn’t spoofed were so infrequent as to likely be flukes — i.e. the spoofed page happened to match the actual page.

There didn’t appear to be any relationship between the actual and spoofed pages. However, there was a clear pattern to which domains were spoofed. In this scan, spoofed pages were from USA Today 19.05% of the time. The next most common domain was the AZ Central at 2.36%, with other domains decreasing from there.

Here’s the scan’s top 10 spoofed domains, how many of their articles were checked, how often they were spoofed, and their spoof frequency:

Domain	Actual	Spoofed	Spoof Freq.
usatoday.com	48	1330	19.05%
azcentral.com	29	165	2.36%
desmoinesregister.com	30	125	1.79%
heraldtribune.com	20	101	1.45%
delawareonline.com	164	92	1.32%
app.com	26	88	1.26%
theledger.com	23	81	1.16%
freep.com	25	78	1.12%
seacoastonline.com	21	76	1.09%
dispatch.com	49	73	1.05%

Once again, these results are consistent with other scans done on other days, including ones using different methods of sampling articles, e.g. selecting up to 25 articles from each day of a month for each site, or all articles from a single site for a whole year.

USA Today was always the most frequently spoofed domain by a wide margin. It was typically spoofed on 20% of checked articles, plus or minus a few percentage points.

The Impact

Gannett’s ad inventory was being misrepresented to advertisers using affected exchanges for over 9 months.

This might not be a concern for programmatic campaigns that purely target users, but most advertisers do care about the sites and content their ads end up on to some degree or another.

If you thought you had bought ad space on USA Today, would you be ok with your ad actually displaying on Detroit Free Press? As of writing, USA Today is ranked #189 on the Tranco list of top sites. Detroit Free Press is #1,983. How about another Gannett site that was spoofed a handful of times in each scan, The Galesburg Register Mail? It’s #98,198. A significant proportion of Gannett’s sites don’t even make the top 100,000.

Is ad space on an article about animal neglect in northern Michigan the same as ad space on a college football article? Not if you’re trying to reach sports fans, for example. Then there’s brand safety and suitability concerns — I suspect certain brands would not be happy to be placed next to the animal neglect article.

The Big Picture

This is unlikely to be the only case of this kind of authorized spoofing in the wild. Exchanges, DSPs, and anti-fraud vendors need to take a good look at why it seemingly went undetected for so long, and where else it might be happening.

Part of what allowed Gannett’s ad inventory to be spoofed is their use of shared DIRECT accounts — if the sites didn’t share accounts the spoofing could have been detected by buyers when validating the inventory against ads.txt files.

Many other multi-site publishers have similar setups, and even more sales houses and other adtech intermediaries regularly mislabel their accounts as DIRECT across large networks of sites. Any of them could be engaging in spoofing — accidentally or otherwise — that won’t be caught by ads.txt validations.

Updating the definition of DIRECT in the ads.txt spec to close the multi-site publisher loophole, along with enforcing it to stamp out mislabelling, would make it easier to detect this kind of authorized domain spoofing (or at least make it harder to implement).

Finally, it would be helpful if certain segments of the adtech industry were a little more cautious about unfurling the “Mission Accomplished” banner.