Mon 04 October 2021

Ad Specs Part 1: Ads.txt Ambiguity

Category: adtech

Tags: adspecs ads.txt

Programmatic advertising is a massive, complicated, and largely opaque industry. Thousands of companies buy and sell ad space across the web in real-time auctions, determining what ads you see and how much publishers get paid.

This system funds a large proportion of the web as we know it, but has produced a variety of harms along the way; from the defunding of journalism in favour of clickbait and disinformation, to encouraging and facilitating ever more invasive user tracking, to the prevalence of outright ad fraud.

The ads.txt and sellers.json standards from the IAB Tech Lab have been touted as tools to provide much needed transparency and help tackle some of these issues, but problems with the interpretation, implementation, and enforcement of the standards have blunted their impact.

This is the first in a series of posts diving into these standards and their problems, and suggesting some potential solutions (or at least improvements).

The Ads.txt Specification

The obvious place to start is with the oldest standard, ads.txt. The first version of the ads.txt specification was released in June 2017, and it has had 3 minor revisions since then. The latest is v1.0.3, released in March 2021.

Its stated goal is to tackle domain spoofing — a type of ad fraud where someone sells ad inventory (ad space) claiming it’s for one site, but the buyer’s ad is actually displayed on a different site. (Yes, most details about ad inventory are self-declared by the seller, including the domain the inventory is on.)

Screenshot of a paragraph under the title "2. Introduction". Paragraph reads: "Fraud can come in various forms, here we are concentrating on the form wherein ad inventory is being offered to buyers with a misrepresented label and account during the real-time bidding process. Typically the domain of the webpage, or the ID of the mobile app has been falsified to look like a site or app they do not have authorization to sell." — The introduction section of the ads.txt spec.

Ads.txt addresses this problem by getting publishers to add a plaintext file (called ads.txt) to their site listing all the ad accounts that are authorized to sell inventory on that site. Before a buyer bids on ad inventory, they can check that the seller’s account is authorized by the relevant site’s ads.txt. If it isn’t, the domain is probably being spoofed.

Here’s an example of an ads.txt file, from the New York Times:

amazon-adsystem.com, 3030, DIRECT
appnexus.com, 3661, DIRECT
google.com, pub-4177862836555934, DIRECT
google.com, pub-9542126426993714, DIRECT
indexexchange.com, 184733, DIRECT
liveintent.com, 130, DIRECT
openx.com, 537145107, DIRECT
openx.com, 539936340, DIRECT
openx.com, 539052954, DIRECT
openx.com, 544071378, DIRECT, 6a698e2ec38604c6
rubiconproject.com, 12330, DIRECT
rubiconproject.com, 17470, DIRECT
triplelift.com, 746, DIRECT
pubmatic.com, 158573, DIRECT, 5d62403b186f2ace
pubmatic.com, 158945, DIRECT, 5d62403b186f2ace
media.net, 8CU2553YN, DIRECT
aol.com, 55861, DIRECT, e1a5b5b6e3255540
yahoo.com, 55861, DIRECT, e1a5b5b6e3255540
aol.com, 55792, DIRECT, e1a5b5b6e3255540
yahoo.com, 55792, DIRECT, e1a5b5b6e3255540
google.com, pub-1793726897772453, DIRECT, f08c47fec0942fa0
aps.amazon.com, 3030, DIRECT
indexexchange.com, 196165, DIRECT, 50b1c356f2c5c8fc

Beyond listing the authorized ad accounts, ads.txt provides another piece of information: the “relationship” between the publisher and account. This can be “DIRECT” or “RESELLER”.

DIRECT means the “…publisher (content owner) directly controls the account…”. RESELLER means the “…publisher has authorized another entity to…resell their ad space…”.

Screenshot of the definition of "Field #3", named "Type of Account/Relationship". Description: "(Required) An enumeration of the type of account. A value of 'DIRECT' indicates that the Publisher (content owner) directly controls the account indicated in field #2 on the system in field #1. This tends to mean a direct business contract between the Publisher and the advertising system. A value of 'RESELLER' indicates that the Publisher has authorized another entity to control the account indicated in field #2 and resell their ad space via the system in field #1. Other types may be added in the future. Note that this field should be treated as case insensitive when interpreting the data." — The definition of the relationship field in the ads.txt spec.

That all sounds fairly straightforward, right? If you’re a publisher, just add an ads.txt file that lists your own ad accounts as DIRECT, and the accounts of any resellers you work with as RESELLER. However, despite the apparent simplicity, there are at least two major contentious issues with ads.txt.

Authorized Spoofing

When checked by buyers (not a given), ads.txt does prevent spoofing by unauthorized accounts. I can’t open an ad account and sell spoofed nytimes.com ad inventory, as my account isn’t in their ads.txt. But what if the spoofer is in the spoofed site’s ads.txt?

Adtech vendors (sales houses, SSPs, resellers, etc.) act as intermediaries for multiple sites, so their ad accounts are authorized by all of them. For example, if a vendor works with a highly prestigious news site that commands high CPMs, what’s to stop them spoofing that prestigious site when selling ad inventory from their other sites?

Checking a site’s ads.txt won’t help buyers identify this kind of spoofing; the vendor’s ad accounts are in the spoofed site’s ads.txt, so it is authorized to sell the inventory. The same possibility arises when a publisher owns multiple sites, or colludes with other publishers (i.e. a dark pool sales house).

But let’s put a pin in this for a minute, and move on to…

Account Mislabelling

Ad inventory sold by DIRECT accounts is frequently preferred by buyers. Programmatic ad campaigns can be configured to prioritise buying from DIRECT accounts, or exclude RESELLER accounts entirely.

Each intermediary between the buyer and the publisher site increases the risk of fraud (e.g. domain spoofing), so “buying direct” is seen as a way to reduce that risk. It also has the potential to reduce costs by bypassing intermediaries that clip the ticket.

Screenshot of a tweet that reads: "I used to work in ad fraud, and resellers are full of fraudulent activity to boost their numbers and money, and the blame is often ascribed to the publisher vs the intermediary so they get to just keep on keeping on." — Response to a question on Twitter about why marketers prefer buying direct.

Screenshot of a tweet that reads: "Not all intermediaries are bad actors, but as an advertiser or marketer you're certainly taking a bit of gamble any time you introduce more links in your digital advertising chain." — Follow-on to above response.

This makes inventory sold by DIRECT accounts more valuable, providing a clear incentive for publishers (often under instruction from the intermediaries) to label accounts as DIRECT, regardless of what the real relationship is.

Mislabelling accounts is ad fraud, as the buyers aren’t getting what they paid for. It’s also rampant. Almost any ad-enabled site you can think of will likely have multiple DIRECT accounts in their ads.txt file that are shared by tens, hundreds, or thousands of other unrelated sites.

They’re easy to spot with Well-Known, an open ads.txt index I run. Here’s a particularly bad case (log in to see counts of sites sharing each account), but the problem isn’t limited to disinformation sites.

Screenshot of the "Intermediary Direct Sellers" table from breitbart.com's ads.txt details on well-known.dev. An alert is shown with the text "Intermediaries shouldn't usually be listed as Direct". Each of the 23 listed ad accounts is shared by multiple sites, often by 10s of thousands of sites. — Sample of mislabelled accounts from breitbart.com’s ads.txt. The numbers in parentheses are the number of sites that list each account as DIRECT.

Tackling this fraud requires three things: an agreed definition of when an account can be labelled DIRECT, an entity with the power and will to enforce this definition, and the ability for violations to be detected at scale. Currently, none of those things exist.

The Definition of DIRECT

For simple cases, the definition of DIRECT in the ads.txt spec seems clear. For example, if I create an account with Google AdX to sell ads on a site I own, I have direct control of the account, so it’s a DIRECT account.

But what if a publisher owns multiple sites? They still “directly control” the account, so can list it as DIRECT on all of them, right? What about a parent company that owns hundreds of companies that each have their own sites?

If an account isn’t DIRECT, does that make it RESELLER? It must, as there isn’t any other option, but RESELLER isn’t defined in opposition to DIRECT — it has its own separate definition.

Publishers often contract a vendor to manage their ad inventory for them. The publisher doesn’t “directly control” the vendor’s accounts, but the vendor arguably isn’t re-selling the inventory, as they never bought it — they’re selling it on behalf of the publisher. Should the vendor’s accounts be DIRECT or RESELLER? You can guess what most of these vendors argue.

Screenshot of a tweet that reads: "In the cases where Freestar is gateway to accessing monetizing on a site, a DIRECT ads.txt for Freestar's direct SSPs is perfectly legitimate (even as the appropriate sellers.json seller type for Freestar is INTERMEDIARY)" — A creative interpretation of DIRECT from an adtech vendor.

The term “publisher” is an essential part of the definitions, but it’s never explicitly defined. The spec refers to “content owners” (abstract and introduction sections), “publisher content distributors” (start of the specification section), and “publishers” (everywhere else) seemingly interchangeably.

The problem with this lack of consistency is that there are cases (e.g. syndicated content) where the content owner isn’t the site owner (“content distributor”). Either could be reasonably called the “publisher”. It’s the site owner that controls the ads.txt file, but should the content owner determine if an account is DIRECT? While there would be many problems trying to implement that interpretation, the definition of DIRECT explicitly includes “content owner” in parentheses after “publisher”, confusing the issue.

The RESELLER definition even seems to include an outright mistake. When I quoted it above I removed a section from the middle to help it make sense. It actually says: “…the Publisher has authorized another entity to control the account indicated in field #2 and resell their ad space…” (removed section in italics). This implies that the publisher controls the RESELLER accounts and has delegated that control to the reseller. This simply isn’t true in any case I know of. RESELLER accounts are owned and controlled by the reseller itself.

At this point, it would be helpful to look at the intention behind the account relationship field as a guide in interpreting the definition. However, the spec contains no information about the purpose of the field, nor how it should be used by buyers.

These fundamental issues with the ads.txt spec result in legitimate ambiguity about which accounts can be labelled DIRECT. This ambiguity is then leveraged to manufacture reasonable doubt whenever mislabelling is highlighted by researchers and activists. While everyone is stuck arguing about whose interpretation is correct — a clash of opinion that’s never resolved — the fraud continues.

Enforcement

This is where an enforcement body could step in. Even without a clear definition of DIRECT in the ads.txt spec itself, some interpretation could be selected and enforced, becoming the de facto “correct” interpretation.

The IAB Tech Lab itself is the obvious candidate here, as they developed the ads.txt standard in the first place. However, they constantly insist they aren’t an enforcement body, and would likely argue they don’t have power to actually enforce their standards even if they wanted to.

Screenshot of a tweet from @shails that reads: "All our standards are voluntary. Thats how a trade organization works. There is no mandate or enforcement. We don't endorse anyone or defend any one company. Every company is treated equally and all their work, participation and adoption is voluntary." — A Tweet from the SVP, Product Management & Global Programs at IAB Tech Lab.

Large scale DSPs and ad exchanges could act as enforcers by not buying from mislabelled accounts. There’s precedent for this — the general adoption of ads.txt was largely driven by Google’s announcement in September 2017 that their products would only buy from ads.txt authorized accounts.

So far, Google and the other big players don’t seem to be interested in seriously addressing account mislabelling. But even if they were, how would they detect it reliably enough to take enforcement action?

Under any reasonable interpretation of the spec, determining whether a publisher has “direct control” over an ad account requires knowing who the site’s publisher is, who owns the account, the relationship between the two, and how much “control” that relationship affords the publisher. That simply can’t be done programmatically.

Well-Known currently has ads.txt data for around 467k sites, which collectively list 537k unique accounts as DIRECT. Then there’s the long tail of ad-enabled sites that aren’t in Well-Known’s dataset. Manually investigating every potential case of mislabelling wouldn’t make a dent in the problem.

A Better Definition of DIRECT

Now that we have some idea of the shortcomings of the ads.txt standard, what can we do about it? Here’s where the pin we put in authorized domain spoofing comes in.

We don’t know what the original purpose of the relationship field was, but given that the purpose of the standard as a whole was to tackle domain spoofing, and ad buyers are trying to use it to limit fraud by avoiding intermediaries, why not use it to tackle authorized domain spoofing?

With this purpose in mind, here’s a new definition of the relationship field:

A value of DIRECT indicates that the ad account is only authorized to sell ad inventory on this site.

A value of RESELLER must be used in all other cases.

(Clearly the term “RESELLER” is far from ideal, but keeping it for backwards compatibility is worth the definitional dissonance.)

We also need a definition of “site” to go with this new relationship definition:

A site is determined by its root domain, with the exception of declared subdomains.

A subdomain that is declared with a SUBDOMAIN variable in its root domain’s ads.txt is its own separate site. Its ads.txt file therefore can’t share any DIRECT accounts with the root domain.

Undeclared subdomains are part of the root domain’s site.

Note that the ads.txt spec already defines the “root domain”, and how to handle SUBDOMAIN variables:

Screenshot of a paragraph under the title "3.1 Access Method". Paragraph reads: "Publishers should post the /ads.txt file on their root domain and any subdomains as needed. For the purposes of this document the “root domain” is defined as the public suffix plus one string in the name. Crawlers should incorporate Public Suffix list to derive the root domain." — The definition of root domain from the ads.txt spec.

Screenshot of the definition of variable "SUBDOMAIN". Value: "Pointer to a subdomain file." Description: "(Optional) A machine readable subdomain pointer to a subdomain within the root domain, on which an ads.txt can be found. The crawler should fetch and consume associate the data to the subdomain, not the current domain. This referral should be exempt from the public suffix truncation process. Only root domains should refer crawlers to subdomains. Subdomains should not refer to other subdomains." — The definition of the SUBDOMAIN variable in the ads.txt spec.

The Benefits

This new definition is enforceable, less prone to misinterpretation, meaningfully reduces opportunities for domain spoofing, and is more in line with what ad buyers expect.

It doesn’t matter who the “publisher” is, what sites they own, or how much “control” they have over the account. All that matters is whether the account is exclusive to the site or not.

Mislabelling is easily detectable — every DIRECT account must only appear in the ads.txt for a single site. This can be automatically checked by anyone with a database of ads.txt files. Any ads.txt file can be checked for mislabelled accounts on Well-Known right now, and DSPs and exchanges could enforce this at any point with minimal effort.

With an unambiguous definition and wide enforcement, buyers could be sure that when they buy from a DIRECT account their money is going to the site the seller claims the inventory is from, and isn’t going through any intermediaries. This doesn’t entirely remove the possibility of domain spoofing — that would require enabling the buyer to validate the inventory details independently of the seller — but it’s far better than what we have currently.

The main limitation with this approach is that it’s a blunt tool. It’s based on what’s essentially a boolean field, so accounts are either DIRECT or they’re not. Buyers can’t choose how direct they want to buy. Many sites won’t be able to be bought DIRECT at all, as they’re too small to have their own accounts.

Thankfully, newer standards provide tools to address these limitations, which I’ll cover in a future post.

The second post in this series is now available here.