Guidesdata privacyprogrammatic advertising

How Online User Data Is Collected, Shared, and Sold: A Plain-Language Guide

third-party cookiesfirst-party cookiesweb trackersweb bugsdata management platforms (DMP)demand-side platforms (DSP)supply-side platforms (SSP)ad exchangesad networksreal-time bidding (RTB)UUIDcookie respawningFlash cookiesHTML5 local storageETagspiggybackingpixelsweb analyticscookie CPMdata segmentationaudience targetingNSA

Before the Internet era, advertisers reached their audiences through broad, blunt instruments — TV spots timed to the right demographic, newspaper ads placed in the right section, billboards positioned in the right neighbourhood. The targeting logic was sound, but the precision was limited.

Today, the online world makes granular audience targeting possible by enabling the collection of enormous volumes of user data. That data has become a significant economic force in its own right.

According to Forrester Research, U.S. companies spend more than $2 billion annually to tap into consumer data.

Online users are being tracked more than ever before, and the online display advertising industry continues to translate this abundance of user data into revenue:

Understanding what that means — for users and for businesses — requires looking at the underlying mechanics.

How User Data Is Collected, Shared, and Traded

Most intermediate Internet users know that their online activity is being tracked, typically via cookies, and that this tracking feeds into advertising and marketing. What fewer people understand is the scale of the data being collected and the number of intermediaries it passes through inside the online advertising ecosystem.

The process involves scripts, cookies, and a chain of technology platforms — and it plays out on virtually every page on the Internet.

While most companies operating in the online advertising industry collect some data as a byproduct of their core function, there is a category of company whose primary business model is collecting and selling consumer data. These are known as data brokers (sometimes also called data suppliers).

What Is a Data Broker?

Data brokers aggregate user profiles obtained from publishers, combine them, and segment them into audiences. They can also provide additional user information on demand to programmatic ad buying platforms.

Information offered by data brokers typically includes:

  • User segments
  • Ad viewability data
  • Ad fraud detection signals
  • Contextual information about publishers

How Data Brokers Collect Online Consumer Data

Scripts

The process begins the moment a user accesses a website — say, a popular tech news publication — and the page starts loading its content: text, images, videos, and other visible elements. Alongside those visible elements, the site also loads hidden items called scripts (commonly referred to as pixels).

Every website uses first-party scripts to render its own content. However, it is extremely common for sites to also load third-party scripts. The distinction is straightforward: anything served to the browser by an organization other than the website itself is considered third-party content.

Third-party scripts fall into a few broad categories, each serving a different function:

  • Ads: Display advertisements such as banner ads.
  • Tracking and analytics: Web analytics services like Google Analytics and Piwik PRO.
  • Social media: Widgets like sharing buttons and like buttons.
  • Fonts: Typography resources rendered across different browsers.

Of these categories, advertising and analytics scripts are the most prevalent by a considerable margin.

Some scripts simply execute their intended action. Others use a technique called piggybacking, in which loading one third-party script triggers the loading of additional third-party scripts and web trackers. A social media widget, for example, may load its own script plus several additional tracking scripts beneath it.

Beyond the privacy implications, this layering of third-party scripts has a measurable impact on page performance — which is why browser extensions and mobile apps that block scripts have become increasingly popular. When iOS 9 was released, it introduced, for the first time, a native feature enabling users to download apps to block trackers, ads, and other unwanted content.

On desktop, tools like 3D Tilt — a Firefox add-on — provide a graphical overview of the different layers of a webpage, including the ad and script elements embedded within it.

Collecting and Sharing Data

Not all third-party scripts are there to harvest data — some exist purely to execute a function, like displaying a social sharing button. But a significant subset are there specifically to collect data about the website and the user visiting it. These are known as trackers or web bugs, and they are typically operated by companies in the data collection and resale business (data suppliers).

When a user accesses a page that loads trackers, those trackers collect two categories of information:

Information About the Website

  • URL
  • Page title
  • Taxonomies (the site's content category)
  • Metadata about the displayed article or product

Information About the User

  • Web browser type
  • Enabled plugins
  • Screen resolution
  • Browser language
  • Web history
  • Geolocation
  • Profile data
  • Online transactional history (e.g., items purchased)

It's worth noting that data can also be shared directly between data brokers and companies — meaning brokers may receive data sets that trackers alone cannot capture, such as demographic information like income, gender, and age.

When a tracker runs on a page, it searches for its own third-party cookie. If it cannot find one — because the user is new, or has deleted their cookies — the tracker generates a UUID (universally unique identifier) and saves it as a third-party cookie in its own domain (e.g., tracker.examplesite.com). That cookie is then used to identify the same user across any future website that loads the same tracker.

Once the tracker has created its third-party cookie, it can sync that cookie with other companies in the display advertising ecosystem — such as data management platforms (DMPs) — making the cookie "active" and allowing collected data to flow into the bidding and targeting process.

Selling the Data

Once data suppliers have collected their data, they typically sell it to data brokers (such as DMPs) through one of two commercial models:

Revenue share model: The broker sells the data to other intermediaries in the ecosystem — DSPs, ad exchanges, ad networks, and so on — and then gives the original supplier a cut of the revenue.

The fundamental problem with this model is transparency: the data supplier has no way of knowing when their data was sold, to whom, or for how much. This opacity is one of the more persistent structural issues in the display advertising industry.

Cookie CPM model: The broker pays the supplier a fixed amount — for example, 30 cents — for every 1,000 unique cookies generated by the supplier's site(s). This is more predictable for suppliers but still limits visibility into downstream use.

Once brokers have acquired the data, they process it and organize it into audience segments — often thousands of them — built around attributes such as:

  • Relationship status (e.g., in a relationship)
  • Interests (e.g., gardening)
  • Ethnicity (e.g., Native American)
  • Age group (e.g., 35–39)
  • Gender (e.g., male)
  • Connected devices (e.g., Xbox 360)
  • Home value (e.g., between $200k–$400k)
  • Annual income (e.g., between $60k–$90k)

Advertisers can then layer multiple segments together to define and reach the precise audiences they want to target.

Despite their utility, these segments come with some well-known limitations.

Problem 1: Data Accuracy

There is generally no way to determine the age of the data underpinning a segment. Some attributes are relatively stable — gender, for instance — but others shift frequently. Buying intent, for example, may only be relevant for a two-week window. Stale data means ads may be served to users whose circumstances have changed, reducing campaign relevance and performance.

Problem 2: Revenue Attribution Across Multiple Sources

A single audience segment may be constructed from data contributed by hundreds of different suppliers. When that segment is sold, the revenue — after the broker's commission — theoretically needs to be distributed among those contributors in proportion to their contribution. In practice, this attribution process lacks transparency and cannot be independently verified.

The questions that remain unresolved in this space include:

  • Should more recent data carry a higher weighting in the attribution model?
  • Should data quality factor into how revenue is distributed?
  • Should the volume of information contributed affect the payout?
  • How should conflicts be resolved when the same data originates from multiple suppliers?

There are no clean answers to these questions, and the industry has yet to converge on a standard approach.

Selling Data Through Technology Platforms

In addition to selling data directly to advertisers, brokers can route it through technology platforms — DSPs, ad exchanges, ad networks, SSPs, and others. When data is sold via a technology platform, it is sold on a CPM basis and billed on top of the cost of the ad inventory itself.

In practice, this means that for every 1,000 impressions purchased through a DSP using audience data from a data broker, the advertiser is billed an additional CPM fee (e.g., $1) on top of the media cost.

This model has its own distinct problems.

Problem 1: Lack of Transparency

In the real-time bidding (RTB) auction model, data is typically included in every bid request sent to the DSP. The DSP's bidder evaluates those requests and places bids on behalf of advertisers — but there is no mechanism to verify whether the bidder actually used the data during that exchange. Typically, it is the DSP that self-reports to the DMP on how much data was consumed, with no independent audit possible.

Problem 2: Static Pricing

Data pricing through technology platforms is largely static. The only variation tends to be a premium applied to segments deemed to be of higher value. There is no mechanism for dynamic pricing based on demand or data quality, which means publishers, data suppliers, brokers, and advertisers alike may all be leaving money on the table.

Overview of the Data Flow

The movement of user data through the online display advertising ecosystem is illustrated below:

It's also important to note that data can be repackaged and resold from one DMP to another, provided the cookies across platforms are mapped to each other.

Cookies remain, by a wide margin, the dominant method for tracking user activity on desktop devices. It follows that users seeking to protect their privacy often delete or block third-party cookies. What many don't realize is that deletion is not always permanent.

Cookie respawning is the process by which a deleted cookie reappears — or "respawns" — by drawing on backed-up data stored in additional files, which are then used to recreate the original cookie when the user revisits the site.

The sequence looks like this:

  1. A user accesses a website.
  2. The website creates a cookie.
  3. The cookie tags the user's browser with a unique identifier that is not straightforward to delete.
  4. The user leaves the site and deletes their cookies.
  5. The user returns to the site; the new cookie detects the identifier still present in the browser and respawns the original cookie.

There are two primary technical methods for achieving this:

Flash cookies: Stored by Adobe Flash Player on the user's computer, these cookies are invisible in standard browser cookie settings. Deleting them requires navigating specifically to the Adobe Flash Player settings panel — a step most users are unaware of.

HTML5 local storage: HTML5 local storage and cache cookies use entity tags (ETags) to respawn cookies by recognizing the persistent identification element (PIE) created by JavaScript and Flash. Even after the original cookie is deleted, the PIE can trigger its recreation.

Who's Tracking You?

From a user perspective, the idea that one's browsing data is flowing through a complex chain of intermediaries is unsettling — and with piggybacking amplifying the number of active trackers on any given page, identifying exactly who is collecting what is genuinely difficult. The information isn't easy to surface, and it changes from site to site.

Two browser-based tools make this more visible.

Ghostery

Ghostery is a browser add-on available for Firefox, Chrome, Safari, Internet Explorer, and Opera, as well as on mobile platforms (Android, iOS, and Firefox for Android).

Once installed, a small ghost icon with a number appears in the browser toolbar. That number reflects the count of trackers active on the current page. Clicking the icon reveals the names and categories of those trackers.

Here's an example of what Ghostery surfaces on a major technology news homepage:

Most trackers will be unfamiliar to the average user, but a few — like Facebook and DoubleClick (Google's DSP) — tend to appear across a wide range of sites.

LightBeam

LightBeam is a Firefox add-on that visualizes how many third-party tracking requests are generated while browsing. Visiting just two popular news sites — nytimes.com and techcrunch.com — resulted in 110 requests to third-party services, of which seven were triggered by both sites. LightBeam's visualization of this data is shown below:

Differing Views on Data Collection

Opinions on how online user data is collected, shared, and sold vary considerably. Some users accept tracking as a reasonable trade-off for free content and relevant advertising. Others view it as a fundamental violation of privacy and take active steps to limit their exposure.

Online privacy has been a prominent public concern since the NSA surveillance revelations in 2013, and as global Internet usage continues to grow, so does the scale of the data collection apparatus.

Regardless of where one stands on the ethics of data collection, it remains an area of the online advertising ecosystem with significant structural challenges — around transparency, accuracy, attribution, and user consent — that the industry has yet to fully resolve.