How to Test For Incremental Impact: Paid Media & Incrementality Testing

Updated June 2026

‍
How to Measure Whether Paid Media Is Driving Incremental Revenue

Your paid media dashboard says revenue is up.

Your CFO asks one harder question: would you have gotten that revenue anyway?

That is the question attribution cannot answer. Platform ROAS, Google Analytics, and last-click can show which ads claimed revenue. They cannot show which sales would have disappeared if the media never ran.

Direct answer: Paid media drives incremental revenue when revenue in the exposed group comes in higher than what you would have expected without the media. To measure it, you compare actual revenue against a credible counterfactual, usually built with a holdout, a matched-market geo test, a conversion lift study, or an MMM calibrated with experiments.

The formula is simple:

Incremental revenue = actual revenue - expected revenue without the media.

Your paid media dashboard shows the revenue your ads claimed. A test shows the revenue they actually caused.

‍

Table of contents

Your dashboard says	Incrementality asks
What revenue did ads claim?	What revenue did ads cause?
Which touchpoint got credit?	What would have happened without the spend?
What was reported ROAS?	What was incremental ROAS?
Did conversions happen?	Did conversions happen because of the media?
Should the channel get credit?	Should the budget change?

‍

What counts as incremental revenue?

Incremental revenue is the extra revenue a campaign caused, not the revenue credited to it. Revenue with the ads running, minus the revenue you would have earned anyway. Platform ROAS, Google Analytics, and last-click all report claimed revenue. None of them show that a single dollar was incremental.

Attribution counts what happened and hands out credit. It cannot estimate what would have happened without the ads.

So you carry two revenue figures that look alike and mean different things. One is the revenue your ads were near. The other is the revenue your ads created. The gap between them is where budget quietly leaks.

‍

Why does paid media get too much credit?

Most of your reported ROAS is demand your ads captured, not demand they created. Retargeting bills you for buyers already at checkout. Branded search catches people who typed your name. View-through windows credit impressions nobody noticed. The reported number often flatters the channel, because the channel is also the measurement system.

Branded Google Search

0.70x median incremental ROAS

The platform often reports it five to ten times higher. Most of that revenue was already yours.

Stella 225-test DTC incrementality benchmark

Branded search is the clearest case, though not a verdict on every brand. If competitors bid on your name, some of that spend is genuinely defensive. The benchmark says the typical branded-search test mostly captures demand the brand already had, and the platform number cannot tell you which kind you are. Run your own reported ROAS through it and see what the range looks like.

What is your reported ROAS actually worth?

Drag your platform-reported ROAS. The estimate below applies Stella's 225-test benchmark ranges to show how much of that ROAS may be truly incremental. Your actual number requires a geo holdout test.

Your reported ROAS 9.0x

1x 15x

What the platform reports

Benchmark-adjusted estimate

Likely incremental

0.9x – 1.8x

Likely attribution inflation

80% – 90%

Your real number needs a test. Book a demo →

Gordon and colleagues put numbers on this in a 2019 Marketing Science study of 15 Facebook experiments. Observational methods usually overstated ad effects, and in half the experiments the estimate was off by a factor of three or more.

Stella's own Meta work shows the platform number is unreliable in both directions. Across 46 inverse-holdout studies on DTC brands that opted into testing, a small and self-selected sample, the reported ROAS did not predict the true result. The mean incrementality factor, true iROAS divided by platform-reported ROAS, was 1.21, so on average Meta drove about 21% more than the dashboard claimed. But that average hides the spread. For some brands the platform over-credited, for others it under-credited, and the reported number told you nothing about which. Prospecting and upper-funnel media in particular are usually under-credited by last-click, because their payoff lands later and through other channels. You cannot assume the dashboard is inflated and subtract a fixed haircut. You have to test to know which case you are in.

The full breakdown of that study is worth reading if you want the methodology: How incremental is Meta really.

‍

Which revenue number should you measure?

The KPI you point the test at changes the answer. Total revenue tells you if ads drove sales. New-customer revenue tells you if ads grew the base. Gross profit tells you if the growth was worth buying. The same test can read like a win on total revenue and a loss on new customers, so pick the metric before you run it.

Take a retargeting-heavy test. Measured on total revenue it might read 8x, because it sweeps up every returning buyer who was coming back anyway. Measured on new-customer revenue, the same test on the same markets could land near 1.4x. The revenue definition changed the answer, not the campaign. Those figures illustrate the gap, they are not a specific study.

Here is the menu a paid media test can point at: total revenue, new-customer revenue, gross profit, contribution margin, orders, qualified leads, pipeline, closed-won revenue, retention, and lifetime value. Each one answers a different business question, and a test optimized for one can look bad on another.

Agree on the revenue definition with finance before the test runs. Otherwise you will spend the post-test meeting arguing about which number was the real one, and the result will die in that room.

‍

How do you measure paid media's incremental revenue?

Measure it in a clear sequence: define the business outcome, pick a test design, build a credible counterfactual, validate the pre-period fit, run long enough to capture the buying cycle, calculate the result with its uncertainty, then decide what changes. The seven steps below are the whole job.

Choose the business outcome. Total revenue, new-customer revenue, gross profit, pipeline, or closed-won. This decision changes everything downstream, so make it first.
Pick the test design. Geo holdout, audience holdout, platform lift, switchback, or an MMM calibrated with experiments. The next section ranks them.
Build the counterfactual. Withhold the media from comparable control, or construct a weighted comparison group from markets that behaved like your test market before the test. That stand-in is your estimate of the world where the ads never ran.
Check the pre-period fit before launch. If the model cannot predict your test market's past, do not trust it to estimate the present. Look at actual versus predicted revenue and the error metrics, not just a single fit score.
Run long enough to capture the purchase cycle. A higher-consideration product needs a longer window, because the revenue it drives shows up later.
Calculate the result with its uncertainty. Incremental revenue, iROAS, the confidence interval, and the minimum detectable effect. A point estimate with no range is a guess.
Decide what changes. Scale, cut, hold, retest, or change the KPI. A test measures the return at the spend level you ran, so a large scale-up assumes the response curve still holds and usually warrants a retest. If the budget does not move the month after the test, the test was reporting.

‍

Which test method should you use?

Five methods, ranked by how much you can audit and defend, not by some absolute claim to truth: geo holdout, audience holdout, platform conversion lift, switchback on-and-off tests, and an MMM calibrated with experiments. None of them hands you certainty. A holdout gives you a control group you can inspect, which is the strongest auditable evidence on this list, but a geo holdout is still a quasi-experiment that leans on how well your control markets match. The quality of that control decides the quality of the answer.

Method	What it does	Best for	Watch out for
Geo holdout / matched market	Runs ads in test markets and compares them to matched control markets held ad-free	Channels you can isolate by geography, including offline media like TV and CTV	Media spillover between regions, weak market matching
Audience holdout	Randomly withholds ads from a slice of the addressable audience	Addressable channels with clean audience splits	Leakage across overlapping audiences and platforms
Platform conversion lift	The platform runs its own holdout and reports the lift back to you	A quick directional read	You cannot inspect the counterfactual; the grader and the student are the same party
Switchback / on-off	Turns spend on and off over time and compares the periods	Single-channel reads on a fast cycle	Seasonality and outside shocks contaminate the periods
MMM calibrated with experiments	Models the full media mix and calibrates it against holdout results	Always-on allocation across every channel at once	Correlational and overconfident if it is never calibrated with a real test

Geo holdouts and inverse holdouts run the same logic in opposite directions. A geo holdout turns ads on in test markets and compares them to matched markets held ad-free. An inverse holdout turns ads off in markets that were already running and measures the revenue that disappears. They estimate a channel's incremental contribution from opposite ends, and they will not always land on the identical number, because ad effects carry over and turning spend off decays differently than turning it on. A clean version of either is still a real causal estimate.

If you want the deeper method breakdown, including which holdout design fits which decision, we cover it in our complete guide to incrementality testing platforms.

Platform conversion lift studies sit in the middle for a reason, and it is not that the method is weak. Google's Conversion Lift and Meta's equivalent are real randomized experiments, and at the user level their internal validity can beat a geo test. The problem is that the platform designs it, runs it, and grades it, and you cannot open the box. A number you cannot audit, from the party that profits from it, is hard to defend to finance no matter how clean the experiment underneath.

A single test is a snapshot of one channel in one window. To keep a continuous read between tests, feed the results into a media mix model and let it carry the estimate forward. That is the difference between measuring incrementality once and running it as a practice.

Here is what "good" looks like in the data. Across 225 DTC incrementality tests Stella has run, a self-selected set of brands that chose to measure rather than a market average, median iROAS landed at 2.31x, with the middle half falling between 1.36x and 3.24x, and 88.4% of tests cleared significance at the 90% level. That last number partly reflects Stella screening out underpowered tests before running them, so read it as "most tests we agreed to run detected something," not "88% of campaigns are incremental." There is no single number for whether paid media works. It is a wide distribution, and your channel could sit anywhere along it.

‍

When is a geo test the wrong tool?

A geo holdout is not always the answer, and pretending it is would be the same overreach this post is arguing against. It is coarse. It measures a channel over weeks, not a creative or a keyword over days, so it cannot run your daily optimization. It is slow, and it costs real media, because the test needs enough spend to move a market and throw a detectable signal. And it needs geographic variation you can actually split.

Some cases break it outright. Campaigns you cannot split by geography, like Meta Advantage+ Shopping, need an account-level inverse holdout instead. Budgets too small to move a market will not produce a statistically sound signal, so wait until spend clears the threshold or test a bigger channel first. When you need an always-on read across every channel at once, that is a job for an MMM calibrated by the occasional test, not a standalone geo experiment.

Attribution still has a place. It is fine for daily, granular reporting, for watching what happens hour to hour. It is a poor tool for deciding where the next dollar goes. Use each for what it is good at, and do not let anyone, sell you one tool as the answer to every question.

‍

How does this work for lead gen?

For lead gen, do not stop at form fills. Measure incremental lift through the funnel: MQL, SQL, opportunity, pipeline, and closed-won revenue. Campaigns that lift leads but not qualified pipeline are scaling cost, not growth. If paid media raises form submissions without raising closed-won revenue, you proved incremental forms, not incremental business.

More leads is the easiest lift to manufacture and the easiest to misread. Loosen targeting and lead volume climbs. Lead quality usually falls at the same time, so the two roughly cancel out. A campaign that doubles form fills and halves the close rate has done nothing for the business.

So measure the lift as far down the funnel as your data reaches. If you can only see MQLs cleanly, measure there and treat it as directional. If you can tie tests to closed-won revenue, do that, because it is the only stage that pays the bills.

The catch is lag. Closed-won revenue can land months after the click, so a lead-gen test has to run long enough to follow the sales cycle. Measure on the form fill and you will scale the wrong campaigns confidently.

‍

What should a real readout include?

A real incremental-revenue result is not a single iROAS number. It is a readout you can audit. If a result is just "iROAS was 3.2x," you cannot tell whether the baseline held, whether the test was powered to find the effect, or what you are supposed to do next. The number without the receipt is marketing, not measurement.

Hold any result to this:

Test and control regions
Pre-period fit, shown as actual versus predicted revenue
Expected revenue without the media
Actual revenue
Incremental revenue
iROAS with a confidence interval
Model error
The revenue definition used
The budget recommendation that followed

A real one, from a $17M athletic apparel brand. Its Google Ads looked strong on the dashboard at a reported 6.0 ROAS, but leadership suspected the real number was higher. The test built synthetic control regions from 120 days of pre-period sales, then ran 45 days against total revenue.

The readout came back at a true iROAS of 10.11x, an incrementality factor of 1.67, meaning Google Ads was driving 67% more revenue than the platform reported. Significance was p < 0.01, model error (MAPE) was 8.9%, and the R-squared was 0.82. The decision was concrete: reset the platform ROAS target from 6.0 to about 3.6, scale Google spend roughly 20% a week, and switch the north-star metric to MER. That 10.11x was the average return at their tested spend level, not a guarantee it would hold as they scaled, which is exactly why they lowered the target gradually and watched MER rather than assume the next dollar paid like the last. Over the following months weekly spend rose from about $19k to $30k while MER climbed from 3.0 to 4.5. The full case study is here.

That is what a readout looks like: the design, the fit, the uncertainty, the revenue definition, and the budget move that followed, not just a headline multiple. And note the R-squared was 0.82, not 0.95. A usable result does not need a perfect fit, only a disclosed one.

‍

What should you ask an incrementality vendor?

Make the vendor show the diagnostics under the number. Ask for the pre-period fit, the holdout design, the minimum detectable effect, the confidence interval, the outside shocks they controlled for, what they excluded, the revenue source, and what budget decision should change. If a vendor cannot show the counterfactual, they are asking you to trust a black box.

Ask the vendor	Why it matters	A good answer looks like
Show me the pre-period fit	The test is only as good as the baseline it predicts	A documented fit you can see: actual versus predicted revenue, error metrics, and where the baseline is strong or weak
What was the holdout design	The control group is your counterfactual	A clean geo or audience split with no exposure leakage
What was the minimum detectable effect	A test can miss real lift simply because it was underpowered	An MDE smaller than the lift you actually care about
What is the confidence interval	A point estimate with no uncertainty is a guess	A reported range, not just a single headline number
What outside shocks did you control for	Promotions, inventory, holidays, and other channels can contaminate the result	Named confounders, modeled as covariates
What was excluded from the test	Unstable markets or leaked audiences can bias the result	A clear list of excluded regions or audiences, and why
What revenue did you measure	Total and new-customer revenue answer different questions	Both, split out, never blended into one figure
What should we change	The test is worthless if the budget does not move	A specific call: scale, cut, hold, or retest

Two terms a vendor may hide behind. Minimum detectable effect is the smallest lift the test could have caught; if it is larger than the lift you care about, a "no effect" result means nothing. Confidence interval is the range the true number probably sits in; a single figure with no range is false precision.

If you are evaluating a measurement partner more broadly, we wrote a full guide on how to vet a marketing measurement consultancy.

‍

How does Stella show the receipt?

Stella produces exactly that readout. Every result shows the test and control regions, the pre-period fit, the expected and actual revenue, the incremental revenue, the iROAS with a confidence interval and an error rate, and a budget recommendation. The number arrives with its receipt.

Underneath, the holdouts are validated across more than one model, so a result does not hang on a single set of assumptions, and a location selection step picks statistically sound test and control regions before you spend. Between formal tests, always-on analysis gives a daily read and a Bayesian MMM keeps the estimate current.

Stella can also close the loop. Once it knows which conversions were actually incremental, it can pass that signal back to the ad platforms through reverse ETL, so they optimize toward the conversions your ads caused instead of the ones attribution would have claimed anyway.

Run it self-serve with your own data, or have Stella's data scientists run it with you. Either way you get the full readout, not a headline number.

‍

FAQ

What is the difference between attributed revenue and incremental revenue? Attributed revenue is credited by a model after the fact. Incremental revenue is what your ads caused, measured against a holdout. Attribution counts what happened. Incrementality estimates what would have happened anyway, then subtracts it. The first is reporting. The second is measurement.

How long does a paid media incrementality test take? Most run a few weeks. Across Stella's 225 tests, durations ran 20 to 59 days, with a median of 33. Match the window to your purchase cycle. A slower, higher-consideration product needs a longer test, because the revenue it drives shows up later.

Can you measure incremental revenue without a holdout? Not cleanly. A media mix model can estimate it from historical patterns, but the strongest answer calibrates that model against a real holdout test. Without an experiment to anchor it, an MMM rests entirely on modeling assumptions, which are far easier to get wrong than a holdout's control group.

What iROAS counts as a good result? It depends on your margin, not a benchmark. For context, the median across Stella's self-selected 225-test set was 2.31x, with the middle half between 1.36x and 3.24x. A 2x on a high-margin product can beat a 4x on a thin one. The benchmark is a reference point, not a target.

Does incrementality testing work for lead generation? Yes, but you have to measure past the form fill. Track lift to MQL, pipeline, and closed-won revenue. A campaign that lifts leads but not qualified pipeline is buying volume, not growth, and the dashboard will not tell you the difference.

‍

See what your paid media actually caused

You have seen what the platform claims your ads did. A holdout tells you what they actually caused. If you want that number for your own spend instead of taking the dashboard's word for it, book a demo.