While the Hecuba leadership team is still working through which LA neighborhoods make sense for the next café location, Reina Daniels has been busy on a different front. She’s brokered a deal to place Hecuba whole-bean coffee on the shelves at Erewhon. This could be a meaningful step for a brand that has, until now, only existed as a café experience.
The Erewhon merchandise buyer said she loved the simplicity of the packaging used at Hecuba stores but she worries it might need something more for her shoppers, especially those that might not be familiar with the brand. Erewhon shoppers are sophisticated, values-driven, and willing to pay a premium, but they’re also choosing between a wall of beautifully packaged specialty products from brands with their own compelling stories. She suggested two potential changes to the package design. The first leans into the Colombian sourcing, which she thinks will align with Erewhon shoppers’ focus on holistic ingredients. She also thinks the authenticity of the brand’s small-lot farming relationships will resonate. The second idea she suggested was to lean into the founders’ story and signal that this is a female-founded business.
Rather than crowd the label with too many messages competing for attention, Denise, Adria, and Reina suggest a concept test with Erewhon shoppers to see which idea has the strongest lift in purchase intentions.
What a Concept Test Is—And Isn’t
A concept test is a survey-based experiment that measures whether exposure to a marketing concept changes how people respond to your brand or product. A “concept” can be many things—an ad campaign direction, a product description, a positioning statement, a package design, or a brand name. What makes it a concept test is the experimental structure underneath it. Some respondents see a specific concept, some don’t.
Concept tests are usually implemented in a monadic design in which respondents are randomly assigned to a control or a test scenario, and they then only see one iteration. They don’t know that the stimulus has been manipulated. In this way, a concept test can reveal whether a direction actually lifts the desired outcome above a baseline.
The Survey Architecture
Most concept tests adhere to a simple three-part structure, but it always depends on the nuances of your study.
The pre-exposure block captures important data before the stimulus is presented to the respondent. This includes screener information used to confirm eligibility (do they buy whole-bean coffee? Would they consider a new brand?), as well as any baseline measures you might want to include in your analysis, such as category involvement, prior brand awareness, purchase frequency, etc. Frequently, questions are asked here that could be affected by order bias or priming effects after the concept is presented. Everyone sees exactly the same content here.
The stimulus is where the groups split. Respondents see one concept, and only one. The control group sees something neutral, perhaps a brief product description with no positioning argument, or the existing creative used in packaging or advertising. These are your counterfactual respondents. They tell you what people say when nothing has changed. It’s the baseline that makes it possible to measure “lift” from proposed changes.
The post-exposure block brings everyone back to a single path. Your outcome measures like purchase intent live here. These are the same questions and in the same order for every respondent regardless of what they just saw.
In Qualtrics, the mechanics are a Survey Flow randomizer block placed after screeners but before any stimulus content. Each branch holds the materials for one condition. The outcome block lives downstream of all branches.
What the Stimulus Actually Looks Like
Concept tests vary enormously in practice. Sometimes the stimulus is a finished package design. Sometimes it’s only a headline or changes in an illustration. Sometimes it’s a paragraph of copy with no image at all. There is no universal template. The right stimulus is whatever faithfully represents the idea you’re trying to evaluate.
What does matter is stimulus isolation. The differences between your control and treatment conditions should be deliberate and limited enough that you can reasonably attribute any difference in outcomes to the manipulation. If too many things change at once, you might not know what actually moved the needle.
This gets genuinely tricky with visual concepts. When you’re testing two package designs, multiple elements can shift simultaneously, including the color palette, typography, iconography, and layout. You can incrementally test each, but that can get cumbersome and expensive. That’s sometimes unavoidable, and experienced researchers accept it as a known limitation rather than a problem to solve. What you want to avoid is accidental variation, such as when differences between conditions creep in through sloppy execution rather than intentional design. That produces uninterpretable data.
For the Hecuba test, the manipulation is deliberately minimal. The bag shape, the label proportions, the typographic system, and the Hecuba figure all remain consistent across conditions. What changes is a coffee bean icon and origin language in Concept A, and a partnership icon and founder language in Concept B.
Here’s what the three conditions look like:
Control Condition
Hecuba Coffee — Colombian Whole Bean
Hecuba Coffee is a specialty roaster based in Venice, California. Their whole-bean coffee is available in a range of varieties and roast levels, sourced from small farms and roasted at their Abbot Kinney facility.

Concept A — Origin/Provenance
Sourced from one of the rarest and most beloved coffee beans.
100% Colombian Coffee
Hecuba sources directly from small-lot growers in Colombia. Our relationships have been built over years from family ties. Limited harvests mean limited supply. Our roasters work each lot until the bean tells them it’s ready. It’s why 100% Colombian coffee has been cherished by coffee lovers the world over.

Concept B — Women Founded
Built from scratch by two women with a bold vision for coffee.
Female Founded and Operated
Hecuba was founded by Adria Marquez and Denise Shaw in 2018 with a single store on Abbot Kinney and a conviction that great coffee starts small. Eight years later, they still source every bean, approve every roast, and run every location. This bag is the product of their diligence, creativity, and love.

Notice what’s consistent across all three conditions: the bag, the label, the typography, the figure. The production value is deliberately equivalent. If one condition looked more finished or more premium than the others, you’d be measuring design quality rather than the positioning argument (the affect bias we mentioned in previous class discussions) and you’d never know which one drove the result.
What’s distinct is the argument each treatment makes. Concept A is about provenance (rare beans, direct relationships, limited supply). Concept B is about the people behind the brand (two female founders who still make every decision). Those are two genuinely different reasons to believe.
Measuring the Right Things
The choice of primary outcome depends on what decision the concept test is meant to inform. Is it consideration? Appeal? Intention? For a packaging test like this one, the question we probably want to answer is how much the label design might move a shopper at the shelf. Consideration or purchase intent are natural dependent variable options.
How likely would you be to consider Hecuba whole-bean coffee the next time you’re shopping for coffee? 1 = Definitely would not consider — 5 = Definitely would consider
Did the Stimulus Land? Manipulation Checks
Before you analyze the outcome, you need to confirm the manipulation worked. This is the step most often skipped, and the one most likely to embarrass you later.
A manipulation check is a question (or set of questions) asked after the stimulus that confirms respondents actually perceived the concept the way you intended. The logic is simple. If you wrote Concept A to argue for premium provenance, respondents who saw it should rate “this brand sounds like it sources high-quality coffee” higher than respondents who saw the control. If they don’t, then either the stimulus didn’t communicate what you thought it communicated, or respondents weren’t paying attention. Either way, you have a problem — and you’d rather find it now than after the deck is built.
For the Hecuba test, useful manipulation checks might look like this:
Based on what you just saw, how strongly does Hecuba emphasize the origin and sourcing of its coffee? 1 = Not at all — 5 = Very strongly
Based on what you just saw, how strongly does Hecuba emphasize the founders’ personal story? 1 = Not at all — 5 = Very strongly
If randomization worked and the manipulations succeeded, the Origin condition should score higher than the others on the first question, and the Founders condition should score higher on the second. If the differences aren’t there, your test is potentially compromised before you even look at purchase intent.
A few things worth knowing about manipulation checks. They go in the post-exposure block, after the primary outcome measures. You don’t want to prime respondents to think about provenance or founders before you measure consideration — you want them to react to the stimulus first, then tell you what they perceived.
Also, manipulation checks are diagnostic, not decorative. If they fail, you have to make a real call. Sometimes the manipulation didn’t work because the stimulus was too subtle, in which case the test needs to be redesigned. Sometimes respondents just weren’t engaged enough to notice the difference, which suggests an attention problem. Sometimes the manipulation worked but the differences are smaller than you hoped, which is itself a finding. Your concept may not be as distinctive as the team thought.
What you don’t do is wave the failure away and report the purchase intent comparison anyway. A null or weak manipulation check means whatever you measure downstream is hard to attribute to the concept itself. That’s exactly the situation the test was supposed to prevent.
As a best-practice, ask your manipulation check question(s) immediately after you ask your dependent variable questions. This ensures that you don’t influence the most important test results. Remember, questions can “prime” respondents and change the way they answer subsequent questions.
Reading the Result
Compare mean consideration scores across conditions. The difference between a treatment mean and the control mean is your lift. But knowing the direction of the result is not enough. You also need to know whether the difference is statistically significant.
Several analytic approaches work here depending on your design. If you’re running a simple two-cell test (one treatment against a control), a simple t-test is perfectly sufficient. With multiple treatment conditions, as in the Hecuba test, ANOVA is the standard choice, followed by post-hoc paired comparisons. Tukey’s HSD is the most common. If your pre-exposure block collected covariates worth controlling for (i.e., category involvement, purchase frequency, prior brand awareness, etc.), regression gives you a cleaner estimate of the treatment effect with those influences held constant.
But before you run any comparison, check that randomization worked. Pull baseline characteristics by condition and look for systematic differences. For example, if treatment respondents skew younger or more category-involved than control respondents, your random assignment broke down somewhere. You’ll want to find this before you interpret anything downstream.
As you analyze the lift, remember that it is relative, not absolute. A mean of 3.8 in a treatment group looks good, but against a control mean of 3.6, it may be a modest and insignificant finding. Against a control mean of 2.9, it’s probably a meaningful one. Always report the comparison, not just the score. Also, concept tests reduce uncertainty but they don’t eliminate it. A concept that performs well in survey conditions still has to survive pricing, shelf placement, and the gap between stated consideration intent and actual behavior. The test is the start of a go/no-go process, not the final word.
Working with AI on this material
The point of these prompts is to make you think harder, not to outsource the thinking. Concept testing is full of small judgment calls — what to control, what to vary, what to measure — and AI will happily generate any of them for you. That’s the trap. Use these prompts to pressure-test your own design decisions, not to skip making them.
| Teach it back. | Without re-reading this note, explain to an AI — in your own words — why a concept test needs a control condition and what specifically you’d lose if you only tested two competing concepts against each other. Then ask AI to identify any place your explanation was vague or where you confused “control” with “comparison.” (The goal is to find the gap between “I read about controls” and “I understand what work the control is doing.”) |
| Defend a stimulus decision. | Look at the three Hecuba concept boards in this note. Without consulting AI yet, identify the single design choice you’d push back on if you were the researcher running this study — something that introduces a confound, an unintended difference, or a fairness problem between conditions. Write your critique in two or three sentences. Then paste it into a chat and ask AI to steelman the original choice — what’s the strongest case for keeping it as is? Decide which side is right. |
| Find the weakest link. | Ask AI: “Read the Hecuba concept test description in this note. Identify the weakest design decision the team made. Where is the test most likely to produce a misleading result, and what would a more rigorous version of the same test look like?” Evaluate the answer. If you think the note’s design holds up, articulate why. If you agree with the critique, explain what specific change would address it without blowing up the team’s budget or timeline. |
| Sit with the ambiguity. | The note says concept tests “reduce uncertainty but don’t eliminate it” and that stated purchase intent is not the same as actual behavior. Without using AI yet, write down two or three specific reasons a concept that wins a survey-based test might still fail at retail. Then ask AI to critique your list — what did you miss, and what on your list isn’t actually a meaningful risk? (The point is to discover what you don’t yet know, not to start with AI’s list.) |
| Generate a test case. | Describe a real concept decision from your own work, a brand you follow, or a category you know well. Sketch what the control and treatment conditions would look like and what your primary outcome measure would be. Ask AI to play the role of a skeptical client: where is your design over-engineered, and where is it under-powered? Where are you measuring something that won’t actually inform the decision the test is supposed to support? |
If you get stuck building the survey logic in Qualtrics or another tool, paste the specific problem (with what you’ve already tried) into AI and ask it to help you debug your flow. Don’t ask it to design the study for you — ask it to help you fix the one you’ve designed.
AI gets less useful, not more, as the question gets more interesting. Wiring up a randomizer is shallow work it’s good at. Deciding what counts as a meaningful lift, and what a finding actually means for a brand decision, is deep work that’s still yours.