Building a real software review operation from nothing
Forbes, Capterra, G2, PCMag — none of them are actually hiring practitioners to use software, evaluate it without a vendor briefing, and follow up with real users to validate what they found. We decided to build that instead.
The market gap we were solving for
Black & White Zebra is a digital media company with 15 publications covering specific professional verticals. Three of those — The Digital Project Manager, People Managing People, and CPO Club — compete directly for B2B SaaS tool review traffic. Project management software, HR and payroll platforms, product management tools. High commercial intent, high competition.
The problem: most software review content is produced by writers who've never logged into the product. They're synthesizing vendor websites, aggregating G2 scores, and calling it research. The results are interchangeable — the same seven criteria, the same four pros, the same vague caveat about “it depends on your needs.”
The thesis was simple: if we could build an operation where analysts actually used the software, talked to the people buying and implementing it, and captured that perspective at scale — we'd have something no one else in our competitive set had. Original data. Real authority. Content that could actually support our advertising relationships because we could say, with specificity, who each tool was right for.
I designed and ran this program end-to-end over the course of 2024 — from the hiring brief to the review methodology to the weekly management cadence to the editorial handoff workflow.
How we actually reviewed software
Each review was built in three distinct phases. The combination of first-hand experience, structured user conversations, and community-level sentiment data was what separated this from a typical editorial operation.
Hands-on Product Trial
Each analyst either set aside budget for a paid plan or worked directly with vendors to secure account access — disclosing upfront that gifted access wouldn't influence the review, which every vendor agreed to. They then built a populated demo environment using AI-generated dummy data (names, tasks, projects, employee records — realistic for the vertical) and ran a structured evaluation against 7 criteria: onboarding, core functionality, standout features, usability, customer support, value, and customer reviews. Two business days per product, max.
ICP Conversations
Analysts reached 3 distinct buyer profiles per product via cold LinkedIn and email outreach. These weren't vendor-supplied references — analysts sourced people who had left public reviews on G2, posted in relevant communities, or were actively talking about the software on LinkedIn, then reached out directly.
Sentiment Analysis
Analysts scanned Reddit forums, Slack communities, and user groups to surface recurring themes around pricing, feature gaps, implementation friction, and competitive comparisons. Smaller tools: ~20 data points. Market leaders like Monday.com, Bamboo HR, ClickUp: 100–200 responses.
Market Report
All three phases fed into a structured market report: a single research asset per product containing scores, original screenshots, call recordings, and quantitative sentiment data. This report became the source of truth for content writers and editors upgrading existing review pages.
Three buyer profiles, per product
The goal wasn't just to collect opinions — it was to understand how the same software performs at different organizational sizes and use cases. Each analyst targeted three conversation types per review.
Owner-operators or department leads at smaller organizations — typically evaluating for the first time, weighing cost and implementation lift heavily, often moving off a manual or legacy solution.
Five to ten years in their field, often managing a transition from one platform to another. Most useful for surfacing why people switch, what breaks down in practice, and what workarounds users develop on their own.
Rapidly scaling or established enterprise accounts where the questions shift — not "does this work" but "does this scale, integrate, and survive procurement?" Higher stakes, more nuanced on implementation complexity, globalization, and total cost of ownership.
How I built and ran the analyst team
I hired three analysts, each aligned to one publication and vertical: one for project management (DPM), one for HR and people management (PMP), and one for product management and UX (CPO Club). None of them came in with a ready-made review methodology — that was part of the design. Building the muscles was the point.
The first two weeks were structured onboarding: identifying best-in-class exemplars to study, defining evaluation criteria, designing their dummy testing scenario, building their outreach contact list, writing interview scripts, and building out their analyst bio and profile. I graded each deliverable and gave written feedback before they started live software reviews.
Management cadence: Monday standups to validate progress and priorities, Wednesday reviews to discuss work quality, interview outcomes, and any scope adjustments. I managed this team exclusively — no middle layer. That proximity mattered because the methodology was still being figured out as we went, and I needed direct feedback loops to course-correct quickly.
Tool selection was a shared decision. I worked with the editorial director to identify which software categories and specific products would generate the most value — mixing market leaders (the high-traffic targets) with emerging tools worth getting ahead of. From there, each analyst selected 10 triable products per batch, with preference for free trials over vendor-gated demos.
What worked. What didn't.
I kept internal retrospectives on this program throughout the year. Here's the honest read.
One day wasn't enough to set up a real test scenario, evaluate against criteria, and identify anything non-obvious. Two days gave us enough depth to write with authority without becoming an inefficient crawl.
Jan through April was a learning curve on outreach mechanics. By June the team was consistently booking 5–10 calls per week. Year-end: 1,500+ outreach contacts, 120+ recorded interviews, more than half with industry SMEs. The calls with real users were consistently the richest research in any market report.
Early on we tested whether AI could do the research. It couldn't — it returned vendor marketing copy. The pivot was to use AI differently: as a starting point investigator that surfaces where analysts should focus, identifies relevant sources, and analyzes quality — with humans taking it from there. That worked well.
One analyst built a custom ChatGPT workflow that followed specific instructions to scrape an interview transcript and pull key quotes into a structured quote bank. All three analysts adopted it. It's now the team's default process for converting interview recordings into usable blog content and surfacing the best moments from any conversation.
Pages upgraded with original screenshots from the program saw 2–3× increases in on-page time on The Digital Project Manager. The data was directionally consistent enough that it justified the research overhead.
We spent nearly two months before accepting the creative team didn't have a clear enough brief to execute video reviews reliably. Multiple reshoots, inconsistent output, and one analyst who delivered the only fully finished video reviews. We cut scope to screen walkthroughs and moved on — two months I'd approach differently.
Originally analysts were supposed to infuse their own research into existing content pages. That was too slow and misaligned with what they were hired to do. We pivoted to handing market reports to editors — but adoption was inconsistent. Infusion needed a clearer owner and a defined process before the first review shipped, not after.
The early instinct was to automate as much as possible. That failed in consistent and instructive ways. Personalized connection outreach written by AI didn't convert — the messages were detectable and response rates confirmed it. Scraping a product's website to generate a review produced a polished summary of the vendor's own marketing copy. Nothing original. AI had no ability to form an opinion on whether a product was a good or bad fit for a specific user — that judgment requires domain experience, real context, and time actually inside the software. We also tried using AI to summarize qualitative data in a vacuum: it stripped out everything that made the data worth having. We eventually built a structured quote bank, scored entries for relevance and quality, and used that as the basis for the model — which worked, but it's a different solution to a different problem than what we started with.
Finding investigation starting points — surfacing contacts, community discussions, and reference material for an analyst to evaluate, not use wholesale. Translating messy interview recordings and transcripts into structured first drafts that analysts could shape into op-eds or summaries. Building dummy data quickly: realistic names, task structures, employee records, and project scenarios ready to load into a platform before a walkthrough. And when information on a product was genuinely scarce, identifying new angles and sources to investigate — not filling gaps with invented data, but pointing analysts toward where real data might exist. The consistent pattern: AI worked as a force multiplier for human judgment. It didn't work as a substitute for it.
What the program shipped
The market reports fed multiple content formats downstream: individual in-depth reviews, comparison articles (X vs. Y), roundup lists, and five state-of-the-industry reports that synthesized findings across all reviewed tools in each vertical. Those larger reports became the highest-effort, highest-differentiation content we published that year.
Note on metrics: Traffic and SEO ranking data from content infusion isn't included here; the 2–3× on-page time figure is from The Digital Project Manager during the first content infusion cycle.
What we used to run it
No custom infrastructure. The operation ran on a practical stack that an editorial team could actually maintain.
What I took from it
This was a program I built without a blueprint. There was no internal model to reference, no industry equivalent doing it quite this way. The first half of the year was heavy on iteration — figuring out what the two-day review cycle should produce, how much structure the analysts needed versus how much they could define themselves, where the editorial team fit in the workflow.
The thing that held up was the core thesis. Software review content that comes from people who've actually used the product — who've talked to three different kinds of buyers, who've read the forums — is categorically different from content that doesn't. Advertisers felt it. The on-page data confirmed it. The analysts who got in front of real users consistently came back with material that couldn't have been generated any other way.
What I'd do differently: front-load the editorial infusion workflow design before the first review ships, not six weeks after. And build the outreach playbook on day one — the analysts figured out cold outreach mechanics, but it took four months to hit a reliable weekly cadence. That's time you could recover with the right training structure from the start.