Metadata is back: The battleground now is meaning and provenance

By Bert Kok, MBA. The author works on the AI transition in media and communication. Strategy, governance, and hands-on training, grounded in journalism and shaped by product and business reality. This article was originally published on LinkedIn and is republished here with permission.

Anyone who has been in media for a couple of years, like me, will recognize the pattern. Big changes rarely arrive with a bang. They arrive often as something “technical.” Something you can park at IT or product development. Until the moment you realise that the foundation is shifting precisely there.

I now live close to Budapest. The first time I came here was in 2004, for a gathering of European news agencies in an innovation program funded by the European Commission. That program ultimately led to the founding of MINDS International, now a network of 26 news agencies.

Many of the conversations that later formed that network, were already happening in Budapest. They were about infrastructure. About standards. About how journalism retains meaning in a digital environment with fewer fixed forms. The central theme of that Hungarian meeting was metadata.

I attended on behalf of the Dutch news agency ANP, together with Peter Maarten Bakker, who at the time was interim head of IT. It did not take long before the group started calling him “Mister Metadata.” Not because he was the biggest specialist in the room. On the contrary. Because he was one of the few who genuinely understood what the discussion was actually about.

While many conversations drifted into tools, systems, and implementation, Bakker kept returning to the underlying concept. What is fact. What is context. What is relationship. And how do you capture that before you even start thinking about publication or distribution.

For some people, that sounded abstract and dry. For others, almost school-like. But if you listened closely, you could hear that he was not talking about systems. He was talking about journalism. About meaning. About the conditions under which information keeps its value once it detaches from paper, broadcast channels, and fixed formats.

Metadata as a legitimacy problem, not an efficiency problem

What was being discussed in Budapest was not optimisation. It was legitimacy, I realise now. It was about the role of news agencies and newsrooms in a world where information was becoming easier to copy, reuse, remix, and re-interpret. Metadata was not an accessory. It was an attempt to make journalistic logic explicit, so that it would not simply disappear into the digital flow.

More than twenty years later, that conversation has returned, but now with more urgency. I recently read an article by Dietmar Schantin, in which he describes “Language Model Optimisation” (LMO or LLMO) as the next major shift for news media.

The picture is straightforward. The dominant interface for news is moving from search and scroll to asking and answering. AI models increasingly operate as an active intermediary layer. They select, combine, and phrase information on behalf of the user. Journalism is not only found in that environment. It is processed. That sounds abstract until you bring it back to the newsroom floor.

When publication is no longer the finish line

It is Monday morning. A journalist is working on a story about a dossier that has been ongoing for months. In the past, they could assume some prior knowledge in the reader. A few lines of background would be enough.

Now something changes. Not as a formal instruction, but as a shift in instinct. This piece will not only be read by people. It will also be read by systems that summarise it, compare it, and blend it with other sources.

The journalist pauses at a familiar phrase. “According to sources close to the dossier.” For a reader, that may be acceptable. For a language model, it is close to meaningless. The sentence gets rewritten. Who are those sources? Civil servants, direct stakeholders, political strategists. Not because the journalist wants to write more “technically,” but because vagueness suddenly has downstream consequences.

A few desks away, an editor is reviewing copy. Not only for style, but for coherence across coverage. This topic appears in multiple articles. Are the same terms used consistently? Is it clear what is fact and what is analysis? Are actors described in stable, unambiguous roles? The editor understands that inconsistency does not just confuse readers. It also confuses the systems that will rephrase and redistribute the story later.

This is what LMO means in practice. Publication is no longer the end point. It is the start of a second life. A life in which the article is read by models that do not share implicit cultural context, yet have outsized influence on how meaning is carried forward.

From storytelling to explicit structure

A different newsroom is working on an analysis piece with multiple perspectives. In the past, that layered tension was the strength of the article. Now the question arises: how does that tension survive when a model compresses the story into a short answer?

The solution is not simplification. It is explicit structure. Clearly indicating who is speaking, from what position, and with what degree of certainty. Journalistic work shifts from only telling to also structuring. Facts, context, interpretation and contestation need to be distinguished more explicitly, because the new intermediaries do not infer these boundaries reliably.

Over time, this changes the craft. Journalists remain storytellers, but they also become designers of context. Editors evolve from linguistic gatekeepers to curators of meaning. Newsrooms start to function more like knowledge organizations. Not because they stop publishing, but because publication becomes a component in a larger system. A shared memory that machines increasingly consult.

Meaning is not enough without provenance

And this is where the conversation needs one more layer. Because in an AI-mediated environment, structure inside the article is necessary, but not sufficient. LMO is about preventing meaning-loss when journalism is summarised, remixed, and redeployed by models. But there is another failure mode. A model can preserve meaning and still erase legitimacy if it cannot reliably track where information came from, and what standards sit behind that source.

That is why source identification and source labelling have become inseparable from the broader metadata question. In a recent post, Vincent Peyrègne proposes a framework designed precisely for this ecosystem. His core distinction is useful because it separates two different needs that often get conflated.

First, machines need to know who a source is. Second, humans, and increasingly machines as well, need to understand what a source stands for in professional terms.

Layer 1. Machine-readable source identity

The first layer is technical and machine-readable. It concerns clearly identifiable source codes, such as Coalition for Content Provenance and Authenticity (C2PA) and the Global Media Identifier. These identifiers are not a quality seal. They don’t claim that a source is reliable. They simply state who the source is.

In an AI workflow, that matters more than it sounds. If provenance becomes fuzzy, systems lose the trail of origin. Once origin is lost, authority becomes guesswork. The model will still produce fluent answers, but the chain of legitimacy that journalism depends on becomes harder to preserve.

Layer 2. Professional trust signals

The second layer is qualitative and professional. It indicates whether a source adheres to standards such as transparency, editorial independence, and journalistic reliability.

Think of initiatives like the Journalism Trust Initiative and similar frameworks in which news organisations disclose their methods, governance, and accountability mechanisms.

This layer helps human users judge information. It can also provide machine-facing cues about which sources should be treated as higher-quality inputs, and which outputs should carry more uncertainty or require corroboration.

A more robust architecture for an answer-first world

Put these layers together and the earlier Budapest logic returns again, with sharper edges.

Back then, metadata was a way to make journalism findable and reusable without losing its internal logic. Today, it becomes a way to make journalism usable inside answer interfaces without losing its provenance.

This is the combined picture:

LMO pushes newsrooms toward explicit internal structure. Facts, context, interpretation, and degrees of certainty must be made legible to systems that will reprocess the work.
Source identification adds an external spine. Journalism needs standardised identifiers so machines can keep provenance intact.
Trust labeling adds legitimacy. Journalism needs transparent standards so audiences, and the systems serving them, can distinguish between a source that is merely known and a source that is professionally accountable.

Without this, models will substitute provenance with probability. They will infer “authority” from patterns in the data rather than from traceable origin and declared editorial practice. The end result is not necessarily misinformation. It is something more corrosive. A blurring of content, origin, and reliability that erodes trust without triggering obvious alarms.

The return of an old insight

This is also why “Mister Metadata” reads differently in hindsight. It was never really a joke about technology. It was a sign that someone had spotted where the real shift would land. Not in tools, but in how meaning and legitimacy get encoded. Not in publication, but in the infrastructure beneath it.

Metadata is back. Not as a buzzword, but as daily newsroom practice. At least, it should be. This time not at the margins of the newsroom, but in the centre of the editorial workflow.

Because what began in Budapest as a discussion about fields and definitions has become a question of who gets to define how journalism is read, understood, and carried forward in a world where machines increasingly co-author the public memory.

This article was published earlier in Dutch on my Substack, The News Stack. Translated to English with the help of ChatGPT.

Source link

Metadata is back: The battleground now is meaning and provenance

Metadata as a legitimacy problem, not an efficiency problem