Why Structured Data, Not Tokenization, is the Future of LLMs

Last Updated: February 27, 2025

Reading Time: 4 minutes

Table of Contents

For months, skeptics have argued that Large Language Models (LLMs) like ChatGPT don’t truly “understand” structured data formats like JSON-LD Schema Markup. A common critique is that LLMs process everything as tokenized text, reducing structured data to “statistical soup” rather than treating it as a well-defined knowledge layer.

While early LLMs faced challenges with structured data due to tokenization limitations, AI advancements in 2025 are improving how LLMs handle structured data alongside tokenization.

Modern LLMs are increasingly capable of leveraging structured data sources like JSON-LD Schema Markup, especially when paired with reasoning models, retrieval-based architectures and knowledge graphs.

Let’s break down why the tokenization argument is obsolete and why structured data, not retrieval-augmented generation (RAG), is the key to future AI performance.

What is Tokenization?

Tokenization is the process by which Large Language Models (LLMs) break down text into smaller units called tokens. These tokens can be words, subwords, or even individual characters; it all depends on the model’s architecture. For example, the word “schema” might be split into multiple tokens like “sche” + “ma”, while a common word like “data” could remain a single token.

Tokenization allows LLMs to process text efficiently by converting it into numerical representations, which are then used for pattern recognition and prediction. However, this approach leads to a loss of structure when processing complex formats like JSON-LD Schema Markup.

Tokenization Issues Are Becoming Less Significant

Advancements in LLMs have addressed many of the earlier limitations of subword tokenization. While tokenization remains a core aspect of LLM architecture, newer models can now call external tools, incorporate logical reasoning components, and use self-verification mechanisms. These improvements enhance their ability to handle tasks requiring character-level precision, reducing errors and improving reliability.

Multimodal and Hybrid AI: Newer LLMs can call external tools (like Python) or use character-level processing when needed.
Symbolic Reasoning: Some models integrate logical reasoning components that allow for more structured interpretation of input.
Self-Verification: Chain-of-thought reasoning and self-checking mechanisms enable models to detect and correct their own errors in tasks that require precise detail.

In short, the argument that LLMs can’t process individual characters within a word—and that they treat Schema Markup the same way—is largely irrelevant in 2025. Tokenization quirks may still exist, but they are no longer a meaningful limitation on AI capabilities.

LLMs Don’t Need to “Guess” When They Have Structured Data

The real problem isn’t tokenization—it’s hallucination. When an LLM extracts financial figures from a report, it might approximate the right format based on its training data but can’t guarantee accuracy.

The solution? Structured data integration.

Schema Markup is Not “Statistical Soup” — It’s Data

A critical misunderstanding in outdated views of LLMs is the assumption that JSON-LD Schema Markup is just another type of text.

In reality, Schema.org is structured data—a predefined, machine-readable format that search engines, Knowledge Graphs, and AI systems can use for reasoning.

Beyond RAG: Structured Data as the Answer

Moving beyond the text-based retrieval of RAG, the integration of structured data allows LLMs to interact with information in a more meaningful way. This approach enables models to retrieve and reason over formal data representations, leading to a deeper understanding and more accurate outputs.

Knowledge Graphs & Schema.org: Instead of treating JSON-LD as unstructured text, LLMs can retrieve schema data from a structured knowledge base (e.g., Google Knowledge Graph, Wikidata, YAGO or Internal graph databases).
Ontology-Driven Understanding: By aligning LLMs with ontology-backed structured data, we eliminate ambiguity and enforce precision.
Hybrid Reasoning Systems: The best AI models combine structured symbolic knowledge with statistical models, reducing the need for probabilistic guesswork.

With structured data, LLMs don’t have to hallucinate—they retrieve and reason over real-world facts and defined relationships.

The Future of LLMs is Data Quality, Not Tokenization

The largest gains in LLM performance are increasingly driven by the integration of high-quality structured data, such as Knowledge Graphs, which enhances precision, reasoning, and retrieval capabilities. While advancements in tokenization and NLP techniques refine language understanding, structured data provides the foundation for more reliable, context-aware, and scalable AI applications

A More Forward-Looking View

Instead of saying:
“LLMs can (sort of) generate schema but don’t understand it.”

The real 2025 perspective is:
“LLMs integrated with structured data sources don’t just generate schema—they use it as a foundation for real-world reasoning.”

Final Thoughts: The Future of AI is Structured Knowledge

The debate about tokenization flaws is increasingly outdated. What really matters is how LLMs use structured data to improve accuracy, reduce hallucinations, and enhance decision-making.

For those working in SEO, content strategy, and AI-driven insights, the takeaway is clear:

Schema Markup is not just “text”—it’s structured data that AI can use for deeper understanding.
Future AI systems will move beyond RAG and rely on Knowledge Graphs for accuracy.
Tokenization quirks are minor compared to the benefits of structured data integration.

Instead of worrying about how many R’s an LLM can count within a word, we should focus on how well our structured data supports AI-driven insights. The future of LLMs isn’t in their statistical tricks—it’s in their ability to reason over structured knowledge.

What’s Next?

For companies and teams using Schema.org, such as Schema App, the next step is aligning AI strategies with structured data best practices. Whether it’s SEO, content automation, or entity-based search optimization, the real gains will come from leveraging structured knowledge—not just text-based retrieval.

Are you ready to build AI that actually understands your data? Get in touch with us to learn how.

Mark van Berkel CTO

Mark van Berkel is the Chief Technology Officer and Co-founder of Schema App. A veteran in semantic technologies, Mark has a Master of Engineering – Industrial Information Engineering from the University of Toronto, where he helped build a semantic technology application for SAP Research Labs. Today, he dedicates his time to developing products and solutions that allow enterprise teams to leverage Schema Markup to boost their AI strategy and drive results.

AI, Large Language Model, LLM, Retrieval Augmented Generation (RAG), Structured Data, Tokenization

RDF and Schema Markup: The Power of Relationships in the Age of Intelligent Systems

How Schema App’s Dynamic Schema Markup Solution Ensures Long-Term Success

Why Structured Data, Not Tokenization, is the Future of LLMs

What is Tokenization?

Tokenization Issues Are Becoming Less Significant

LLMs Don’t Need to “Guess” When They Have Structured Data

Schema Markup is Not “Statistical Soup” — It’s Data

Beyond RAG: Structured Data as the Answer

The Future of LLMs is Data Quality, Not Tokenization

A More Forward-Looking View

Final Thoughts: The Future of AI is Structured Knowledge

What’s Next?

Connect

Contacts

About Us

Recent Blogs

Customer Support

Resources

Schema App Tools

Schema App Status