fbpx

Ethics in AI: Data Usage and Embeddings — The Invisible Architecture of AI

By Michelle Collins, Chief Revenue Officer, CodeBaby

As AI becomes more integrated into our everyday lives—from virtual educators to real-time customer service avatars—there’s a growing conversation about ethics. But much of it focuses on the surface: what AI says, what it looks like, how it behaves. That’s important, of course. But what often gets missed is what’s underneath: the data.

Behind every AI interaction is a web of inputs—massive datasets, curated (or not), processed (hopefully), and embedded into mathematical representations of our world. It’s easy to forget that even the most lifelike avatar is ultimately powered by numbers and patterns. And those numbers? They come from us.

At CodeBaby, we work with conversational, generative and agentic AI. These systems rely heavily on embeddings—compressed representations of language, meaning and behavior that allow AI to “understand” and respond to humans. Embeddings are the translation layer between human thought and machine logic. But that translation isn’t neutral. It’s built from data, and data is never just data. It’s decisions, assumptions and often, bias.

The Ethics of Embedding Human Experience

When we create an embedding, we’re essentially deciding what matters. Which words, ideas, and associations get represented—and which don’t. If a model is trained mostly on western, English-language content, for instance, then its embedding space is going to reflect that worldview. The result? An AI that may sound global but thinks local.

That’s why transparency matters. It’s not enough to know what a model can do—we have to know how it does it. What data was used? Who approved it? How were edge cases handled? At CodeBaby, we’ve built processes that audit both the sources and the structures of the embeddings our models use. We also design for human-in-the-loop feedback, because the people using our tools need a way to push back when the data gets it wrong.

Consent Isn’t Just for Interfaces

Another ethical concern is the use of personal or proprietary content in building embeddings. In the rush to scale, some companies have scraped the internet without consent, embedding not just the words of millions—but their identities, labor and ideas. That’s not innovation. 

We believe consent needs to be foundational in data collection—not an afterthought. If your content helped train a system, you should know. Better yet, you should be asked. At CodeBaby, we’ve turned down data sources we couldn’t verify and built clear contractual protections to ensure our partners’ data doesn’t become someone else’s training set.

Embeddings That Empower

Done ethically, embeddings have incredible power. They allow us to create AI systems that respond to nuance, context and emotion. They let an AI tutor recognize when a student is frustrated. They help a virtual health coach mirror not just words, but tone and intent. They are the architecture of empathy at scale.

But to get there, we need vigilance. Regular audits. Diverse input. Community involvement. And above all, humility—the recognition that any system built on human data will reflect human flaws unless we work intentionally to correct them.

Looking Forward

As AI continues to evolve, the ethics of data usage and embeddings will become even more critical. These are not backend technicalities. They are the moral scaffolding on which all AI rests.

At CodeBaby, we’re committed to building AI that doesn’t just work—it respects. That means treating data not as raw material, but as the accumulated experience of real people. And embedding those values into every layer of our systems.

Because at the end of the day, if AI is going to reflect us, let’s make sure it reflects our best.

Get started by requesting a 14 Day Trial
Request partnership information from CodeBaby
Request a new feature with CodeBaby