What is a Vector Database?

Structured, Semi-Structured, & Unstructured Data

Shortcomings of Semi-Structured Data

The Role of Machine Learning

The Puppy-in-a-Flowerbed Analogy

Determinism and Consistency

From Exact Matches to Nearest Neighbors

Understanding Dimensionality

How “Closeness” is Measured

Approximate Nearest Neighbor (ANN) Search

Embeddings Can Overlap

More Than Just Images

What Is a Vector Database?

Why Not Just Store Vectors in a Regular Database?

Why Vector Databases Matter

What is a Vector Database?

Structured, Semi-Structured, & Unstructured Data

Data professionals often categorize information into three broad types: structured, semi-structured, and unstructured.

Structured data fits neatly into tables. Think of the file browser on your computer: each row is a file, and each column is a property — filename, file type, size, date created, date modified. You can sort by any of these columns with a click: A–Z, largest to smallest, newest to oldest.

Example: A relational database (SQLite) — predictable columns, predictable queries. (Interactive: toggle table columns on/off to see NULL values appear)

Semi-structured data starts to bend the rules. Not every entry has the same set of properties, so rigid columns start filling up with “NULL” (empty) values. For example:

A text file might have a word count.
An image might have a resolution and GPS location.
Both still have common fields like name, size, and date.

You could make a giant table with a column for every possible property, but most cells would be empty. Object-based databases (like MongoDB) solve this by letting each entry store only the properties it needs — no wasted columns.

Try It Out! Toggle table columns on and off to see how structured data stays consistent while semi-structured data fills with empty values (NULLs).

Shortcomings of Semi-Structured Data

Semi-structured data works great when you already know which properties you want to store — but sometimes what you care about isn’t explicit metadata at all.

Instead of asking “What is the resolution of this photo?” you might ask:

“What’s in this photo?”
“Which of my photos are similar to this one?”

This is where unstructured data and machine learning come in.

The Role of Machine Learning

Unstructured data — images, audio, free-form text — doesn’t fit neatly into rows and columns. To work with it meaningfully, we need a way to turn it into structured form without manually tagging every detail ourselves.

This is where machine learning comes in.

A trained model can scan raw data, detect patterns, and output features — measurable characteristics of the data. This step is called feature extraction. Each feature is something we can express as a number, such as:

How confident the model is that an image contains a dog
How confident it is that there are flowers
A numeric value for the image’s predominant color

The Puppy-in-a-Flowerbed Analogy

Suppose we printed every image in our dataset and laid them on the floor:

Left to right based on “dogness” (more dog to the right)
Near to far based on “floweriness” (more flowers farther away)

Now imagine we tie each photo to a piece of yarn from the ceiling, raising or lowering them based on their predominant color.

A picture of a puppy lying in a flowerbed would end up somewhere in the middle horizontally (moderate dogness), far back (high floweriness), and at the height matching its color.

Once positioned, we could record each photo’s location as coordinates:

(x, y, z) = (dogness, floweriness, color)

This list of numbers is called a vector (or embedding). In a computer, this is the structured representation of our unstructured photo.

Try It Out! Switch between a JSON-style object view and a table view to compare how semi-structured data stores only the fields it needs.

Determinism and Consistency

A good model is deterministic:

The same input produces the same vector every time.
Similar inputs produce similar vectors.

Determinism ensures that if we re-run the puppy photo through the model tomorrow, we get the exact same coordinates, making it a reliable “address” for that image in vector space.

From Exact Matches to Nearest Neighbors

If all we wanted was to store and retrieve that exact puppy photo, a normal database or your computer’s file system could do the job — as long as we knew its name or folder.

But vectors give us more:

We can find other entries whose vectors are close to our query’s vector in multi-dimensional space.
This is called a nearest neighbor search.

In our photo analogy, it’s like saying:

“Bring me pictures near the puppy-in-a-flowerbed in our layout — same general area, even if they aren’t identical.”

A vector database is optimized to make this fast. Instead of scanning every row like a relational database

Try It Out! Click an image to highlight its nearest neighbors in vector space and see how similar items cluster together.

Understanding Dimensionality

Each feature we measure becomes a dimension in our vector space:

2D → dogness and floweriness
3D → dogness, floweriness, and color
768D → a typical text embedding from a modern NLP model

More dimensions allow us to capture richer meaning, but they also make searches harder. This is the curse of dimensionality:

Distances become less meaningful because points spread out evenly in high-dimensional space.
Algorithms need more computation to compare vectors.

This is why vector databases balance enough dimensions to capture meaning without creating unnecessary complexity.

Try It Out! Adjust the number of dimensions with the slider to see how point clusters become easier or harder to separate.

How “Closeness” is Measured

When we say “find the closest matches,” we’re talking about distance metrics — mathematical ways to compare vectors:

Euclidean distance → straight-line distance between points.
Cosine similarity → measures the angle between two vectors, ignoring magnitude.
Dot product → measures how aligned vectors are.

Different applications choose different metrics. For example:

Image similarity might favor Euclidean distance.

Text search often uses cosine similarity.

Try It Out! Select two points and switch between different distance metrics to see how their similarity changes based on the metric.

Approximate Nearest Neighbor (ANN) Search

Searching for the closest vector exactly means comparing against every stored vector — which is slow at scale.

ANN algorithms speed things up by:

Skipping most comparisons
Using graph or tree structures to focus only on promising candidates
Accepting that results are almost the closest matches, but much faster to compute

Common ANN methods include:

HNSW (Hierarchical Navigable Small World) — a graph-based search
IVF (Inverted File Index) — groups vectors into clusters

Product Quantization — compresses vectors for faster distance checks

Try It Out! Switch between exact search and approximate search to see the trade-off between speed and accuracy.

Embeddings Can Overlap

Just as two different people can accidentally have the same phone number in a poorly managed system, two different data items can produce identical or nearly identical embeddings.

This is why vector search results are about probability of similarity, not guaranteed uniqueness.

In a relational database, you’d give each row a primary key to ensure uniqueness. In a vector database, the vector itself is not a primary key — it’s more like a search hint.

Try It Out! Compare two different images that produce nearly identical vectors to understand how different inputs can share similar embeddings.

More Than Just Images

Although our puppy-in-a-flowerbed example focuses on images, embeddings work for many data types:

Text → semantic search, topic clustering, question answering
Audio → song recognition, speaker identification, sound classification
Video → scene similarity, event detection
Multimodal → embeddings that combine vision, text, and audio in a single vector space

The same principles apply: extract features, turn them into vectors, and store them for fast similarity search.

Try It Out! Switch between image, text, and audio inputs to see how each produces its own type of embedding vector.

What Is a Vector Database?

A vector database is a database built to store and retrieve items based on their vectors.

Instead of looking up puppy_flowerbed.jpg by name, you could query:

dogness: 0.72

floweriness: 0.91

color: 0.34

and get the exact match — even if you don’t remember the file name, location, or format.

Try It Out! Run a vector search for an item and see how it returns the exact stored match, even without using the file name or ID.

Why Not Just Store Vectors in a Regular Database?

You could put vectors into a regular SQL table as a list of numbers. But the real power of a vector database comes from nearest neighbor search.

If you have a photo of a golden retriever in tulips, you might not want that exact image — you might want others that are visually similar. A vector database can:

Compare the query vector to every other vector.
Find the ones “closest” in multi-dimensional space.
Return the top N matches.

Doing this efficiently is tricky. Vector DBs use specialized indexes (like HNSW, IVF, or Product Quantization) to skip most comparisons and jump straight to likely matches. Regular relational databases aren’t built for this — they’d need to brute-force scan every row, which doesn’t scale.

Try It Out! Watch a side-by-side animation of a SQL table scan and a vector database search to compare how quickly each finds a match.

Why Vector Databases Matter

Beyond photo search, vector databases power:

Image similarity search (find pictures like this one)
Semantic text search (find passages with the same meaning, not just the same words)
Recommendation engines (users who liked this also liked…)
Audio fingerprinting (Shazam-style song matching)
Anomaly detection (flagging transactions unlike previous ones)

Whenever you need to store the meaning of unstructured data and retrieve “things like this,” a vector database is the right tool.

Try It Out! Click on a use-case card to see a mini demo of how vector search powers real-world applications like image search, recommendations, or anomaly detection.

Review

Structured data → Tables (SQL)
Semi-structured data → Flexible schemas (NoSQL)
Unstructured data → Feature extraction → Vectors → Vector databases

With a good embedding model, you can turn unstructured data into structured, searchable form. With a vector database, you can find not just exact matches, but the closest neighbors in meaning — efficiently, at scale.

Page updated

Google Sites

Report abuse