Introduction to Vector Databases

Written by

Introduction to Vector Databases

1. Why Vector Databases Are Gaining Attention

The database landscape has evolved significantly over time. While relational, object-oriented, and time-series databases each had their peak moments, vector databases represent a new frontier — driven largely by the explosion of unstructured data and the rise of machine learning.

2. The Data Explosion

Global data generation is accelerating at an unprecedented pace. A few key facts set the stage:

Sources everywhere: Wearables, GPS fleets, social media uploads, IoT sensors — all constantly generating data.
IDC Projection: The global data sphere is expected to reach 400 zettabytes by 2028 (1 zettabyte = 10²¹ bytes).
Critical insight: At that scale, over 80% of all data will be unstructured, with roughly 30% generated in near real-time.

Takeaway: The future is unstructured. Understanding how to store, search, and make sense of it is the core challenge vector databases address.

3. Three Types of Data

Type	Definition	Examples	Storage Options
Structured	Fixed format, fits neatly into tables	Books catalog (ISBN, year, author)	PostgreSQL, MySQL
Semi-structured	Has keys/markers but no rigid schema	JSON documents with optional fields	MongoDB, Cassandra, Redis
Unstructured	No fixed format, arbitrary size	Images, emails, audio, sensor logs	Needs ML to process

Structured Data

Rows and columns with a well-defined schema. Easy to sort, filter, and query — e.g., ORDER BY author or SORT BY year.

Semi-structured Data

Uses key-value pairs (like JSON), but fields may be missing or vary across records. Flexible for evolving data models.

Unstructured Data

No predefined structure. Can be arbitrarily large or small. Cannot simply be dropped into a traditional database table. Requires transformation and indexing before it becomes searchable.

4. Common Sources of Unstructured Data

Machine-generated:

Sensor data (temperature, humidity, GPS, motion)
System/application/event logs
IoT device streams (smart thermostats, wearables)
Computer vision output (image recognition, object detection, video analysis)

Human-generated:

Emails (free-form text, images, attachments)
Text messages (informal language, abbreviations, emojis)
Social media posts (text, images, videos, hashtags)
Audio/video recordings (calls, voicemails, video notes)

5. The Core Problem — and the Solution

Problem: How do you search and analyze data that has no fixed format and no predictable size?

Solution: Machine Learning (specifically Deep Learning)

ML/DL models can process raw unstructured data and convert it into a numerical representation that can be stored, indexed, and searched — this is the foundation of how vector databases work.

6. The Big Picture

Unstructured Data → Deep Learning Model → Vector (numerical representation) → Vector Database

A vector database stores these numerical representations (called embeddings) so that you can perform intelligent searches — not just keyword matches, but semantic similarity searches.

7. What’s Coming Next

The next topic will cover:

How a vector database actually stores data internally
What vectors/embeddings look like
How similarity search works in practice

Introduction to Vector Databases

More posts

Introduction to Vector Databases

Data Lakehouse Architecture: Why Every Data Engineer Needs to Understand It in 2026

Weekly Reflection: What This Data Engineer Learned This Week

Weekly Reflection: What This Data Engineer Learned This Week