Introduction to Vector Databases

1. Why Vector Databases Are Gaining Attention
The database landscape has evolved significantly over time. While relational, object-oriented, and time-series databases each had their peak moments, vector databases represent a new frontier — driven largely by the explosion of unstructured data and the rise of machine learning.
2. The Data Explosion
Global data generation is accelerating at an unprecedented pace. A few key facts set the stage:
- Sources everywhere: Wearables, GPS fleets, social media uploads, IoT sensors — all constantly generating data.
- IDC Projection: The global data sphere is expected to reach 400 zettabytes by 2028 (1 zettabyte = 10²¹ bytes).
- Critical insight: At that scale, over 80% of all data will be unstructured, with roughly 30% generated in near real-time.
Takeaway: The future is unstructured. Understanding how to store, search, and make sense of it is the core challenge vector databases address.
3. Three Types of Data
| Type | Definition | Examples | Storage Options |
| Structured | Fixed format, fits neatly into tables | Books catalog (ISBN, year, author) | PostgreSQL, MySQL |
| Semi-structured | Has keys/markers but no rigid schema | JSON documents with optional fields | MongoDB, Cassandra, Redis |
| Unstructured | No fixed format, arbitrary size | Images, emails, audio, sensor logs | Needs ML to process |
Structured Data
Rows and columns with a well-defined schema. Easy to sort, filter, and query — e.g., ORDER BY author or SORT BY year.
Semi-structured Data
Uses key-value pairs (like JSON), but fields may be missing or vary across records. Flexible for evolving data models.
Unstructured Data
No predefined structure. Can be arbitrarily large or small. Cannot simply be dropped into a traditional database table. Requires transformation and indexing before it becomes searchable.
4. Common Sources of Unstructured Data
Machine-generated:
- Sensor data (temperature, humidity, GPS, motion)
- System/application/event logs
- IoT device streams (smart thermostats, wearables)
- Computer vision output (image recognition, object detection, video analysis)
Human-generated:
- Emails (free-form text, images, attachments)
- Text messages (informal language, abbreviations, emojis)
- Social media posts (text, images, videos, hashtags)
- Audio/video recordings (calls, voicemails, video notes)
5. The Core Problem — and the Solution
Problem: How do you search and analyze data that has no fixed format and no predictable size?
Solution: Machine Learning (specifically Deep Learning)
ML/DL models can process raw unstructured data and convert it into a numerical representation that can be stored, indexed, and searched — this is the foundation of how vector databases work.
6. The Big Picture
Unstructured Data → Deep Learning Model → Vector (numerical representation) → Vector Database
A vector database stores these numerical representations (called embeddings) so that you can perform intelligent searches — not just keyword matches, but semantic similarity searches.
7. What’s Coming Next
The next topic will cover:
- How a vector database actually stores data internally
- What vectors/embeddings look like
- How similarity search works in practice