Introduction to Vector Databases

Introduction to Vector Databases

1. Why Vector Databases Are Gaining Attention

The database landscape has evolved significantly over time. While relational, object-oriented, and time-series databases each had their peak moments, vector databases represent a new frontier — driven largely by the explosion of unstructured data and the rise of machine learning.


2. The Data Explosion

Global data generation is accelerating at an unprecedented pace. A few key facts set the stage:

  • Sources everywhere: Wearables, GPS fleets, social media uploads, IoT sensors — all constantly generating data.
  • IDC Projection: The global data sphere is expected to reach 400 zettabytes by 2028 (1 zettabyte = 10²¹ bytes).
  • Critical insight: At that scale, over 80% of all data will be unstructured, with roughly 30% generated in near real-time.

Takeaway: The future is unstructured. Understanding how to store, search, and make sense of it is the core challenge vector databases address.


3. Three Types of Data

TypeDefinitionExamplesStorage Options
StructuredFixed format, fits neatly into tablesBooks catalog (ISBN, year, author)PostgreSQL, MySQL
Semi-structuredHas keys/markers but no rigid schemaJSON documents with optional fieldsMongoDB, Cassandra, Redis
UnstructuredNo fixed format, arbitrary sizeImages, emails, audio, sensor logsNeeds ML to process

Structured Data

Rows and columns with a well-defined schema. Easy to sort, filter, and query — e.g., ORDER BY author or SORT BY year.

Semi-structured Data

Uses key-value pairs (like JSON), but fields may be missing or vary across records. Flexible for evolving data models.

Unstructured Data

No predefined structure. Can be arbitrarily large or small. Cannot simply be dropped into a traditional database table. Requires transformation and indexing before it becomes searchable.


4. Common Sources of Unstructured Data

Machine-generated:

  • Sensor data (temperature, humidity, GPS, motion)
  • System/application/event logs
  • IoT device streams (smart thermostats, wearables)
  • Computer vision output (image recognition, object detection, video analysis)

Human-generated:

  • Emails (free-form text, images, attachments)
  • Text messages (informal language, abbreviations, emojis)
  • Social media posts (text, images, videos, hashtags)
  • Audio/video recordings (calls, voicemails, video notes)

5. The Core Problem — and the Solution

Problem: How do you search and analyze data that has no fixed format and no predictable size?

Solution: Machine Learning (specifically Deep Learning)

ML/DL models can process raw unstructured data and convert it into a numerical representation that can be stored, indexed, and searched — this is the foundation of how vector databases work.


6. The Big Picture

Unstructured Data  →  Deep Learning Model  →  Vector (numerical representation)  →  Vector Database

A vector database stores these numerical representations (called embeddings) so that you can perform intelligent searches — not just keyword matches, but semantic similarity searches.


7. What’s Coming Next

The next topic will cover:

  • How a vector database actually stores data internally
  • What vectors/embeddings look like
  • How similarity search works in practice