Flag Dropdown Button

Apache Druid is an open-source, high-performance, real-time analytics database designed for fast slice-and-dice analytics on large datasets. It is well-suited for applications requiring low-latency queries on large amounts of data, such as business intelligence, operations, and monitoring.

Key Features

  • Real-Time and Batch Data Ingestion: Supports both real-time streaming data and batch data ingestion from various sources.
  • Column-Oriented Storage: Efficient storage format that allows for rapid access and high compression ratios.
  • Time-Based Partitioning: Data is partitioned based on time, enabling fast time-based queries and data retention management.
  • Distributed Architecture: Scalability and fault-tolerance through a distributed, microservices-based architecture.

Architecture Components

  1. Data Servers:

    • Historical Nodes: Store immutable segments of data and handle batch ingestion.
    • Real-Time Nodes (MiddleManagers): Ingest and query real-time data.
  2. Query Servers:

    • Brokers: Distribute query requests to appropriate data servers and aggregate results.
    • Coordinator: Manages data distribution and balancing across the cluster.
  3. Data Management:

    • Overlord: Coordinates data ingestion tasks.
    • Metadata Storage: Stores the metadata about data segments and tasks.

Data Ingestion

Druid supports multiple ways to ingest data:

  • Real-Time Ingestion: Data is ingested from streaming sources like Apache Kafka or Amazon Kinesis.
  • Batch Ingestion: Data is loaded from static files stored in HDFS, Amazon S3, or other similar storage systems.
  • Indexing Service: MiddleManager nodes run indexing tasks to create segments from raw data.

Querying

Druid supports a variety of query types:

  • Time Series: Aggregations over time intervals.
  • TopN: Finds the top N values for a given dimension.
  • GroupBy: Groups data by one or more dimensions and applies aggregations.
  • Search: Searches for specific string values.
  • Scan: Retrieves raw data, similar to SQL SELECT statements.

Performance

  • Segment Caching: Historical nodes cache segments to reduce query latency.
  • Roll-Up: Pre-aggregates data during ingestion to reduce storage and speed up queries.
  • Indexing: Bitmap and compressed indexes for efficient filtering and retrieval.

Use Cases

  • Business Intelligence: Real-time dashboards and reporting.
  • Operational Analytics: Monitoring and analyzing operational metrics.
  • Clickstream Analytics: Analyzing web and mobile user behavior.
  • Network Performance Monitoring: Tracking and analyzing network performance data.

Integrations

Apache Druid integrates with various tools and platforms:

  • Data Ingestion: Apache Kafka, Amazon Kinesis, HDFS, Amazon S3.
  • Querying and Visualization: Apache Superset, Tableau, Grafana.
  • Data Processing: Apache Hadoop, Apache Spark, Apache Flink.

Advantages

  • Scalability: Easily scales horizontally to handle increasing data volumes.
  • Flexibility: Supports both real-time and batch data, accommodating various data ingestion needs.
  • Performance: Optimized for fast querying and high throughput.
Scroll to Top