Concept of indexing in Apache Cassandra

Last Updated : 23 Aug, 2024

Indexing in Apache Cassandra

In Apache Cassandra, data can be accessed using attributes that are part of the partition key. For example, if Emp_id is a column in an Employee table and it serves as the partition key, you can filter or search data using this key. In this case, the WHERE clause can be used to define conditions over the attribute to retrieve data efficiently.

However, if you want to filter or search using a column that is not a partition key, the query will not execute and will return an error.

You need to define an index to enable efficient data lookup using attributes other than the partition key. Indexes can be used for various purposes, such as indexing collections, static columns, and other columns except for counter columns.

When to use an Index

Large Tables: Indexes are beneficial on tables with many rows where the indexed column values are not unique.
Unique Values: Indexing is effective on columns with many unique values.
Overhead Reduction: If a table experiences overhead due to large entries, indexing can improve performance.
Query Optimization: Indexing is useful for optimizing queries and maintaining performance.

Example

Suppose you have a cricket match table with millions of entries for players across hundreds of matches. If you want to look up players' ranks by the number of matches played, and many players share the same match_year, this match_year column is a good candidate for indexing.

Syntax to create an Index

CREATE INDEX [ IF NOT EXISTS ] index_name
 ON [keyspace_name.]table_name
 ([ ( KEYS | FULL ) ] column_name) 
 (ENTRIES column_name);

Example

Consider a table in keyspace1 called Task:

CREATE TABLE keyspace1.Task 
(
 Task_id text,
 Task_name text,
 Task_time timestamp,
 T_location text,
 PRIMARY KEY (Task_id, Task_name)
);

Cassandra is a distributed and decentralized database, and data is organized by partition key. Typically, WHERE clause queries must include a partition key.

Valid Queries

SELECT * 
FROM Task 
WHERE Task_id = 'T210';

This query would work fine.

SELECT * 
FROM Task 
WHERE Task_id = 'T210' AND Task_name = 'set alarm';

Note: In above table Task_id and Task_name columns are the part of primary key.

Invalid Query

SELECT * 
FROM Task 
WHERE Task_time = '2019-09-30 15:02:56';

This query will not work because Task_time is not part of the partition key.

Error:

Bad Request: no indexed columns present in by columns clause with Equal operator.

To resolve this error, you can create an index on the clustering column. Define a table with a composite partition key, and then create an index on the clustering column.

CREATE TABLE keyspace1.Task (
 Task_id text,
 Task_name text,
 Task_time timestamp,
 T_location text,
 PRIMARY KEY ((Task_id, Task_name), Task_time) 
);

CREATE INDEX ON keyspace1.Task(Task_time);

Now, this query will work:

SELECT * 
FROM Task 
WHERE Task_time = '2019-09-30 15:02:56';

Note:

Creating secondary indexes does not necessarily increase the speed of queries in Cassandra. One of the main advantages of secondary indexes is that they allow WHERE clauses referencing columns beyond the primary and clustering keys to execute successfully.

However, for better query performance, it's often preferable to create a table specifically designed for the query pattern.

Another Example:

Consider a table Student_record in keyspace1:

CREATE TABLE Student_record 
(
 Stu_state text,
 Stu_zip text,
 Stu_address text,
 PRIMARY KEY (Stu_state, Stu_zip)
);

In this table, Stu_state and Stu_zip might be the same for multiple records. To uniquely define each record, you can add Stu_id as a primary key.

ALTER TABLE Student_record ADD Stu_id int PRIMARY KEY;

Output:

👁 Image

Table :

To describe the table structure, use

DESCRIBE TABLE Student_record;

To query data using Stu_id

SELECT * 
FROM Student_record 
WHERE Stu_id = '107';

Output:

👁 Image

Creating a Composite Partition Key

A composite partition key involves more than one column. Here's an example:

CREATE TABLE Registration (
 Name text,
 Date timestamp,
 Email text, 
 Query text,
 PRIMARY KEY((Name, Date), Email) 
) WITH CLUSTERING ORDER BY (Date DESC);

To query the table

SELECT * 
FROM Registration LIMIT 2;

Note:

When creating an index in Cassandra, each row has a row key. It's common to have denormalized data in Cassandra. However, it is a misconception that secondary indexes make queries faster. The use of parentheses is essential when specifying a composite partition key.

Comment

Article Tags:

DBMS

Databases

Apache

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Courses

URL: https://www.geeksforgeeks.org/dbms/concept-of-indexing-in-apache-cassandra/