Skip to main content

Apache Cassandra indexing without having to say I’m sorry

Recently, there’s been a new change proposal for Cassandra indexing that attempts to reduce the tradeoff between usability and stability: Making the WHERE clause much more interesting and useful for end-users. This new method is called Storage-Attached Indexing (SAI). It’s not the flashiest name, but what do you expect? Engineers aren’t known for naming things, but cool technology is never a joke. SAI has captured the attention of the Cassandra community, but why? Indexing data is not a new concept in the database world. 

How we index our data can change over time based on the desired use cases and deployment models. Cassandra was built combining aspects of Dynamo and Big Table to reduce the complexity of read and write overhead by keeping things simple. The complexity of Cassandra has been mostly reserved to its distributed nature and as a result, created a tradeoff for developers. If you want the incredible scale of Cassandra, you have to spend the time learning how to data model. Database indexes are meant to enhance your data model and make your queries more efficient. For Cassandra, they have existed in some form since the early days of the project. The unfortunate reality is that they haven’t matched well with user requirements. Any use of indexing comes with a long list of tradeoffs and warnings to the point that they are mostly avoided and for some, just a hard no. As a result, users have learned how to data model with basic queries to get the best performance. 

Those days may be getting behind us and features like SAI are helping us get there.

Secondary indexes in distributed databases
Not all indexes are created equal. Primary indexes are also known as the unique key, or in Cassandra vocabulary, partition key. As a primary access method on the database, Cassandra utilizes the partition key to identify the node holding the data then the data file that stores the partition of data. Primary index reads in Cassandra are fairly simple but beyond the scope of this article. You can read more about them here

Secondary indexes create a completely different and unique challenge in a distributed database. Let’s look at an example table to make a few points:

Secondary indexes create a completely different and unique challenge in a distributed database. Let’s look at an example table to make a few points:

Secondary indexes create a completely different and unique challenge in a distributed database. Let’s look at an example table to make a few points:

Secondary indexes create a completely different and unique challenge in a distributed database. Let’s look at an example table to make a few points:

CREATE TABLE users (
id                           long,
firstName            text,
lastName             text,
country text,
created timestamp,
PRIMARY KEY (id)
);

A primary index lookup would be pretty simple like this:
SELECT firstName, lastName FROM users WHERE id = 100;

What if I wanted to find everyone in France? As someone familiar with SQL, you would expect this query to work:
SELECT firstName, lastName FROM users WHERE country = ‘FR’;

Without creating a secondary index in Cassandra, this query will fail. The fundamental access pattern in Cassandra is by partition key. In a non-distributed database like a traditional RDBMS, every column of the table is easily visible to the system. You can still access the column even if there is no index since they all exist in the same system and data files. Indexes in this case help reduce query time by making the lookup more efficient. 

In a distributed system like Cassandra, the column values are on each data node and must be included in the query plan. This sets up what we call the “Scatter-Gather” scenario where a query is sent to every node, data is collected, merged, and returned to the user. Even though this operation can be done over multiple nodes at once, the latency management is down to how fast the node can find the column value. 

Quick review of Cassandra data writes
You may be thinking that adding indexes is about reading data, which is certainly the end goal. However, when building a database the technical challenges on indexing are biased at the point where data is written. Accepting the data at the fastest speed while formatting the indexes in the most optimal form for reads is a huge challenge. It’s worth doing a quick review of how data is written in a Cassanda database at the individual node level. Refer to the following diagram as I explain how it works.

When data is presented to a node, which we call a mutation, the write path for Cassandra is very simple and optimized for that operation. This is also true for many other databases based on Log-Structured Merge(LSM) trees.

  1. Validate data is the correct format. Type check against the schema. 
  2. Write data into the tail of a commit log. No seeks, just the next spot on the file pointer. 
  3. Write data into a memtable, which is just a hashmap of the schema in memory. 

Done! The mutation is acknowledged when those things happen. I love how simple this is compared to other databases that require a lock and seek to perform a write. 

Later, as the memtables fill physical memory, a flush process writes out segments in a single pass on disk to a file called an SSTable (Sorted Strings Table). The accompanying commit log is deleted now that the persistence has moved to the SSTable. This process keeps repeating as data is written to the node.

Important detail: SSTables are immutable. Once they are written they never get updated, just replaced. Eventually, as more data is written, a background process called compaction merges and sorts sstables into new ones which are also immutable. There are a lot of compaction schemes, but fundamentally, they all perform this function.

You now have enough basic foundation on Cassandra so we can get sufficiently nerdy with indexes. Any further depth of information is left as an exercise for the reader

Issues with previous indexing
Cassandra has had two previous secondary indexing implementations. Storage Attached Secondary Indexing(SASI) and Secondary Indexes, which we refer to as 2i. Again, my point about engineers not being flashy with names holds up here. Secondary indexes have been a part of Cassandra from the beginning, but the implementations have made them troublesome for end-users with their long list of tradeoffs. The two main concerns we have constantly dealt with as a project are write amplification and index size on disk. As a result, they can be frustratingly tempting to new users only to have them fail later in the deployment. Let’s look at each.

Secondary Indexes (2i) — This original work in the project started out as a convenience feature for early Thrift data models. Later, as Cassandra Query Language replaced Thrift as the preferred query method for Cassandra, 2i functionality was retained with the “CREATE INDEX” syntax. If you had come from SQL, this was a really easy way to learn the law of unintended consequences. Just like in SQL indexing, the more you add the more you affect write performance. However, with Cassandra, this triggered the larger issue with write-amplification. Referring to the write path above, Secondary Indexes added a new step into the path. When a mutation on an indexed column occurs, an indexing operation is triggered that re-indexes data in a separate index file. More indexes on a table can dramatically increase disk activity in a single row write operation. When a node is taking a high amount of mutations, the result can be saturated disk activity which can make the individual nodes unstable, giving 2i the deserved guidance of “use sparingly.” Index size is fairly linear in this implementation but with re-indexing, the amount of disk space needed can be hard to plan for in an active cluster.

Storage Attached Secondary Indexing (SASI) — SASI was originally designed by a small team at Apple to solve a specific query problem and not the general problem of secondary indexes. To be fair to that team, it got away from them in a use case that it was never designed to solve. Welcome to open source everyone. The two query types that SASI was designed to address:

  • Finding rows based on partial data matching. Wildcard, or LIKE queries. 
  • Range queries on sparse data, specifically timestamps. How many records fit in a time range type queries.

It did both of those operations quite well and it also addressed the issue of write amplification with legacy 2i. As mutations are presented to a Cassandra node, the data is indexed in-memory during the initial write, much like how memtables are used. No disk activity is required on a permutation. A huge improvement on clusters with a lot of write activity. When memtables are flushed to sstables, the corresponding index for the data is flushed. Every index file written is immutable and attached to the sstable, hence the name Storage Attached. When compaction occurs, data is reindexed and written to a new file as new sstables are created. From a disk activity standpoint, this was a major improvement. The downside of SASI was primarily in the size of the indexes created. The on-disk index format caused an enormous amount of disk space used for each column indexed. This makes them very difficult to manage for operators. In addition, SASI was marked as experimental and not much has happened with regards to feature improvement. Many bugs have been found over time with expensive fixes that have brought on the discussion of whether SASI should be removed altogether. If you need the deepest dive on this feature, Duy Hai Doan did an amazing job of breaking down how SASI works. 

What makes SAI better
The first, best answer to that question is that SAI is evolutionary in nature. Engineers at DataStax realized that the core architecture of Secondary Indexing needed to be addressed from the ground up but with solid lessons that have been learned from previous implementations. Addressing the issues of write-amplification and index file size while creating a path for better query enhancements in Cassandra has been the primary mission. How does SAI address both of these topics?

Write amplification — As we learned from SASI, in-memory indexing and flushing indexes with SSTables was the right way to keep in line with how the Cassandra write-path works, while adding new functionality. With SAI, when the mutation is acknowledged, meaning fully committed, the data is indexed. With optimizations and a lot of testing, the impact on write performance has vastly improved. You should see better than a 40% increase in throughput and over 200% better write latencies over 2i. That being said, you should still plan on an increase of 2x latency and throughput on indexed tables compared to non-indexed tables. To quote Duy Hai Doan, “There is no magic,” just good engineering. 

Index size — This is the most dramatic improvement and arguably where most work has been done. If you follow the world of database internals, you know that data storage is still a lively field filled with continuously evolving improvements. SAI uses two different types of indexing schemes based on the data type. 

  • Text – Inverted indexes are created with terms broken into a dictionary. The biggest  improvement is from the use of Trie based indexing which offers much better compression which means smaller index sizes. 
  • Numeric – Utilizing a data structure called block kd-trees, taken from Lucene, which offers excellent range query performance. A separate row ID list is maintained to optimize for token order queries. 

With a strong emphasis on index storage, the result was a massive improvement in the volume vs the number of table indexes. As you can see in the graph below, the fast indexing brought by SASI was quickly eclipsed by the explosion of disk usage. Not only does it make operational planning a pain, but the index files had to be read during compaction events which could saturate disks leading to node performance issues. 

Outside of write amplification and index size, the internal architecture of SAI allows for further expansion and added functionality in the future. This is in line with project goals to be more modular in future builds. Take a look at some of the other CEPs that are pending and you can see that this is only the beginning. 

Where does SAI go from here?
DataStax has offered SAI to the Apache Cassandra project through the Cassandra Enhancement Process as CEP-7. The discussion is on now for inclusion in the 4.x branch of Cassandra.

If you want to try this now before it’s a part of the Apache Cassandra project, we have a couple places for you to go.  For operators or people who like a little more technical hands-on, you can download the latest DataStax Enterprise 6.8. If you are a developer, SAI is now enabled in DataStax Astra, our Cassandra as a Service. You can create a free-forever tier to play around with syntax and new where clause functionality. With that, learn how to use this feature by going to the Cassandra Indexing Skills Page  and included documentation

The post Apache Cassandra indexing without having to say I’m sorry appeared first on SD Times.



from SD Times https://ift.tt/2FTtMp8

Comments

Popular posts from this blog

Difference between Web Designer and Web Developer Neeraj Mishra The Crazy Programmer

Have you ever wondered about the distinctions between web developers’ and web designers’ duties and obligations? You’re not alone! Many people have trouble distinguishing between these two. Although they collaborate to publish new websites on the internet, web developers and web designers play very different roles. To put these job possibilities into perspective, consider the construction of a house. To create a vision for the house, including the visual components, the space planning and layout, the materials, and the overall appearance and sense of the space, you need an architect. That said, to translate an idea into a building, you need construction professionals to take those architectural drawings and put them into practice. Image Source In a similar vein, web development and design work together to create websites. Let’s examine the major responsibilities and distinctions between web developers and web designers. Let’s get going, shall we? What Does a Web Designer Do?

A guide to data integration tools

CData Software is a leader in data access and connectivity solutions. It specializes in the development of data drivers and data access technologies for real-time access to online or on-premise applications, databases and web APIs. The company is focused on bringing data connectivity capabilities natively into tools organizations already use. It also features ETL/ELT solutions, enterprise connectors, and data visualization. Matillion ’s data transformation software empowers customers to extract data from a wide number of sources, load it into their chosen cloud data warehouse (CDW) and transform that data from its siloed source state, into analytics-ready insights – prepared for advanced analytics, machine learning, and artificial intelligence use cases. Only Matillion is purpose-built for Snowflake, Amazon Redshift, Google BigQuery, and Microsoft Azure, enabling businesses to achieve new levels of simplicity, speed, scale, and savings. Trusted by companies of all sizes to meet

2022: The year of hybrid work

Remote work was once considered a luxury to many, but in 2020, it became a necessity for a large portion of the workforce, as the scary and unknown COVID-19 virus sickened and even took the lives of so many people around the world.  Some workers were able to thrive in a remote setting, while others felt isolated and struggled to keep up a balance between their work and home lives. Last year saw the availability of life-saving vaccines, so companies were able to start having the conversation about what to do next. Should they keep everyone remote? Should they go back to working in the office full time? Or should they do something in between? Enter hybrid work, which offers a mix of the two. A Fall 2021 study conducted by Google revealed that over 75% of survey respondents expect hybrid work to become a standard practice within their organization within the next three years.  Thus, two years after the world abruptly shifted to widespread adoption of remote work, we are declaring 20