> For the complete documentation index, see [llms.txt](https://avinandanbanerjee99.gitbook.io/system-design-notes/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://avinandanbanerjee99.gitbook.io/system-design-notes/databases/is-cassandra-a-column-oriented-database.md).

# Is Cassandra a Column Oriented Database?

Suppose we have a table **Bonuses**

```
ID      Last    First   Bonus
1       Doe     John    8000
2       Smith   Jane    4000
3       Beck    Sam     1000
```

* In a **row-oriented** database management system, the data would be stored like this: `1,Doe,John,8000 ; 2,Smith,Jane,4000 ; 3,Beck,Sam,1000;`
* In a **column-oriented** database management system, the data would be stored like this:\
  `1,2,3 ; Doe,Smith,Beck ; John,Jane,Sam ; 8000,4000,1000;`

Row-oriented databases like traditional SQL are good for OLTP applications, i.e when we want to get details of a specific row or group of rows, i.e we want most or all columns of a row/group of rows.

Column-oriented databases shine when we want OLAP or analytical operations across a huge number of rows. If we wanted the average bonus for all males, in row-oriented DBMS, we would have to scan across all rows (or even if indexed, a good amount of rows) and then filter out only the bonuses column. In column oriented databases, we have direct access to the Bonus column. However, Inserts and Updates are slower compared to Row-oriented databases, unlike adding a single tuple at the end in OLTP, we have to added a new entry at the end of each column.

Column-oriented databases compress the data better too, as each entry (the column) is of the same datatype.

Notice that this specification of column-oriented DB is still **relational**. Eg. - Sybase IQ

Cassandra is a **column-family** database. It stores as following

```
"Bonuses" : {
     row1 : { "ID":1, "Last":"Doe", "First":"John", "Bonus":8000},
     row2 : { "ID":2, "Last":"Smith", "First":"Jane", "Bonus":4000}
     ...
}
```

Note that row1 could have had an extra column, "Extra Bonus", and row2 could have omitted "Last"

So each row contains a group of columns,  a column-family. It is possible to access each column family quickly for each row, i.e unlike column-oriented DBs, **columns are not independent**.

The storage layer also varies in Cassandra. As columns may be missing from rows, each row can have a different column family. Storing NULLs for missing columns like in column oriented DBs would result in most of the database full of NULLs. Thus we have very sparse data model, which stores the row name, column name, or both along with the data entry.

So Cassandra is essentially row-oriented, but treats columns as a first class citizen.
