Big data – From Modeling to Ingesting (1)

In this article, we take a closer look at the importance of data modeling. I will also share some tips and tricks to help you perform on your data models and make your next project ROCKS! - Most of terms mentioned here are being asked in Data Engineering Interviews.

I have divided this article into three main topics as below:

Data Modeling Overview
RDBMS Data Modeling
NoSQL Data Modeling

Remember, the critical thinking skills you employ when you do data modeling is a skill that is and should be independent of the type of database you’re working with. Once you have these skills, you’ll be able to transfer these skills to any database you are working with. The skill is universal, but data models between systems should be carefully reviewed before doing any kind of blind copy paste.

Data Modeling Overview Link to heading

What is data modeling? Data modeling is an abstraction that organizes elements of data, and how they will relate to each other. It’s an iterative process, most of data engineers continually reorganize, restructure, and optimize data models to fit the needs of the organization.

Let’s make things simple, who hasn’t made a spreadsheet to organize their bills or a trip? Do you remember the process you went through to create that first set of rows and columns? That is Data modeling. The data model is critical for accurately representing each data object.

Consider it as the final state of your data, data home. The process of data modeling is to organize data into a database system to ensure that your data is persisted, easily usable by you and your organization and how they are connected to each other.

Who does data modeling? Link to heading

Well, almost anyone is involved in the process of analyze and using the data should learn how to model their data (i.e.: data scientist, software engineers and data engineers).

Data Modeling vs Machine Learning Modeling Link to heading

Machine learning includes a lot of data wrangling to create the inputs for machine learning models, but data modeling is more about how to structure data to be used by different people within an organization

Data Modeling Types Link to heading

Before starting to work on any data model, you should ask yourself few questions on data, below are types of data models.

data-modelling

Online Analytical Processing (OLAP): Databases optimized for workloads that allow for complex analytical and ad hoc queries, including aggregations. These type of databases are optimized for reads. Example: OLAP query will tell you how many shoes, last week, were sold in stores in particular store.
Online Transactional Processing (OLTP): Databases optimized for workloads allow for less complex queries in large volume. The types of queries for these databases are read, insert, update, and delete. Example: An OLTP query will tell you the price of the shoes.

This Stack overflow post describes it well.

RDBMS Data Modeling – Important terms and facts Link to heading

Let’s agree that all information in a relational database is represented explicitly at the logical level and in exactly one way – by values in tables. Once data is transformed into the rows and columns format, your data is standardized and you can query it with SQL. Data is systematically stored and modeled in tabular format - That is Relational database.

When to use RDBMS? Link to heading

Need of ACID transactions.
Will use many joins.
Will do many aggregations and analytics.
Small data volume – Simplicity.
Easy to change business requirements.

Note: There are some NoSQL databases that offer some form of ACID transaction. As of v4.0, MongoDB added multi-document ACID transactions within a single replica set.

Nominalization and Denormalization Link to heading

Have you heard about the terms Normalization and Denormalization before? Well they are the two biggest concepts of RDBMS data modeling. I will give a brief intro here about these two important concepts; however, I suggest trying it yourself with many examples you can find on web.

Nominalization: is about trying to increase data integrity by reducing the number of copies of the data. Data that needs to be added or updated will be done in as few places as possible. Example.

Denormalization: is trying to increase performance by reducing the number of joins between tables (as joins can be slow). Data integrity will take a bit of a potential hit, as there will be more copies of the data (to reduce JOINS). Example.

Some tips to better modeling your data Link to heading

Star schema The star schema is the simplest style of data marts schemas. It consists of one or more fact tables referencing any number of dimension tables, and this is one of the most widely used schemas in the industry. (It creates a star symbol as in the image below)

Facts and dimensions tables: Link to heading

Let’s start with an example here, imagine we have a database for a store. The following questions will give a heads up about what facts and dimensions tables are:

Dimension tables providing the following information:

Where the product was bought?
When the product was bought?
What product was bought?

The Fact table provides the metric of the business process

How many units of products were bought?

While fact and dimension tables are not created differently in the DDL, their conceptual and extremely important for organization.

NoSQL Data Modeling (More into Apache Cassandra) Link to heading

Remember you must know queries and model data based on this queries. you can’t do ad-hoc queries like in RDBMS.

When not to use RDBMS? When to use a NoSQL Database?

Large amounts of data.
Store different type of formats - Relational databases are not designed to handle unstructured data.
Need of fast reads.
Need a flexible schema – Flexible schema can allow for columns to be added
Need horizontal scalability.
Need high availability: The fact that relational databases are not distributed (and even when they are, they have a coordinator/worker architecture), they have a single point of failure. When that database goes down, a fail-over to a backup system occurs and takes time.

Distributed database design Link to heading

In a distributed database, in order to have high availability you need copies of your data. It’s not just a single system but a database made up of multiple machines. (i.e: RDBMS: Teradata and NoSQL is Apache Cassandra).

What is Apache Cassandra? Link to heading

Apache Cassandra provides scalability and high availability without compromising performance. All kinds of companies. For example, Uber uses Apache Cassandra for their entire backend. Good use cases for NoSQL (and more specifically Apache Cassandra) are:

Transaction loggings (retails, healthcare)
Internet of things IoT (sensors)
Time series data

NoSQL: Read about Primary Key, Partition Key, and Clustering Column

In my next article, I will go through some important tips you will need during data ingesting from RDBMS / NoSQL databases to Big data echo system.

References: