Using SAP Analytics Cloud

May 9, 2021

10 min read

DOING DATA SCIENCE FROM SCRATCH TASK BY TASK

A tool that got me excited

Recently I was pleasantly surprised whilst getting an overview of SAP Analytics Cloud watching a classic Formula 1 race, one in which Mika Hakkinen was utterly brilliant. If you are an F1 fan watching a golden oldie, you know that something must be pretty special to catch your eye. Indeed that is my view of SAP Analytics Cloud. It is a tool that got me excited for SAP customers and users who want to combine B.I. with Machine Learning powered predictions, and it also does Financial Planning and Modelling. If you know the industry offerings, you will know that usually, you need many different tools to do such things. SAP SAC gives it all to you in one place backed by the HANA data appliance.

So why did I get excited about a product during a classic emotional F1 race? Read on, and let me explain! If you haven’t read my work before, you should know that I love building things from scratch and therefore have a heightened awareness of databases, software development, and Cloud hosting. Data Science people will probably not use SAC, but I suspect the no-code fans will be super excited, if not spoiled by the array of available products in this space.

What is SAP Analytics Cloud?

As I mentioned earlier, SAP Analytics Cloud (SAC) is a bundle of business tools. SAP put it ideally "Get all the cloud analytics features you need – business intelligence (B.I.), augmented analytics, predictive analytics, and enterprise planning – in a single solution.". So SAC is a complete environment for analytics hosted in the SAP cloud using HANA technology.

SAC is cloud-based, and you can sign-up for a free trial and get hands-on. Which is what I did! I used my Mac Mini M1 (8 gigs of RAM) with the Google Chrome browser for the trial.

What is HANA, and why is it important?

HANA is an In-Memory database designed by SAP. SAP defines HANA as "a database that accelerates real-time, data-driven decisions.". Traditional database products were tremendous inventions, but there were snags; those products include:-

DB2 – a family of data management products, including database servers, developed by I.B.M.
MySQL – an open-source relational database management system (RDBMS)
PostgreSQL – a powerful, open-source object-relational database system
Oracle – a database commonly used for running online transaction processing (OLTP), data warehousing (D.W.) and mixed (OLTP & D.W.) database workloads.

Typically, these products manifested as a Database server with lots of physical storage disks and core database software, all running in a silo. DW experts work on these systems, and they would be optimized for either Transaction processing (Online Transaction Processing OLTP) or for Analytics (Online Analytics Processing OLAP or Data warehouse D.W.) but never both (mixed OLTP and D.W. workloads). User experience (responsiveness) was influenced by:-

The power and performance of the bare metal server used as the host with the memory, storage drive speed, and I/O port throughputs all throttling user performance.
The location of the server and network bandwidth availability
The enterprise workload levels, and generally bad user queries!

Therefore performance and user query response times could vary and tend to force batch or offline processing of workloads. Connecting products like Tableau, IBM Cognos, Power Bi, Looker, and many others, over TCP/IP with ODBC or other protocol was then a potentially horrific experience if things weren’t hyper-organized. The BI Server connecting to the Data Server for each request was pretty heavy with lots of processing overhead.

HANA is different. It was designed from scratch to eliminate all these problems and uses a combination of data stored in large amounts of expensive RAM, indexing, and persistence on Solid-State Drives. HANA is extremely fast and solves all the bottlenecking and overhead of the traditional SQL databases. Figure 1, below, is the SAP architecture view.

👁 Figure 1 - The HANA system from SAP publically available documentation. Image captured by the author from this site.

Figure 1 – The HANA system from SAP publically available documentation. Image captured by the author from this site.

The concept of In-Memory database technology isn’t new. My previous work always included a REDIS database for caching. I always found incredible performance improvements in my web applications using REDIS in-memory caching supporting MongoDB or Postgres long term persistence.

In fact, my first experience with in-memory/database accelerators was with Netezza. Netezza is a "high-performance data warehouse appliances and advanced analytics applications for uses including enterprise data warehousing, business intelligence, predictive analytics and business continuity planning." – Wikipedia. Netezza was renamed IBM Pure Data, and like HANA, it is speedy. Connecting Tableau, PowerBI, Looker and the other B.I. Tools to Pure Data provided tremendous performance over the traditional model and is a powerful Datawarehousing solution. Naturally, the downside was the need for Extract Transform Load (ETL) strategies, leading to an entire industry of DW/ETL specialists.

Analytics Cloud

Having discussed HANA and why we care about HANA, the Analytics Cloud is then an SAP hosting environment, using HANA technology and SAP’s analytics software in an integrated environment. When we sign up for SAC, we get a tenancy in the preconfigured environment for optimum performance out of the box.

Having signed up for an account, I decided to go for a test drive. Since I doubt I will ever get a Formula 1 test drive, I had to settle for the prospect of a blazing fast experience from SAP.

The test drive

I suppose, in my emotional state with F1 classics, my mind drifted back to the Titanic and that now-famous learner’s dataset. Figure 2 is an image from NOAA of the classic lady in her resting place. "A seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious, ill-fated R.M.S. Titanic." – from IMDB. What about those haunting lyrics of the theme song? – my heart will go on…. What an emotional rollercoaster?

👁 Figure 2: Photo by NOAA on Unsplash

Figure 2: Photo by NOAA on Unsplash

The Titanic dataset is used in many machine learning courses; indeed, I often used it myself. Having signed into SAP SAC, I created a new data folder, as shown in Figure 3

👁 Figure 3: Image by the author of SAC folder - 'Titantic'

Figure 3: Image by the author of SAC folder – ‘Titantic’

Next, I imported a training and test file I retrieved from Kaggle. Shown in Figure 4.

👁 Figure 4: Authors screenshot of imported Titanic data files.

Figure 4: Authors screenshot of imported Titanic data files.

Clicking the train.csv file, we can see how SAC works. We have measures and dimensions. The ‘measures’ are the numbers, whilst ‘dimensions’ are the categories and other referential data. I added Figure 5 to demonstrate the view. You will notice that anything that looks like a number is assumed to be a measure. So age, Parch, SibSp and others might have to be updated to be a dimension.

👁 Figure 5 - The fields in our train data set.

Figure 5 – The fields in our train data set.

As I mentioned, Titanic is a very well used and studied data set. Here on Towards Data Science, I found a previous article by Niklas Donges.

Predicting the Survival of Titanic Passengers

Niklas provided a handy description of each field, as shown in Figure 6

👁 Figure 6 - Screenshot of a table from Niklas Donges, Towards Data Science

Figure 6 – Screenshot of a table from Niklas Donges, Towards Data Science

It turned out, from a file import to creating a Story (dashboard) has some limitations, which left me unable to update the data type. To control the data types, you need to build a model. Using a model, I was able to make better progress. Figure 7 gives a view of the model.

👁 Figure 7: A screenshot from SAC showing the model dimensions. Image by the author

Figure 7: A screenshot from SAC showing the model dimensions. Image by the author

For my purpose, I made little change to the defaults, but in an actual model, you would need to challenge each column and ensure it has the correct data type. With a model built, I was able to create some visuals.

Visuals

Figure 8 shows a stacked bar chart with the Trellis function. We can see a chart showing Female and Male, survival (0 = false, 1 = true), distribution by age group, and numbers of individuals. According to the data, most of the women and children got off, but few male passengers survived.

👁 Figure 8: A stacked bar chart of who was on the Titanic by the author.

Figure 8: A stacked bar chart of who was on the Titanic by the author.

Figure 9 adds the ticket class to see how the social classes faired out.

👁 Figure 9: an additional stacked Trellis chart adding ticket class.

Figure 9: an additional stacked Trellis chart adding ticket class.

I suppose that it is well known now that Passengers in 1st and 2nd class did better. Figure 9 shows that Female passengers from 1st and 2nd class mostly all got off whilst some in 3rd class got off. It seems more Adult males in the 3rd class got off than from the 2nd or 1st class.

Despite searching, I was unable to find and add a label for the legend. The title should be Survival. The imputation of missing fields is a bit clunky with a simple change function based on a formula. I guess you could try to do mean imputation. Grouping or binning categorical values appears to be done by a ‘formula’ in the model during the file import process. I had to manually group the age dimension into an adult, teen, child, retired based on arbitrary cutoff values.

Machine Learning

SAP SAC has a limited Machine Learning environment. Figures 10 and 11 show the training results for a classification model.

👁 Figure 10: Training results for a classification model. Image by the author

Figure 10: Training results for a classification model. Image by the author

👁 Figure 11: The confusion matrix showing classification accuracy. 6% false negative/positive.

Figure 11: The confusion matrix showing classification accuracy. 6% false negative/positive.

I was able to apply the model and get a new output. It wasn’t clear how to deploy the ML model to the data model to get a continuous inference, but I didn’t spend a lot of time on the feature. The Machine Learning options are limited to those in Financial Analysis just now, as shown in Figure 12.

👁 Figure 12: The currently available machine learning algorithms.

Figure 12: The currently available machine learning algorithms.

Financial Planning

SAP SAC offers two types of data models. Those are:-

An analytical model for the traditional BI use case
A Planning Model for the multi-dimensional financial planning use cases typically using products like TM1 with Excel add-in.

The trial license did not allow me to make progress on the Planning model functionality, but it is awe-inspiring based on the overview I took. If you want to dip your toe in the water, be inspired like me, I left some tips and links in the Inspiration section at the end of this document.

The test drive is over

The test drive is over. The product is shiny and new and is very tempting. I guess with most large purchases a 20-minute test drive is not going to convince you to make a large financial investment.

👁 Photo by Zakaria Zayane on Unsplash

Photo by Zakaria Zayane on Unsplash

With any analytics tool, you would need to sit down and plan a data model, implement that model, and then take a real test drive on your data volumes with user-defined use cases. Go ahead and do a proof of concept and demonstrate the value to yourself.

👁 Photo by Scott Graham on Unsplash

Photo by Scott Graham on Unsplash