DOING DATA SCIENCE FROM SCRATCH TASK BY TASK
A tool that got me excited
Recently I was pleasantly surprised whilst getting an overview of SAP Analytics Cloud watching a classic Formula 1 race, one in which Mika Hakkinen was utterly brilliant. If you are an F1 fan watching a golden oldie, you know that something must be pretty special to catch your eye. Indeed that is my view of SAP Analytics Cloud. It is a tool that got me excited for SAP customers and users who want to combine B.I. with Machine Learning powered predictions, and it also does Financial Planning and Modelling. If you know the industry offerings, you will know that usually, you need many different tools to do such things. SAP SAC gives it all to you in one place backed by the HANA data appliance.
So why did I get excited about a product during a classic emotional F1 race? Read on, and let me explain! If you haven’t read my work before, you should know that I love building things from scratch and therefore have a heightened awareness of databases, software development, and Cloud hosting. Data Science people will probably not use SAC, but I suspect the no-code fans will be super excited, if not spoiled by the array of available products in this space.
What is SAP Analytics Cloud?
As I mentioned earlier, SAP Analytics Cloud (SAC) is a bundle of business tools. SAP put it ideally "Get all the cloud analytics features you need – business intelligence (B.I.), augmented analytics, predictive analytics, and enterprise planning – in a single solution.". So SAC is a complete environment for analytics hosted in the SAP cloud using HANA technology.
SAC is cloud-based, and you can sign-up for a free trial and get hands-on. Which is what I did! I used my Mac Mini M1 (8 gigs of RAM) with the Google Chrome browser for the trial.
What is HANA, and why is it important?
HANA is an In-Memory database designed by SAP. SAP defines HANA as "a database that accelerates real-time, data-driven decisions.". Traditional database products were tremendous inventions, but there were snags; those products include:-
- DB2 – a family of data management products, including database servers, developed by I.B.M.
- MySQL – an open-source relational database management system (RDBMS)
- PostgreSQL – a powerful, open-source object-relational database system
- Oracle – a database commonly used for running online transaction processing (OLTP), data warehousing (D.W.) and mixed (OLTP & D.W.) database workloads.
Typically, these products manifested as a Database server with lots of physical storage disks and core database software, all running in a silo. DW experts work on these systems, and they would be optimized for either Transaction processing (Online Transaction Processing OLTP) or for Analytics (Online Analytics Processing OLAP or Data warehouse D.W.) but never both (mixed OLTP and D.W. workloads). User experience (responsiveness) was influenced by:-
- The power and performance of the bare metal server used as the host with the memory, storage drive speed, and I/O port throughputs all throttling user performance.
- The location of the server and network bandwidth availability
- The enterprise workload levels, and generally bad user queries!
Therefore performance and user query response times could vary and tend to force batch or offline processing of workloads. Connecting products like Tableau, IBM Cognos, Power Bi, Looker, and many others, over TCP/IP with ODBC or other protocol was then a potentially horrific experience if things weren’t hyper-organized. The BI Server connecting to the Data Server for each request was pretty heavy with lots of processing overhead.
HANA is different. It was designed from scratch to eliminate all these problems and uses a combination of data stored in large amounts of expensive RAM, indexing, and persistence on Solid-State Drives. HANA is extremely fast and solves all the bottlenecking and overhead of the traditional SQL databases. Figure 1, below, is the SAP architecture view.
The concept of In-Memory database technology isn’t new. My previous work always included a REDIS database for caching. I always found incredible performance improvements in my web applications using REDIS in-memory caching supporting MongoDB or Postgres long term persistence.
In fact, my first experience with in-memory/database accelerators was with Netezza. Netezza is a "high-performance data warehouse appliances and advanced analytics applications for uses including enterprise data warehousing, business intelligence, predictive analytics and business continuity planning." – Wikipedia. Netezza was renamed IBM Pure Data, and like HANA, it is speedy. Connecting Tableau, PowerBI, Looker and the other B.I. Tools to Pure Data provided tremendous performance over the traditional model and is a powerful Datawarehousing solution. Naturally, the downside was the need for Extract Transform Load (ETL) strategies, leading to an entire industry of DW/ETL specialists.
Analytics Cloud
Having discussed HANA and why we care about HANA, the Analytics Cloud is then an SAP hosting environment, using HANA technology and SAP’s analytics software in an integrated environment. When we sign up for SAC, we get a tenancy in the preconfigured environment for optimum performance out of the box.
Having signed up for an account, I decided to go for a test drive. Since I doubt I will ever get a Formula 1 test drive, I had to settle for the prospect of a blazing fast experience from SAP.
The test drive
I suppose, in my emotional state with F1 classics, my mind drifted back to the Titanic and that now-famous learner’s dataset. Figure 2 is an image from NOAA of the classic lady in her resting place. "A seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious, ill-fated R.M.S. Titanic." – from IMDB. What about those haunting lyrics of the theme song? – my heart will go on…. What an emotional rollercoaster?
The Titanic dataset is used in many machine learning courses; indeed, I often used it myself. Having signed into SAP SAC, I created a new data folder, as shown in Figure 3
Next, I imported a training and test file I retrieved from Kaggle. Shown in Figure 4.
Clicking the train.csv file, we can see how SAC works. We have measures and dimensions. The ‘measures’ are the numbers, whilst ‘dimensions’ are the categories and other referential data. I added Figure 5 to demonstrate the view. You will notice that anything that looks like a number is assumed to be a measure. So age, Parch, SibSp and others might have to be updated to be a dimension.
As I mentioned, Titanic is a very well used and studied data set. Here on Towards Data Science, I found a previous article by Niklas Donges.
Niklas provided a handy description of each field, as shown in Figure 6
It turned out, from a file import to creating a Story (dashboard) has some limitations, which left me unable to update the data type. To control the data types, you need to build a model. Using a model, I was able to make better progress. Figure 7 gives a view of the model.
For my purpose, I made little change to the defaults, but in an actual model, you would need to challenge each column and ensure it has the correct data type. With a model built, I was able to create some visuals.
Visuals
Figure 8 shows a stacked bar chart with the Trellis function. We can see a chart showing Female and Male, survival (0 = false, 1 = true), distribution by age group, and numbers of individuals. According to the data, most of the women and children got off, but few male passengers survived.
Figure 9 adds the ticket class to see how the social classes faired out.
I suppose that it is well known now that Passengers in 1st and 2nd class did better. Figure 9 shows that Female passengers from 1st and 2nd class mostly all got off whilst some in 3rd class got off. It seems more Adult males in the 3rd class got off than from the 2nd or 1st class.
Despite searching, I was unable to find and add a label for the legend. The title should be Survival. The imputation of missing fields is a bit clunky with a simple change function based on a formula. I guess you could try to do mean imputation. Grouping or binning categorical values appears to be done by a ‘formula’ in the model during the file import process. I had to manually group the age dimension into an adult, teen, child, retired based on arbitrary cutoff values.
Machine Learning
SAP SAC has a limited Machine Learning environment. Figures 10 and 11 show the training results for a classification model.
I was able to apply the model and get a new output. It wasn’t clear how to deploy the ML model to the data model to get a continuous inference, but I didn’t spend a lot of time on the feature. The Machine Learning options are limited to those in Financial Analysis just now, as shown in Figure 12.
Financial Planning
SAP SAC offers two types of data models. Those are:-
- An analytical model for the traditional BI use case
- A Planning Model for the multi-dimensional financial planning use cases typically using products like TM1 with Excel add-in.
The trial license did not allow me to make progress on the Planning model functionality, but it is awe-inspiring based on the overview I took. If you want to dip your toe in the water, be inspired like me, I left some tips and links in the Inspiration section at the end of this document.
The test drive is over
The test drive is over. The product is shiny and new and is very tempting. I guess with most large purchases a 20-minute test drive is not going to convince you to make a large financial investment.
With any analytics tool, you would need to sit down and plan a data model, implement that model, and then take a real test drive on your data volumes with user-defined use cases. Go ahead and do a proof of concept and demonstrate the value to yourself.
Inspiration
If you want to see a little bit more about SAP Analytics Cloud you could use the following resources.
Udemy
SAP training
https://training.sap.com/course/sac01-introduction-to-sap-analytics-cloud-classroom-026-ie-en/?
https://training.sap.com/trainingpath/Analytics-SAP+Analytics+Cloud-SAP+Analytics+Cloud
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS