About Preparing Datasets for Predictive Scenarios

Predictive Scenarios supports acquired datasets and live dataset. However, datasets for predictive scenarios must have a certain structure and must contain some mandatory information depending on the type of predictive scenario you are creating and where you are in the modeling process.

To create Predictive Scenarios, you can use one of the following types of input datasets:

Acquired: Data is imported (copied) and stored in SAP Analytics Cloud. Acquired dataset have already been prepared on your computer (supported formats are .TXT, .CSV and .XLSX ).
Live: Data is stored in the source system. It isn't copied to SAP Analytics Cloud, so any changes in the source data are available immediately if no structural changes are brought to the table or SQL view. You can connect to live data and create a live dataset.

Dataset Prerequisites

Depending on the type of predictive model that you are creating, you need to provide some essential information for the input dataset.

Classification and Regression Predictive Model

Your training dataset contains at least one column for the target variable, and multiple columns for other variables that you think may have an influence on the target variable. These variables are called Influencers.
The datasets used to train and apply a predictive model must come from the same type of data source (acquired or live). You can't apply a predictive model on a live dataset if it was trained with an acquired dataset, nor can you apply a predictive model on an acquired dataset if it was trained using a live one. However, you can have several predictive models trained and applied with live and acquired datasets in the same predictive scenario.
Note
While using live datasets, both live datasets (training and apply datasets) must come from the same SAP HANA system: you cannot train a predictive model with a live dataset with data from SAP HANA system 1 and then apply this predictive model on a live dataset with data coming from SAP HANA system 2.

Times Series Forecasting Predictive Model

Your training dataset contains a column for the signal variable. This contains the value that you want to forecast. The signal variable has to be continuous, with no missing values.
Your dataset contains a column for the date variable.
Note
The date formats must be:
- YYYY-MM-DD
- YYYY/MM/DD
- YYYY/MM-DD
- YYYY-MM/DD
- YYYYMMDD
- YYYY-MM-DD hh:mm:ss
where YYYY stands for the year, MM stands for the month, DD stands for the day of the month, hh stands for hour, mm stands for minutes, and ss stands for seconds.
Example
January 25, 2018 will take one of the following supported formats:
- 2018-01-25
- 2018/01/25
- 2018/01-25
- 2018-01/25
- 20180125
Forecast accuracy can be improved when the training dataset includes influencers. These are other variables that you think may have an influence on the signal variable. While values would normally be available for these over the observation period, you must ensure that values for influencers are also provided for the period you want to forecast. If values for influencers to cover the forecasted periods are not available, the predictive model won’t be successful, as you need observations to cover all of the requested forecast dates.

For example, if you want to forecast chocolate sales for the year, you could add the specific dates of festive occasions to your dataset, for example, Easter, Christmas, Mother’s Day, or Valentine's day.
Note

The predictive forecasts will be influenced by the quality of the data you provided: both the historical data but also the future values for the influencers can impact your predictive model's accuracy. In some cases, it can be difficult to provide data with high quality for future values. For example: Future values for weather information can only be forecasts.

Dataset Structure

To be used in SAP Analytics Cloud, your dataset must contain columns (corresponding to the variables) and rows (corresponding to the observations):

When the goal is to create a classification or regression predictive scenario, the corresponding training dataset must contain at least one column for the target variable and multiple columns for influencer variables.

Whereas, if you want to create a time series predictive model, the training dataset should contain:

A column with the date variable (one date per period of time). The date formats must be one amongst:
- YYYY-MM-DD
- YYYY/MM/DD
- YYYY/MM-DD
- YYYY-MM/DD
- YYYYMMDD
- YYYY-MM-DD hh:mm:ss
where YYYY stands for the year, MM stands for the month, DD stands for the day of the month, hh stands for the hour, mm stands for the minutes and ss stands for the seconds.
Example
January 25, 2018 will take one of the following supported formats:
- 2018-01-25
- 2018/01/25
- 2018/01-25
- 2018-01/25
- 20180125
A column with the value that you want to forecast, being the signal variable.
While this is not mandatory, we highly recommend also including influencer variables as part of the training dataset for the past dates and for the forecasted period, in order to improve the forecast accuracy.

If your goal is to create a segmented time series predictive model, make sure that one column includes the segment information.

Example

For example, say that you work for an energy supplier, and you want to estimate the energy consumption over the next 24 months, by dwelling sector of a given district. Your dataset should then contain at least three columns, including the following data:

The consumption (signal variable) for the past date.
The date (date variable), including one date per month over several years.
The consumption by sector (segmented by).
Eventually influencer variables (for example, various calendar events) that happen in the past and in the next 24 months.

To summarize your dataset must contain the right data in the right format, depending on the type of predictive scenario.

Creating an Acquired Dataset

You create an acquired dataset following these step from Datasets start page, select Start of the navigation path From a CSV or Excel File End of the navigation path .

Creating a Live Dataset

You can create a live dataset using data from your SAP HANA on-premise system. However you must take care that the following prerequisites are mets:

Ensure that a connection has been established between your SAP HANA on-premise system and SAP Analytics Cloud. For more information, see Live Data Connection Overview Diagram.
Ensure that you have one of the following security roles: Predictive Admin,BI Admin, or Admin.

To create a live Dataset:

From the Datasets start page, select Create Dataset. And select Data from a data source.

From the Connect to Live Data area, select your existing SAP HANA Live Data Repository and fulfill the steps to complete connection:

Information requested	What does it mean
1- Select a Data Repository	Select the Data Repository, which has been set up to create your live datasets.
2- Select a table	The existing schemas available for this Data repository are displayed. Select the one that points to the actual data stored in your SAP HANA on premise system.

Give a name to your dataset in SAP Analytics Cloud and save it.

The dataset is now created and appears in the Files menu of SAP Analytics Cloud.

Note

As the data stays in your on-premise SAP HANA system, the data is available in real time: you don't need to manually upload the dataset each time you update your data. Any data changes you make to your tables and SQL views in your SAP HANA on-premise system appear immediately in the live dataset stored in SAP Analytics Cloud, as long as you do not change the list of columns used. To update the predictions, you need only two steps:

Retrain your predictive model if you want to get it updated.
Note
This step is recommended for Time series predictive models.
Re-apply the updated predictive model to get the output SAP HANA table updated. The updated predictions will be right-away visible into your stories.
Note
When applying classification and regression predictive models, make sure to select the exact same list of predictive outputs so that the existing output table can be updated in SAP HANA.

Differences Between Acquired and Live Datasets

In SAP Analytics Cloud you can use one of the following types of input datasets:

Acquired: Data is imported (copied) and stored in SAP Analytics Cloud. Changes made to the data in the source system don't affect the imported data.
Live: Data is stored in the source system. It isn't copied to SAP Analytics Cloud, so any changes in the source data are available immediately if no structural changes are brought to the table or SQL view.

When you select an input dataset to create a predictive model, you won't see any distinction between acquired and live datasets, although it is a good practice for the administrator creating the connection to a dataset to include "live" in the dataset name, or its description. You can also see live datasets in the list of objects available to you on the page Start of the navigation path Browse Next navigation step Files <User> End of the navigation path as they include (Live) in the file type.

There are some differences between the two types of input datasets that concern mainly the display of results in the predictive model debrief, and how some types of data are handled, when training or applying a predictive model. These differences will be resolved over time, as predictive models are improved with each release of SAP Analytics Cloud, but for the immediate future, please take into account the following differences that can be observed when using either an acquired or live input dataset:

Element	Acquired dataset	Live dataset
Support of data type conversion In the column Storage under Edit Variable Metadata, you can specify the data type for each variable.	Available If necessary, the physical data type, for example, a string, is automatically converted to the data type specified by the user, for example, a number.	Not available Note Training a predictive model won't complete if a user specified data type doesn’t match the physical data type.

Caution

The input datasets used to train and apply a predictive model must come from the same data source type. You can't apply a predictive model on a live dataset if it was trained with an acquired dataset, nor can you apply a predictive model on an acquired dataset if it was trained using a live one. However, you can have several predictive models trained and applied with live and acquired datasets in the same predictive scenario.

Note

While using live datasets, both live datasets (training and apply datasets) must come from the same SAP HANA system: you cannot train a predictive model with a live dataset with data from SAP HANA system 1 and then apply this predictive model on a live dataset with data coming from SAP HANA system 2.

The type of dataset you have used to train and apply your predictive model will impact the way you will consume the generated predictions in SAP Analytics Cloud. For more information, see Using the Generated Dataset in SAP Analytics Cloud.

Datasets - Examples

Your dataset must have a specific structure so that it can be relevant for Smart Predict.

How a classification or regression predictive scenario training dataset must be structured?

Example of a training dataset used for a classification predictive scenario

In the example below, you want to predict if a customer will buy a product or not. You have prepared a training dataset containing the historical data on customers that have previously bought a similar product. In this dataset, the values of the target variable ("did the customer buy the product?") are known .

Once you have created your predictive model, you apply it to an application dataset. This dataset contains the same information about the customers you want to target with the new product. The target variable column ("Will_Buy?") is empty or even doesn't exist because this is what you are expecting to predict:

Smart Predict will use the predictive model to calculate the probability that each customer will buy the product. The column "Will_Buy?" is now filled in the generated dataset:

Example of a training dataset used for a regression predictive scenario

In the example below, you want to predict the number of complaints that your customer support will receive this week. You have prepared a training dataset containing historical values for several previous weeks. In this dataset, the values of your target variable ("how many complaints per week") are known:

You apply the regression predictive model to a new dataset which contains the same influencers. The values of the target variable for this week ("number of complaints this week") are unknown.

The predictive model makes a prediction of the number of complaints that can be expected this week.

How a time series predictive scenario training dataset must be structured?

Example of a dataset used for a time series predictive scenario (Non-segmented):

In the example below, you want to forecast the "product sales" for the next 3 months. You have prepared a dataset containing historical data of the previous sales statistics for the last 6 months.

Note

To give you a quick and clear overview of the structure, this example shows a 2:1 ratio, or 6 months of historical data for 3 months of forecast. When creating your own time series predictive scenarios with a month granularity, we recommend you use a 5:1 ratio, or 5 months of historical data for each month of forecast. This ratio ensures that the engine can detect enough cycles to create a forecast.

The training and the application are done at the same time: it consists of generating N forecasts in the future. In this example, we want to generate product sales forecasts for the next 3 months.

The generated dataset will look like the one displayed below:

You can include influencers in your data model to refine the forecasts. In fact, this is highly recommended for improving forecast accuracy. The values for these influencers are filled in for the next 3 months as well:

Example of a training dataset used for a time series predictive scenario with entities:

You can also forecast product sales for multiple products. You will need to add the product-related information to the dataset. In Smart Predict, you will specify that the variable "Product Name" will be used to split your predictive model into distinct entities:

In the generated dataset, the observations are divided into "Product Name" and you will get forecasts per product:

Of course, you can combine entities and influencers. In fact, this is recommended if you want to increase the accuracy of your time series models.