Stop! That Data Needs To Be Cleaned

Axella Yusuf Written by Axella Yusuf · 1 min read >

Resources are scarce and usually scarce and we as managers need to continually plan and budget.

We should be concerned about the future and remember that budgeting helps take care of the future. Budgeting helps us estimate the future.

To understand what the future is, you need to understand what has happened in the past.

There are 3 terms we consider;

                                                           Forecasting ;

Is the process of making estimate of the future based on present data. It helps in decision making. We however have another term called backcasting which involves estimating backwards. There is also nowcasting which involves estimating what will happen between the present minute till  the next 24 hours and so examples are weatherforcast, stock prices etc.

Note however that the longer the period for which you are making a forecast the less accurate that forecast will be.

Anything beyond 24 hours is forecasting. However, ensure that all relevant data is available.

                                  Forecasting approach

In forecasting, there are some approaches we need to consider

  1. Naïve approach
  2. Moving average
  3. Exponential smoothening
  4. Trend projection

                            Techniques in Forecasting

Forecasting usually depends on

  • Availability of accurate historic data
  • Simplicity of the model
  • Cost consideration

Prediction; is not 100 percent accurate. There are 2 types of data used; train and test data. This is usually used when companies need to find patterns in data . in prediction, it is usually best to estimate the period when we have the actual data and compare these data to check how close they are.

                                                       Training data

is a portion of the actual data set that is fed into the machine learning model so as to discover and learn the patterns available in the set. This sets a precedent  for the model. It is larger than the testing data set. this is so because the model should get as much data as possible so that a pattern can be discovered.

                                               Testing Data

When a model is established with the training data above, we require ‘never seen’ data to test the model. And this data is referred to as the testing data.

This data is used to test and evaluate the performance and progress of the algorithms training and also helps optimize for improved results.

2 criterias should be considered for testing data;

. it  should represent the actual data set

. it should be large enough to generate meaningful predictions.

Remember that the dataset used here needs to be new and one that has not been seen by the model because the model already knows the training data.

The performance of the model after the introduction of this new test data will let you know if it is working accurately or if more training data is needed in order to perform according to specifications.

Test data provides a final realistic check of the ‘unseen dataset’ to confirm that the machine learning algorithm was effectively trained.

In data science, data is typically split into 80-20 where 80 percent is used for training and 20 percent is used for testing.

Projection; is driven solely by assumption.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: