Five stages of data preparation

Data preparation is a formal component of many enterprise systems and applications maintained by IT, such as data warehousing and business intelligence. However, prepared data is also needed by business users for other purposes, such as analysis and ad hoc reporting. As a result, IT and other IT-savvy employees (e.g., data scientists) have to handle a lot of requests for specific data sets. These days, business users are increasingly interested in self-preparation tools that allow you to access and work with data sources, not particularly delve into the intricacies of SQL, Python and SAS code.

Search
The essence of this stage – search of the data most suitable for the decision of a specific task. Many users find it incredibly difficult and time-consuming. In order to effectively search for data, a complete, well-documented data catalog (i.e., metadata repository) must be created and maintained. In addition to data profiling statistics and other content, the catalog stores a descriptive index indicating the location of available data.

Data profiling is key to understanding the data because it provides high-level statistics on data quality (number of rows, types of data in columns, minimum, maximum and average values per column, number of null values, etc.). Profiling makes it easy to select one of several suitable datasets.

Data preparation should become a formalized corporate practice. Shared metadata, a continuously managed repository, and multiple conversion and/or cleaning logic make data preparation an efficient, consistent, and repeatable process.

Retention
The essence of the retention phase is the consolidation of data selected during the search phase. Figuratively speaking, it is a temporary arrest of data needed in further stages of preparation. But in many organizations, data remains “locked up” in spreadsheets forever, even when preparation is complete. The cleanup phase requires a temporary workspace or temporary data holding area. To continuously hold intermediate or delivered data, you need a shared and managed repository: a relational database, a network file system, or a “big data” repository, such as a “data lake” on the Hadoop platform. A new trend is to put data directly into RAM (or the cloud). This greatly speeds up the real-time consolidation and formatting of data prior to further processing.
purging
During cleansing, it is necessary to determine whether the data is suitable for the task at hand. Cleansing is also used to evaluate the quality of data and that is why it is an integral part of their preparation. The laboriousness of this assessment (it usually includes data validation, deduplication and enrichment) often depends on the ability to reuse components from other deployed systems.
Documentation
Documentation is the process of recording business and technical metadata about retrieved, consolidated, and cleaned data. Metadata includes:

Identified master data administrators.
All of this metadata is available in the data catalog. Preparing data manually, usually in spreadsheets, is not only time consuming, but often duplicative. This is because different users (or even the same person) may get different results for the same task. Shared metadata allows for faster data preparation and consistent repetition when needed. In addition, shared metadata allows multiple users responsible for different aspects of data preparation to collaborate effectively.

Delivery
Delivery is about bringing cleaned data into a format suitable for human or process use. In doing so, the need to permanently retain the delivered data sets must be assessed. If there is one, appropriate metadata is placed in the catalog so that other users can search for the data as well.

Data management policies must be followed during delivery to, for example, minimize the risk of leaks of sensitive information. It is important to note that deliveries are not just one-time deliveries. In the case of recurring deliveries of new or changed data, preparations are scheduled or requested. In addition, the use of the delivered data must be tracked and the unused data must be deleted after a certain period of time (along with the corresponding entries in the data directory).