The concept of multidimensional data analysis is closely associated with operational analysis, which is performed using OLAP systems.
OLAP (On-Line Analytical Processing) is a technology for operational analytical data processing that uses methods and tools for collecting, storing and analyzing multidimensional data to support decision-making processes.
The main purpose of OLAP systems is to support analytical activities and arbitrary (the term ad-hoc is often used) requests from user analysts. The purpose of OLAP analysis is to test emerging hypotheses.
At the origins of OLAP technology is the founder of the relational approach, E. Codd. In 1993, he published an article entitled "OLAP for User Analysts: What It Should Be." This paper outlines the basic concepts of online analytics and identifies the following 12 requirements that must be met by products that enable online analytics. Tokmakov G.P. Database. Database concept, relational data model, SQL languages. P. 51
Listed below are the 12 rules outlined by Codd that define OLAP.
1. Multidimensionality -- an OLAP system at the conceptual level should present data in the form of a multidimensional model, which simplifies the processes of analysis and perception of information.
2. Transparency -- the OLAP system must hide from the user the real implementation of the multidimensional model, the method of organization, sources, processing and storage means.
3. Availability -- An OLAP system must provide the user with a single, consistent and holistic data model, providing access to data regardless of how or where it is stored.
4. Consistent performance when developing reports - the performance of OLAP systems should not decrease significantly as the number of dimensions on which analysis is performed increases.
5. Client-server architecture -- the OLAP system must be able to work in a client-server environment, because Most of the data that today needs to be subjected to operational analytical processing is stored distributedly. The main idea here is that the server component of the OLAP tool should be sufficiently intelligent and allow the construction of a common conceptual scheme based on the generalization and consolidation of various logical and physical schemes of corporate databases to provide the effect of transparency.
6. Dimensional equality -- The OLAP system must support a multidimensional model in which all dimensions are equal. If necessary additional characteristics can be given to individual dimensions, but this capability must be given to any dimension.
7. Dynamic management of sparse matrices -- the OLAP system must provide optimal processing of sparse matrices. The access speed must be maintained regardless of the location of the data cells and be constant for models with different numbers of dimensions and varying degrees of data sparsity.
8. Support for multi-user mode - the OLAP system must provide the ability for several users to work together with one analytical model or create different models for them from single data. In this case, both reading and writing data is possible, so the system must ensure its integrity and security.
9. Unlimited cross operations -- the OLAP system must ensure that the functional relationships described using a certain formal language between the cells of the hypercube are preserved when performing any slice, rotation, consolidation or drill-down operations. The system should independently (automatically) perform the transformation of established relationships, without requiring the user to redefine them.
10. Intuitive data manipulation -- An OLAP system must provide a way to perform slicing, rotating, consolidating, and drilling operations on a hypercube without the user having to do a lot of interface manipulation. The dimensions defined in the analytical model must contain all the necessary information to perform the above operations.
11. Flexible options for obtaining reports -- the OLAP system must support various methods of data visualization, i.e. reports should be presented in any possible orientation. Reporting tools must present synthesized data or information resulting from the data model in any possible orientation. This means that rows, columns or pages must show 0 to N dimensions at a time, where N-- number measurements of the entire analytical model. Additionally, each content dimension shown in a single post, column, or page must allow any subset of the elements (values) contained in the dimension to be shown in any order.
12. Unlimited dimension and number of aggregation levels - research on the possible number of necessary dimensions required in the analytical model showed that up to 19 dimensions can be used simultaneously. Hence, it is strongly recommended that the analytical tool be able to provide at least 15, and preferably 20, measurements simultaneously. Moreover, each of the common dimensions should not be limited in the number of user-analyst-defined aggregation levels and consolidation paths.
Codd's Additional Rules.
The set of these requirements, which served as the de facto definition of OLAP, quite often causes various complaints, for example, rules 1, 2, 3, 6 are requirements, and rules 10, 11 are unformalized wishes. Tokmakov G.P. Database. Database concept, relational data model, SQL languages. P. 68 Thus, Codd's listed 12 requirements do not allow us to accurately define OLAP. In 1995, Codd added the following six rules to the above list:
13. Batch Retrieval vs. Interpretation -- An OLAP system must provide access to both its own and external data equally effectively.
14. Support for all OLAP analysis models -- An OLAP system must support all four data analysis models defined by Codd: categorical, interpretative, speculative and stereotypical.
15. Processing of non-normalized data -- the OLAP system must be integrated with non-normalized data sources. Data modifications made in the OLAP environment should not result in changes to data stored in the original external systems.
16. Saving OLAP results: storing them separately from the source data - an OLAP system operating in read-write mode must save the results separately after modifying the source data. In other words, the security of the original data is ensured.
17. Elimination of missing values - An OLAP system, when presenting data to the user, must discard all missing values. In other words, missing values must be different from null values.
18. Handling Missing Values -- The OLAP system must ignore all missing values without regard to their source. This feature is related to the 17th rule.
In addition, Codd divided all 18 rules into the following four groups, calling them features. These groups were named B, S, R and D.
The main features of (B) include the following rules:
Multidimensional conceptual representation of data (rule 1);
Intuitive data manipulation (rule 10);
Availability (rule 3);
Batch extraction vs. interpretation (rule 13);
Support for all OLAP analysis models (rule 14);
Client-server architecture (rule 5);
Transparency (rule 2);
Multi-user support (rule 8)
Special Features (S):
Processing of non-normalized data (rule 15);
Storing OLAP results: storing them separately from the source data (rule 16);
Elimination of missing values (rule 17);
Handling missing values (rule 18). Reporting Features (R):
Flexibility of reporting (rule 11);
Standard reporting performance (rule 4);
Automatic physical layer configuration (modified original rule 7).
Dimension Control (D):
Universality of measurements (rule 6);
Unlimited number of dimensions and aggregation levels (rule 12);
Unlimited operations between dimensions (rule 9).
The concept of OLAP technology was formulated by Edgar Codd in 1993.
This technology is based on the construction of multidimensional data sets - so-called OLAP cubes (not necessarily three-dimensional, as one might conclude from the definition). The purpose of using OLAP technologies is to analyze data and present this analysis in a form convenient for management personnel to understand and make decisions based on them.
Basic requirements for applications for multivariate analysis:
OLAP system tools provide the ability to sort and select data according to specified conditions. Various qualitative and quantitative conditions can be specified.
The main data model used in numerous tools for creating and maintaining databases - DBMS - is the relational model. The data in it is presented in the form of a set of two-dimensional relation tables connected by key fields. To eliminate duplication, inconsistency, and reduce labor costs for maintaining databases, a formal apparatus for normalizing entity tables is used. However, its use is associated with additional time spent on generating responses to database queries, although memory resources are saved.
A multidimensional data model represents the object under study in the form of a multidimensional cube; a three-dimensional model is more often used. Dimensions or attribute details are plotted along the axes or faces of the cube. The base attributes are the filling of the cube cells. A multidimensional cube can be represented by a combination of three-dimensional cubes in order to facilitate perception and presentation when generating reporting and analytical documents and multimedia presentations based on analytical work materials in a decision support system.
Within the framework of OLAP technologies, based on the fact that a multidimensional representation of data can be organized both by means of relational DBMSs and multidimensional specialized tools, three types of multidimensional OLAP systems are distinguished:
In multidimensional DBMSs, data is organized not in the form of relational tables, but in the form of ordered multidimensional arrays in the form of hypercubes, when all stored data must have the same dimension, which means the need to form the maximum full basis measurements. Data can be organized in the form of polycubes; in this option, the values of each indicator are stored with its own set of dimensions, and data processing is carried out by the system’s own tool. The storage structure in this case is simplified, because there is no need for a data storage area in a multidimensional or object-oriented form. Huge labor costs for creating models and systems for converting data from a relational model to an object model are reduced.
The advantages of MOLAP are:
Limitations of MOLAP include:
Labor costs for creating multidimensional data increase sharply, because... In this situation, there are practically no specialized means of objectifying the relational model of data contained in the information repository. Query response times often cannot meet the requirements for OLAP systems.
The advantages of ROLAP systems are:
The disadvantage of ROLAP systems is lower productivity, the need for careful development of database schemas, special tuning of indexes, analysis of query statistics and taking into account analysis findings when modifying database schemas, which leads to significant additional labor costs.
Fulfillment of these conditions allows, when using ROLAP systems, to achieve indicators similar to MOLAP systems in terms of access time, as well as surpass them in memory savings.
Hybrid OLAP systems are a combination of tools that implement a relational and multidimensional data model. This allows you to dramatically reduce the resource costs of creating and maintaining such a model and the response time to requests.
This approach uses the advantages of the first two approaches and compensates for their disadvantages. In the most developed software products For such purpose this principle is implemented.
The use of hybrid architecture in OLAP systems is the most appropriate way to solve problems associated with the use of software tools in multidimensional analysis.
The pattern detection mode is based on intelligent data processing. The main task here is to identify patterns in the processes under study, relationships and mutual influence of various factors, search for large “unusual” deviations, and predict the course of various significant processes. This area relates to data mining.
The OLAP mechanism is one of the popular methods of data analysis today. There are two main approaches to solving this problem. The first of them is called Multidimensional OLAP (MOLAP) - implementation of the mechanism using a multidimensional database on the server side, and the second Relational OLAP (ROLAP) - building cubes on the fly based on SQL queries to a relational DBMS. Each of these approaches has its pros and cons. Their comparative analysis is beyond the scope of this article. We will describe our implementation of the core of the desktop ROLAP module.
This task arose after using a ROLAP system built on the basis of Decision Cube components included in Borland Delphi. Unfortunately, the use of this set of components showed poor performance on large amounts of data. This problem can be mitigated by trying to cut out as much data as possible before feeding it into cubes. But this is not always enough.
You can find a lot of information about OLAP systems on the Internet and in the press, but almost nowhere is it said about how it works inside. Therefore, the solution to most problems was given to us by trial and error.
The general scheme of operation of a desktop OLAP system can be represented as follows:
The operating algorithm is as follows:
Let's consider how similar system can be arranged inside. We will start this from the side that can be seen and touched, that is, from the displays.
The displays used in OLAP systems most often come in two types: cross-tabs and charts. Let's look at a crosstab, which is the basic and most common way to display a cube.
In the figure below, the rows and columns containing aggregated results are shown in yellow, the cells containing facts are in light gray, and the cells containing dimensional data are in dark gray.
Thus, the table can be divided into the following elements, which we will work with in the future:
When filling out the matrix with facts, we must proceed as follows:
It should be noted that the resulting matrix will be very sparse, which is why its organization in the form of a two-dimensional array (the option lying on the surface) is not only irrational, but, most likely, impossible due to the large dimension of this matrix, for storing which there is no no volume is enough random access memory. For example, if our cube contains information about sales for one year, and if it has only 3 dimensions - Customers (250), Products (500) and Date (365), then we will get a fact matrix of the following dimensions:
Number of elements = 250 x 500 x 365 = 45,625,000
And this despite the fact that there may be only a few thousand filled elements in the matrix. Moreover, the greater the number of dimensions, the more sparse the matrix will be.
Therefore, to work with this matrix, you need to use special mechanisms for working with sparse matrices. Various options for organizing a sparse matrix are possible. They are quite well described in the programming literature, for example, in the first volume of the classic book "The Art of Programming" by Donald Knuth.
Let us now consider how we can determine the coordinates of a fact, knowing the dimensions corresponding to it. To do this, let's take a closer look at the header structure:
In this case, you can easily find a way to determine the numbers of the corresponding cell and the totals in which it falls. Several approaches can be proposed here. One of them is to use a tree to find matching cells. This tree can be constructed by traversing the selection. In addition, an analytical recurrence formula can be easily defined to calculate the required coordinate.
The data stored in the table needs to be transformed in order to be used. Thus, in order to improve performance when constructing a hypercube, it is desirable to find unique elements stored in columns that are dimensions of the cube. In addition, you can perform preliminary aggregation of facts for records that have the same dimension values. As mentioned above, the unique values available in the measurement fields are important to us. Then the following structure can be proposed for storing them:
By using this structure, we significantly reduce the memory requirement. Which is quite relevant, because... To increase operating speed, it is advisable to store data in RAM. In addition, you can only store an array of elements, and dump their values to disk, since we will only need them when displaying the cross-tab.
The ideas described above were the basis for creating the CubeBase component library.
TCubeSource carries out caching and conversion of data into an internal format, as well as preliminary aggregation of data. Component TCubeEngine carries out calculations of the hypercube and operations with it. In fact, it is an OLAP engine that transforms a flat table into a multidimensional data set. Component TCubeGrid displays the cross-tab and controls the display of the hypercube. TCubeChart allows you to see the hypercube in the form of graphs, and the component TCubePivote controls the operation of the cube core.
This set of components showed much higher performance than Decision Cube. So, on a set of 45 thousand records, the Decision Cube components required 8 minutes. to build a pivot table. CubeBase loaded data in 7 seconds. and building a pivot table in 4 seconds. When testing on 700 thousand Decision Cube records, we did not receive a response within 30 minutes, after which we canceled the task. CubeBase loaded data in 45 seconds. and building a cube in 15 seconds.
On data volumes of thousands of records, CubeBase processed tens of times faster than Decision Cube. On tables with hundreds of thousands of records - hundreds of times faster. And high performance is one of the most important indicators of OLAP systems.
Purpose of the report
This report will focus on one of the categories of intelligent technologies that are a convenient analytical tool - OLAP technologies.
The purpose of the report: to reveal and highlight 2 issues: 1) the concept of OLAP and their applied importance in financial management; 2) implementation of OLAP functionality in software solutions: differences, opportunities, advantages, disadvantages.
I would like to note right away that OLAP is a universal tool that can be used in any application area, and not just in finance (as can be understood from the title of the report), which requires data analysis using various methods.
Financial management
Financial management is an area in which analysis is more important than any other. Any financial and management decision arises as a result of certain analytical procedures. Today, financial management is becoming important for the successful functioning of an enterprise. Despite the fact that financial management is an auxiliary process in an enterprise, it requires special attention, since erroneous financial and managerial decisions can lead to large losses.
Financial management is aimed at providing the enterprise with financial resources in the required volumes, at the right time and in the right place in order to obtain the maximum effect from their use through optimal distribution.
It is perhaps difficult to define the level of “maximum resource efficiency”, but in any case,
The CFO should always know:
To obtain reasonable answers to these questions, it is necessary to have, analyze and know how to analyze a sufficiently large number of performance indicators. In addition, FU covers a huge number of areas: analysis of cash flows (movements Money), analysis of assets and liabilities, profitability analysis, marginal analysis, profitability analysis, assortment analysis.
Knowledge
Therefore, a key factor in the effectiveness of the financial management process is the availability of knowledge:
If the first lies in the scope of actions of this financier (or the HR director who hired this employee), then the second should be purposefully created at the enterprise by the joint efforts of employees of financial and information services.
What is there now
However, now a paradoxical situation is typical in enterprises: there is information, there is a lot of it, too much. But it is in a chaotic state: unstructured, inconsistent, fragmented, not always reliable and often erroneous, it is almost impossible to find and obtain. Long-term and often useless generation of mountains is carried out financial statements, which is inconvenient for financial analysis and difficult to understand, since it is not created for internal management, but for provision to external regulatory authorities.
According to the results of a study conducted by the company Reuters Among 1,300 international managers, 38% of respondents say they spend a lot of time trying to find the information they need. It turns out that a highly qualified specialist spends highly paid time not on data analysis, but on collecting, searching and systematizing the information necessary for this analysis. At the same time, managers are overloaded with data that is often irrelevant, which again reduces the effectiveness of their work. The reason for this situation: excess information and lack of knowledge.
What to do
Information must be turned into knowledge. For modern business, valuable information, its systematic acquisition, synthesis, exchange, use is a kind of currency, but in order to receive it, it is necessary to manage information, like any business process.
The key to information management is delivering the right information in the right form to stakeholders within the organization at the right time. The goal of such management is to help people work better together using increasing amounts of information.
Information technology in this case acts as a means by which it would be possible to systematize information in an enterprise, provide certain users with access to it and give them the tools to transform this information into knowledge.
OLAP technologies (from the English On-Line Analytical Processing) is the name not of a specific product, but of an entire technology for the operational analysis of multidimensional data accumulated in a warehouse. In order to understand the essence of OLAP, it is necessary to consider the traditional process of obtaining information for decision making.
Here, of course, there can also be many options: complete information chaos or the most typical situation when the enterprise has operational systems with the help of which the facts of certain operations are recorded and stored in databases. To extract data from databases for analytical purposes, a system of queries for specific data samples has been built.
But this method of decision support lacks flexibility and has many disadvantages:
If query reports are presented in a one-dimensional format, then business problems are usually multidimensional and multifaceted. If you want to get a clear picture of a company's business, then you need to analyze data from various perspectives.
Many companies create excellent relational databases data, ideally sorting out mountains of unused information, which in itself does not provide either a quick or sufficiently competent reaction to market events. YES - relational databases were, are and will be the most suitable technology for storing corporate data. It's not about new technology DB, but rather about analysis tools that complement the functions of existing DBMSs and are flexible enough to provide and automate different types intellectual analysis inherent in OLAP.
What does OLAP provide?
To get the effect of using OLAP technologies, you must: 1) understand the essence of the technologies themselves and their capabilities; 2) clearly define what processes need to be analyzed, what indicators they will be characterized by and in what dimensions it is advisable to see them, i.e. create an analysis model.
The basic concepts that OLAP technologies operate on are as follows:
To understand the multidimensionality of the data, you should first present a table showing, for example, the performance of Enterprise Costs by economic elements and business units.
This data is presented in two dimensions:
This table is not informative, as it shows sales for one specific period of time. For different time periods, analysts will have to compare several tables (for each time period):
The figure shows a 3rd dimension, Time, in addition to the first two. (Article, business unit)
Another way to show multidimensional data is to represent it in the form of a cube:
OLAP cubes allow analysts to obtain data at various slices to obtain answers to questions posed by the business:
Answers to such questions are necessary for making management decisions: on the reduction of certain cost items, the impact on their structure, identifying the reasons for changes in costs over time, deviations from the plan and their elimination - optimizing their structure.
In this example, only 3 dimensions are considered. It's difficult to depict more than 3 dimensions, but it works in the same way as with 3 dimensions.
Typically, OLAP applications allow you to obtain data on 3 or more dimensions, for example, you can add one more dimension - Plan-Actual, Cost Category: direct, indirect, by Orders, by Months. Additional dimensions allow you to obtain more analytical slices and provide answers to questions with multiple conditions.
OLAP also allows analysts to organize each dimension into a hierarchy of groups, subgroups, and totals that reflect the measure across the entire organization—the most logical way to analyze a business.
For example, it is advisable to group costs hierarchically:
OLAP allows analysts to look at the overall summary metric (at the top level) and then drill down to the bottom and subsequent levels to discover the exact reason why the metric changed.
By allowing analysts to use multiple dimensions in a data cube, with the ability to hierarchically construct dimensions, OLAP provides a picture of the business that is not compressed by the information warehouse structure.
As a rule, they operate in concepts: dimensions specified in columns, rows (there may be several of them), the rest form slices, the contents of the table form dimensions (sales, costs, cash)
Typically, OLAP allows you to change the orientation of cube dimensions, thereby presenting the data in different views.
The display of cube data depends on:
Thus, OLAP allows you to carry out various types of analysis and understand their relationships with their results.
Efficiency: we can say that OLAP is based on the laws of psychology: the ability to process information requests in “real time” - at the pace of the process of analytical comprehension of data by the user.
If a relational database can read about 200 records per second and write 20, then a good OLAP server, using calculated rows and columns, can consolidate 20,000-30,000 cells (equivalent to one record in a relational database) per second.
Visibility: It should be emphasized that OLAP provides advanced means of graphical presentation of data to the end user. The human brain is capable of perceiving and analyzing information that is presented in the form of geometric images, in a volume that is several orders of magnitude greater than information presented in alphanumeric form. Example: Let's say you need to find a familiar face in one of a hundred photographs. I believe this process will take you no more than a minute. Now imagine that instead of photographs you will be offered a hundred verbal descriptions of the same persons. I think that you will not be able to solve the proposed problem at all.
Simplicity: The main feature of these technologies is that they are intended for use by non-specialists in the field information technologies, not an expert statistician, but a professional in the applied field - a credit department manager, a budget department manager, and finally a director. They are designed for the analyst to communicate with the problem, not with the computer..
Despite the great capabilities of OLAP (in addition, the idea is relatively old - the 60s), its actual use is practically never found in our enterprises. Why?
Our approach and Western ones to the use of OLAP
In addition, we also have a specific understanding of the application utility of OLAP even while understanding its technological capabilities.
Our and Russian authors of various materials devoted to OLAP express the following opinion regarding the usefulness of OLAP: most perceive OLAP as a tool that allows you to expand and collapse data simply and conveniently, carrying out the manipulations that come to the analyst’s mind during the analysis process. The more “slices” and “sections” of data the analyst sees, the more ideas he has, which, in turn, require more and more “slices” for verification. It is not right.
The Western understanding of the usefulness of OLAP is based on a methodological analysis model that must be incorporated when designing OLAP solutions. The analyst should not play with the OLAP cube and aimlessly change its dimensions and levels of detail, data orientation, graphical display of data (and this really takes!), but clearly understand what views he needs, in what sequence and why (of course, the elements " there may be discoveries here, but it is not fundamental to the usefulness of OLAP).
One of the most fertile areas of application of OLAP technologies. It is not for nothing that no modern budgeting system is considered complete without the presence of OLAP tools for budget analysis. Most budget reports are easily built on the basis of OLAP systems. At the same time, the reports answer a very wide range of questions: analysis of the structure of expenses and income, comparison of expenses for certain items in different divisions, analysis of the dynamics and trends of expenses for certain items, analysis of costs and profits.
OLAP will allow you to analyze cash inflows and outflows in the context of business operations, counterparties, currencies and time in order to optimize their flows.
If you have the appropriate data, you can find various applications of OLAP technology.
This section will discuss OLAP as a software solution.
There are many ways to implement OLAP applications, so no particular technology should have been required, or even recommended. Under different conditions and circumstances, one approach may be preferable to another. The implementation techniques include many different proprietary ideas that vendors are so proud of: variations of client-server architecture, time series analysis, object orientation, data storage optimization, parallel processes, etc. But these technologies cannot be part of the definition of OLAP.
There are characteristics that must be observed in all OLAP products (if it is an OLAP product), which is the ideal of the technology. These are the 5 key definitions that characterize OLAP (the so-called FASMI test): Fast Analysis of Shared Multidimensional Information.
In other words, these 5 key definitions are the goals that OLAP products are designed to achieve.
An OLAP system includes certain components. There are various schemes for their operation that this or that product can implement.
Typically, an OLAP system includes the following components:
It should be noted that not all components are required. There are desktop OLAP systems that allow you to analyze data stored directly on the user's computer and do not require an OLAP server.
However, what element is required is the data source: data availability is an important issue. If they exist, in any form, such as an Excel table, in the accounting system database, or in the form of structured reports from branches, the IT specialist will be able to integrate with the OLAP system directly or with intermediate conversion. OLAP systems have special tools for this. If this data is not available, or it is of insufficient completeness and quality, OLAP will not help. That is, OLAP is only a superstructure over the data, and if there is none, it becomes a useless thing.
Most data for OLAP applications originates in other systems. However, in some applications (for example, planning or budgeting), data can be created directly in OLAP applications. When data comes from other applications, it is usually necessary for the data to be stored in a separate, duplicate form for the OLAP application. Therefore, it is advisable to create data warehouses.
It should be noted that the term “OLAP” is inextricably linked with the term “data warehouse” (Data Warehouse). A data warehouse is a domain-specific, time-based, and immutable collection of data to support management decision-making. Data in the warehouse comes from operational systems (OLTP systems), which are designed to automate business processes; the warehouse can be replenished from external sources, for example, statistical reports.
Despite the fact that they contain obviously redundant information that is already in databases or operating system files, data warehouses are necessary because:
The purpose of the repository is to provide the “raw material” for analysis in one place and in a simple, understandable structure. That is, the concept of Data Warehousing is not a concept of data analysis, rather it is a concept of preparing data for analysis. It involves the implementation of a single integrated data source.
When using OLAP products, two questions are important: how and where keep And process data. Depending on how these two processes are implemented, OLAP architectures are distinguished. There are 3 ways to store data for OLAP and 3 ways to process this data. Many manufacturers offer several options, some try to prove that their approach is the single most prudent one. This is, of course, absurd. However, very few products can operate in more than one mode efficiently.
OLAP data storage options
Storage in this context means keeping data in a constantly updated state.
Multidimensional databases: assume that data is stored in a multidimensional database on a server. It can include data extracted and summarized from other systems and relational databases, end-user files, etc. In most cases, multidimensional databases are stored on disk, but some products allow you to use RAM, calculating the most frequently used data on the fly " Very few products based on multidimensional databases allow multiple editing of data; many products allow single editing but multiple readings of data, while others are limited to reading only.
These three storage locations have different storage capabilities, and they are arranged in descending order of capacity. They also have different query performance characteristics: relational databases are much slower than the latter two options.
Options for processing OLAP data
There are 3 of the same data processing options:
Server-side multidimensional processing: This is a popular place to perform multidimensional calculations in client-server OLAP applications and is used in many products. Performance is usually high because most of the calculations have already been done. However, this requires a lot of disk space.
Matrix of OLAP architectures
Accordingly, by combining storage/processing options, it is possible to obtain a matrix of OLAP system architectures. Accordingly, theoretically there can be 9 combinations of these methods. However, since 3 of them lack common sense, in reality there are only 6 options for storing and processing OLAP data.
Multidimensional storage options Options |
Relational database |
Server-side multidimensional database |
Client computer |
Cartesis Magnitude |
|||
Multidimensional server processing |
Crystal Holos (ROLAP mode) IBM DB2 OLAP Server CA EUREKA:Strategy Informix MetaCube Speedware Media/MR Microsoft Analysis Services Oracle Express (ROLAP mode) Pilot Analysis Server Applix iTM1 |
Crystal Holos Comshare Decision Hyperion Essbase Oracle Express Speedware Media/M Microsoft Analysis Services PowerPlay Enterprise Server Pilot Analysis Server Applix iTM1 |
|
Multidimensional processing on the client computer |
Oracle Discoverer Informix MetaCube |
Dimensional Insight Hyperion Enterprise |
Cognos PowerPlay Personal Express iTM1 Perspectives |
Since it is storage that determines processing, it is customary to group by storage options, that is:
MOLAP products – in sectors 4 and 5
HOLAP products (allowing both multidimensional and relational data storage options) – in 2 and 4 (in italics)
There are more than 40 OLAP vendors, although they cannot all be considered competitors because their capabilities are very different and, in fact, they operate in different market segments. They can be grouped into 4 fundamental categories, the differences between which are based on the following concepts: complex functionality - simple functionality, performance - disk space. It is useful to depict categories in the shape of a square because it clearly shows the relationships between them. The distinctive feature of each category is represented on its side, and the similarities with others are represented on the adjacent sides, therefore, the categories on opposite sides are fundamentally different.
Peculiarities |
Advantages |
Flaws |
Representatives |
|
Applied OLAP |
Complete applications with rich functionality. Almost all require a multidimensional database, although some work with a relational one. Many of this category of applications are specialized, such as sales, manufacturing, banking, budgeting, financial consolidation, sales analysis |
Possibility of integration with various applications High level of functionality High level of flexibility and scalability |
Application complexity (user training required) High price |
Hyperion Solutions Crystal Decisions Information Builders |
The product is based on a non-relational data structure that provides multidimensional storage, processing and presentation of data. During the analysis process, data is selected exclusively from a multidimensional structure. Despite the high level of openness, suppliers persuade buyers to purchase their own tools |
High performance (fast calculations of summary indicators and various multidimensional transformations for any of the dimensions). The average response time to an ad hoc analytical query when using a multidimensional database is usually 1-2 orders of magnitude less than in the case of an RDB High level of openness: a large number of products with which integration is possible They easily cope with the tasks of including various built-in functions in the information model, carrying out specialized analysis by the user, etc. |
The need for large disk space to store data (due to redundancy of data that is stored). This is an extremely inefficient use of memory - due to denormalization and pre-executed aggregation, the volume of data in a multidimensional database corresponds to 2.5-100 times less than the volume of the original detailed data. In any case, MOLAP does not allow you to work with large databases data. The real limit is a database of 10-25 gigabytes The potential for a database “explosion” is an unexpected, sharp, disproportionate increase in its volume Lack of flexibility when it comes to modifying data structures. Any change in the structure of dimensions almost always requires a complete restructuring of the hypercube For multidimensional databases, there are currently no uniform standards for the interface, languages for describing and manipulating data |
Hyperion (Essbase) |
|
DOLAP (Desktop OLAP) |
Client OLAP products that are fairly easy to implement and have a low cost per seat We are talking about such analytical processing where hypercubes are small, their dimension is small, the needs are modest, and for such analytical processing a personal machine on a desktop is sufficient The goal of the producers of this market is to automate hundreds and thousands of jobs, but users must perform a fairly simple analysis. Buyers are often encouraged to buy more jobs than necessary |
Good integration with databases: multidimensional, relational Possibility of making complex purchases, which reduces the cost of implementation projects Ease of use of applications |
Very limited functionality (not comparable in this regard with specialized products) Very limited power (small data volumes, small number of measurements) |
Cognos (PowerPlay) Business Objects Crystal Decisions |
This is the smallest sector of the market. Detailed data remains where it was originally - in the relational database; some aggregates are stored in the same database in specially created service tables |
Capable of handling very large amounts of data (cost-effective storage) Provide a multi-user mode of operation, including editing mode, and not just reading Higher level of data protection and good options for differentiating access rights Frequent changes to the measurement structure are possible (do not require physical reorganization of the database) |
Low performance, significantly inferior in terms of response speed to multidimensional ones (response to complex queries measured in minutes or even hours rather than seconds). These are better report builders than interactive analytics tools Complexity of products. Requires significant maintenance costs from information technology specialists. To provide performance comparable to MOLAP, relational systems require careful design of the database schema and configuration of indexes, that is, a lot of effort on the part of database administrators Expensive to implement The limitations of SQL remain a reality, which prevents the implementation in RDBMS of many built-in functions that are easily provided in systems based on a multidimensional representation of data |
Information Advantage Informix (MetaCube) |
It should be noted that consumers of hybrid products that allow the choice of ROLAP and MOLAP mode, such as Microsoft Analysis Services, Oracle Express, Crystal Holos, IBM DB2 OLAPServer, almost always select MOLAP mode.
Each of the presented categories has its own strengths and weaknesses, there is no one optimal choice. The choice affects 3 important aspects: 1) performance; 2) disk space for data storage; 3) capabilities, functionality and especially the scalability of the OLAP solution. In this case, it is necessary to take into account the volume of data being processed, the power of the equipment, the needs of users and seek a compromise between speed and redundancy of disk space occupied by the database, simplicity and versatility.
Classification of Data Warehouses in accordance with the volume of the target database
Disadvantages of OLAP
Like any technology, OLAP also has its drawbacks: high requirements for hardware, training and knowledge of administrative personnel and end users, high costs for the implementation of the implementation project (both monetary and time, intellectual).
Choosing the right OLAP product is difficult, but very important if you want the project to not fail.
As you can see, product differences lie in many areas: functional, architectural, technical. Some products are very limited in settings. Some are created for specialized subject areas: marketing, sales, finance. There are products for general purposes, which do not have an application specific use, which must be quite flexible. As a rule, such products are cheaper than specialized ones, but the implementation costs are higher. The range of OLAP products is very wide - from the simplest tools for building pivot tables and charts that are part of office products, to tools for analyzing data and searching for patterns, which cost tens of thousands of dollars.
As in any other field, in the field of OLAP there cannot be clear guidelines for choosing tools. You can only focus on a number of key points and compare the proposed software capabilities with the needs of the organization. One thing is important: without properly thinking about how you are going to use OLAP tools, you risk creating a huge headache for yourself.
During the selection process, there are 2 questions to consider:
Then compare all this and, in fact, make a choice.
You can't make a rational product choice without understanding what it will be used for. Many companies want the “best possible product” without a clear understanding of how it should be used.
In order for the project to be successfully implemented, the financial director must, at a minimum, competently formulate his wishes and requirements to the manager and automation service specialists. Many problems arise due to insufficient preparation and awareness for the choice of OLAP; IT specialists and end users experience communication difficulties simply because they manipulate different concepts and terms during conversation and put forward conflicting preferences. There needs to be consistency in goals within the company.
Some factors have already become obvious after reading the overview of OLAP product categories, namely:
Technical aspects
Amounts of input information that are available and that will appear in the future
Users
Key Features: Data Writeback Needs, Distributed Computing, Complex Currency Conversions, Report Printing Needs, Spreadsheet Interface, Application Logic Complexity, Dimensions Required, Analysis Types: Statistical, Goal Search, What-If Analysis
Implementation
Despite the fact that the cost of analytical systems remains quite high even today, and the methodologies and technologies for implementing such systems are still in their infancy, today the economic effect they provide significantly exceeds the effect of traditional operational systems.
The effect of proper organization, strategic and operational planning of business development is difficult to quantify in advance, but it is obvious that it can exceed the costs of implementing such systems by tens and even hundreds of times. However, one should not be mistaken. The effect is ensured not by the system itself, but by the people working with it. Therefore, declarations like: “a system of Data Warehousing and OLAP technologies will help the manager make the right decisions” are not entirely correct.” Modern analytical systems are not artificial intelligence systems and they can neither help nor hinder decision making. Their goal is to promptly provide the manager with all the information necessary to make a decision in a convenient form. And what information will be requested and what decision will be made based on it depends only on the specific person using it.
All that remains to be said is that these systems can help solve many business problems and can have far-reaching positive effects. It remains to be seen who will be the first to realize the benefits of this approach and be ahead of the others.
OLAP(from the English OnLine Analytical Processing - operational analytical data processing, also: analytical data processing in real time, interactive analytical data processing) - an approach to analytical data processing based on their multidimensional hierarchical representation, which is part of the broader field of information technology - business -analysts ().
For a catalog of OLAP solutions and projects, see the OLAP section on TAdviser.
From the user's point of view, OLAP-systems provide tools for flexible viewing of information in various sections, automatic receipt of aggregated data, performing analytical operations of convolution, detailing, and comparison over time. All this makes OLAP systems a solution with obvious advantages in the field of data preparation for all types of business reporting, involving the presentation of data in various sections and different levels of hierarchy - for example, sales reports, various forms of budgets, and so on. The advantages of such a representation in other forms of data analysis, including forecasting, are obvious.
The key requirement for OLAP systems is speed, which allows them to be used in the process of an analyst’s interactive work with information. In this sense, OLAP systems are contrasted, firstly, with traditional RDBMS, samples from which with typical queries for analysts using grouping and aggregation of data are usually expensive in terms of waiting time and loading of the RDBMS, so interactive work with them for any significant volumes data is complex. Secondly, OLAP systems are also opposed to the usual flat-file presentation of data, for example, in the form of frequently used traditional spreadsheets, the presentation of multidimensional data in which is complex and not intuitive, and operations for changing the slice - points of view on the data - also require time and complicate interactive work with data.
At the same time, on the one hand, data requirements specific to OLAP systems usually imply storing data in special structures optimized for typical OLAP tasks; on the other hand, directly extracting data from existing systems during the analysis process would lead to a significant drop in their performance.
Therefore, an important requirement is to ensure the most flexible import-export link between existing systems, acting as a data source and an OLAP system, as well as an OLAP system and external data analysis and reporting applications.
Moreover, such a combination must satisfy the obvious requirements of supporting import-export from several data sources, implementing procedures for cleaning and transforming data, unifying the used classifiers and reference books. In addition to these requirements is added the need to take into account various data update cycles in existing information systems and unification of the required level of data detail. The complexity and versatility of this problem led to the emergence of the concept of data warehouses, and, in a narrow sense, to the identification of a separate class of data conversion and transformation utilities - ETL (Extract Transform Load).
We indicated above that OLAP assumes a multidimensional hierarchical representation of data, and, in a sense, is opposed to RDBMS-based systems.
This, however, does not mean that all OLAP systems use a multidimensional model to store active, “working” system data. Since the active data storage model influences all requirements dictated by the FASMI test, its importance is emphasized by the fact that it is on this basis that OLAP subtypes are traditionally distinguished - multidimensional (MOLAP), relational (ROLAP) and hybrid (HOLAP).
However, some experts, led by the aforementioned Nigel Pends, indicate that classification based on one criterion is not complete enough. Moreover, the vast majority of existing OLAP systems will relate to hybrid type. Therefore, we will dwell in more detail on active data storage models, mentioning which of them correspond to which of the traditional OLAP subtypes.
In this case, OLAP data is stored in multidimensional DBMSs that use designs optimized for this type of data. Typically, multidimensional DBMSs support all typical OLAP operations, including aggregation along the required hierarchy levels and so on.
This type of data storage, in a sense, can be called classic for OLAP. However, all steps for preliminary data preparation are fully necessary for it. Typically, multidimensional DBMS data is stored on disk, however, in some cases, to speed up data processing, such systems allow data to be stored in RAM. For the same purposes, storing pre-calculated aggregate values and other calculated values in the database is sometimes used.
Multidimensional DBMSs that fully support multi-user access with competing read and write transactions are quite rare; the usual mode for such DBMSs is single-user with write access with multi-user read access, or multi-user read-only.
Among the shortcomings characteristic of some implementations of multidimensional DBMSs and OLAP systems based on them, one can note their susceptibility to an unpredictable growth in the amount of space occupied by the database from a user point of view. This effect is caused by the desire to minimize the reaction time of the system, which dictates storing pre-calculated values of aggregate indicators and other quantities in the database, which causes a nonlinear increase in the volume of information stored in the database with the addition of new data values or measurements.
The extent to which this problem manifests itself, as well as the related problems of efficiently storing sparse data cubes, is determined by the quality of the approaches and algorithms used for specific implementations of OLAP systems.
OLAP data can also be stored in a traditional RDBMS. In most cases, this approach is used when trying to “painlessly” integrate OLAP with existing accounting systems or RDBMS-based data warehouses. At the same time, this approach requires some additional capabilities from the RDBMS to ensure effective fulfillment of the requirements of the FASMI test (in particular, ensuring minimal system response time). Typically, OLAP data is stored in denormalized form, and some of the pre-calculated aggregates and values are stored in special tables. When stored in a normalized form, the effectiveness of RDBMS as a method for storing active data decreases.
The problem of choosing effective approaches and algorithms for storing precomputed data is also relevant for OLAP systems based on RDBMS, so manufacturers of such systems usually focus on the merits of the approaches used.
In general, it is believed that RDBMS-based OLAP systems slower systems, based on multidimensional DBMSs, including due to data storage structures that are less efficient for OLAP tasks, but in practice this depends on the characteristics of a particular system.
Among the advantages of storing data in an RDBMS is the greater scalability of such systems.
This approach involves storing chunks of data in regular files. It is usually used as an addition to one of the two main approaches in order to speed up work by caching current data on disk or in the RAM of the client PC.
Most manufacturers of OLAP systems that promote their comprehensive solutions, often including, in addition to the OLAP system itself, DBMS, ETL (Extract Transform Load) and reporting tools, currently use a hybrid approach to organizing the storage of active system data, distributing it in one way or another between RDBMS and specialized storage, as well as between disk structures and in-memory caching.
Since the effectiveness of such a solution depends on the specific approaches and algorithms used by the manufacturer to determine whether what data and where to store, then hastily draw conclusions about the initially greater efficiency of such solutions as a class without assessing the specific features of the system under consideration.
OLAP(eng. on-line analytical processing) – a set of methods for dynamic processing of multidimensional queries in analytical databases. Such data sources are usually quite large, and one of the most important requirements in the tools used to process them is high speed. In relational databases, information is stored in separate tables that are well normalized. But complex multi-table queries are executed quite slowly in them. Much best performance processing speed in OLAP systems is achieved due to the peculiarities of the data storage structure. All information is clearly organized, and two types of data storage are used: measurements(contain directories divided into categories, for example, points of sale, clients, employees, services, etc.) and data(characterize the interaction of elements various measurements, for example, on March 3, 2010, seller A provided a service to customer B in store C for the amount of D monetary units). Measures are used to calculate results in the Analysis Cube. Measures are collections of facts aggregated by corresponding selected dimensions and their elements. Thanks to these features, complex queries with multidimensional data take much less time than with relational sources.
One of the main vendors of OLAP systems is Microsoft Corporation. Let's look at the implementation of OLAP principles using practical examples of creating an analytical cube in the Microsoft SQL Server Business Intelligence Development Studio (BIDS) and Microsoft Office PerformancePoint Server Planning Business Modeler (PPS) applications and get acquainted with the possibilities of visually representing multidimensional data in the form of graphs, charts and tables.
For example, in BIDS it is necessary to create an OLAP cube using data about an insurance company, its employees, partners (clients) and points of sale. Let's assume that the company provides one type of service, so measurement of services will not be needed.
First let's define the measurements. The following entities (data categories) are associated with the company’s activities:
As you can see, the fact table is connected to the dimension tables through a one-to-one correspondence of identifier fields (PartnerID, EmployeeID, etc.).
Let's look at the result. On the cube explorer tab, by dragging measures and dimensions into the totals, rows, columns and filters fields, we can get a view of the data of interest (for example, concluded transactions on insurance contracts concluded by a certain employee in 2005).