Data mining

format_list_bulleted Contenido keyboard_arrow_down
ImprimirCitar

Data mining or data exploration (it is the analysis stage of "knowledge discovery in databases" or KDD) is a field of statistics and computer science refers to the process of trying to discover patterns in large volumes of data sets. It uses the methods of artificial intelligence, machine learning, statistics, and database systems. The general objective of the data mining process is to extract information from a data set and transform it into an understandable structure for later use. In addition to the raw analysis stage, it involves aspects of data and database management, data processing, model and inference considerations, interest metrics, computational complexity theory considerations, post-processing of the discovered structures, visualization and online updating.

The term is a buzzword, and is frequently misused to refer to any form of large-scale data or information processing (collection, extraction, storage, analysis, and statistics), but it has also been generalized to any type of computer decision support system, including artificial intelligence, machine learning, and business intelligence. In usage of the word, the key term is discovery, commonly defined as "the detection of something new". Even the popular book "Data Mining: A System of Practical Learning Tools and Techniques with Java" (which covers all the machine learning stuff) was originally going to be called simply "hands-on machine learning", and the term "data mining" it was added for marketing reasons. Often the more general terms "(large scale) data analysis", or "analysis". Or when they refer to current methods, artificial intelligence and machine learning, they are more appropriate.

The real data mining task is the automatic or semi-automatic analysis of large amounts of data to extract hitherto unknown interesting patterns, such as groups of data records (cluster analysis), unusual records (the detection of anomalies) and dependencies (mining by association rules). This usually involves the use of database techniques such as spatial indexes. These patterns can then be seen as a kind of summary of the input data, and can be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step could identify various groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither data collection, data preparation, nor interpretation of results and information are part of the data mining stage, but they belong to the whole KDD process as additional steps.

Terms related to data mining, data fishing, and data espionage refer to the use of data mining methods to sample portions of an established larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any pattern discovered. These methods can, however, be used in the creation of new hypotheses that are tested against larger data populations.

Process

A typical data mining process, understood as a set of successive stages, consists of the following general steps:

  1. Selection of datasetboth in terms of the objective variables (those that are to be predicted, calculated or inferred), as well as the independent variables (the ones that serve to make the calculation or process), and possibly the sampling of the available records.
  2. Analysis of data propertiesespecially histograms, scatter diagrams, presence of atypical values and absence of data (null values).
  3. Transformation of input dataset, will be carried out in various forms depending on the previous analysis, with the aim of preparing it to apply the data mining technique that best suits the data and the problem, to this step is also known as preprocessing of the data. A substantial problem associated with the development of such systems when they contain English text is the size of their vocabulary; it is larger than that of any other Language in the world. A method that is now being used in these cases is that of simplification before proceeding with the process, by turning that text into basic English; it contains only 1,000 words that are also used to describe in footnotes the meaning of the more than 30,000 words defined in the “Basic Sciences Dictionary”.
  4. Selection and application of data mining technique, the predictive, classification or segmentation model is built.
  5. Extraction of knowledge, by means of a data mining technique, a knowledge model is obtained, which represents patterns of behavior observed in the values of the variables of the problem or relationship of association between those variables. Several techniques can also be used at the same time to generate different models, although usually each technique requires a different preprocessing of the data.
  6. Data interpretation and evaluationOnce the model has been obtained, its validation must be made by verifying that the conclusions it throws are valid and sufficiently satisfactory. In the event of having obtained several models by using different techniques, models should be compared in search of one that best fits the problem. If none of the models achieves the expected results, any of the above steps should be altered to generate new models.

If the final model does not pass this evaluation, the process could be repeated from the beginning or, if the expert considers it appropriate, from any of the previous steps. This feedback can be repeated as many times as deemed necessary until a valid model is obtained.

Once the model has been validated, if it turns out to be acceptable (it provides adequate outputs and/or with admissible margins of error), it is ready for exploitation. The models obtained by data mining techniques are applied by incorporating them into the information analysis systems of organizations, and even into transactional systems. In this regard, it is worth noting the efforts of the Data Mining Group, which is standardizing the PMML language (Predictive Model Markup Language), so that data mining models are interoperable on different platforms, regardless of the system with which they have been built. The main manufacturers of database systems and information analysis programs make use of this standard.

Traditionally, data mining techniques were applied to information contained in data warehouses. In fact, many large companies and institutions have created and feed databases specially designed for data mining projects in which they centralize potentially useful information from all their business areas. However, unstructured data mining such as information contained in text files, on the Internet, etc., is currently gaining increasing importance.

Protocol of a data mining project

A data mining project has five necessary phases which are essentially:

  • Understanding: the business and the problem you want to solve.
  • Determination, obtaining and cleaning: of the necessary data.
  • Creation of mathematical models.
  • Validation, communication: of the results obtained.
  • Integration: if applicable, of the results in a transactional or similar system.

The relationship between all these phases is only linear on paper. In reality, it is much more complex and hides a whole hierarchy of sub-phases. Through the accumulated experience in data mining projects, methodologies have been developed to manage this complexity in a more or less uniform way.

Data mining techniques

As already mentioned, data mining techniques come from artificial intelligence and statistics, these techniques are nothing more than algorithms, more or less sophisticated that are applied to a data set to obtain some results.

The most representative techniques are:

  • Neuronal networks.- They are an automatic learning and processing paradigm inspired by the way the nervous system of animals works. It is a system of interconnection of neurons in a network that collaborates to produce an output stimulus. Some examples of neural network are:
    • The perceptre.
    • The multilayer.
    • Self-organized maps, also known as Kohonen networks.
  • Linear return.- It is the most used to form data relations. Quick and effective but insufficient in multidimensional spaces where more than 2 variables can be related.
  • Decision trees.- A decision tree is a prediction model used in the field of artificial intelligence and predictive analysis, given a database these logical construction diagrams are constructed, very similar to rule-based prediction systems, which serve to represent and categorize a series of conditions that happen successively, for the resolution of a problem. Examples:
    • Algoritmo ID3.
    • Algorithm C4.5
  • Statistical models.- It is a symbolic expression in the form of equality or equation used in all experimental designs and regression to indicate the different factors that modify the response variable.
  • Grouping or Clustering.- It is a procedure of grouping a series of vectors according to usually distance criteria; it will try to have the input vectors so that those with common characteristics are closer. Examples:
    • Algoritmo K-means
    • Algoritmo K-medoids
  • Rules of association.- They are used to discover facts that occur in common within a certain set of data.

According to the objective of the data analysis, the algorithms used are classified as supervised and unsupervised (Weiss and Indurkhya, 1998):

  • Supervised algorithms (or predictives): predict a data (or a set of them) unknown to priori, from other acquaintances.
  • Unsupervised algorithms (or knowledge discovery): patterns and trends in data are discovered.

Examples of use of data mining

Business

Data mining can make a significant contribution to customer relationship-based business management applications. Instead of contacting the customer indiscriminately through a call center or by sending emails, only those who are perceived to have a higher probability of responding positively to a certain offer or promotion will be contacted.

Companies that employ data mining typically see return on investment, but also recognize that the number of predictive models developed can grow very quickly. Instead of building models to predict which customers may change, the company could build separate models for each region and/or each type of customer. You may also want to determine which customers are going to be profitable during a window of time (a fortnight, a month,...) and only send offers to people who are likely to be profitable. To maintain this number of models, it is necessary to manage the versions of each model and move to data mining that is as automated as possible.

In such a changing environment where the volumes of measurable data grow exponentially thanks to digital marketing, "the waits produced by units of technical departments and expert statisticians finally make the analysis results useless" to business users and decision makers. This explains why data mining tool providers are working on applications that are easier to use in what is known as visual data mining and the demand for employment of this type of user analyst of business has skyrocketed in recent years. According to Gartner, it is foreseeable that during 2016-2017 there will only be "qualified professionals to cover a third of the positions".

Shopping cart analysis

The classic example of the application of data mining has to do with the detection of purchasing habits in supermarkets. A widely cited study found that on Fridays an unusually high number of customers were purchasing diapers and beer at the same time. It was detected that it was due to the fact that on that day young parents used to go to the supermarket whose perspective for the weekend consisted of staying at home taking care of their son and watching television with a beer in hand. The supermarket was able to increase its beer sales by placing them next to the diapers to encourage compulsive sales.

Fueling Patterns

A more common example is the detection of leakage patterns. In many industries - such as banking, telecommunications, etc. - there is an understandable interest in detecting as soon as possible those customers who may be thinking of terminating their contracts to possibly switch to the competition. These customers —and depending on their value— could be made personalized offers, offer special promotions, etc., with the ultimate goal of retaining them. Data mining helps determine which customers are most likely to unsubscribe by studying their behavior patterns and comparing them to samples of customers who have actually unsubscribed in the past.

Frauds

An analogous case is the detection of money laundering transactions or fraud in the use of credit cards or mobile phone services, and even in relation to taxpayers with the tax Generally, these fraudulent or illegal operations tend to follow characteristic patterns that allow, with a certain degree of probability, to distinguish them from legitimate ones and thus develop mechanisms to take quick action against them.

Human resources

Data mining can also be useful for human resources departments in identifying the characteristics of their most successful employees. The information obtained can help recruiting staff, focusing on the efforts of your employees and the results obtained by them. In addition, the help offered by the applications for Strategic Management in a company translates into obtaining advantages at the corporate level, such as improving the profit margin or sharing objectives; and in the improvement of operational decisions, such as development of production plans or labor management.

Internet behavior

An area that is also in vogue is analyzing the behavior of visitors —especially when they are potential customers— on a Internet website. Or the use of information —obtained by more or less legitimate means— about them to offer them advertising specifically adapted to their profile. Or, once they purchase a certain product, immediately know what else to offer them, taking into account the historical information available about the customers who have purchased the first one.

Terrorism

Data mining has been cited as the method by which the US Army's Able Danger unit had identified the leader of the September 11, 2001 attacks, Mohammed Atta, and three other hijackers of the "9/11" as possible members of an al Qaeda cell operating in the US more than a year before the attack. It has been suggested that both the Central Intelligence Agency and its Canadian counterpart, the Canadian Security and Intelligence Service, have also employed this method.

Games

Since the early 1960s, with the availability of oracles for certain combinational games, also called board game endgames (for example, for tic tac toe or chess endgames) with any starting configuration, has opened a new area in data mining that consists of the extraction of strategies used by people for these oracles. Current approaches to pattern recognition do not seem to be successfully applied to the operation of these oracles. Instead, the production of insightful patterns is based on extensive experimentation with databases about those endgames, combined with intensive study of the endgames themselves. in well-designed problems and with knowledge of the technique (previous data on the end of the game). Notable examples of researchers working in this field are Berlekamp on the game of points-and-boxes (or Timbiriche) and John Nunn on chess endgames.

Video Games

Over the years, technologies and advances in relation to data mining have been involved in different business processes and the video game industry has not been left behind in this field, the need to know its consumers and the taste of these is a fundamental part to survive in an environment as competitive as this is, different data is needed before even starting the project idea in a new video game. Large developer companies have fallen under the cloak of cancellations, losses, failures and in cases even bankruptcy due to mishandling of information. In recent years, these video game development companies have understood the great importance of the content that is handled and how the consumer sees it, which is why they focused on hiring the services of companies specialized in this data mining sector in order to present quality products and that the public really like based on the analysis of the information obtained in the passing of the “video game” years of its target audience.

Science and Engineering

In recent years, data mining has been widely used in various areas related to science and engineering. Some application examples in these fields are:

Genetics

In the study of human genetics, the primary goal is to understand the mapping relationship between parts and individual variation in human DNA sequences and variability in disease susceptibility. In simpler terms, it is about knowing how changes in an individual's DNA sequence affect the risk of developing common diseases (such as cancer). This is very important to help improve the diagnosis, prevention and treatment of diseases. The data mining technique used to accomplish this task is known as "multifactor dimensionality reduction".

Electrical Engineering

In the field of electrical engineering, data mining techniques have been widely used to monitor the conditions of high voltage installations. The purpose of this monitoring is to obtain valuable information on the state of the insulation of the equipment. For the monitoring of vibrations or the analysis of load changes in transformers, certain techniques are used for data grouping (clustering) such as self-organizing maps (SOM: Self-organizing map). These maps are used to detect abnormal conditions and to estimate the nature of those anomalies.

Gas analysis

Data mining techniques have also been applied for the analysis of dissolved gases (DGA: Dissolved gas analysis) in electrical transformers. Dissolved gas analysis has long been known as the tool for diagnosing transformers. Self-organizing maps (SOM) are used to analyze data and determine trends that could be missed using classical techniques (DGA).

Data mining and other similar disciplines

There is some controversy about defining the boundaries between data mining and analogous disciplines, such as statistics, artificial intelligence, etc. There are those who argue that data mining is nothing more than statistics wrapped in business jargon that turns it into a salable product. Others, on the other hand, find in it a series of specific problems and methods that make it different from other disciplines.

The fact is that, in practice, all the models and algorithms in general use in data mining —neural networks, regression and classification trees, logistic models, principal component analysis, etc.— have a long tradition relatively long in other fields.

From statistics

Certainly, data mining drinks from statistics, from which it takes the following techniques:

  • Variance analysis, which evaluates the existence of significant differences between the averages of one or more continuous variables in different populations.
  • Regression: defines the relationship between one or more variables and a set of predictive variables of the first ones.
  • Chi-square test: through which the contrast of the dependency hypothesis between variables is performed.
  • Grouping or clustering analysis: allows the classification of a population individuals characterized by multiple attributes (binaries, qualitative or quantitative) in a certain number of groups, based on the similarities or differences of individuals.
  • Discriminatory analysis: allows the classification of individuals in groups that have previously been established, allows to find the classification rule of the elements of these groups, and therefore a better identification of the variables that define the group's membership.
  • Time Series: allows the study of the evolution of a variable over time to be able to make predictions, from that knowledge and under the assumption that no structural changes will occur.

Informatics

From computer science he takes the following techniques:

  • Genetic algorithms: These are numerical methods of optimization, in which the variable or variables that are intended to be optimized along with the study variables are a segment of information. Those configurations of the analysis variables that obtain better values for the response variable will correspond to segments with greater reproductive capacity. Through reproduction, the best segments last and their proportion grows from generation to generation. It can also introduce random elements for the modification of variables (mutations). After a certain number of iterations, the population will be constituted by good solutions to the problem of optimization, because the bad solutions have been discarding, iteration after iteration.
  • Artificial Intelligence: Using a computer system that simulates a smart system, the data available is analyzed. Artificial Intelligence systems would include the Expert Systems and Neuronal Networks.
  • Expert Systems: These systems have been created from practical rules extracted from the knowledge of experts. Mainly based on inferences or cause-effect.
  • Intelligent Systems: They are similar to expert systems, but with greater advantage to new situations unknown to the expert.
  • Neuronal networks: Genericly, they are parallel numerical process methods, in which the variables interact by means of linear or non-linear transformations, to obtain outputs. These outputs are contrasted with those that had to come out, based on test data, leading to a feedback process through which the network is reconfigured, to obtain an appropriate model.

Data mining based on information theory

All traditional data mining tools assume that the data they will use to build the models contains the information necessary to achieve the intended purpose: to obtain enough knowledge that can be applied to the business (or problem).) to get a benefit (or solution).

The downside is that this is not necessarily true. In addition, there is another even bigger problem. Once the model is built, it is not possible to know if it has captured all the information available in the data. For this reason, common practice is to run several models with different parameters to see if any achieve better results.

A relatively new approach to data analysis addresses these problems by making the practice of data mining more like a science than an art.

In 1948 Claude Shannon published a paper called “A Mathematical Theory of Communication”. Later this was renamed Information Theory and laid the foundations of communication and information coding. Shannon proposed a way to measure the amount of information to be expressed in bits.

In 1999 Dorian Pyle published a book called “Data Preparation for Data Mining” in which he proposed a way to use Information Theory to analyze data. In this new approach, a database is a channel that transmits information. On the one hand, there is the real world that captures data generated by the business. On the other are all important business situations and problems. And the information flows from the real world and through the data, to the business problem.

With this perspective and using Information Theory, it is possible to measure the amount of information available in the data and what portion of it can be used to solve business problems. As a practical example, the data might be found to contain 65% of the information needed to predict which customers will terminate their contracts. In this way, if the final model is capable of making predictions with 60% accuracy, it can be ensured that the tool that generated the model did a good job capturing the available information. Now, if the model had only a 10% hit rate, for example, then trying other models or even with other tools might be worth it.

The ability to measure information contained in data has other important advantages.

By analyzing the data from this new perspective, an information map is generated that makes prior preparation of the data unnecessary, an absolutely essential task if good results are to be desired, but which takes an enormous amount of time.

It is possible to select an optimal group of variables that contains the necessary information to carry out a prediction model.

Once the variables are processed in order to create the information map and then those that provide the most information are selected, the choice of the tool that will be used to create the model ceases to be important, since the greatest work was done in the previous steps.

Trends

Data Mining has undergone transformations in recent years in accordance with technological changes, marketing strategies, the extension of online purchase models, etc. The most important of them are:

  • The importance of the informal data (text, Internet pages, etc.).
  • The need to integrate algorithms and results obtained into operational systems, Internet portals, etc.
  • The demand that processes work practically online (e.g. in cases of credit card fraud).
  • Them Response times. The large volume of data to be processed in many cases to obtain a valid model is a drawback; this involves large amounts of process time and there are problems that require a real-time response.

Detractors

Problems whose goal is to predict the value of a particular attribute based on the values of other attributes. The attribute that is predicted is commonly called the target attribute (or dependent variable), while the attributes that are used for the prediction are known as explanatory attributes (or independent variables). The problems of classification or estimation of value stand out here and as techniques we can highlight the approaches based on statistics, regression, decision trees and neural networks.

Descriptive problems whose objective is to derive patterns (correlations, trends, groupings or clusters, trajectories and anomalies) that summarize the characteristics inherent in the data.

Software Tools

There are many software tools for developing both free and commercial data mining models, such as:

  • RapidMiner
  • KNIME
  • Neural Designer
  • OpenNN
  • Orange
  • R
  • SPSS Modeler
  • SAS
  • STATISTICA
  • Weka

Contenido relacionado

Closed source

In computing, a program is closed source when the source code is not available to any user, that is, it is not made public. It is called as opposed to open...

Fibered

In topology, a bundle is a continuous function surjective π, from a topological space E into another topological space B, satisfying another condition that...

Cardano

Cardano may refer...
Más resultados...
Tamaño del texto: