Gegevens mining

Gegevens mining is the process of discovering patterns te large gegevens sets involving methods at the intersection of machine learning, statistics, and database systems. [1] It is an essential process where slim methods are applied to samenvatting gegevens patterns. [1] [Two] It is an interdisciplinary subfield of laptop science. [1] [Trio] [Four] The overall aim of the gegevens mining process is to samenvatting information from a gegevens set and convert it into an understandable structure for further use. [1] Aside from the raw analysis step, it involves database and gegevens management aspects, gegevens pre-processing, monster and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. [1] Gegevens mining is the analysis step of the “skill discovery ter databases” process, or KDD. [Five]

The term is a misnomer, because the objective is the extraction of patterns and skill from large amounts of gegevens, not the extraction (mining) of gegevens itself. [6] It also is a buzzword [7] and is frequently applied to any form of large-scale gegevens or information processing (collection, extraction, warehousing, analysis, and statistics) spil well spil any application of rekentuig decision support system, including artificial intelligence, machine learning, and business intelligence. The book Gegevens mining: Practical machine learning devices and technics with Java [8] (which covers mostly machine learning material) wasgoed originally to be named just Practical machine learning, and the term gegevens mining wasgoed only added for marketing reasons. [9] Often the more general terms (large scale) gegevens analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more adequate.

The actual gegevens mining task is the semi-automatic or automatic analysis of large quantities of gegevens to samenvatting previously unknown, interesting patterns such spil groups of gegevens records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database technics such spil spatial indices. Thesis patterns can then be seen spil a kleintje of summary of the input gegevens, and may be used ter further analysis or, for example, te machine learning and predictive analytics. For example, the gegevens mining step might identify numerous groups ter the gegevens, which can then be used to obtain more accurate prediction results by a decision support system. Neither the gegevens collection, gegevens prep, strafgevangenis result interpretation and reporting is part of the gegevens mining step, but do belong to the overall KDD process spil extra steps.

The related terms gegevens dredging, gegevens fishing, and gegevens snooping refer to the use of gegevens mining methods to sample parts of a larger population gegevens set that are (or may be) too puny for reliable statistical inferences to be made about the validity of any patterns discovered. Thesis methods can, however, be used te creating fresh hypotheses to test against the larger gegevens populations.


Ter the 1960s, statisticians and economists used terms like gegevens fishing or gegevens dredging to refer to what they considered the bad practice of analyzing gegevens without an a-priori hypothesis. The term “gegevens mining” wasgoed used te a similarly critical way by economist Michael Lovell ter an article published ter the Review of Economic Studies 1983. Lovell indicates that the practice “masquerades under a multitude of aliases, ranging from “experimentation” (positive) to “fishing” or “snooping” (negative). [Ten]

The term gegevens mining appeared around 1990 ter the database community, generally with positive connotations. For a brief time ter 1980s, a phrase “database mining”™, wasgoed used, but since it wasgoed trademarked by HNC, a San Diego-based company, to pitch their Database Mining Workstation, [11] researchers consequently turned to gegevens mining. Other terms used include gegevens archaeology, information harvesting, information discovery, skill extraction, etc. Gregory Piatetsky-Shapiro coined the term “skill discovery ter databases” for the very first workshop on the same topic (KDD-1989) and this term became more popular ter AI and machine learning community. However, the term gegevens mining became more popular te the business and press communities. [12] Presently, the terms gegevens mining and skill discovery are used interchangeably.

Ter the academic community, the major forums for research embarked te 1995 when the Very first International Conference on Gegevens Mining and Skill Discovery (KDD-95) wasgoed commenced te Montreal under AAAI sponsorship. It wasgoed co-chaired by Usama Fayyad and Ramasamy Uthurusamy. A year zometeen, te 1996, Usama Fayyad launched the journal by Kluwer called Gegevens Mining and Skill Discovery spil its founding editor-in-chief. Zometeen he embarked the SIGKDDD Newsletter SIGKDD Explorations. [13] The KDD International conference became the primary highest quality conference ter gegevens mining with an acceptance rate of research paper submissions below 18%. The journal Gegevens Mining and Skill Discovery is the primary research journal of the field.

The manual extraction of patterns from gegevens has occurred for centuries. Early methods of identifying patterns ter gegevens include Bayes’ theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and enhancing power of pc technology has dramatically enlargened gegevens collection, storage, and manipulation capability. Spil gegevens sets have grown te size and complexity, ongezouten “hands-on” gegevens analysis has increasingly bot augmented with officieus, automated gegevens processing, aided by other discoveries ter laptop science, such spil neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s). Gegevens mining is the process of applying thesis methods with the intention of uncovering hidden patterns [14] ter large gegevens sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way gegevens is stored and indexed ter databases to execute the actual learning and discovery algorithms more efficiently, permitting such methods to be applied to everzwijn larger gegevens sets.

The skill discovery ter databases (KDD) process is commonly defined with the stages:

It exists, however, ter many variations on this theme, such spil the Cross Industry Standard Process for Gegevens Mining (CRISP-DM) which defines six phases:

  1. Business understanding
  2. Gegevens understanding
  3. Gegevens prep
  4. Modeling
  5. Evaluation
  6. Deployment

or a simplified process such spil (1) Pre-processing, (Two) Gegevens Mining, and (Trio) Results Validation.

Polls conducted te 2002, 2004, 2007 and 2014 vertoning that the CRISP-DM methodology is the leading methodology used by gegevens miners. [15] The only other gegevens mining standard named te thesis polls wasgoed SEMMA. However, 3–4 times spil many people reported using CRISP-DM. Several teams of researchers have published reviews of gegevens mining process models, [16] [17] and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA te 2008. [Legal]

Pre-processing Edit

Before gegevens mining algorithms can be used, a target gegevens set voorwaarde be assembled. Spil gegevens mining can only uncover patterns actually present ter the gegevens, the target gegevens set vereiste be large enough to contain thesis patterns while remaining concise enough to be mined within an acceptable time limit. A common source for gegevens is a gegevens mart or gegevens warehouse. Pre-processing is essential to analyze the multivariate gegevens sets before gegevens mining. The target set is then cleaned. Gegevens cleaning liquidates the observations containing noise and those with missing gegevens.

Gegevens mining Edit

Gegevens mining involves six common classes of tasks: [Five]

  • Anomaly detection (outlier/switch/deviation detection) – The identification of unusual gegevens records, that might be interesting or gegevens errors that require further investigation.
  • Association rule learning (dependency modelling) – Searches for relationships inbetween variables. For example, a supermarket might gather gegevens on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to spil market basket analysis.
  • Clustering – is the task of discovering groups and structures te the gegevens that are ter some way or another “similar”, without using known structures ter the gegevens.
  • Classification – is the task of generalizing known structure to apply to fresh gegevens. For example, an e-mail program might attempt to classify an e-mail spil “legitimate” or spil “spam”.
  • Regression – attempts to find a function which models the gegevens with the least error that is, for estimating the relationships among gegevens or datasets.
  • Summarization – providing a more klein representation of the gegevens set, including visualization and report generation.

Results validation Edit

Gegevens mining can unintentionally be misused, and can then produce results which show up to be significant, but which do not actually predict future behaviour and cannot be reproduced on a fresh sample of gegevens and bear little use. Often this results from investigating too many hypotheses and not performing decent statistical hypothesis testing. A elementary version of this problem te machine learning is known spil overfitting, but the same problem can arise at different phases of the process and thus a train/test split – when applicable at all – may not be sufficient to prevent this from happening. [Nineteen]

The final step of skill discovery from gegevens is to verify that the patterns produced by the gegevens mining algorithms occur te the broader gegevens set. Not all patterns found by the gegevens mining algorithms are necessarily valid. It is common for the gegevens mining algorithms to find patterns ter the training set which are not present te the general gegevens set. This is called overfitting. To overcome this, the evaluation uses a test set of gegevens on which the gegevens mining algorithm wasgoed not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a gegevens mining algorithm attempting to distinguish “spam” from “legitimate” emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not bot trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. A number of statistical methods may be used to evaluate the algorithm, such spil ROC forms.

If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and switch the pre-processing and gegevens mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into skill.

The premier professional assets te the field is the Association for Computing Machinery’s (ACM) Special Rente Group (SIG) on Skill Discovery and Gegevens Mining (SIGKDD). [20] [21] Since 1989 this ACM SIG has hosted an annual international conference and published its proceedings, [22] and since 1999 it has published a biannual academic journal titled “SIGKDD Explorations”. [23]

Pc science conferences on gegevens mining include:

There have bot some efforts to define standards for the gegevens mining process, for example the 1999 European Cross Industry Standard Process for Gegevens Mining (CRISP-DM 1.0) and the 2004 Java Gegevens Mining standard (JDM 1.0). Development on successors to thesis processes (CRISP-DM Two.0 and JDM Two.0) wasgoed active ter 2006, but has stalled since. JDM Two.0 wasgoed withdrawn without reaching a final draft.

For exchanging the extracted models – ter particular for use ter predictive analytics – the key standard is the Predictive Monster Markup Language (PMML), which is an XML-based language developed by the Gegevens Mining Group (DMG) and supported spil exchange format by many gegevens mining applications. Spil the name suggests, it only covers prediction models, a particular gegevens mining task of high importance to business applications. However, extensions to voorkant (for example) subspace clustering have bot proposed independently of the DMG. [24]

Gegevens mining is used wherever there is digital gegevens available today. Notable examples of gegevens mining can be found across business, medicine, science, and surveillance.

While the term “gegevens mining” itself may have no ethical implications, it is often associated with the mining of information ter relation to peoples’ behavior (ethical and otherwise). [25]

The ways ter which gegevens mining can be used can te some cases and contexts raise questions regarding privacy, legality, and ethics. [26] Ter particular, gegevens mining government or commercial gegevens sets for national security or law enforcement purposes, such spil te the Total Information Awareness Program or te ADVISE, has raised privacy concerns. [27] [28]

Gegevens mining requires gegevens prep which can uncover information or patterns which may compromise confidentiality and privacy obligations. A common way for this to occur is through gegevens aggregation. Gegevens aggregation involves combining gegevens together (possibly from various sources) ter a way that facilitates analysis (but that also might make identification of private, individual-level gegevens deducible or otherwise apparent). [29] This is not gegevens mining vanaf se, but a result of the prep of gegevens before – and for the purposes of – the analysis. The threat to an individual’s privacy comes into play when the gegevens, once compiled, cause the gegevens miner, or anyone who has access to the freshly compiled gegevens set, to be able to identify specific individuals, especially when the gegevens were originally anonymous. [30] [31] [32]

It is recommended that an individual is made aware of the following before gegevens are collected: [29]

  • the purpose of the gegevens collection and any (known) gegevens mining projects,
  • how the gegevens will be used,
  • who will be able to mine the gegevens and use the gegevens and their derivatives,
  • the status of security surrounding access to the gegevens,
  • how collected gegevens can be updated.

Gegevens may also be modified so spil to become anonymous, so that individuals may not readily be identified. [29] However, even “de-identified”/”anonymized” gegevens sets can potentially contain enough information to permit identification of individuals, spil occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL. [33]

The inadvertent revelation of personally identifiable information leading to the provider violates Fair Information Practices. This indiscretion can cause financial, emotional, or bodily harm to the indicated individual. Ter one example of privacy disturbance, the patrons of Walgreens filed a lawsuit against the company ter 2011 for selling prescription information to gegevens mining companies who ter turn provided the gegevens to pharmaceutical companies. [34]

Situation ter Europe Edit

Europe has rather strong privacy laws, and efforts are underway to further strengthen the rights of the consumers. However, the U.S.-E.U. Safe Harbor Principles presently effectively expose European users to privacy exploitation by U.S. companies. Spil a consequence of Edward Snowden’s global surveillance disclosure, there has bot enlargened discussion to revoke this agreement, spil ter particular the gegevens will be fully exposed to the National Security Agency, and attempts to reach an agreement have failed. [ citation needed ]

Situation ter the United States Edit

Ter the United States, privacy concerns have bot addressed by the US Congress via the passage of regulatory controls such spil the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to give their “informed consent” regarding information they provide and its intended present and future uses. According to an article te Biotech Business Week, “‘[i]n practice, HIPAA may not suggest any greater protection than the longstanding regulations te the research strijdperk,’ says the AAHC. More importantly, the rule’s aim of protection through informed consent is treatment a level of incomprehensibility to average individuals.” [35] This underscores the necessity for gegevens anonymity te gegevens aggregation and mining practices.

U.S. information privacy legislation such spil HIPAA and the Family Educational Rights and Privacy Act (FERPA) applies only to the specific areas that each such law addresses. Use of gegevens mining by the majority of businesses te the U.S. is not managed by any legislation.

Situation ter Europe Edit

Due to a lack of flexibilities te European copyright and database law, the mining of in-copyright works such spil web mining without the permission of the copyright holder is not legal. Where a database is zuivere gegevens te Europe there is likely to be no copyright, but database rights may exist so gegevens mining becomes subject to regulations by the Database Directive. On the recommendation of the Hargreaves review this led to the UK government to amend its copyright law te 2014 [36] to permit content mining spil a limitation and exception. Only the 2nd country te the world to do so after Japan, which introduced an exception ter 2009 for gegevens mining. However, due to the confinement of the Copyright Directive, the UK exception only permits content mining for non-commercial purposes. UK copyright law also does not permit this provision to be overridden by contractual terms and conditions. The European Commission facilitated stakeholder discussion on text and gegevens mining ter 2013, under the title of Licences for Europe. [37] The concentrate on the solution to this legal punt being licences and not limitations and exceptions led to representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue te May 2013. [38]

Situation te the United States Edit

By tegenstelling to Europe, the pliable nature of US copyright law, and ter particular fair use means that content mining te America, spil well spil other fair use countries such spil Israel, Taiwan and South Korea is viewed spil being legal. Spil content mining is transformative, that is it does not supplant the original work, it is viewed spil being lawful under fair use. For example, spil part of the Google Book settlement the presiding judge on the case ruled that Google’s digitisation project of in-copyright books wasgoed lawful, ter part because of the transformative uses that the digitisation project displayed – one being text and gegevens mining. [39]

Free open-source gegevens mining software and applications Edit

The following applications are available under free/open source licenses. Public access to application source code is also available.

  • Carrot2: Text and search results clustering framework.
  • A chemical structure miner and web search engine.
  • DataMelt: (DMelt) A framework written ter Java language with support of scripting languages.
  • ELKI: A university research project with advanced cluster analysis and outlier detection methods written te the Java language.
  • GATE: a natural language processing and language engineering instrument.
  • KNIME: The Konstanz Information Miner, a user friendly and comprehensive gegevens analytics framework.
  • Massive Online Analysis (MOA): a real-time big gegevens stream mining with concept drift contraption te the Java programming language.
  • MEPX – cross toneel implement for regression and classification problems based on a Genetic Programming variant.
  • ML-Flex: A software package that enables users to integrate with third-party machine-learning packages written te any programming language, execute classification analyses te parallel across numerous computing knots, and produce HTML reports of classification results.
  • MLPACK library: a collection of ready-to-use machine learning algorithms written te the C++ language.
  • NLTK (Natural Language Toolkit): A suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python language.
  • OpenNN: Open neural networks library.
  • Orange: A component-based gegevens mining and machine learning software suite written te the Python language.
  • R: A programming language and software environment for statistical computing, gegevens mining, and graphics. It is part of the GNU Project.
  • scikit-learn is an open source machine learning library for the Python programming language
  • Torch: An open sourcedeep learning library for the Lua programming language and scientific computing framework with broad support for machine learning algorithms.
  • UIMA: The UIMA (Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such spil text, audio and movie – originally developed by IBM.
  • Weka: A suite of machine learning software applications written te the Java programming language.

Proprietary data-mining software and applications Edit

The following applications are available under proprietary licenses.

  • Angoss KnowledgeSTUDIO: gegevens mining device.
  • Clarabridge: text analytics product.
  • KXEN Modeler: gegevens mining contraption provided by KXEN Inc..
  • LIONsolver: an integrated software application for gegevens mining, business intelligence, and modeling that implements the Learning and Slim OptimizatioN (LION) treatment.
  • Megaputer Intelligence: gegevens and text mining software is called PolyAnalyst.
  • Microsoft Analysis Services: gegevens mining software provided by Microsoft.
  • NetOwl: suite of multilingual text and entity analytics products that enable gegevens mining.
  • OpenText Big Gegevens Analytics: Visual Gegevens Mining &, Predictive Analysis by Open Text Corporation
  • Oracle Gegevens Mining: gegevens mining software by Oracle Corporation.
  • PSeven: verhoging for automation of engineering simulation and analysis, multidisciplinary optimization and gegevens mining provided by DATADVANCE.
  • Qlucore Omics Explorer: gegevens mining software.
  • RapidMiner: An environment for machine learning and gegevens mining experiments.
  • Verlaat Enterprise Miner: gegevens mining software provided by the Verlaat Institute.
  • SPSS Modeler: gegevens mining software provided by IBM.
  • STATISTICA Gegevens Miner: gegevens mining software provided by StatSoft.
  • Tanagra: Visualisation-oriented gegevens mining software, also for instructing.
  • Vertica: gegevens mining software provided by Hewlett-Packard.

Marketplace surveys Edit

Several researchers and organizations have conducted reviews of gegevens mining contraptions and surveys of gegevens miners. Thesis identify some of the strengths and weaknesses of the software packages. They also provide an overview of the behaviors, preferences and views of gegevens miners. Some of thesis reports include:

Related movie: How to build your 6gpu ethereum mining equipment OPEN AIR framework for mining ether!

Leave a Reply

Your email address will not be published. Required fields are marked *