TOPICS various companies. With the help of data

 

TOPICS IN DATA SCIENCE

CP-8210

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

FINAL REPORT

STRUCTURED
AND UNSTRUCTURED DATA

 

 

Submitted to :- Abdolreza Abhari

 

 

 

 

Submitted by :-      Gurpreet
Singh

Student Number:-  500802475

 

DATE 21/Dec/2017

Introduction

 

Data mining is
a process which is used to turn raw data into useful information by various
companies. With the help of data mining, the companies can look into patterns
and understand the customers in a better way with more effective strategies
which will further increase their sale and decrease the prices.

 

The data is
stored electronically & the search is automatic by computer in data mining.
Its not even new, statisticians and engineers have been working from long that
patterns in the data can be solved automatically and also validated and could
be used for predictions. With the growth in database, it almost gets doubled in
every 20 months, so its very difficult in quantitative sense. The opportunities
for data mining will increase definitely, as the world will grow in complexity,
the data it generates, so data mining is the only hope for elucidating of the
hidden patterns. The data which is intelligently analysed is a very valuable
resource, which can lead to new insights further has various advantages.

 

Data mining is
all about the solution of the problems with the analysing of data which is
already present in the databases. For instance, the problem of customers
loyalty in the highly competitive market. 
The key to this problem is the database of customer choices with their
profiles. The behaviour pattern of former customers can be used to analyse the
characteristics of those who remains loyal and those who change products. They
can easily characterise the customers to identify them who care willing to jump
the ship. Those groups can be identified and can be targeted with the special treatment.
Same technique can be used to know the customers who are attracted to other
services. So, in todays competitive world, data is the material which can
increase the growth of any business, only if it is mined.

 

 

 

 

 

And how are the patterns expressed?

The nontrival predictions on new data are allowed with the help of useful
patterns. There are two ways to express the pattern :- as a black box whose
inwards are incomprehensible and the other one is a transparent box whose
construction reveals the structure of the pattern. Assuming, both can make good
predictions. The difference among both is that whether or not the mined
patterns are represented in way of structure, which can be used to form future
decisions. These kind of patterns are known as structural as they do capture
the decision structure in an excellent manner. They basically help to tell or
explain something about the data.

 

 

Data Mining

 

The techniques which are used for learning and doesn’t represent conceptual problems are known as machine
learning. Data mining is a procedure which involves learning in practical, not
much theoretical. We will find out techniques to find structural patterns, and
to make predictions from the data.  The
information/knowledge will be collected from the data, as an example clients
which have switched loyalties.

The prediction is made whether a customer will be switching the loyalty
under different circumstances, but the output might also include the exact
description of the structure that can be utilised to group the unknown
examples.

And in addition, it is useful to supply an explicit portrayal of the
learning that is gained. Fundamentally, this reflects the two meanings of
learning considered over: the securing of information and the capacity to
utilize it. Many learning procedures search for structural depictions of what
is found out—portrayals
that can turn out to be genuinely unpredictable and are typically communicated
as sets of guidelines, for example, the ones portrayed already or the decision
trees portrayed. Since they can be comprehended by individuals, these
depictions serve to clarify what has been realized—at the end of the day, to clarify the reason for new
prediction.

 

 

The past
experience tells us that in most of the applications of data mining, the
knowledge structure, the structural descriptions are very important as much as to
perform on new instances. Data mining is usually used by people to gain
knowledge, not only the predictions. It sounds like a good idea to gain
knowledge from the available data.

 

Data mining deals with the kind of patterns that can
be mined. On the basis of the kind of data to be mined, there are two
categories of functions involved in Data Mining ?

Descriptive
Classification and Prediction

Descriptive
Function

The descriptive function deals with the general
properties of data in the database. Here is the list of descriptive functions ?

Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of
Clusters

Class/Concept Description

Class/Concept alludes to the data to be related with
the classes or ideas. For instance, in an organization, the classes of things
for deals incorporate PC and printers, and ideas of clients incorporate
enormous spenders and budget spenders. Such depictions of a class or an idea
are called class/idea portrayals. These depictions can be inferred by the
accompanying two ways –

 

·     
Data Characterization – It means to summarize the whole data of class under
study. This class under study is known as Target Class.

·     
Data Discrimination ? It refers to the mapping or classification of a
class with some predefined group or class.

 

 

Mining of Frequent Patterns

Frequent patterns are those examples that happen
every now and again in value-based data. Here is the rundown of sort of regular
examples ?

 

·     
Frequent Item
Set ? It alludes to
an arrangement of things that as often as possible seem together, for instance,
milk and bread.

·     
Frequent
Subsequence ? An
arrangement of examples that happen every now and again, for example, acquiring
a camera is trailed by memory card.

·     
Frequent Sub
Structure ? Substructure
alludes to various auxiliary structures, for example, charts, trees, or cross
sections, which might be joined with thing sets or subsequences.

 

 

 

Mining
of Association

Affiliations are utilized as a part of retail deals
to recognize patterns that are every now and again bought together. This
procedure refers to the way toward revealing the relationship among data and
deciding affiliation rules.

For instance, a retailer creates an affiliation
decide that demonstrates that 70% of time milk is sold with bread and just 30%
of times biscuits are sold with bread.

 

Mining
of Correlations

It is a sort of extra investigation performed to
reveal fascinating measurable connections between’s related characteristic
esteem sets or between two thing sets to break down that in the event that they
have positive, negative or no impact on each other.

 

Mining
of Clusters

Clusters alludes to a gathering of comparative sort
of items. Cluster examination alludes to shaping gathering of items that are
fundamentally the same as each other however are very not quite the same as the
articles in different clusters.

 

 

 

Classification and Prediction

Classification is the process of finding a model
that describes the data classes or concepts. The purpose is to be able to use
this model to predict the class of objects whose class label is unknown. This
derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms ?

Classification (IF-THEN) Rules
Decision Trees
Mathematical Formulae
Neural Networks

The list of functions involved in these processes
are as follows ?

·     
Classification ? It predicts the class of objects whose
class label is unknown. Its objective is to find a derived model that describes
and distinguishes data classes or concepts. The Derived Model is based on the
analysis set of training data i.e. the data object whose class label is well
known.

·     
Prediction ? It is used to predict missing or
unavailable numerical data values rather than class labels. Regression Analysis
is generally used for prediction. Prediction can also be used for
identification of distribution trends based on available data.

Data Mining Task Primitives

We can specify a data mining task in the form of a data mining
query.
This query is input to the system.
A data mining query is defined in terms of data mining task
primitives.

Note ?
These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives ?

Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.

 

1.   Set
of task relevant data to be mined

This is the portion of database in which the user
is interested. This portion includes the following ?

Database Attributes
Data Warehouse dimensions of interest

2.   Kind
of knowledge to be mined

It refers to the kind of functions to be performed.
These functions are ?

Clustering
Discrimination
Classification
Characterization
Prediction
Evolution Analysis

 

3.    
Background learning

The background
learning enables data to be mined at numerous levels of reflection. For
instance, the Concept chains of command are one of the background information
that enables information to be mined at different levels of deliberation.

4.    
Intriguing quality measures and limits
for pattern assessment

This is
utilized to assess the patterns that are found by the procedure of information
disclosure. There are diverse interesting measures for various sort of
information.

5.    
Portrayal for imagining the discovered
patterns

This refers to
the shape in which found patterns are to be shown. These portrayals may
incorporate the following –

Rules
Decision Trees
Tables
Graphs

 

 

 

There are two forms of data analysis that can be used for
extracting models describing important classes or to predict future data
trends. These two forms are as follows ?

Classification
Prediction

Classification
models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model
to categorize bank loan applications as either safe or risky, or a prediction
model to predict the expenditures in dollars of potential customers on computer
equipment given their income and occupation.

What
is classification?

Following
are the examples of cases where the data analysis task is Classification ?

·     
A bank loan officer wants to analyze the data in order to know
which customer (loan applicant) are risky or which are safe.

·     
A marketing manager at a company needs to analyze a customer with
a given profile, who will buy a new computer.

In
both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data
and yes or no for marketing data.

What
is prediction?

Following
are the examples of cases where the data analysis task is Prediction ?

Suppose
the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a
numeric value. Therefore the data analysis task is an example of numeric
prediction. In this case, a model or a predictor will be constructed that
predicts a continuous-valued-function or ordered value.

Note ? Regression analysis is a statistical methodology that is
most often used for numeric prediction.

How
Does Classification Works?

With
the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process
includes two steps ?

Building
the Classifier or Model
Using
Classifier for Classification

Building the
Classifier or Model

·     
This step is the learning step or the learning phase.

·     
In this step the classification algorithms build the classifier.

·     
The classifier is built from the training set made up of database
tuples and their associated class labels.

·     
Each tuple that constitutes the training set is referred to as a
category or class. These tuples can also be referred to as sample, object or
data points.

Using Classifier for Classification

In
this step, the classifier is used for classification. Here the test data is
used to estimate the accuracy of classification rules. The classification rules
can be applied to the new data tuples if the accuracy is considered acceptable.

Classification and
Prediction Issues

The
major issue is preparing the data for Classification and Prediction. Preparing
the data involves the following activities –

1.Data Cleaning

2. Relevance Analysis

3. Data Transformation and reduction:- Normalization &
Generalization

Data can also be reduced
by some other methods such as wavelet transformation, binning, histogram
analysis, and clustering.

 

 

 

 

 

 

 

 

 

 

Issues

Data mining isn’t a simple task,
as the calculations utilized can get exceptionally perplexing and data
isn’t generally accessible at one place. It should be coordinated from
different heterogeneous information sources. These components likewise
make a few issues. Here in this instructional exercise, we will talk about
the significant issues with respect to ?
Mining Methodology and User
Interaction
Issues in Performance
Issues in Diverse data types
The following diagram
describes the major issues.

 

Mining Methodology and User
Interaction Issues

It refers to the following kinds of issues –

•        Mining
various types of information in databases ? Different clients might be keen on
various types of learning. In this way it is important for data mining to cover
a wide scope of learning revelation task.

 

•        Interactive
mining of learning at various levels of deliberation ? The data mining process
should be intuitive on the grounds that it enables clients to center the scan
for patterns, giving and refining data mining demands in light of the returned
comes about.

 

 

Performance Issues

There can be performance-related issues such as
follows ?

•Parallel, circulated, and incremental mining
calculations ? The components, for example, tremendous size of databases, wide
appropriation of data, and many-sided quality of data mining techniques rouse
the advancement of parallel and conveyed information mining calculations. These
calculations isolate the information into allotments which is additionally
prepared in a parallel mold. At that point the outcomes from the partitions is
consolidated. The incremental calculations, refresh databases without mining
the information again starting with no outside help.

 

Diverse Data Types Issues

·     
Handling of
relational and complex types of data ?
The database may contain complex data objects, multimedia data objects, spatial
data, temporal data etc. It is not possible for one system to mine all these
kind of data.

·     
Mining information
from heterogeneous databases and global information systems ? The data is available at different data
sources on LAN or WAN. These data source may be structured, semi structured or
unstructured. Therefore mining the knowledge from them adds challenges to data
mining.

 

 

 

 

 

Applications

Data Mining
Applications in Sales/Marketing

The hidden pattern inside historical purchasing
transactions data are better understood with the help of data mining. Which enables
the launch of new campaigns in the market in a cost-efficient way. The data
mining applications are described as under :-

Data
mining is used for market basket analysis to provide information on what
product combinations were purchased together when they were bought and in
what sequence.  This information helps businesses promote their most
profitable products and maximize the profit. In addition, it
encourages customers to purchase related products that they may have been
missed or overlooked.
The
buying pattern of customer’s behaviour is identified by retail companies
with the use of data mining.

 

Data Mining Applications
in Banking / Finance

The
data mining technique is used to help identifying the credit card fraud
detection.
Customer’s
loyalty is identified by data mining techniques , i.e by analysing the purchasing
activities of customers, for example the information of recurrence of
procurement in a timeframe, an aggregate fiscal value of all buys and when
was the last buy. In the wake of dissecting those measurements, the
relative measure is created for every client. The higher of the score, the
more relative faithful the client is.
By
using data mining, credit card spending by the customers can be identified
Data
mining also helps in identifying the rules of stock trading from historical
data.

 

 

 

Data Mining
Applications in Health Care and Insurance

 

The development of the insurance business altogether relies
upon the capacity to convert data into the learning, data or knowledge about
clients, contenders, and its business sectors. Data mining is connected in insurance
industry of late however conveyed gigantic upper hands to the organizations who
have actualized it effectively. The data mining applications in the protection
business are as under:

 

•            Data
mining is connected in claims investigation, for example, distinguishing which medical
methodology         are asserted together.

•            Data
mining empowers to forecasts which clients will conceivably buy new policies.

•            Data
mining permits insurance agencies to identify dangerous clients’ behaviour
patterns.

•            Data
mining recognizes deceitful behaviour.