TOPICS IN DATA SCIENCECP-8210 FINAL REPORTSTRUCTUREDAND UNSTRUCTURED DATA Submitted to :- Abdolreza Abhari Submitted by :- GurpreetSinghStudent Number:- 500802475 DATE 21/Dec/2017Introduction Data mining isa process which is used to turn raw data into useful information by variouscompanies. With the help of data mining, the companies can look into patternsand understand the customers in a better way with more effective strategieswhich will further increase their sale and decrease the prices. The data isstored electronically & the search is automatic by computer in data mining.Its not even new, statisticians and engineers have been working from long thatpatterns in the data can be solved automatically and also validated and couldbe used for predictions. With the growth in database, it almost gets doubled inevery 20 months, so its very difficult in quantitative sense. The opportunitiesfor data mining will increase definitely, as the world will grow in complexity,the data it generates, so data mining is the only hope for elucidating of thehidden patterns. The data which is intelligently analysed is a very valuableresource, which can lead to new insights further has various advantages.
Data mining isall about the solution of the problems with the analysing of data which isalready present in the databases. For instance, the problem of customersloyalty in the highly competitive market. The key to this problem is the database of customer choices with theirprofiles.
The behaviour pattern of former customers can be used to analyse thecharacteristics of those who remains loyal and those who change products. Theycan easily characterise the customers to identify them who care willing to jumpthe ship. Those groups can be identified and can be targeted with the special treatment.Same technique can be used to know the customers who are attracted to otherservices. So, in todays competitive world, data is the material which canincrease the growth of any business, only if it is mined.
And how are the patterns expressed? The nontrival predictions on new data are allowed with the help of usefulpatterns. There are two ways to express the pattern :- as a black box whoseinwards are incomprehensible and the other one is a transparent box whoseconstruction reveals the structure of the pattern. Assuming, both can make goodpredictions. The difference among both is that whether or not the minedpatterns are represented in way of structure, which can be used to form futuredecisions. These kind of patterns are known as structural as they do capturethe decision structure in an excellent manner.
They basically help to tell orexplain something about the data. Data Mining The techniques which are used for learning and doesn’t represent conceptual problems are known as machinelearning. Data mining is a procedure which involves learning in practical, notmuch theoretical. We will find out techniques to find structural patterns, andto make predictions from the data.
Theinformation/knowledge will be collected from the data, as an example clientswhich have switched loyalties.The prediction is made whether a customer will be switching the loyaltyunder different circumstances, but the output might also include the exactdescription of the structure that can be utilised to group the unknownexamples. And in addition, it is useful to supply an explicit portrayal of thelearning that is gained.
Fundamentally, this reflects the two meanings oflearning considered over: the securing of information and the capacity toutilize it. Many learning procedures search for structural depictions of whatis found out—portrayalsthat can turn out to be genuinely unpredictable and are typically communicatedas sets of guidelines, for example, the ones portrayed already or the decisiontrees portrayed. Since they can be comprehended by individuals, thesedepictions serve to clarify what has been realized—at the end of the day, to clarify the reason for newprediction.
The pastexperience tells us that in most of the applications of data mining, theknowledge structure, the structural descriptions are very important as much as toperform on new instances. Data mining is usually used by people to gainknowledge, not only the predictions. It sounds like a good idea to gainknowledge from the available data. Data mining deals with the kind of patterns that canbe mined. On the basis of the kind of data to be mined, there are twocategories of functions involved in Data Mining ? Descriptive Classification and PredictionDescriptiveFunctionThe descriptive function deals with the generalproperties of data in the database. Here is the list of descriptive functions ? Class/Concept Description Mining of Frequent Patterns Mining of Associations Mining of Correlations Mining of ClustersClass/Concept DescriptionClass/Concept alludes to the data to be related withthe classes or ideas. For instance, in an organization, the classes of thingsfor deals incorporate PC and printers, and ideas of clients incorporateenormous spenders and budget spenders. Such depictions of a class or an ideaare called class/idea portrayals.
These depictions can be inferred by theaccompanying two ways – · Data Characterization – It means to summarize the whole data of class understudy. This class under study is known as Target Class.· Data Discrimination ? It refers to the mapping or classification of aclass with some predefined group or class. Mining of Frequent PatternsFrequent patterns are those examples that happenevery now and again in value-based data. Here is the rundown of sort of regularexamples ? · Frequent ItemSet ? It alludes toan arrangement of things that as often as possible seem together, for instance,milk and bread.· FrequentSubsequence ? Anarrangement of examples that happen every now and again, for example, acquiringa camera is trailed by memory card.· Frequent SubStructure ? Substructurealludes to various auxiliary structures, for example, charts, trees, or crosssections, which might be joined with thing sets or subsequences. Miningof Association Affiliations are utilized as a part of retail dealsto recognize patterns that are every now and again bought together.
Thisprocedure refers to the way toward revealing the relationship among data anddeciding affiliation rules. For instance, a retailer creates an affiliationdecide that demonstrates that 70% of time milk is sold with bread and just 30%of times biscuits are sold with bread. Miningof Correlations It is a sort of extra investigation performed toreveal fascinating measurable connections between’s related characteristicesteem sets or between two thing sets to break down that in the event that theyhave positive, negative or no impact on each other. Miningof Clusters Clusters alludes to a gathering of comparative sortof items. Cluster examination alludes to shaping gathering of items that arefundamentally the same as each other however are very not quite the same as thearticles in different clusters.
Classification and PredictionClassification is the process of finding a modelthat describes the data classes or concepts. The purpose is to be able to usethis model to predict the class of objects whose class label is unknown. Thisderived model is based on the analysis of sets of training data. The derivedmodel can be presented in the following forms ? Classification (IF-THEN) Rules Decision Trees Mathematical Formulae Neural NetworksThe list of functions involved in these processesare as follows ?· Classification ? It predicts the class of objects whoseclass label is unknown. Its objective is to find a derived model that describesand distinguishes data classes or concepts.
The Derived Model is based on theanalysis set of training data i.e. the data object whose class label is wellknown.· Prediction ? It is used to predict missing orunavailable numerical data values rather than class labels. Regression Analysisis generally used for prediction. Prediction can also be used foridentification of distribution trends based on available data.Data Mining Task Primitives We can specify a data mining task in the form of a data mining query. This query is input to the system.
A data mining query is defined in terms of data mining task primitives.Note ?These primitives allow us to communicate in an interactive manner with the datamining system. Here is the list of Data Mining Task Primitives ? Set of task relevant data to be mined. Kind of knowledge to be mined. Background knowledge to be used in discovery process. Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns. 1. Setof task relevant data to be minedThis is the portion of database in which the useris interested.
This portion includes the following ? Database Attributes Data Warehouse dimensions of interest2. Kindof knowledge to be minedIt refers to the kind of functions to be performed.These functions are ? Clustering Discrimination Classification Characterization Prediction Evolution Analysis 3. Background learning The backgroundlearning enables data to be mined at numerous levels of reflection.
Forinstance, the Concept chains of command are one of the background informationthat enables information to be mined at different levels of deliberation. 4. Intriguing quality measures and limitsfor pattern assessment This isutilized to assess the patterns that are found by the procedure of informationdisclosure. There are diverse interesting measures for various sort ofinformation. 5. Portrayal for imagining the discoveredpatterns This refers tothe shape in which found patterns are to be shown. These portrayals mayincorporate the following – Rules Decision Trees Tables Graphs There are two forms of data analysis that can be used forextracting models describing important classes or to predict future datatrends. These two forms are as follows ? Classification PredictionClassificationmodels predict categorical class labels; and prediction models predictcontinuous valued functions.
For example, we can build a classification modelto categorize bank loan applications as either safe or risky, or a predictionmodel to predict the expenditures in dollars of potential customers on computerequipment given their income and occupation.Whatis classification?Followingare the examples of cases where the data analysis task is Classification ?· A bank loan officer wants to analyze the data in order to knowwhich customer (loan applicant) are risky or which are safe.· A marketing manager at a company needs to analyze a customer witha given profile, who will buy a new computer.Inboth of the above examples, a model or classifier is constructed to predict thecategorical labels. These labels are risky or safe for loan application dataand yes or no for marketing data.Whatis prediction?Followingare the examples of cases where the data analysis task is Prediction ?Supposethe marketing manager needs to predict how much a given customer will spendduring a sale at his company. In this example we are bothered to predict anumeric value. Therefore the data analysis task is an example of numericprediction.
In this case, a model or a predictor will be constructed thatpredicts a continuous-valued-function or ordered value.Note ? Regression analysis is a statistical methodology that ismost often used for numeric prediction.HowDoes Classification Works?Withthe help of the bank loan application that we have discussed above, let usunderstand the working of classification. The Data Classification processincludes two steps ? Building the Classifier or Model Using Classifier for ClassificationBuilding theClassifier or Model· This step is the learning step or the learning phase.
· In this step the classification algorithms build the classifier.· The classifier is built from the training set made up of databasetuples and their associated class labels.· Each tuple that constitutes the training set is referred to as acategory or class.
These tuples can also be referred to as sample, object ordata points.Using Classifier for ClassificationInthis step, the classifier is used for classification. Here the test data isused to estimate the accuracy of classification rules.
The classification rulescan be applied to the new data tuples if the accuracy is considered acceptable.Classification andPrediction IssuesThemajor issue is preparing the data for Classification and Prediction. Preparingthe data involves the following activities –1.Data Cleaning2. Relevance Analysis3.
Data Transformation and reduction:- Normalization &GeneralizationData can also be reducedby some other methods such as wavelet transformation, binning, histogramanalysis, and clustering. Issues Data mining isn’t a simple task, as the calculations utilized can get exceptionally perplexing and data isn’t generally accessible at one place. It should be coordinated from different heterogeneous information sources. These components likewise make a few issues.
Here in this instructional exercise, we will talk about the significant issues with respect to ? Mining Methodology and User Interaction Issues in Performance Issues in Diverse data types The following diagram describes the major issues. Mining Methodology and UserInteraction IssuesIt refers to the following kinds of issues –• Miningvarious types of information in databases ? Different clients might be keen onvarious types of learning. In this way it is important for data mining to covera wide scope of learning revelation task. • Interactivemining of learning at various levels of deliberation ? The data mining processshould be intuitive on the grounds that it enables clients to center the scanfor patterns, giving and refining data mining demands in light of the returnedcomes about. Performance IssuesThere can be performance-related issues such asfollows ?•Parallel, circulated, and incremental miningcalculations ? The components, for example, tremendous size of databases, wideappropriation of data, and many-sided quality of data mining techniques rousethe advancement of parallel and conveyed information mining calculations. Thesecalculations isolate the information into allotments which is additionallyprepared in a parallel mold. At that point the outcomes from the partitions isconsolidated. The incremental calculations, refresh databases without miningthe information again starting with no outside help.
Diverse Data Types Issues· Handling ofrelational and complex types of data ?The database may contain complex data objects, multimedia data objects, spatialdata, temporal data etc. It is not possible for one system to mine all thesekind of data.· Mining informationfrom heterogeneous databases and global information systems ? The data is available at different datasources on LAN or WAN. These data source may be structured, semi structured orunstructured.
Therefore mining the knowledge from them adds challenges to datamining. ApplicationsData MiningApplications in Sales/MarketingThe hidden pattern inside historical purchasingtransactions data are better understood with the help of data mining. Which enablesthe launch of new campaigns in the market in a cost-efficient way. The datamining applications are described as under :- Data mining is used for market basket analysis to provide information on what product combinations were purchased together when they were bought and in what sequence. This information helps businesses promote their most profitable products and maximize the profit. In addition, it encourages customers to purchase related products that they may have been missed or overlooked.
The buying pattern of customer’s behaviour is identified by retail companies with the use of data mining. Data Mining Applicationsin Banking / Finance The data mining technique is used to help identifying the credit card fraud detection. Customer’s loyalty is identified by data mining techniques , i.e by analysing the purchasing activities of customers, for example the information of recurrence of procurement in a timeframe, an aggregate fiscal value of all buys and when was the last buy. In the wake of dissecting those measurements, the relative measure is created for every client. The higher of the score, the more relative faithful the client is. By using data mining, credit card spending by the customers can be identified Data mining also helps in identifying the rules of stock trading from historical data.
Data MiningApplications in Health Care and Insurance The development of the insurance business altogether reliesupon the capacity to convert data into the learning, data or knowledge aboutclients, contenders, and its business sectors. Data mining is connected in insuranceindustry of late however conveyed gigantic upper hands to the organizations whohave actualized it effectively. The data mining applications in the protectionbusiness are as under: • Datamining is connected in claims investigation, for example, distinguishing which medicalmethodology are asserted together.• Datamining empowers to forecasts which clients will conceivably buy new policies.
• Datamining permits insurance agencies to identify dangerous clients’ behaviourpatterns. • Datamining recognizes deceitful behaviour.