Data Mining in Telecom 2009024
Data Mining in Telecom 2009024
Prof. V Nagadevara
Quantitative Methods & Information Systems
Table of Contents
Introduction ........................................................................................................... 3 Privacy Concerns in datamining in telecommunication industry................................. 7 Customer Segmentation and Analysis using Clustering technique ............................. 8 Reducing Customer Churn...................................................................................... 9 Customer Fraud Prevention .................................................................................. 11 Network Operation and Maintenance Management............................................... 13 Telecommunication Software Quality .................................................................... 16 Classification of telecommunications companies .................................................... 17 Insights for future ................................................................................................. 18 Conclusion .......................................................................................................... 19 Reference ............................................................................................................ 20
Introduction
Datamining is the process of discovering interesting patterns and knowledge from large amounts of data. It is a process where intelligent methods are applied to extract data patterns [26].The telecommunications industry is one of the earliest adopters of datamining. The telecom companies (Telcos) generate a huge amount of data from customer details to call patterns to network usage. Such data is collected and forms the telecom data warehouse. A data warehouse can be described as [2] a data set which is subject-oriented, integrated, relatively stable and reflects historical changes, used for supporting decision-making in management. Telcos are able to carry out datamining on huge data sets they collect and which are part of the telcos data warehouse. The basic data mining techniques include association rules, classification, clustering, regression analysis, sequence analysis, discriminant analysis, outlier analysis, neural networks, fuzzy logic etc. I have been involved in the development of products for the telecom industry for close to two decades. I have worked on products ranging from consumer devices to network products. Currently I am working on optical network management software which is used to monitor the critical backbone infrastructure. The amount of network data which is captured through alarms and operational logs are large. To make meaningful real-time decisions based on such huge amounts of collected data would require appropriate data mining techniques to be applied on the huge data set collected by the telcos. The telecommunication industry has grown in leaps and bounds in the past few decades and it generates a vast amount of data. The call details records contain information about every call made and billions of such records are created worldwide every month. As per literature, AT&T long distance customers generated over 300 million call detail records per day in 2001 [1]. Such huge amounts of data offer a huge opportunity for data mining which is useful in the area of marketing and fraud prevention. Highly evolved network management systems are able to keep a pulse on the operation of the network and collect a vast amount of data in the form of alarms and the network usage. Such kinds of information can be used for datamining purposes which can help in better network deployment and utilization. Network planning tools benefit immensely by the predictive models developed from analyzing existing data. It is therefore quite easy to appreciate that the telecommunication industry is one of the early adopters of datamining techniques. The nature of data collected from telecommunication networks pose many interesting challenges for datamining purposes. The scale of the data with billions of record in raw form represented in time series format present a challenge in analyzing. The need to summarise such data in a useful form for analysis and building rules and models poses considerable challenges. The rarity of events which must be predicted like network failure (99.999% uptime requirement for networks), and telephone fraud implies that rules and models developed from the datamining techniques need to use the right technique for analyzing such vast amounts of data. Further, application of models/rules developed in real-time like fraud detection involves real time performance.
The steps of data preparation are quite important in analyzing the data generated from telecommunication networks. The basic forms of data collected are broadly classified as call detail records, network records, and customer records. Figure 1: The general structure of data mining in the telecommunications industry
DECISION MAKERS
DATA MINING
MODEL BASE
DATA WAREHOUSE
EXTERNAL DATA
NETWORK RECORDS
CUSTOMER RECORDS
The call detail records have descriptive information about the calls and number of call detail data tends to be huge. As the purpose of datamining is to extract information at the customer level and not at the level of individual calls, the call detail records are not used as it is for data mining. The summarization of a customer into a single record describing the calling pattern of the customer is of more interest. The useful description of the customer depends on the suitable choice of summary variables and features. Based on a call received and made by a customer in time frame P, Weiss [1] lists a few features which may summarise the customer: Average call duration % no-answer calls % calls to/from a different area code % of weekday calls (Monday Friday) % of daytime calls (9am 5pm) average number of calls received per day average number of calls originated per day number of unique area codes called during P These can be used to summarise the customer as a business or a residential caller, as a telemarketer, the tip over from peak business to residential use, etc. Sometimes, these summary description, also called signatures, have to be updated in real-time for millions of phone lines which means that there must be relatively short and simple features (summaries) which must be updated quickly and efficiently. The network record is generated by the performance and fault management reports generated by the thousands of components connected in a network. Nowadays, the network is a heterogeneous combination of different kinds of technologies and interconnections which add complexity to the isolation of problems. The data collected include strings indicating the nature of problem, the timestamp, the network elements involved etc. Due to the critical nature of the operations, rule based corrective actions are automatically initiated in many circumstances. The datamining techniques can be used to identify faults by automatically extracting knowledge from the network data, which involves real time data stream that have to be summarised appropriately for effective datamining based predictive and/or corrective action. Customer records maintained by the telcos have many attributes of the customer from the name and address to the plan and average usage. Apart from this many demographic and econometric data also forms part of the customer record which can be used in conjunction with the call detail records to form rules for cross-selling and up-selling opportunities, and fraud detection. Datamining applications for telcos include subscription fault detection and superimposition fraud detection using deviation methods and outlier analysis using real time data, call detail record, and customer data. For example, a rule for detecting fraud can be a person who has an average international call duration of five minutes but the call going on for more than an hour indicates him/her to be a victim of cloning fraud. The cost of misclassification is very high is such cases and therefore the datamining technique and the rules developed will have to be sensitive to the cost of letting a suspected fraud call going through or not. The rarity of the events makes it more challenging for datamining techniques to evolve highly robust rules. In such cases, oversampling of the training dataset must be done to accommodate for the rarity of frauds and reduce the skewness.
Datamining is also used for marketing purposes and customer profiling. There are privacy concerns and anti-trust concerns involved in such usage. While there are laws in certain countries to address this, datamining techniques can be used by telcos to devise programs to retain customers, increase customer lifetime value/profitability, and reducing churn through predictive model development. Classification and Clustering are two main techniques in the customer segmentation. Classification techniques, Clustering techniques, and Neural network based techniques are used for datamining the telco data for customer segmentation and customer related. In telecommunications, association rule learning, can for example, be used on calling detail data to identify pairs of customers that frequently call each other (which in turn can be used to identify socalled calling circles). Network fault isolation and corrective action models are developed using classification techniques and by using correlation models developed from fault and performance management records. The time series data is summarized as a set of classified examples for datamining to develop effective predictive models. Datamining methods (Table 1) are also used for optimizing network deployment and international roaming agreements. In short, datamining can help the telcos handle four key challenges that they face today as summarized by Pareek (2007) as the 4 Cs: consolidation, competition, commoditisation and customer service [10].
Data Mining Application Areas Marketing, Sales and CRM Data Mining Techniques Association Classification Clustering Forecasting Regression Sequence discovery Visualization Outliers detection Deviation detection Statistical modeling Dynamic clustering Classification Visualization for pattern recognition Classification Prediction Sequence analysis Time-series data analysis Data Mining Methods Association rules Decision trees Genetic algorithms Multiple factor analysis Neural networks K-Nearest Neighbour Linear/logistic regression Visualization methods Anomaly detection techniques Rule discovery Clustering algorithms Bayesian rules Visualization methods for recognizing unusual patterns Markov models Neural networks Genetic algorithms Bayesian Belief Networks Rough sets Classification trees Self-organizing maps
Fraud Detection
Network Management
Privacy Preserving Association Rule Learning uses transformation/randomization, and secure multi-party distribution computation techniques to ensure privacy preservation.
data and
Privacy Preserving Clustering Techniques uses data transformation/aggregation so that the clustering performed on the distorted data is still valid. Multidimensional Scaling (MDS) technique, where a set of points in a high dimensional space is transformed to a lower dimensional one while preserving the relative distances between pairs of points, is also used for privacy preservation.
Frequency of the consumers: Frequency refers to the number of transactions in a certain period of time, for example, two times a month or two times of one week. The greater the frequency is the bigger F is. Monetary value of the consumers: Monetary value refers to consumption money amount in a particular period. The more the monetary value is the bigger M is.
According to research, the bigger the R and the F indicates an increased likelihood of new transactions. Bigger the M is the likelihood of repeat purchases is increased. In the telecom space the payment of a customer is a very important aspect in deciding the importance of the customer. Thus, in the RFM model, the M value gets the largest weightage in using centroids in clustering customers in the range of importance for targeting. Case Study: Qin, Zheng, and Huang applied the improved K-means algorithm to analyse a practical dataset from a mobile communication company for customer segmentation. They were able to get eight classes of customers based on the payment fees as the biggest indicator of customer importance. Based on the eight classes of customer, the mobile company can use different strategies for different classes of customers. The results of the cluster analysis are given in Appendix 1.
As per research, the churn prediction factors for telcos include subscribers who do not have a discounted package, incoming local and long distance calls, being in a prepaid plan, and standard deviation of calls from other operators. Richter, Yom-Tov, and Slonim [6] have attempted churn prediction by analysis of social groups and have used key performance indicators to predict churn even before the first churn has happened in the group. This is by just using call records and not financial or demographic details. They have used regression methods for data mining. Data Understanding and Data Preparation are extremely important in Churn predictor analysis for telcos as the amount of records are vast and it has to be ensured that there are no duplications, incorrectness or incompleteness in the dataset. It is very important to identify the dependent variable for churn analysis. Logistic Regression is seen as an appropriate method for doing churn analysis. Logistic regression can be used on data to find similarities between observations within each classification in terms of the predictor value. This lends itself to classification of observation into a churn class or not. It is crucial to balance the training dataset based on the dependent variable, in this case if the customer has churned or not so as to get an unbiased predictor model. After this balancing, the variables which have no effect on the target variable are removed. The remaining variables are considered as significant. It is further analysed if these variables have correlation. For a given level of significance, the logistic regression analysis is done to determine the reasons for customer churn. Decision Tree analysis also is used in customer churn analysis. Analysis done by Grsoy [4] using the records of a mobile operator yielded the following results. In Logistic Regression Analysis, the accuracy of correctly predicting non-churners is 74.3% and the accuracy of predicting the churner subscribers correctly is 66%. For evaluating C&RT Decision tree algorithm, the accuracy of the model is calculated and the result is 71.76% which is considered as very high. The probability of having wrong outcome is 28.24%. In a study done by Pan [7] on churn among broadband customers using three methods - C5.0, Logistics regression, and neural network algorithm - C5.0 had the best performance among three algorithms for that dataset. If the model trained by C5.0 is selected as an early-warning system, about 80 percent of the potential churn customer can be found by using the model. Oseman et. Al. [14] studied customer churn through data classification and construction of decision tree using ID3 (Iterative Dichotomiser 3) algorithm. From the decision tree analysis, the first classification attribute that contribute to churn is the area of the subscribers and this is related to the lengths of services and total of minutes for customer churn. It was surmised that if area is rural and length of service more than 20 years, subscribers are not likely to churn to other providers, and if area is sub-urban and the total of minutes that customer engages in line less than 10 minutes, the subscribers are likely to churn. Fuzzy Correlation Analysis can also be used to deduce the right kind of marketing for a particular group of customers which will reduce churn. Using attributes like time to contract expiry and bill payment amount the kind of marketing method yielding the highest retention rate can be deduced [22]. Thus the marketing and
10
customer data can be used to extract key factors of telecom churn which would help in utilizing appropriate marketing techniques to reduce churn. Case study: As per [4] two instances of the use of datamining to handle customer churn are cited. Vodafone which bought Telsim applies data mining for sales, marketing, financial management, future prediction, and for many different needs. Vodafone detects peak hours by using its databases and makes more workforces ready to avoid any disruption in communication. Also, Vodafone determines average of prepaid minutes purchased and finds subscribers who will likely churn. Turkcell which obtained customer information through business intelligence and data mining techniques offers new tariffs and develops campaigns for existing customers. Turkcell also has been developing programs that increase customer loyalty.
connection weight vectors consistent with the probability distribution of input model, which means the density of the connection weight vectors could reflect the statistical features of input model. The layout of neurons in output layer has many forms, the typical one being a two-dimension plane matrix [8]. The Kohonen neural network algorithm is used to form clusters after which classification techniques can be use to predict the potentially malicious customers. The process of clustering can extract the outliers from the entire data set, but cannot state the features of this group, so it becomes necessary to use the decision tree of classification to extract the classification rules of this special group, and apply them to the process of business forecast. Once the telcos identify the problem of customer fraud as an important business problem to solve, the relevant data can be obtained by regularly processing the bills and call records of the customers. We can use attributes like calling charge, time of calling, duration of calling, type of calling with information table and arrears table of customer to build feature records set of target customers. Also, attributes like customers ID, gender, age, time in network, state, fees of longdistant phone call, fees of roaming, average length of phone call, number of phone calls, fees in arrears, service change type, etc. become crucial. The data preprocessing phase needs to convert the raw date into reliable data for data mining through sampling, setting the label of the decision variable, etc. Subsequently, we can use the Neural Network algorithm to form clusters and extract classification rules to identify and predict malicious usage. The rules are verified using test data. The confusion matrix can be generated to test the effectiveness using test data. Actually not in arrears Actually in arrears 0 1 Forecast not in arrears A C 0 Forecast in arrears B D 1
Accuracy rate of forecasting malicious arrears = D/(B+D) Accuracy rate of forecasting not in arrears maliciously = A/(A+C) Hitting rate of forecasting malicious arrears = D/(C+D) Hitting rate of forecasting not in arrears maliciously = A/ (A+B)
Table 3: Forecasting Effectiveness using confusion matrix [8] Once the confusion matrix analysis suggests suitability of use as per business policies and strategies, the rules can be used on real-time customers. According to their information of calling duration and cost, account, state-changing, it becomes possible to forecast the probability of customers becoming the malicious accounts in the later months. Sequential Pattern Mining algorithms can be used to solve the problem of discovering the maximum frequent sequences in a given database. Genetic algorithms are one of the best ways to solve a problem for which little is known and its uses the principles of selection and evolution to produce several solutions to a given problem. The Sequential Pattern Clustering with Genetic Algorithm (GA) has been used on a database of telephone numbers to detect fraud and identify the misuse customer and malpractice customer in the telecommunication area [18]. Before using the technique initial pre-activities like defining the fitness value, initial population size, mutation point, crossover point and ordering of data are done. Fitness function value is a criterion for generating the new solution or new candidate functions. Initial samples size, maximum generations, threshold value, minimum fitness are used as input to give as output the malpractice user number-list. This study illustrates the
12
use of clustering and GA techniques for datamining the telco database to detect and prevent potential customer fraud. Sanver and Karahoca [23] have used the Adaptive Neuro-fuzzy Inference System (ANFIS) structure and optimization processes to be applied to Call Detail Records to detect customer fraud. The ANFIS approach uses Gaussian functions for fuzzy sets and linear functions for the rule outputs. The initial parameters of the network are the mean and standard deviation of the membership functions (antecedent parameters) and the coefficients of the output linear functions (consequent parameters). The ANFIS learning algorithm is then used to obtain these parameters by recursively updating the rule parameters till an acceptable level of error is achieved. Iterations have two steps, one forward and one backward. In the forward pass, the antecedent parameters are fixed, and the consequent parameters are obtained using the linear least-squares estimate. In the backward pass, the consequent parameters are fixed, and the output error is back-propagated through this network, and the antecedent parameters are accordingly updated using the gradient descent method [23]. As per the study the total duration of calls, international calls, and total cost of calls are an important factors in fraud identification.
optimization. The data mining begins with preprocessing the data which reduces noise, copes with extremes, deals with missing values, and balances variables and their value ranges. Subsequently many analysis methods are used depending of the statistical information available and the overall goal of the analysis. The phase following the analysis consists of interpretation, presentation and validation of information and knowledge based on the features discovered in the previous process. Frequent Patterns are identified from the dataset generated by the network. Frequent sets consist of value combinations that occur inside data records like alarm entries. Frequent episodes describe common value sequences like alarm message types that occur in the network. Frequent sets can be used for calculation of association rules. In telecommunications management systems, the kind of event creation, transfer and logging mechanisms result in unordered entries. The Apriori algorithm ([15], page 27) for frequent sets can be modified to compute frequent unordered episodes such as using a set of log/alarm sequence windows. Frequent patterns capture information in these repetitive entries. The Comprehensive log compression (CLC) method uses frequent patterns to summarise and remove repetitive entries from log data [15]. This makes it easier for a human observer to analyse the log contents. Also, CLC supports on-line analysis of large log les. The CLC method analyses the data and identifies frequently occurring value or event type combinations by searching for frequent patterns from the data. For the purposes of OMC in the telecom space, frequent episodes and association rules can be used to assist in defining rules for an alarm-correlation engine. But, maintenance and security events are much larger, where some of the event types are unknown and new types keep appearing, when new components are added to the network. The method that addresses this kind of changing data set cannot rely on a pre-dened knowledge base only. Hence, predictive algorithms become necessary and these can be done by self-learning methods. Large volumes of daily network log data enforce network operators to compress and archive the data to offline storages. Whenever an incident occurs in system security monitoring, for example that immediately requires detailed analysis of recent history data, the data has to be fetched from the archiving system. Typically the data also has to be decompressed before it can be analysed. These kind of on-line decision-making tasks on the tactical level are challenging for data mining methods and the knowledge discovery process. Log les that telecommunications networks produce are typically archived in compressed form. Closed sets can be used to create a representation that reduces the size of the stored log by coding the frequently occurring value combinations. The coding can be done without any prior knowledge about the entries. Using datamining techniques to identify the critical attributes to be stored can be a critical differentiator in implementing efficient algorithms for real-time access of data without the need for decompressing. Queryable lossless Log data Compression (QLC) [15] is one way to archive data so that it can be accessed without decompression. Burn-Thornton et. al. [16] used many datamining methods to evaluate their efficacy in proactive datamining for network management. The algorithms chosen were: k-NN (Statistical), C4.5 (Machine Learning, decision tree), CN2 (Machine Learning, rule-induction), RBF (Neural Networks), and OC1 (Machine Learning,
14
decision tree algorithm). The study surmised that the C4.5 algorithm could prove to be the best algorithm to use in order to both accurately, and rapidly mine multi-set, multivariate small class performance data, and also perform this task over the smaller size of data set. However, the k-NN algorithm appeared to be the most suitable data mining algorithm to use for providing the basis for proactive system management due to its better accuracy for classification, although its speed was less than that of C4.5. Data Mining can be used to produce the probabilistic network by correlating offline alarm event data, and deducing the cause using this probabilistic network from live alarm events. The cause and effect graph to form a Bayesian Graph/Network can be considered a complex form of alarm correlation. The alarms are connected by edges that indicate the probabilistic strength of correlation. Induction has to be used to deduce this structure from the data, but is a NP-hard problem to solve due to the vast amount of variables which in turn gives a very large number of potential graphical structures which can be induced. In practice, when it comes to learning the cause and effect graph, the volume of event traffic and correlation of alarms can be reduced by simple first stage correlation (generally pattern matchers). The expert system approach (for eg. the deduction from the probabilistic network) could then handle the remaining more complex problems, taking advantage of the much reduced and enriched stream of events. Rules for the system can be written from mined results from such tools as Clementine [17]. Based on the datamining results, additional rules could be potentially adapted to extend the existing correlation system in an element/network manager/management system. Another use of datamining on the huge amount of network data generated is to monitor the delivery of Service Level Agreement (SLA) to customers. Data Exceptions (DE), specific indications of unexpected performance, indicate periods where the delay data differs from some expectation due to reasons like spikes, delay step changes and, changes in time-of-day-delay variation. Using datamining on weeks of collected networks data, the networks operator can classify unseen exceptions as DE and hence be able to take corrective action as required so as to maintain the SLAs. Neural network can be used on the data to create decision engines which helps the operator in doing this. As per the study by Phillips et. al. (Reference 19) DEs are automatically detected using a two-stage process. First, a statistical test is performed (KolmogorovSmirnov) which gives an indication that an exception has occurred in the data. Subsequently, the delay data immediately surrounding the exception is passed to a neural network for classification. The neural network is initially trained with a set of DEs of various types taken from the communication network. As per Phillips et al.s evaluation of this approach, a classification accuracy of 99% was eventually achieved on this training set. When used on unseen exceptions, the trained neural network achieved a classification accuracy of 79%, averaged over all exception types. The KS test, which preceded the neural network, correctly identified 99.5% of the exceptions presented to it [19]. Classification And Regression Trees (CART) algorithm can be used to the classify QoS based on the Key Performance Indicators (KPI) of the telecommunication networks and its element for eg. a cell site. The CART is a binary decision tree, which can be applied to both numerical and nominal data. Classification with the CART is based on observations of a set of variables data, variables which are used as predictors, and a classification variable (also called as the target value) attached to these observations. The tree construction helps in determining the
15
binary splits of data set X with training data set L so that X is cut into smaller and smaller subsets. The solution is to search over every possible threshold in every variable for the split that best improves the tree structure according to a specified score function [20]. The terminal nodes can be written out as linguistic classification rules which can be relatively easily understood by the by the NOC operator to take necessary action based on the QoS classification. Self-Organizing Map (SOM), an unsupervised neural network, can also be used to manage telecommunication traffic. SOM has been analysed for various telecommunication purposes like call admission control, controlling a router system, creating geographical clusters from calling patterns, for adaptive resource allocation, user profiling to detect fraud in mobile telecommunications networks, visualizing the behavior of network cells, visualizing the performance, detect the anomalies, and analyze the trends of a mobile network [20]. Case Study: The Telecommunication Alarm Sequence analyzer (TASA) is one of the data mining tools that help in fault identification by automatically discovery recurrent patterns of alarms within the network data. This patterns discovered by the tool are used to construct a rule-based alarm correlation system. TASA is also capable of finding episodic rules that depend on temporal relationships between the alarms.[21]
Machine classifier (SVM), which attempts to identify optimal hyperplanes with nonlinear boundaries in the variable space in order to minimize misclassification. There are certain challenges in using datamining techniques in detecting fault proneness of software. When building models to predict fault components or files, the process tends to be rather exploratory which results in a large number of predictors with low correlation. The number of training instances needed for instance-based learning increases exponentially with the number of irrelevant variables present in the data set. At the same time, strong inter-correlations among variables affect variable selection heuristics in regression analysis. Variable selection schemes like Correlation-based Feature Selection (CFS) help in the better selection of variables for analysis. Arisholm et. al. [25] used a large Java based legacy system maintained by Telenor in Norway consisting of more than 2600 Java classes amounting to about 148K SLOC (Lines of Code). Using occurrences of corrections in classes of a specific release which are due to field error reports as the dependent variable, the aim was to facilitate unit testing and inspections on the most fault prone modules so as to focus the efforts and reduce cost of quality. They found that C4.5 decision trees, happen to perform very well overall (for different percentages of code and test sets) and the 20% most fault prone classes account for around 70% of test sets. On using this model, feedback from developers showed that they were able to uncover many new faults by investing extra days of unit testing on classes predicted as the most fault prone. Turhan, Koak and Bener [23] analyzed twenty five projects of a large telecommunication system in their study. To predict defect proneness of modules they trained models on publicly available Nasa MDP data. They used projects implemented in Java and gathered 29 static code metrics from each. In total, there are approximately 48,000 modules spanning 763,000 lines of code. All projects are from presentation and application layers. In their experiments they used Static Call Graph Based Ranking as well as Nearest Neighbor Sampling for constructing method level defect predictors. They found that Nave Bayes methodology achieves significantly better results than many other mining algorithms for defect prediction
into one of the six categories. Both the classification methods used similar variables for classification. the DT model relies heavily on ROE, Interest Coverage, Equity to Capital, while, for MLR technique, for each model equation, in general, positive and large coefficients for ROE, Interest Coverage and Equity to Capital, and negative/small coefficients for the other variables. This is an interesting case study of how to predict the financial health of the telecommunications industry using datamining techniques.
18
Location based advertisement services can be made more effective by using clustering techniques and creating clusters based on attributes derived from connection between demographics, location, and product purchase history of the customer. This would involve collaboration between different data sets from different players and issues of privacy will have to addressed here. Methods of anonymisation will have to be used and use advanced clustering methods like probabilistic model-based clustering techniques or flexible fuzzy clustering methods using Expectation-Maximisation Algorithm. Due to the large amount of multi-dimensional data Subspace Search Methods may have to be deployed to evolve engines for targeted advertising. Analysis of data and text generated through the high amount of data flowing through the networks will involve text mining techniques to understand both the customer trends, and also to access the network planning and network deployment needs. Such capability at the telco end would greatly enhance the competitiveness of telcos with respect to companies which use resident data for attracting advertisement revenue. Such text mining capability at the telco side would enable them to become a more important player in the communication industry value chain. Use of Mobile and availability of fast internet has shifted usage patterns of customers. There are reports of the network being clogged due to high usage of multimedia content over wireless and operators having to remove plans of unlimited usage and such. With landlines being outstripped by mobile connections, and the need to increase Average Revenue Per User (ARPU), and data applications have become an important part of the ecosystem. Datamining techniques used on the type of traffic, plans, and profitability would help the telcos offer the right kind of plan in a customized manner to individual customers. Online Analytical Processing systems can be developed which can classify the new or existing customer based on patterns of usage, ARPU generated, technology adoption pattern etc. to offer real-time plans which would greatly enhance customer retention and loyalty. Discriminant Analysis can be used to predict customer acceptability to reduce customer fraud. Logistic Regression techniques can be used to predict and ensure better quality of service based on SLA agreements and the evolving patterns of network traffic. Datamining techniques used for predictive analysis will play a critical role in the success of telcos especially with the changing landscape of technology and players, customer choices, and lifestyles.
Conclusion
The various applications of datamining in the telecommunications industry has been presented. Even though the telecommunications industry has been one of the earliest adopters of datamining, the changing trends of technology and the focus on customer delight are making datamining an integral part of the operators. The new evolving techniques to predict future trends from past data are critical to make datamining a powerful tool especially since the uncertainty of adoption is high coupled with the opportunities for making substantial business gains.
19
Reference
1 2 3 Data Mining In Telecommunications - Gary M. Weiss, Department of Computer and Information Science, Fordham University, Copyright 2009, IGI Global The Research of Data Mining in Telecom Data Warehouse- SHI Jun-yong, LI Ling-ling, 2010 IEEE Computer Society, DOI 10.1109/ICSEM.2010.160 Improved K-Means algorithm and application in customer segmentation Xiaoping Qin, Shijue Zheng, Ying Huang, Guangsheng Deng(Vietnam),Department of Computer Science, Huazhong Nomal University, 2010 IEEE Compuer Society, 2010 Asia-Pacific Conference on Wearable Computing Systems, DOI 10.1109/APWCS.2010.68 Customer churn analysis in telecommunication sector - Umman Tuba imek Grsoy, Istanbul University Journal of the School of Business Administration, Cilt/Vol:39, Say/No:1, 2010, 35-49,ISSN: 1303-1732 www.ifdergisi.org 2010 Clustering Analysis of Telecommunication Customers - H. Ren, Y. Zheng, Y. Wu, The Journal of China Universities of Post and Telecommunications. 16(2), 114-116 (2009). Predicting customer churn in mobile networks through analysis of social groups- Yossi Richter, Elad Yom-Tov, Noam Slonim -IBM Haifa Research Lab, 165 Aba Hushi st., Haifa 31905, Israel. On Customer Churn and Early Warning Model of Telecom Broadband - Ding Pan, Center for Business Intelligence Research, School of Management, Jinan University, Guangzhou, China- 978-0-7695-4261-4/10 2010 IEEE Computer Society Fraudulent Behavior Forecast in Telecom Industry Based on Data Mining Technology - Sen Wu, Naidong Kang, Liu Yang - School of Economics and Management, University of Science and Technology Beijing, Communications of the IIMA, 2007 Volume 7 Issue 4 Designing an Expert System for Fraud Detection in Private Telecommunications Networks - C.S. Hilas, Expert Systems with Applications. 36, 11559-69 (2009). Business Intelligence Applications and Data Mining Methods in Telecommunications: A Literature Review - Dorina Kabakchieva, Sofia University,125 Tzarigradsko shosse Blvd., 1113 Sofia, Bulgaria Self Organization of a Massive Document Collection -Teuvo Kohonen, Samuel Kaski, Krista Lagus, Jarkko Salojrvi, Jukka Honkela, Vesa Paatero, Antti Saarela, IEEE Transactions On Neural Networks, Vol. 11, No. 3, May 2000 A Framework for Predictive Data Mining in the Telecommunications Sector Adrian Costea, Tomas Eklund ,Turku Centre for Computer Science and IAMSR / bo Akademi University Lemminkisenkatu 14 B, FIN-20520 Turku, Finland Privacy Preserving Data Mining in Telecommunication Services - Ole-C hristoffer Granmo , Vladimir A . Oleshchuk, ISSN 0085-7130 Telenor ASA 2007, Telektronikk 2.2007 Data Mining in Churn Analysis Model for Telecommunication Industry Khalida binti Oseman, Sunarti binti Mohd Shukor, Norazrina Abu Haris, Faizin bin Abu Bakar, Journal of Statistical Modeling and Analytics Vol. 1 No. 19-27, 2010 Data mining for telecommunications network log analysis - Kimmo Hatonen, University of Helsinki, Finland
5 6 7
9 10 11 12 13 14
15
20
16 Pro-Active Network Management Using Data Mining - K.E.Burn-Thornton, J. Garibaldi & A.E. Mahdi, School of Electronic, Communication and Electrical Engineering, University of Plymouth, 0-7803-4984-9/98 1998 IEEE. 17 Data Mining telecommunications network data for fault management and development testing - R. Sterritt, K. Adamson, C.M. Shapcott, E.P. Curran Faculty of Informatics, University of Ulster, Northern Ireland. 18 A Fraud Detection Approach in Telecommunication using Cluster GA V.Umayaparvathi, Dr.K.Iyakutti, MKU, International Journal of Computer Trends and Technology- May to June Issue 2011, ISSN: 2231-2803 19 Architecture for the Management and Presentation of Communication Network Performance Data- Iain Phillips, David Parish, Mark Sandford, Omar Bashir, and Anthony Pagonis, IEEE Transactions On Instrumentation And Measurement, VOL. 55, NO. 3, JUNE 2006, 0018-9456 2006 IEEE 20 Data Mining for Managing Intrinsic Quality of Service in Digital Mobile Telecommunications Networks - Pekko Vehvilinen, Tampere University of Technology, Publications 458 21 Computational Intelligence in Data Mining and Prospects in Telecommunication Industry - Isinkaye O. Folasade , Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 2 (4): 601-605 (ISSN: 2141-7016) 22 Analysis of marketing data to extract key factors of telecom churn management - Hao-En Chueh, African Journal of Business Management Vol. 5(20), pp. 8242-8247,16 September, 2011 23 Data mining source code for locating software bugs: A case study in telecommunication industry- Burak Turhan, Gozde Kocak, Ayse Bener, Expert Systems with Applications 36 (2009) 99869990 24 Fraud Detection Using an Adaptive Neuro-Fuzzy Inference System in Mobile Telecommunication Networks - Mert Sanver*, Adem Karahoca**, * Institute for Computational and Mathematical Engineering, Stanford University, Stanford, 94305, USA, **Department of Computer Engineering, Bahcesehir University, Besiktas, Istanbul, 34900, TURKEY 25 Data Mining Techniques for Building Fault-proneness Models in Telecom Java Software - Erik Arisholm, Lionel C. Briand, Magnus Fuglerud - Simula Research Laboratory, 18th IEEE International Symposium on Software Reliability Engineering, 1071-9458/07 2007 IEEE, DOI 10.1109/ISSRE.2007.22 26 Datamining Concepts and Techniques, Third Edition Jiawei Han, Micheline Kamber, Jian Pei, Morgan Kaugmann Series
21
22
Appendix 4: A cluster hierarchy of an operator business model and responsibility areas. [15]
23