Research issues of mining big data streams



The current stage of information technology has shown that the use of concepts of big data is effective for a wide range of problems. To maintain acompetitive decision-making processed and analyzed huge amounts previously available for analysis of data types with new intelligent processing methods data mining. Stream data mining is one of the important directions because evolving data streams methods are becoming most efficient way for real time prediction and analysis.

Introduction

Traditional databases store are all relatively static records with no pre- defined notion of time, you can insert, update, delete, or select any record at any time ifyou have the authorization. Traditional databases have been used in applications that require persistent data storage and complex querying. Usually a database consists of a set of objects, with insertions, updates, anddeletions occurring less frequently than queries. However during past few years have witnessed an emergence of applications that do not fit this data model and querying paradigm. Instead, information naturally occurs in the formof a sequence of data values [1].

Everyday, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have receiveda lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, overlooking important challenges imposed by real-world applications [2].

Streaming data have gained considerable attention in database and data mining communities because of the emergence of a class of applications that produce these data. Data streams have some unique characteristics that are not exhibited by traditional data: unbounded, fast-arriving, and time-changing. Traditional data mining techniques that make multiple passes over data or that ignore distribution changes are not applicable to dynamic data streams. Mining data streams has been an active research area to address requirements of the streaming applications. [3]

Big data mining

Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk [4].

Ingeneral terms, as a common denominator of the various definitions available, ‘big data’ 4 refers to the practice of combining huge volumes of diversely sourced information and analysing them, using more sophisticated algorithms to inform decisions. Big data relies not only on the increasing ability of technology to support the collection and storage of large amounts of data, but also on its ability to analyse, understand and take advantage of the full value of data [5].

Doug Laney was the first one in talking about 3 V’s in Big Data management:

· volume: there is more data than ever before, its size continues increasing, but not the percent of data that our tools can process;

· variety: there are many different types of data, astext, sensor data, audio, video, graph, and more ;

· velocity: data is arriving continuously as streams of data, and we are interested in obtaining useful information from it in realtime

Nowadays,there are two more V’s:

· variability: there are changes in the structure of thedata and how users want to interpret that data

· value: business value that gives organization acompelling advantage, due to the ability of making decisions based in answering questions that were previously considered beyond reach

Big Data mining is the capability of extracting useful information from these large datasets or streams of data, that due to its volume, variability, and velocity, it was not possible before to do it [6].

Data mining involves exploring and analyzing large amounts of data to find patterns for big data. The techniques came out of the fields of statistics and artificial intelligence (AI), with a bit of database management thrown into the mix.

Generally, the goal of the data mining is either classification or prediction. In classification, the idea is to sort data into groups. For example, a marketer might be interested in the characteristics of those who responded versus whodidn’t respond to a promotion.

These are two classes. In prediction, the idea is to predict the value of acontinuous variable. For example, a marketer might be interested in predicting those who will respond to a promotion.

Typical algorithms used in data mining include the following:

Classification trees: A popular data-mining technique that is used to classify a dependent categorical variable based on measurements of one or more predictor variables. The result is a tree with nodes and links between the nodes that can be read to form if-then rules.

Logistic regression: A statistical technique that is a variant of standard regression but extends the concept todeal with classification. It produces a formula that predicts the probabilityof the occurrence as a function of the independent variables.

Neural networks: A software algorithm that is modeled after the parallel architecture of animal brains. The network consists of input nodes, hidden layers, and output nodes. Each unit isassigned a weight. Data is given to the input node, and by a system of trial and error, the algorithm adjusts the weights until it meets a certain stopping criteria. Some people have likened this to a black–box approach.

Clustering techniques like K-nearest neighbors: Atechnique that identifies groups of similar records. The K-nearest neighbor technique calculates the distances between the record and points in the historical (training) data. It then assigns this record to the class of its nearest neighbor in a data set [7].

Datastream mining

The developments of information and communication technologies dramatically change the data collection and processing methods. What distinguish current datasets from earlier ones are automatic data feeds. We do not just have people entering information into a computer. We have computers entering data into each other[14]. More over, advances in miniaturization and sensor technology lead to sensor networks, collecting high detailed spatiotemporal data about theenvironment.

Datamining in this context requires continuous processing of the incoming data monitoring trends, and detecting changes. Traditional one-shot systems—memory based, trained from fixed training sets and generating static models are not prepared to process the high detailed data available—are also not able to continuously maintain a predictive model consistent with the actual state of the nature, or to quickly react to changes [8].

Mining big data streams faces three principal challenges: volume, velocity, and volatility. Volume and velocity require a high volume of data to be processedin limited time. Starting from the first arriving instance, the amount of available data constantly increases from zero to potentially infinity. This requires incremental approaches that incorporate information as it become savailable, and online processing if not all data can be kept [9].

Thewide spread dissemination and rapid increase of data stream generators coupledwith high demand to utilize these streams of data in critical real-time data analysis tasks have led to the emerging focus on stream processing. Data stream processing is broadly classified into two main categories according to the type of processing namely

· data stream management: this represents querying and summarization of data streams for further processing

· data stream mining: performing traditional data mining techniques with linear/sublinear time and space complexity

The next table shows the major differences between data stream processing and traditional data processing. The objective of this table is to clearly differentiate between traditional stored data processing and stream processing as a step towards focusing on the data mining aspects of data stream processing systems [10].

Research issues of Big Data Stream Mining

Most of the traditional data mining processing methods are originated from the statistical area with progressive development and evolution, which tend to be more focused on the correctness and availability of the algorithm and lackin-depth study and attention on processing large-scale data sets, high-dimensional data processing capabilities and the execution efficiency of algorithms. In addition, there are no high standards on the space and time complexity of the algorithm.

With the development of information technology, big data problems appear gradually. It is necessary to process data with the grade of TB or even PB. Further more, the growth trend of big data will surpass the growth rate of corresponding data processing capacity [11].

Data stream mining is a stimulating field of study that has raised many challenge sand research issues. The following is a brief discussion of some crucial open research issues:

Memory management: The first fundamental issue we need to consider is how to optimize the memory space consumed by the mining algorithm. Memory management is a particular challenge when processing streams because many real data streams are irregular in their rate of arrival, exhibiting burstiness and variation of data arrival rate overtime. Fully addressing this issue in the mining algorithm can greatly improveits performance [12].

Data pre-processing: data pre-processing is an important and time consuming phase in the knowledge discovery process and must be taken into consideration when mining data streams. The challenge here is to automate such a process and integrate it with the mining techniques.

Compact data structure: Due to bounded memory size and the huge amount of data streams coming continuously, efficient and compact data structure is needed to store, update and retrieve the collected information.

Resource aware: This is a fundamental issue that considers the problem of how the limited resources, e.g., memory space and computation power, can be well utilized to produce accurate estimates. Data will be lost when the memory is used up and this wouldlead to inaccuracy of the mining results, thus degrade the performance of the mining algorithm [13].

Conclusion

Data stream mining applications address the same tasks as traditional data mining but over unbounded, continuous, fast-arriving, and time-changing data streams. These characteristics impose many new challenges for even the simplest task in traditional data mining. Most of the existing techniques cannot be adopted for the data stream environment.

In this regard, there is a need to investigate and improve data mining real-time algorithms to adapted for possible use in a wide range of industries. Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance.

References

  1. Issues in Data Stream Management,https://tianyesite.com/2016/01/25/issues\-in\-data\-stream\-management/,\[access:14.05.2016].

  2. Krempl, G., Zliobaite, I., Brzezinski, D., Hullermeier, E., Last, M.,Lemaire, V., Noack, T., Shaker, A., Sievi, S., Spiliopoulou, M., Stefanowski,J.: Open challenges for data stream mining research. 16(1), 1–10 (2014).June

  3. P. Boedihardjo: Efficient Algorithms for Mining Data Streams (2010),PhD thesis

  4. Big data - Wikipedia, the free encyclopedia,https://en.wikipedia.org/wiki/Big\_data ,[access: 14.05.2016].

  5. European Data Protection Supervisor, Opinion 7/2015: Meeting thechallenges of big data, 19 November 2015.

  6. Wei Fan, Albert Bifet, “Mining Big Data: Current Status and Forecastto the Future”, SIGKDD Explorations, 14 (2), pp1-5

  7. Data Mining for Big Data, http://www.dummies.com/how\-to/content/data\-mining\-for\-big\-data.html,\[access: 14.05.2016].

  8. Lalit S. Agrawal and Dattatraya S. Adane Models and Issues in Data Stream Mining

  9. T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi. Aninformation-theoretic approach to detecting changes in multi-dimensional datastreams. In 38th Symposium on the Interface of Statistics, Computing Science,and Applications. Citeseer, 2005

  10. O. Maimon, L. Rokach (eds.), Data Mining and Knowledge DiscoveryHandbook, 2nd ed.

  11. WANG, Yuan-Zhuo, JIN, Xiao-Long CHENG, Xue-Qi. Network Big Data:Present and Futur[J]. Chinese Journal Of Computers,2013.

  12. L. Golab and M. T. Ozsu. Issues in Data Stream Management. In SIGMODRecord, Volume 32, Number 2, June 2003.

  13. Elena I, Suzana L, Dejan G (2007) A survey of stream datamining. In: Proceedings of 8th national conference with internationalparticipation, ETAI, Ohrid

  14. Muthukrishnan, S. (2005). Data Streams: Algorithms and Applications.Now Publishers.
May 27, 2016