Stock Price Prediction Based On Stock Big Data and Pattern Graph Analysis
Stock Price Prediction Based On Stock Big Data and Pattern Graph Analysis
Stock Price Prediction Based On Stock Big Data and Pattern Graph Analysis
Analysis
Keywords: Stock Price Prediction, Hierarchical Clustering, Pattern Matching, Feature Selection, Artificial Neural
Network.
Abstract: Stock price prediction is extremely difficult owing to irregularity in stock prices. Because stock price some-
times shows similar patterns and is determined by a variety of factors, we present a novel concept of finding
similar patterns in historical stock data for high-accuracy daily stock price prediction with potential rules for
simultaneously selecting the main factors that have a significant effect on the stock price. Our objective is
to propose a new complex methodology that finds the optimal historical dataset with similar patterns accord-
ing to various algorithms for each stock item and provides a more accurate prediction of daily stock price.
First, we use hierarchical clustering to easily find similar patterns in the layer adjacent to the current pattern
according to the hierarchical structure. Second, we select the determinants that are most influenced by the
stock price using feature selection. Moreover, we generate an artificial neural network model that provides
numerous opportunities for predicting the best stock price. Finally, to verify the validity of our model, we use
the root mean square error (RMSE) as a measure of prediction accuracy. The forecasting results show that the
proposed model can achieve high prediction accuracy for each stock by using this measure.
223
Jeon, S., Hong, B., Kim, J. and Lee, H-j.
Stock Price Prediction based on Stock Big Data and Pattern Graph Analysis.
DOI: 10.5220/0005876102230231
In Proceedings of the International Conference on Internet of Things and Big Data (IoTBD 2016), pages 223-231
ISBN: 978-989-758-183-0
Copyright
c 2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
IoTBD 2016 - International Conference on Internet of Things and Big Data
224
Stock Price Prediction based on Stock Big Data and Pattern Graph Analysis
Table 1: Example of stock raw data. input data to the generation of the prediction model
Attribute Value from the perspective of data analysis and processing.
Date
20140813090024 4.1 Aggregation of Stock Data
(yyyymmddhhmmsss)
Type 0
Completion price (won) 77,500 Because the tick-by-tick data we have are the data
Completion amount 37 generated per transaction, the completion price at the
Opening price (won) 78,900 time is zero if the transaction is not carried out, as
High price (won) 78,900 shown in Figure 2 (a). In other words, as the data
Low price (won) 76,600 is non-continuous data, it is difficult to predict the
Price just before (won) 77,400 price. Consequently, we generate aggregated data at
Accumulated completion five-minute intervals to obtain a continuous flow of
475,021 data, as shown in Figure 2 (b).
amount
Accumulated completion
36,770,000,000
price (won)
3 DATA SPECIFICATION
In this study, stock data that was gathered over twelve
consecutive months (August 2014 to July 2015) from
the Korea Composite Stock Price Index (KOSPI) was (b) Completion price after aggregation.
used as the input . Figure 2: The need for aggregation.
The stock data was provided by Koscom. A data
sample is listed in Table 1; it consists of the date,
type, completion price, completion amount, opening 4.2 Searching for Similar Patterns
price, high price, low price, price just before, accu-
mulated completion amount, and accumulated com- Above all, it is necessary to make patterns from ag-
pletion price. Because there are four types (domestic gregated data for searching similar patterns. Figure 3
purchase price (0), domestic selling price (1), foreign shows the processes of patterning the aggregated data.
purchase price (2), and foreign selling price (3)), the The length of a pattern is one day and patterns are
stock price is the sum of thirty-two items. The size of generated at five-minute intervals, e.g., by the sliding
each data set was 168 GB and the data was collected window method, for pattern matching analysis using
during the one-year period from August 2014 to July various patterns. The number of patterns for one hour
2015. will be twelve.
Figure 4 shows similar patterns in the graph of
real stock price. The similar patterns can be found
by comparing historical patterns and the current pat-
4 OUTLINE OF PROPOSED tern. There are various methods for pattern match-
MODEL ing. We use a hierarchical clustering algorithm that
can find similar patterns quickly and simultaneously.
In this section, we describe the overall process, from The patterns are structured by hierarchical clustering
data preprocessing for making continuous data, the and similar patterns are neighbor or sibling nodes of
search for similar pattern data, and the selection of the current pattern. If there are only a limited number
225
IoTBD 2016 - International Conference on Internet of Things and Big Data
226
Stock Price Prediction based on Stock Big Data and Pattern Graph Analysis
Table 2: Results of stepwise regression in real stock data of Hyundai Motor Company.
Domestic purchase price Domestic selling price Foreign purchase price Foreign selling price
Name Choice Name Choice Name Choice Name Choice
Completion Completion Completion Completion
O O O O
price price price price
Completion Completion Completion Completion
O O O O
amount amount amount amount
Opening price X Opening price X Opening price O Opening price O
High price O High price O High price O High price O
Low price O Low price O Low price O Low price O
Price Price Price Price
O X O O
just before just before just before just before
Accumulated Accumulated Accumulated Accumulated
completion X completion O completion O completion O
amount amount amount amount
Accumulated Accumulated Accumulated Accumulated
completion X completion X completion X completion X
price price price price
Table 3: Explanatory powers according to hidden layers. 5.1 Series of Operations for Generating
Hidden Hidden Hidden Hidden Hidden Predicted Stock Data
layer 1 layer 2 layer 3 layer 4 layer 5
37.6% 95.5% 95.9% 94.2% 95.3% We propose the following steps to generate a predic-
tion model for big data processing and analysis tools,
one or more hidden layers. The learning rate increases as shown in Figure 6.
with the number of hidden layers. However, the con- Step 1 (Stock Data Aggregation and Pattern
nection point between input and output could be lost Generation as Data Preprocessing): We stored the
if there are too many hidden layers, and the learning one-year stock data provided by Koscom in Hadoop
could be disturbed (Dominic et al., 1991). distributed file systems (HDFSs) of the Hadoop-based
We employed five hidden layers to ensure that the cluster. Because we could not manually modify the
system can bear the processing load and created the fi- source code of MapReduce for extracting the desired
nal model with a hidden number that shows the high- data from each HDFS of the Hadoop cluster, we used
est explanatory power (R-squared value) by perform- the RHive tool to provide HiveQL, which facilitates
ing learning in sequence from hidden layer 1 to hid- the search for the desired data, e.g., through select
den layer 5 for each stock item. Table 3 summarizes query of RDBMS. After the data was extracted, it was
the explanatory power in each hidden layer, and the aggregated at five-minute intervals by using R based
layer with the highest value is layer 3. on the tick-by-tick data. Then, patterns were gener-
ated from them because of concatenation of similar
patterns in R of the master computer. The size of a
pattern was one day and the generation unit was five
5 SYSTEM ARCHITECTURE FOR minutes. The total number of patterns was 17,323.
STOCK PRICE PREDICTION Step 2 (Pattern Matching with Hierarchical
Clustering): To retrieve similar patterns with the cur-
This section describes the series of operations that rent pattern, we used the hclust function in R, which
were implemented when generating the final artificial offers two advantages: it can quickly autodetect sim-
neural network model. All the processes were con- ilar patterns and freely determine the range of similar
ducted on a cluster consisting of four connected com- patterns simultaneously. Algorithm 1 describes the
puters (one master and three slaves) with Hadoop and procedure for finding similar patterns. After insert-
RHive installed. ing the current pattern into the aggregated patterns as
a historical dataset, clustered patterns were generated
via the hclust function. Then, similar patterns of the
same level as the current pattern could be found.
Step 3 (Feature Selection using Stepwise Re-
227
IoTBD 2016 - International Conference on Internet of Things and Big Data
Figure 6: Dependent and independent variables should be defined in stepwise regression analysis.
Algorithm 1: Algorithm for pattern matching. regression. Before selecting the variables, the time of
input : Aggregated patterns is a list of similar patterns was determined, and then, variables
aggregated data, current pattern at the time were brought. Variables with p value be-
represents the current pattern low a specified threshold were judged as significant
output: similar patterns is a list of similar variables.
patterns after clustering
Algorithm 2: Algorithm for feature selection in step-
1 int last = Aggregated patterns.length()-1;
wise regression.
2 List<Integer> level = new
ArrayList<Integer>(); input : similar patterns represents a list of
3 int level = 0; foreach count in similar patterns, variables represents a
Aggregated patterns.length() do list of all variables constituting the
4 Aggregated patterns[last][count] = price
current pattern[count]; output: remainder is a list of variables
excluding the insignificant variables
5 run(’sink()’);
6 run(’hc< −hclust(dist(Aggregated patterns), 1 boolean f lag = false;
method=’ave’)’); 2 variables =
7 run(’sink(’out.txt’)’); getVariables(similar patterns.atTime());
8 List result patterns = Read File(’out.txt’); 3 while f lag == false do
9 foreach index in result patterns.length() do 4 remainder = run(’step(variables,
10 if result patterns[index] == direction=’both’)’);
current pattern then 5 foreach i in remainder.length() do
11 foreach level in result patterns do 6 if remainder[i].p value > 0.05 then
12 similar patterns = find SP(index); 7 break;
8 else
13 return similar patterns; 9 f lag = true;
10 return remainder;
gression): Given several similar patterns of stock
price, insignificant variables among all the variables
constituting the price were removed. Algorithm 2 de-
Step 4 (Predicted Data Generation on Artificial
scribes the steps for feature selection using stepwise
Neural Network): To create the predicted data, we
228
Stock Price Prediction based on Stock Big Data and Pattern Graph Analysis
used an ANN after feature selection. Algorithm 3 6.1 Dataset and Test Scenario
describes the steps for generating the predicted data
using an ANN. Among the input data, we prepared To prove the effectiveness of the proposed model, we
dependent and independent variables as training data used a real historical stock dataset consisting of var-
with another time zone because we would predict the ious items for the one-year period from August 2014
next day of the current pattern. Specifically, given to July 2015. To measure the prediction accuracy, we
historical time of similar pattern ht, the time of the prepared three items (Hyundai Motor Company, KIA
dependent variable is ht + 1 and the time of the inde- Motors, and Samsung Electronics) as companies rep-
pendent variable is ht. After the independent and de- resenting the Republic of Korea, with their stock data
pendent variables were bound, we generated an ANN- for August 1, 2014, to July 28, 2015, as the training
based model using the neuralnet function provided by data, and their stock data for July 29–31, 2015, as the
R. Then, the independent variables at the current time test data. As a test scenario, first, two predicted stock
t in the model were input and the predicted data were data for one day were generated according to the pro-
generated. posed model and feature selection. Then, we checked
the prediction accuracy by using the RMSE values to
Algorithm 3: Algorithm for generation of predicted compare the predicted and real stock data.
data.
input : tr dependent represents the total 6.2 Evaluation of Prediction Accuracy
completion price at historical time
ht + 1, tr independent represents the We performed experiments to compute the accuracy
remaining variables excluding the total of the proposed method. Figure 7 compares the actual
completion price at historical time ht, data and two data values predicted by the proposed
te dependent represents the remaining model with only feature selection for July 31, 2015.
variables excluding the total The x-axis represents the time at five-minute intervals
completion price at current time t and the y-axis represents the total completion price,
output: predicted is a dataset generated by i.e., stock price according to the time. First, Figure 7
ANN (a) compares the results of Hyundai Motor Company
1 run(’training < − stock; we can see that the stock movement change
cbind(tr dependent,tr independent)’); of the proposed model is closer to the real stock data
2 run(’colnames(training) < − than that of only feature selection. In particular, this
c(’output’,’input’)’); can be an especially clear view of the rising curve of
3 run(’ANN result < − neuralnet(output input, the morning and the declining curve of the afternoon.
training, hidden=1∼5,act.fct=’tanh’)’); Figure 7 (b) shows the stock data derived from the
4 run(predicted < − ’prediction(ANN result, real and predicted data for KIA Motors. In contrast to
te dependent)’); Figure 7 (a), there are slight differences between the
5 return predicted; stock movement change of the proposed model and
the real data, whereas there is no clear view of the
rising and declining curve in the graph. Lastly, Fig-
Step 5 (Verification using RMSE): To verify the ure 7 (c) depicts the stock data derived from the real
validity of the proposed model, we selected RMSE and predicted data for Samsung Electronics. As com-
as a measure of prediction accuracy; the function was pared with only the feature selection graph, the stock
also provided in R. The measure was computed from movement change of the proposed model is similarly
comparisons between real and predicted data. drawn to the real data despite a slight difference in
price.
In this study, we selected RMSE as a measure of
6 EVALUATION prediction in order to verify the validity of our model
because this measure is frequently used in the stock
In this section, we describe the one-year test data pro- domain. Figure 8 shows the experimental results of
vided by Koscom and evaluate the accuracy of each the proposed model and only feature selection us-
stock item by computing the RMSE. ing RMSE. In Figure 8 (a) and (b), we can see that
there are good predictions except on July 30, when
the interesting aspect is the same item. For this rea-
son, we can estimate that there are variables affect-
ing the same theme, not variables that affect individ-
229
IoTBD 2016 - International Conference on Internet of Things and Big Data
ual stocks; it is necessary to make up for this point. (c) Comparison results for Samsung Electronics stock.
Unlike Figure 8 (a) and (b), Figure 8 (c) shows good
Figure 8: RMSE results.
prediction for all days. In particular, it shows good
predictions on the last day in all the graphs. work based on Hadoop and R. Finally, we demon-
strated the prediction accuracy for three stock items
using RMSE.
7 CONCLUSIONS In the future, we plan to enhance the reliability
of our model by further investigating big and small
pattern matching and analysis. In addition, we will
In this paper, we determined that stock prices sparsely develop a distributed parallel algorithm and predict all
show similar patterns and that not all the variables the stock items instead of only some of them.
have a significant impact on the price. For short-
term prediction, we proposed a novel method based
on a combination of hierarchical clustering, stepwise
regression, and ANN model in order to find similar ACKNOWLEDGEMENTS
historical patterns for each stock item and predict the
daily stock price using optimal significant variables This work was supported by the Research Program
through feature selection. Moreover, we dealt with funded by the Korea Centers for Disease Control and
the overall process using a big data processing frame- Prevention(fund code#2015-E33016-00).
230
Stock Price Prediction based on Stock Big Data and Pattern Graph Analysis
231