Jumat, 17 Februari 2017

Answer to Problem Data Analysis Competition (DAC) 2017

Apa kabar kawan kawan ?, sudah lama saya tidak sharing. Kebetulan lagi banyak kerjaan, deadline, nykripsi, konferensi, dan lomba-lomba. Nah, kali ini saya akan berbagi terkait salah satu lomba analisis data yang cukup populer dikalangan mahasiswa Statistika khususnya. Data Analysis Competition atau sering dikenal dengan DAC adalah salah satu lomba analisis data yang diselenggarakan oleh HIMASTA dan HIMADATA ITS. Ini kali pertama saya sharing terkait DAC, walaupun tahun-tahun sebelumnya ikutan cuma kagak lolos (ngerjain nggak pernah niat, dan ilmunya nggak nyampek ). 
Satu hal yang sangat membuat saya kaget pada kompetisi kali ini adalah Soalnya yang jauh dari espektasi. kalau pada 3 tahun sebelumnya soalnya mentok pada regresi, nah sekarang telah meluas ke arena lain yang cukup mengejutkan. Salah satu soalnya terkait Machine Learning, kalian tau itu? bagus ! kalau belum tau, silahka tanya mbah gugel. oke langsung saja berikut ini jawabannya :

Untuk datanya berkaitan dengan rating film, dari IMDb row data sebanyak 4974, itu merupakan film dari tahun 1920 sampai pada 2016. Saya share sesuai yang saya jawab, karena pakai bahasa inggris so, yaudah, itung-itung belajar wkwk !!

Firstly !!
1.             Net Profit

The first step to describe Net Profit from each film in year 1920-2015 is decreasing Gross variable with Budget variable and then we get Net Profit variable. The Net Profit variable then displayed in ggplot to see the spread of each film with “x” label is Year and “y” label is Net Profit.  The secound step is sum the Net Profit by Year and displayed in Line Plot. The result of descriptive analysis ggplot and Line Plot are shown in Figure 1.1 and Figure 1.3.

Figure 1.1 Net Profit Plot
To clarify the spread of the data in Figure 1.1, can be done by transforming data with the "log" and then display it on ggplot that shown in Figure 1.2

Figure 1.2 Transformation (Log) Net Profit Plot
Based on Figure 1.2, we get information that the number of film continues to increase from year 1920 to 2015 that shown by the spread of data in ggplot.

Figure 1.3 Sum of Net Profit by Year
The higest net profit on films that produced in  2012 with net profit total $2,999,664,363 and the lowest Net Profit on films that produced in 2006 with net profit total $-12,108,032,279. That information obtained in line diagram from Figure 1.3.
            From year to year the number of net profit can be said to be stable, but when the 1990s began to fluctuate even in the year 2005 to 2006 net profit is very different than previous years. After that, 2007 became the rise of the film by able to restore the stability of the net profit as in previous years. The highest net profit is in 2012.

2.             Film Category base on IMDb Score
To make a film chategory from film like total based on IMDb Score, made a plot to get the spread of data between IMDb Score and Content Rating for each film that shown in Figure 2.1

Figure 2.1. Content Rating VS IMDb Score
            To make a classification based on content rating and most like movie in facebook but still according to IMDb Score , then it can be noted on the plot above. Based on the data, most of all film got Content Rating Parental Guidance (PG), Parents Strongly Cautioned (PG-13), and Restricted (R) with mean of IMDb Score above 6,45. Meanwhile, just few films are located outside from PG, PG-13, and R. If we look, it turns out 4436 of 4974 films has content rating R, PG-13 and PG. More in the following figure:

Figure 2.2 Order of film category by count of movie. 
This Graph is an order of the movie category (content_rating) from the most liked movies category based on imdb_score. Authors undertook the selection of the data used to determine the films are much liked by the average imdb_score, then authors filtering to see many films that are above average value of IMDb_Score. From 4974 films, 2722 films that have obtained IMDb_Score above the overall average. Based on the chart above, in 2408 from the 2722 films derived from the content rating R, PG-13 and PG. It means that most of the films produced with the content, so when the value of IMDb Score of each content rating are summed, that three content rating (R, PG-13, PG) will be highest order.

Figure 2.3 Order of film category by imdb_score
The graph above shows that the films are produced during 1920 - 2016, almost all were above 5.9 based IMDb_Score. From the films that have over 5.9 IMDb_Score these are mostly films by content rating R, PG-13 and PG. This is consistent with the previous explanation about the sequence of content rating with the most likes by IMDb Score.

3.             Make a Group Using Cluster Analysis
In this case to grouping movies based on their income and Imdb Score, we used K-means cluster with Davies Bouldin index to determine the number of clusters to be formed from these data. From the result of analysis we have four clusters that shown in Figure 3.1

Figure 3.1 Cluster Plot
The result of analysis shows that Cluster 1 is red, Cluster 2 is green, Cluster 3 is blue, and Cluster 4 is purple. Each cluster that is formed have certain characteristics to distinguish between the cluster one with another cluster. Cluster 1, Cluster 2, and Cluster 4 have lower Gross and the difference is Cluster 1 has the highest Imdb Score, Cluster 2 has the lowest Imdb Score, and Cluster 4 has the medium Imdb Score because Cluster 4 among Cluster 1 and Cluster 2. Whereas  Cluster 3 has higest Gross than the other but Cluster 3 has intermediate Imdb Score, it is founded in Figure 3.1 by looking the spread of the data.

4.             Variables with Greatest to Weaknest Correlation with IMDb Score (Correlation Matrix)
To look for variables that have correlation to the IMDb Score is using correlation formula that invented by British scientist named Pearson.

The correlation formula inserted into the R Program version 3.3.2 becomes correlation algorithm. From the results of correlation analysis using R Program, obtained correlation matrix that shown in Figure 4.1 below:

Figure 4.1 Correlation Matrix
After getting correlation value of each variable with IMDb Score, then made an order from the largest to smallest. The greater correlation value of a variable to the IMDb Score, then these variables has strong correlation with IMDb Score and vice versa. The sequence shown in the Table below:
Table 3.1 Correlation Table
Ranking
Variables
Correlation Value
Kind of Correlation
1
gross
0.09
Positive
2
director_facebook_likes
0.07
Positive
3
num_voted_users
0.05
Positive
4
title_year
-0.04
Negative
5
num_user_for_reviews
0.03
Positive
5
budget
0.03
Positive
7
num_critic_for_reviews
0.02
Positive
7
movie_facebook_likes
0.02
Positive
7
cast_total_facebook_likes
0.02
Positive
7
actor_1_facebook_likes
0.02
Positive
11
facenumber_in_poster
0.01
Positive
11
aspect_ratio
0.01
Positive
11
actor_3_facebook_likes
0.01
Positive
14
actor_2_facebook_likes
0
No Correlation
14
duration
0
No Correlation

According to the table above, we know that from 15 variables tested its correlation with IMDb Score, there are 12 variables were positively correlated, 1 variable that negatively correlated that is "title_year", and 2 variable that has no correlation with IMDb Score is "actor_2_facebook_likes" and “duration”.

5.             Machine Learning Method for IMDb to Determine the Rate of the Films
Lots of estimation methods using Machine Learning, in this case the authors make comparisons between several methods (K Nearest Neighbor, Neural Network, and Deep Learning). Based on this method, the author will determine the best method to estimate the IMDb Score for each movie based on the Sum of Square Error. However, to get the value of the error, the authors made training of data with high accuracy, as well as tests on the validation stage.
Authors create the data partition where training data is 80% of the total, while the data validation is 20% of the total. Found the number of training data are 3981 films, while the number of data validation are 993 films.
After estimation using K Nearest Neighbor, Neural Network, and Deep Learning, then we get Sum of Square Error value from those methods as follows:
Table 3.2 Sum of Square Error Table
K Neires Neighbour
Neural Network
Deep Learning
2125.13
1208.63
1491.052965
Based on the Sum of Square Error table, we can take a decision that the most accurate method for estimating IMDb_Score is Neural Network with 5 hidden layer. Estimates were made using Neural Network method get the smallest Sum of Square Error value with 1208,63. It means has the lowest bias, so it is more accurate than the other two methods are attempted. To continue the prediction, it used weights for each Input Layer by Hidden Layer as in the following table:
Table 3.3 Weight of Input Layer to Hidden Layer
Input Layer
Hidden 1
Hidden 2
Hidden 3
Hidden 4
Hidden 5
Intercept
-2.5845
-0.2245
-0.1704
-1.9172
0.60954
num_user_for_reviews
-0.4841
0.59466
0.32564
0.32365
1.62698
num_critic_for_reviews
-1.9198
0.42584
-2.2413
0.45193
-0.5645
duration
-1.0219
-0.4442
0.64074
1.71278
-0.0026
director_facebook_likes
0.54627
-0.4448
-0.5659
0.89154
0.2535
movie_facebook_likes
1.98427
0.15636
0.74031
-0.6089
0.34619
cast_total_facebook_likes
1.06183
0.57583
-1.4384
-1.7641
-0.883
actor_1_facebook_likes
-0.4298
-0.2786
0.13606
-1.1822
-1.6882
actor_2_facebook_likes
1.00763
0.10771
-2.54
-0.5619
-2.172
actor_3_facebook_likes
0.46108
-0.3232
2.09136
0.59369
1.79963
gross
1.90723
0.01976
0.56595
-0.4777
0.05529
num_voted_users
-0.3271
-1.1465
0.082
0.65271
1.32115
facenumber_in_poster
0.45345
-1.161
1.10857
1.87662
0.42533
budget
-0.3141
0.24404
1.43579
0.22287
-1.3692
title_year
-0.1203
0.34752
0.38051
-0.9772
1.9438
aspect_ratio
0.0197
0.0705
1.54709
1.16785
0.72714
After getting the weight on Hidden Layer, then we look for the weight values to make predictions on IMDb Score. Here is the weight that obtained for each Hidden Layer by Output Layer:
Table 3.4 Weight of Hidden Layer to Input Layer
Hidden Layer
IMDb Score
Intercept Hidden Layer
2.760557
Hidden Layer 1
0.164642
Hidden Layer 2
-0.05371868
Hidden Layer 3
3.661651
Hidden Layer 4
-0.2461656
Hidden Layer 5
0.1176617
Based on the table above, then it can created network architecture to predict the value of IMDb Score. The architecture are 15 Input Layer, 5 Hidden Layer, and 1 Output Layer that shown in Figure 5.1.

Figure 5.1 Network Architecture
Network Architecture above is a simple network that has feedforward structure where the signal move from input unit through out the hidden layer and finally reach the output unit (have a stable behavior structure). The type of feedforward networks have nerve cells that is composed of several layers. The input layer is not a nerve cell, this layer just to provide services by introducing a value of a variable. Hidden layer and output layer of nerve cells connect to one another with the previous layer. The possibility that arises there is relationship with some units of the previous layer or connect to all of them. The network structure is used to predict the value of imdb_score with 17,300 iterations / step.

6.             Test of Machine Learning Method
Machine that have been made by learning algorithm uses pattern recognition to the data, is used to make predictions on IMDb Score. Here are the results of predicted IMDb_Score using a machine that has been tested:

Figure 6.1 Diagram of IMDb Score Predicted
Having obtained the network model in previous discussions, that model is used to predict the value of IMDb_Score. After the data of IMDb_Test is entered as input layer into a machine that has been made and have been trained to use the data IMDb_Clean then obtained value IMDb_Score for 20 movies listed as in the chart above. Seen that the value of IMDb_Score is above 6.00, while the average of the 20 films in the data that is predicted ± 6.4, not much different from the previous data (4974 data) with an average of 6,45. "L.A. confidential" into a film with the highest IMDb_Score based on predictions that using feed forward neural network.



7.             Conclusion and Suggestion
Based on the result above, we can make some conclusion. Detailed conclusions are as the follows:
a.       The highest number of net profit (surplus) is 2012 and the lowest net profit (deficit) is 2006.
b.    Mean of IMDb_Score from the entire film in 1920-2016 is 6.45 with the number of films that are above the average is 2722 movie. From 2722 films as 2408 is a film with content rating R, PG-13 and PG.
c.       There are 4 clusters that obtained, Cluster 1, Cluster 2, and Cluster 4 have lower Gross and the difference is Cluster 1 has the highest Imdb Score, Cluster 2 has the lowest Imdb Score, and Cluster 4 has the medium Imdb Score because Cluster 4 among Cluster 1 and Cluster 2. Whereas Cluster 3 has higest Gross than the other but Cluster 3 has intermediate Imdb Score.
d.    From 15 variables tested its correlation with IMDb Score, with "gross" is variable that has strongest positive correlation from the other variables, one variable that negatively correlated that is "title_year", and two variables that has no correlation with IMDb Score is "actor_2_facebook_likes" and “duration” ,
e.   From three methods (KNN, Neural Network, and Deep Learning) were compared, it was found that the Neural Network has the best accuracy with Sum of Square Error lowest.
f.        Based on the predicted results show that an average of IMDb Score from 20 films (imdb_test_data) is 6.4 with a film entitled "LA Confidential" which has the highest predictive IMDb Score is 6,65.
Suggestions for further analysis are follows:
a.    To overcome the subjectivity from the results of this analysis, is needed more engine, so the analysis will be more accurate.
b.       Necessary to add another method as a comparison from methods that have been tested to obtain the best method in predicting IMDb Score.
c.        It needs more variables such as genre films and more to get representative results.

Terimakasih !!
Jangan sungkan untuk memberikan saran dan komen !

1 komentar: