Apa kabar kawan kawan ?, sudah lama saya tidak sharing. Kebetulan lagi banyak kerjaan, deadline, nykripsi, konferensi, dan lombalomba. Nah, kali ini saya akan berbagi terkait salah satu lomba analisis data yang cukup populer dikalangan mahasiswa Statistika khususnya. Data Analysis Competition atau sering dikenal dengan DAC adalah salah satu lomba analisis data yang diselenggarakan oleh HIMASTA dan HIMADATA ITS. Ini kali pertama saya sharing terkait DAC, walaupun tahuntahun sebelumnya ikutan cuma kagak lolos (ngerjain nggak pernah niat, dan ilmunya nggak nyampek ).
Satu hal yang sangat membuat saya kaget pada kompetisi kali ini adalah Soalnya yang jauh dari espektasi. kalau pada 3 tahun sebelumnya soalnya mentok pada regresi, nah sekarang telah meluas ke arena lain yang cukup mengejutkan. Salah satu soalnya terkait Machine Learning, kalian tau itu? bagus ! kalau belum tau, silahka tanya mbah gugel. oke langsung saja berikut ini jawabannya :
Untuk datanya berkaitan dengan rating film, dari IMDb row data sebanyak 4974, itu merupakan film dari tahun 1920 sampai pada 2016. Saya share sesuai yang saya jawab, karena pakai bahasa inggris so, yaudah, itungitung belajar wkwk !!
Firstly !!
1.
Net Profit
The first step to describe Net Profit from each film in
year 19202015 is decreasing Gross variable with Budget variable and then we
get Net Profit variable. The Net Profit variable then displayed in ggplot to
see the spread of each film with “x” label is Year and “y” label is Net
Profit. The secound step is sum the Net
Profit by Year and displayed in Line Plot. The result of descriptive analysis
ggplot and Line Plot are shown in Figure 1.1 and Figure 1.3.
Figure 1.1 Net Profit Plot
To clarify the
spread of the data in Figure 1.1, can be done by transforming data with the
"log" and then display it on ggplot that shown in Figure 1.2
Figure 1.2 Transformation (Log)
Net Profit Plot
Based on Figure
1.2, we get information that the number of film continues to increase from year
1920 to 2015 that shown by the spread of data in ggplot.
Figure 1.3 Sum of Net Profit by Year
The higest net profit on films that produced in 2012 with net profit total $2,999,664,363 and
the lowest Net Profit on films that produced in 2006 with net profit total $12,108,032,279.
That information obtained in line diagram from Figure 1.3.
From year to year the number of net
profit can be said to be stable, but when the 1990s began to fluctuate even in
the year 2005 to 2006 net profit is very different than previous years. After
that, 2007 became the rise of the film by able to restore the stability of the
net profit as in previous years. The highest net profit is in 2012.
2.
Film Category base on IMDb Score
To make a film
chategory from film like total based on IMDb Score, made a plot to get the
spread of data between IMDb Score and Content Rating for each film that shown
in Figure 2.1
Figure 2.1. Content Rating VS IMDb Score
To make a classification based on
content rating and most like movie in facebook but still according to IMDb
Score , then it can be noted on the plot above. Based on the data, most of all
film got Content Rating Parental Guidance (PG), Parents Strongly Cautioned
(PG13), and Restricted (R) with mean of IMDb Score above 6,45. Meanwhile, just
few films are located outside from PG, PG13, and R. If we look, it turns out
4436 of 4974 films has content rating R, PG13 and PG. More in the following
figure:
Figure 2.2 Order of film category by count of movie.
This Graph is an order of the movie category
(content_rating) from the most liked movies category based on imdb_score.
Authors undertook the selection of the data used to determine the films are
much liked by the average imdb_score, then authors filtering to see many films
that are above average value of IMDb_Score. From 4974 films, 2722 films that
have obtained IMDb_Score above the overall average. Based on the chart above,
in 2408 from the 2722 films derived from the content rating R, PG13 and PG. It
means that most of the films produced with the content, so when the value of
IMDb Score of each content rating are summed, that three content rating (R,
PG13, PG) will be highest order.
Figure 2.3 Order of film category by imdb_score
The graph above shows that the films are produced during
1920  2016, almost all were above 5.9 based IMDb_Score. From the films that
have over 5.9 IMDb_Score these are mostly films by content rating R, PG13 and
PG. This is consistent with the previous explanation about the sequence of
content rating with the most likes by IMDb Score.
3.
Make a Group Using Cluster Analysis
In this case to grouping movies based on their income and
Imdb Score, we used Kmeans cluster with Davies Bouldin index to determine the number
of clusters to be formed from these data. From the result of analysis we have
four clusters that shown in Figure 3.1
Figure 3.1 Cluster Plot
The result of analysis shows that Cluster 1 is red,
Cluster 2 is green, Cluster 3 is blue, and Cluster 4 is purple. Each cluster
that is formed have certain characteristics to distinguish between the cluster
one with another cluster. Cluster 1, Cluster 2, and Cluster 4 have lower Gross
and the difference is Cluster 1 has the highest Imdb Score, Cluster 2 has the
lowest Imdb Score, and Cluster 4 has the medium Imdb Score because Cluster 4
among Cluster 1 and Cluster 2. Whereas
Cluster 3 has higest Gross than the other but Cluster 3 has intermediate
Imdb Score, it is founded in Figure 3.1 by looking the spread of the data.
4.
Variables with Greatest to Weaknest Correlation with IMDb
Score (Correlation Matrix)
To look for variables that have correlation to the IMDb
Score is using correlation formula that invented by British scientist named
Pearson.
The correlation
formula inserted into the R Program version 3.3.2 becomes correlation
algorithm. From the results of correlation analysis using R Program, obtained
correlation matrix that shown in Figure 4.1 below:
Figure 4.1 Correlation Matrix
After getting correlation value of each variable with
IMDb Score, then made an order from the largest to smallest. The greater
correlation value of a variable to the IMDb Score, then these variables has
strong correlation with IMDb Score and vice versa. The sequence shown in the
Table below:
Table 3.1 Correlation Table
Ranking

Variables

Correlation Value

Kind of Correlation

1

gross

0.09

Positive

2

director_facebook_likes

0.07

Positive

3

num_voted_users

0.05

Positive

4

title_year

0.04

Negative

5

num_user_for_reviews

0.03

Positive

5

budget

0.03

Positive

7

num_critic_for_reviews

0.02

Positive

7

movie_facebook_likes

0.02

Positive

7

cast_total_facebook_likes

0.02

Positive

7

actor_1_facebook_likes

0.02

Positive

11

facenumber_in_poster

0.01

Positive

11

aspect_ratio

0.01

Positive

11

actor_3_facebook_likes

0.01

Positive

14

actor_2_facebook_likes

0

No Correlation

14

duration

0

No Correlation

According to the table above, we know that from 15
variables tested its correlation with IMDb Score, there are 12 variables were
positively correlated, 1 variable that negatively correlated that is "title_year",
and 2 variable that has no correlation with IMDb Score is "actor_2_facebook_likes"
and “duration”.
5.
Machine Learning Method for IMDb to Determine the Rate of
the Films
Lots of estimation
methods using Machine Learning, in this case the authors make comparisons
between several methods (K Nearest Neighbor, Neural Network, and Deep
Learning). Based on this method, the author will determine the best method to
estimate the IMDb Score for each movie based on the Sum of Square Error.
However, to get the value of the error, the authors made training of data with
high accuracy, as well as tests on the validation stage.
Authors create the
data partition where training data is 80% of the total, while the data
validation is 20% of the total. Found the number of training data are 3981
films, while the number of data validation are 993 films.
After estimation using K Nearest Neighbor, Neural
Network, and Deep Learning, then we get Sum of Square Error value from those
methods as follows:
Table 3.2 Sum of Square Error
Table
K Neires Neighbour

Neural Network

Deep Learning

2125.13

1208.63

1491.052965

Based on the Sum of Square Error table, we can take a
decision that the most accurate method for estimating IMDb_Score is Neural
Network with 5 hidden layer. Estimates were made using Neural Network method
get the smallest Sum of Square Error value with 1208,63. It means has the
lowest bias, so it is more accurate than the other two methods are attempted.
To continue the prediction, it used weights for each Input Layer by Hidden
Layer as in the following table:
Table 3.3 Weight of Input Layer to Hidden Layer
Input Layer

Hidden 1

Hidden 2

Hidden 3

Hidden 4

Hidden 5

Intercept

2.5845

0.2245

0.1704

1.9172

0.60954

num_user_for_reviews

0.4841

0.59466

0.32564

0.32365

1.62698

num_critic_for_reviews

1.9198

0.42584

2.2413

0.45193

0.5645

duration

1.0219

0.4442

0.64074

1.71278

0.0026

director_facebook_likes

0.54627

0.4448

0.5659

0.89154

0.2535

movie_facebook_likes

1.98427

0.15636

0.74031

0.6089

0.34619

cast_total_facebook_likes

1.06183

0.57583

1.4384

1.7641

0.883

actor_1_facebook_likes

0.4298

0.2786

0.13606

1.1822

1.6882

actor_2_facebook_likes

1.00763

0.10771

2.54

0.5619

2.172

actor_3_facebook_likes

0.46108

0.3232

2.09136

0.59369

1.79963

gross

1.90723

0.01976

0.56595

0.4777

0.05529

num_voted_users

0.3271

1.1465

0.082

0.65271

1.32115

facenumber_in_poster

0.45345

1.161

1.10857

1.87662

0.42533

budget

0.3141

0.24404

1.43579

0.22287

1.3692

title_year

0.1203

0.34752

0.38051

0.9772

1.9438

aspect_ratio

0.0197

0.0705

1.54709

1.16785

0.72714

After getting the weight on Hidden Layer, then we look
for the weight values to make predictions on IMDb Score. Here is the weight
that obtained for each Hidden Layer by Output Layer:
Table 3.4 Weight of Hidden
Layer to Input Layer
Hidden Layer

IMDb Score

Intercept Hidden
Layer

2.760557

Hidden Layer 1

0.164642

Hidden Layer 2

0.05371868

Hidden Layer 3

3.661651

Hidden Layer 4

0.2461656

Hidden Layer 5

0.1176617

Based on the table above, then it can created network
architecture to predict the value of IMDb Score. The architecture are 15 Input
Layer, 5 Hidden Layer, and 1 Output Layer that shown in Figure 5.1.
Figure 5.1 Network Architecture
Network Architecture above is a simple network that has
feedforward structure where the signal move from input unit through out the
hidden layer and finally reach the output unit (have a stable behavior
structure). The type of feedforward networks have nerve cells that is composed
of several layers. The input layer is not a nerve cell, this layer just to
provide services by introducing a value of a variable. Hidden layer and output
layer of nerve cells connect to one another with the previous layer. The possibility
that arises there is relationship with some units of the previous layer or
connect to all of them. The
network structure is used to predict the value of imdb_score with 17,300
iterations / step.
6.
Test of Machine Learning Method
Machine
that have been made by learning algorithm uses pattern recognition to the data,
is used to make predictions on IMDb Score. Here are the results of predicted
IMDb_Score using a machine that has been tested:
Figure 6.1
Diagram of IMDb Score Predicted
Having
obtained the network model in previous discussions, that model is used to
predict the value of IMDb_Score. After the data of IMDb_Test is entered as
input layer into a machine that has been made and have been trained to use the
data IMDb_Clean then obtained value IMDb_Score for 20 movies listed as in the
chart above. Seen that the value of IMDb_Score is above 6.00, while the average
of the 20 films in the data that is predicted ± 6.4, not much different from
the previous data (4974 data) with an average of 6,45. "L.A.
confidential" into a film with the highest IMDb_Score based on predictions
that using feed forward
neural network.
7.
Conclusion and Suggestion
Based on the result
above, we can make some conclusion. Detailed conclusions are as the follows:
a. The highest number
of net profit (surplus) is 2012 and the lowest net profit (deficit) is 2006.
b. Mean of IMDb_Score
from the entire film in 19202016 is 6.45 with the number of films that are
above the average is 2722 movie. From 2722 films as 2408 is a film with content
rating R, PG13 and PG.
c. There are 4
clusters that obtained, Cluster 1, Cluster 2, and Cluster 4 have lower Gross
and the difference is Cluster 1 has the highest Imdb Score, Cluster 2 has the
lowest Imdb Score, and Cluster 4 has the medium Imdb Score because Cluster 4
among Cluster 1 and Cluster 2. Whereas Cluster 3 has higest Gross than the
other but Cluster 3 has intermediate Imdb Score.
d. From 15 variables
tested its correlation with IMDb Score, with "gross" is variable that
has strongest positive correlation from the other variables, one variable that
negatively correlated that is "title_year", and two variables that
has no correlation with IMDb Score is "actor_2_facebook_likes" and
“duration” ,
e. From three methods
(KNN, Neural Network, and Deep Learning) were compared, it was found that the
Neural Network has the best accuracy with Sum of Square Error lowest.
f. Based on the
predicted results show that an average of IMDb Score from 20 films
(imdb_test_data) is 6.4 with a film entitled "LA Confidential" which
has the highest predictive IMDb Score is 6,65.
Suggestions for further analysis are follows:
a. To overcome the
subjectivity from the results of this analysis, is needed more engine, so the
analysis will be more accurate.
b. Necessary to add
another method as a comparison from methods that have been tested to obtain the
best method in predicting IMDb Score.
c. It needs more variables
such as genre films and more to get representative results.
Terimakasih !!
Jangan sungkan untuk memberikan saran dan komen !
apakah ada soalnya?
BalasHapus