-
Notifications
You must be signed in to change notification settings - Fork 1
/
TextClassification.Rmd
executable file
·120 lines (94 loc) · 3.52 KB
/
TextClassification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
title: "Text Classification"
author: "Nirzaree"
date: "14/07/2020"
output: html_document
---
https://blogs.rstudio.com/ai/posts/2017-12-07-text-classification-with-keras/
**Dataset:**
IMDB: 50000 reviews:
Training data: 25000 reviews
Testing data: 25000 reviews
Balanced datasets: Equal number of positive and negative reviews
```{r setup,include=FALSE,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
library(data.table)
library(keras)
library(dplyr)
library(ggplot2)
library(purrr)
```
```{r loadData,include=FALSE,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
dtIMDB <- dataset_imdb(num_words = 10000)
dtTrain <- dtIMDB$train$x
dtTrainLabels <- dtIMDB$train$y
dtTest <- dtIMDB$test$x
dtTestLabels <- dtIMDB$test$y
# word_index <- dataset_imdb_word_index()
# dtWordIndex <- data.table(
# word = names(word_index),
# id = unlist(word_index,use.names = F)
# )
```
```{r ExploreData,include=FALSE,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
# reverse_word_index <- names(word_index)
# names(reverse_word_index) <- word_index
# decoded_review <- sapply(dtTrain[[1]],function(index) {
# word <- if (index >= 3) reverse_word_index[[as.character(index - 3)]]
# if (!is.null(word)) word else "?"
# })
#
# cat(decoded_review)
```
**Preprocess:**
* Integers cannot be fed to neural network directly. They need to be converted to tensors.
There are 2 techniques that can be used:
1.
```{r PrepareData,include=FALSE,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
#one hot encode
onehotencode <- function(sequence) {
encoding <- matrix(data = 0,nrow = length(sequence),ncol = 10000)
for (index in 1:length(sequence)) {
encoding[index,sequence[[index]]] <- 1
}
return(encoding)
}
x_train <- onehotencode(dtTrain)
y_test <- onehotencode(dtTest)
#convert labels from integer to numeric (todo: why so?)
x_labels <- as.numeric(dtTrainLabels)
y_labels <- as.numeric(dtTestLabels)
```
[Architecture of the model](https://blogs.rstudio.com/ai/posts/2017-12-07-text-classification-with-keras/images/3_layer_network.png)
```{r Model,include=FALSE,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
model <- keras_model_sequential() %>% layer_dense(units = 16,activation = 'relu',input_shape = c(10000)) %>% layer_dense(units = 16,activation = 'relu') %>% layer_dense(units = 1,activation = 'sigmoid')
#loss function and optimizer
model %>% compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
```
```{r ModelValidation,include=FALSE,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
#split train into train + validation
validation_indices <- 1:10000
x_val <- x_train[validation_indices,]
x_train_partial <- x_train[-validation_indices,]
x_labels_val <- x_labels[validation_indices]
x_labels_partial <- x_labels[-validation_indices]
history <- model %>% fit(
x_train_partial,
x_labels_partial,
epochs = 20,
batch_size = 512,
validation_data = list(x_val,x_labels_val)
)
```
todo: what does batch size mean here?
Overfitting beyond epoch 4.
```{r finalmodel,include=FALSE,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
model %>% fit(x_train,x_labels,epochs = 4,batch_size = 512)
results <- model %>% evaluate(y_test,y_labels)
```
```{r predictonnewdata,include=FALSE,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
model %>% predict(y_test[1:10,])
```