batching huge data in tensorflow

I am trying to perform binary classification using the code/tutorial from
https://github.com/eisenjulian/nlp_estimator_tutorial/blob/master/nlp_estimators.py

print("Loading data...") (x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size) print(len(y_train), "train sequences") print(len(y_test), "test sequences") print("Pad sequences (samples x time)") x_train = sequence.pad_sequences(x_train_variable, maxlen=sentence_size, padding='post', value=0) x_test = sequence.pad_sequences(x_test_variable, maxlen=sentence_size, padding='post', value=0) print("x_train shape:", x_train.shape) print("x_test shape:", x_test.shape) def train_input_fn(): dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train)) dataset = dataset.shuffle(buffer_size=len(x_train_variable)) dataset = dataset.batch(100) dataset = dataset.map(parser) dataset = dataset.repeat() iterator = dataset.make_one_shot_iterator() return iterator.get_next() def eval_input_fn(): dataset = tf.data.Dataset.from_tensor_slices((x_test, x_len_test, y_test)) dataset = dataset.batch(100) dataset = dataset.map(parser) iterator = dataset.make_one_shot_iterator() return iterator.get_next() def cnn_model_fn(features, labels, mode, params): input_layer = tf.contrib.layers.embed_sequence( features['x'], vocab_size, embedding_size, initializer=params['embedding_initializer']) training = mode == tf.estimator.ModeKeys.TRAIN dropout_emb = tf.layers.dropout(inputs=input_layer, rate=0.2, training=training) conv = tf.layers.conv1d( inputs=dropout_emb, filters=32, kernel_size=3, padding="same", activation=tf.nn.relu) # Global Max Pooling pool = tf.reduce_max(input_tensor=conv, axis=1) hidden = tf.layers.dense(inputs=pool, units=250, activation=tf.nn.relu) dropout_hidden = tf.layers.dropout(inputs=hidden, rate=0.2, training=training) logits = tf.layers.dense(inputs=dropout_hidden, units=1) # This will be None when predicting if labels is not None: labels = tf.reshape(labels, [-1, 1]) optimizer = tf.train.AdamOptimizer() def _train_op_fn(loss): return optimizer.minimize( loss=loss, global_step=tf.train.get_global_step()) return head.create_estimator_spec( features=features, labels=labels, mode=mode, logits=logits, train_op_fn=_train_op_fn) cnn_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn, model_dir=os.path.join(model_dir, 'cnn'), params=params) train_and_evaluate(cnn_classifier)

The example here loads data from IMDB movie reviews. I have my own dataset in the form of text which is approx 2GB huge. Now in this example the line
(x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size) tries to load whole dataset in memory. If I try to do the same I run out of memory. How can I restructure this logic to read data in batches from my disk?

(x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size)

1 Answer
1

You want to change the dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train)) line. There are lots of ways of creating a dataset - from_tensor_slices is the easiest, but won't work on its own if you can't load the entire dataset to memory.

dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train))

from_tensor_slices

The best way depends on how you have the data stored, or how you want to store it/manipulate it. The simplest in my opinion with very little down-side (unless running on multiple GPUs) is to have the original dataset just give indices to data, and write a normal numpy function for loading the ith example.

i

dataset = tf.data.Dataset.from_tensor_slices(tf.range(epoch_size)) def tf_map_fn(i): def np_map_fn(i): return load_ith_example(i) inp1, inp2 = tf.py_func(np_map_fn, (i,), Tout=(tf.float32, tf.float32), stateful=False) # other preprocessing/data augmentation goes here. # unbatched sizes inp1.set_shape(shape1) inp2.set_shape(shape2) return inp1, inp2 dataset = dataset.repeat().shuffle(epoch_size).map(tf_map_fn, 8) dataset = dataset.batch(batch_size) dataset = dataset.prefetch(1) # start loading data as GPU trains on previous batch inp1, inp2 = dataset.make_one_shot_iterator().get_next()

Here I assume your outputs are float32 tensors (Tout=...). set_shape calls aren't strictly necessary, but if you know the shape it'll do better error checks.

float32

Tout=...

set_shape

So long as your preprocessing doesn't take longer than your network to run, this should run just as fast as any other method on a single GPU machine.

The other obvious way is to convert your data to tfrecords, but that'll take up more space on disk and is more of a pain to manage if you ask me.

tfrecords

is there an example which I can implement. BTW how would this map function in this case if we consider IMDB dataset ? here is the implementation of load function in keras github.com/keras-team/keras/blob/master/keras/datasets/imdb.py
– Rohit
Aug 14 at 19:29

I posted a similar, more extended answer here. I'm not familiar with imdb, but the example in this answer only requires you to implement load_ith_example. You may have to change how you store the data on disk to do such, or consider writing them as tfrecords as explained in the other answer just linked.
– DomJack
Aug 14 at 23:06

load_ith_example

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Sfyjdyy

batching huge data in tensorflow

batching huge data in tensorflow

1 Answer
1

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard

batching huge data in tensorflow

batching huge data in tensorflow

1 Answer 1

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard

1 Answer
1