I've been looking into using iterators for batch processing in Doctrine (http://docs.doctrine-project.org/en/2.0.x/reference/batch-processing.html). I've got a database with 20,000 images which I would like to iterate over.
I understand that using an iterator is supposed to prevent Doctrine from loading every row in memory. However the memory usage between the two examples is almost exactly the same. I am calculating the memory usage before and after using (memory_get_usage() / 1024)
.
$query = $this->em->createQuery('SELECT i FROM Acme\Entities\Image i');
$iterable = $query->iterate();
while (($image = $iterable->next()) !== false) {
// Do something here!
}
Memory usage for the iterator.
Memory usage before: 2823.36328125 KB
Memory usage after: 50965.3125 KB
This second example loads the entire result set into memory using the findAll
method.
$images = $this->em->getRepository('Acme\Entities\Image')->findAll();
Memory usage for findAll
.
Memory usage before: 2822.828125 KB
Memory usage after: 51329.03125 KB
Batch processing with doctrine is trickier than it seems, even with the help of iterate()
and IterableResult
.
Just as you expected greatest benefit of IterableResult
is that it does not load all of the elements into memory, and the second benefit is that it doesn't hold references to the entities you load, thus IterableResult
doesn't prevent GC from freeing memory from your entity.
However there's another object Doctrine's EntityManager
(more specifically UnitOfWork
) which holds all the references to each object which you queried explicitly or implicitly (EAGER
associations).
In simple words, whenever you get any entity(ies) returned by findAll()
findOneBy()
even through DQL
queries and also IterableResult
, then a reference to each of those entities is saved inside of doctrine. The reference is simply stored in an assoc array, here's pseudocode:
$identityMap['Acme\Entities\Image'][0] = $image0;
So because upon each iteration of your loop, your previous images (despite not being present in the loop's scope or IterableResult
's scope) are still present inside of this identityMap
, GC cannot clean them and your memory consumption is the same as when you were calling findAll()
.
Now let's go through the code and see what is actually happening
$query = $this->em->createQuery('SELECT i FROM Acme\Entities\Image i');
// here doctrine only creates Query object, no db access here
$iterable = $query->iterate();
// unlike findAll(), upon this call no db access happens. // Here the Query object is simply wrapped in an Iterator
while (($image_row = $iterable->next()) !== false) {
// now upon the first call to next() the DB WILL BE ACCESSED FOR THE FIRST TIME
// the first resulting row will be returned
// row will be hydrated into Image object
// ----> REFERENCE OF OBJECT WILL BE SAVED INSIDE $identityMap <----
// the row will be returned to you via next()
// to access actual Image object, you need to take [0]th element of the array
$image = $image_row[0];
// Do something here!
write_image_data_to_file($image,'myimage.data.bin');
//now as the loop ends, the variables $image (and $image_row) will go out of scope
// and from what we see should be ready for GC
// however because reference to this specific image object is still held
// by the EntityManager (inside of $identityMap), GC will NOT clean it
}
// and by the end of your loop you will consume as much memory
// as you would have by using `findAll()`.
So the first solution is to actually tell Doctrine EntityManager to detach the object from the $identityMap
. I also replaced while
loop to foreach
to make it more readable.
foreach($iterable as $image_row){
$image = $image_row[0];
// do something with the image
write_image_data_to_file($image);
$entity_manager->detach($image);
// this line will tell doctrine to remove the _reference_to_the_object_
// from identity map. And thus object will be ready for GC
}
However the example above has few flaws, even though it is featured in the doctrine's documentation on batch processing. It works well, in case your entity Image
isn't performing EAGER
load for any of it's associations. But if you're EAGERly loading any of the associations eg. :
/*
@ORM\Entity
*/
class Image {
/*
@ORM\Column(type="integer")
@ORM\Id
*/
private $id;
/*
@ORM\Column(type="string")
*/
private $imageName;
/*
@ORM\ManyToOne(targetEntity="Acme\Entity\User", fetch="EAGER")
This association will be automatically (EAGERly) loaded by doctrine
every time you query from db Image entity. Whether by findXXX(),DQL or iterate()
*/
private $owner;
// getters/setters left out for clarity
}
So if we use same piece of the code as above, upon
foreach($iterable as $image_row){
$image = $image_row[0];
// here becuase of EAGER loading, we already have in memory owner entity
// which can be accessed via $image->getOwner()
// do something with the image
write_image_data_to_file($image);
$entity_manager->detach($image);
// here we detach Image entity, but `$owner` `User` entity is still
// referenced in the doctrine's `$identityMap`. Thus we are leaking memory still.
}
The possible solution can be to use EntityManager::clear()
instead or EntityManager::detach()
which will clear COMPLETELY the identity map.
foreach($iterable as $image_row){
$image = $image_row[0];
// here becuase of EAGER loading, we already have in memory owner entity
// which can be accessed via $image->getOwner()
// do something with the image
write_image_data_to_file($image);
$entity_manager->clear();
// now ``$identityMap` will be cleared of ALL entities it has
// the `Image` the `User` loaded in this loop iteration and as as
// SIDE EFFECT all OTHER Entities which may have been loaded by you
// earlier. Thus you when you start this loop you must NOT rely
// on any entities you have `persist()`ed or `remove()`ed
// all changes since the last `flush()` will be lost.
}
So hope this helps to understand doctrine iteration a little bit.