Software Engineering
garbage-collection
Updated Mon, 19 Sep 2022 10:28:23 GMT

Does Garbage Collection Scan The Entire Memory?


I was reading a bit about garbage collectors and I am wondering if the garbage collector of a program scans the entire heap memory or what is allocated to it? If it reads the entire system memory, does it mean it is reading memory locations that are used by other applications? I understand that this does not make much sense security or performance wise.

If garbage collector only reads the memory that is allocated to it, how does it mark those areas?

Sorry for the rookie question, I am not a software engineer and this is pure out of my curiosity




Solution

I was reading a bit about garbage collectors and I am wondering if the garbage collector of a program scans the entire heap memory or what is allocated to it?

That depends on the garbage collector. There are many different kinds of garbage collectors.

For example, Reference Counting Garbage Collectors don't "scan" anything at all! In a Reference Counting Garbage Collector, the system counts references to objects, something like this:

SomeObject foo = new SomeObject();

Let's say, this new object was allocated at memory address 42. The GC records "there is 1 reference to the object at address 42".

SomeObject bar = foo;

Now, the GC records "there are 2 references to the object at address 42".

foo = null;

Now, the GC records "there is 1 reference to the object at address 42".

bar = null;

Now, the GC says "there are 0 references to the object at address 42, therefore, I can collect it".

At no point did the GC "scan" anything.

What you are probably thinking about is an extremely simplistic implementation of a so-called "Tracing Garbage Collector", namely the Mark-and-Sweep GC.

Any Tracing GC starts off with a set of objects that they know are always reachable. This is called the root set. The root set typically includes all global variables, the local variables, CPU registers, the stack, and some other stuff. For all of these objects, the GC looks at the instance variables and checks the objects that the instance variables point to. Then it checks those objects' instance variables, and so on and so forth.

This way, the GC "sees" all "live" objects, i.e. the objects that are reachable from the root set. What the GC does with those "live" objects depends on the kind of GC.

As I mentioned above, what you are thinking of is the most simplistic kind of Tracing GC, which is the Mark-and-Sweep GC. During the tracing phase I described above, the GC will "mark" all live objects by either setting a flag directly in the object header itself, or in a separate data structure.

Then, the GC will indeed "scan" the entire memory and find all objects and do one of two things:

  • If the object is marked, remove the mark.
  • If the object is unmarked, de-allocate the object.

After this "sweep" phase, you end up with all unreachable objects destroyed and all live objects unmarked, ready for the next "mark" phase.

But, as I mentioned, this is only one of many different kinds of Tracing GCs, and is a very simple one with many disadvantages. The two major disadvantages are that scanning the entire memory is expensive and leaving the live objects where they are and only collecting the dead objects in between leads to memory fragmentation.

Another very simple but much faster Tracing GC is Henry Baker's Semi-Space GC. The Semi-Space GC "wastes" half of the available memory, but gains a lot of performance for it. The way the Semi-Space GC works is that it divides the available memory into two halves, let's call them A and B. Only one of the two halves is active at any one time, meaning new objects only get allocated in one of the two halves.

We start out with half A: The GC "traces" the "live" objects just as described above, but instead of "marking" them, it copies them to half B. That way, once the "tracing" phase is done, there is no need to scan the entire memory anymore. We know that all live objects are in half B, so we can simply forget about half A completely. From now on, all new objects are allocated in half B. Until the next garbage collection cycle, when all live objects are copied to half A, and we forget about half B.

These are just two examples of Tracing GCs, and only one of those two scans the entire memory.

If it reads the entire system memory, does it mean it is reading memory locations that are used by other applications? I understand that this does not make much sense security or performance wise.

This is simply impossible. No modern Operating System allows a process to read another process's memory. (And when I say "modern", I mean "since the 1960s or so".)

But even if it were possible, it would not make sense. If the memory belongs to another process, then the GC has no idea what the objects that are in this memory even look like. But it needs to know what the objects look like in order to find all the instance variables and to know how to interpret those references. If it uses an internal marker flag inside the object itself, it also needs to know how to find that marker flag and how to set it. And that is assuming that the marker flag is even there at all! What happens if the application that owns that memory doesn't use marker flags?

Or, worse: what happens if the application that owns that memory does use a GC which uses marker flags. Now, the one GC is overwriting the other GC's markers!

If garbage collector only reads the memory that is allocated to it, how does it mark those areas?

There are two popular approaches.

The first approach is that there is flag in the object header of each object reserved for marking. During the "mark" phase, the GC sets this flag. The major advantage of this approach is that there is no separate bookkeeping involved and it is thus very simple: the mark is right there on the object itself. The major disadvantage is that objects are scattered all through the memory, and thus during the marking phase, the GC writes all over the entire memory. This means that there are "dirty" pages all over memory, in a multiprocessor system (which almost all systems are nowadays) this means that we have to notify the other CPU cores that we have modified some memory, we have polluted the cache with tons of writes that we will never need again, and so on.

The alternative is to keep a separate data structure where we keep a table of all marked objects. This has the disadvantage of more bookkeeping (we need to keep a relationship between the mark table and the objects) but it has the major advantage that we are only writing to one place in memory, which means we can keep this one piece of data in the cache all the time.

But again, not all GCs even have a concept of "marking" at all.





Comments (5)

  • +6 – "Then, the GC will indeed "scan" the entire memory and find all objects." This is not really true or necessary. There's nothing to be done about 'dead' objects unless there was some sort of hook to notify about it's 'death'. Also, I consider reference counting to be much more simplistic than mark-sweep. I grant that it can be more performant but it's unable to guarantee all 'garbage' is freed. — Aug 15, 2022 at 20:22  
  • +5 – The "Mark-and-Sweep GC" also has one advantage over simple reference counting: It removes objects with circular references, that are otherwise not referenced anywhere. Imagine a parent object having a reference to a child object, and vice versa. If both parent and child lose all external references, a simple ref counter GC would still mark them alive, since they reference each other. A Mark-and-Sweep GC would find no path from the root to the objects, and would delete them. (Note that weakrefs would solve this) — Aug 16, 2022 at 11:24  
  • +4 – Of course operating systems allow reading the memory of other processes. Not as a matter of course, but for instance the win32 debugging functions let you do just that. I think what you mean is there's no way for this to happen by accident. — Aug 16, 2022 at 12:08  
  • +3 – @MasonWheeler My understanding is that the standard CPython runtime uses reference counting primarily with a generational collector as a backup to catch cycles. The combination is pretty pragmatic and performant. Cycles are relatively rare in most applications. — Aug 16, 2022 at 20:22  
  • +3 – @MasonWheeler it may be down to the specific definition you use for GC, but Python has this to say: "The main garbage collection algorithm used by CPython is reference counting." so they consider it a GC. And you can disable the automated cycle counter and run it on demand, which will make it completely temporally deterministic. — Aug 17, 2022 at 11:11