Software Engineering
algorithms graph search graph-traversal graph-databases
Updated Mon, 23 May 2022 08:56:41 GMT

Searching, storing, and finding graph attributes and vertices


I've been reading the 3rd edition of [Algorithms][1] by Cormen, Leiserson, Rivest and Stein. For DFS and BFS their algorithm loops through all the vertices first and colors them white.

1) If the color attribute were part of the node/vertex you would have to traverse the graph before you traverse the graph. And if you color every vertex white and your graph is cyclical then you would have an infinite loop.

clarification: CLRS' algorithm for DFS is

for each vertex u 'in' G.V
    u.color = WHITE
    u.Parent = NIL
time = 0
for each vertex u 'in' G.V
    if u.color == WHITE
    DFS-VISIT(G.u)

If the u.color is part of the node in the graph how do you actually do that loop? Do you have information on all the nodes stored elsewhere? And if you have an adjacency-list as a linked list with over a billion nodes wouldn't that take too long to iterate through the linked list every time you look for an adjacent vertex?

2) When I thought about storing attributes in an adjacency-matrix or adjacency-list I have a hard time contemplating what that data structure looks like. At least in C I can only make arrays so big. And if I use a linked list then I couldn't index into it. I would have to traverse a potentially huge list to get a vertex attribute. I'm having a hard time thinking of how Facebook would store 1 billion users or Google storing all the addresses for Maps.

3) As a follow up to the Facebook and Google thoughts, assuming that information is stored in a graph, I imagine they don't traverse the entire graph every time somebody searches for an address or person. For example if I enter my address and a destination for directions somehow I would think they find the address and get a pointer to the vertex in the graph that represents my address. From there they would do a BFS or something to calculate my address. They wouldn't start by traversing the whole world map to find my address.

*) I apologize if this isn't the correct way to ask my questions. Really these are just concepts I am having a hard time understanding. Hopefully I have communicated what I am confused by and can get some additional clarification. Thanks!

edit: In reading coding examples of DFS I notice people don't even implement the graph. They implement the adjacency-matrix or list and traverse that. In a tree I may have a struct with a "left" and "right" pointer to its children. While I always assumed the graph nodes would contain links to other nodes (edges) I suppose the nodes could be more simple (just contain the data) and the edges are ONLY stored in the adjacency-matrix or list. Is that a common way to implement graphs?




Solution

TL;DR

A graph is the concept you need to have in mind. How it is actually implemented depends on many factors and might not look like a graph at all.


[..] For DFS and BFS their algorithm loops through all the vertices first and colors them white. [..]

You need to differentiate between the - as I'd call it - procedural description of the algorithm and an actual implementation. The description here is actually emphasizing that a node that has not been visited yet is to be considered as 'white'.

How this is implemented depends on the situation. For a graph that you know cannot have cycles, isn't accessed concurrently, etc. it might be a good idea to actually store the color in the graph node.

Another option is to have a buffer storing the nodes that you consider being 'white'. When you start your search, you add only the start node to that buffer. Then you visit the "first" node of that buffer, removing it from the buffer and adding all its direct neighbors to the 'white' buffer. Depending on what kind of buffer you use you can implement different search strategies: Using a stack (adding == pushing an element on top of the stack, the first == the top most element on the stack) gives you a DFS, using a queue (adding == appending to the end, the first == the first on the front) you get BFS. Add a buffer to store already visited nodes to get cycle detection.

If you can get hold of it the book Artificial Intelligence: A Modern Approach has a really detailed and IMO good explanation of search strategies (and of course much more).

[..] Do you have information on all the nodes stored elsewhere? [..]

I think you're too focused on this kind of node:

struct node {
  struct node * neighbors;
  int data1;
  char * data2;
  // ...
};

But your node can also be ...

typedef int node_t;

.. and information about neighbors, the data the nodes contain etc is stored in tables (as simple as int **neighbors;, or in a SQL database, or whatever's appropriate).

[..] As a follow up to the Facebook and Google thoughts, assuming that information is stored in a graph, I imagine they don't traverse the entire graph every time somebody searches for an address or person. [..]

I don't know how it is actually implemented, but wouldn't it be kind of weird to even consider (= have in memory) information about streets in Canada when you're looking for a route from Paris to Rome?

A possible solution is to split a huge graph into multiple levels of detail. For example: A top most level, where nodes are countries (France is a neighbor of Italy, so consider only those two, or perhaps their neighbors as well to also consider routes through Switzerland), a mid level where nodes are "regions"/cities and a lowest level where nodes are actually street crossings.

[..] At least in C I can only make arrays so big. [..]

int * buffer = malloc(10 * sizeof(int)); // space for 10 int, add error handling
// ... some code ...
// need more space
void * new_buffer = realloc(buffer, 12 * sizeof(int)); // space for 12 int
if (new_buffer) {
  buffer = new_buffer;
  buffer[11] = 42; // perfectly fine
} else { /* error handling */ }






External Links

External links referenced by this document: