Filesystem Walking: Diving Deeper Than Beginner Tutorials, Pt. 1

Walking the filesystem, Tutorial style (use it, why not?)

When just looking at file walking examples in Go docs and tutorials they usually give a very simple example that leaves you in a pickle when its time to do actual work. For example:

	...
	err := filepath.WalkDir(path, func(path string, d os.DirEntry, err error) error {
		if err != nil {
			return err
		}
		if !d.IsDir() {
			// do your work on each file here
		}
	})
	if err != nil {
	...

Downsides

The example is good to convey the idea of walking and how it is done but has some clear downsides.

  • Work has to be done in the thread that is also doing the filesystem walking
  • When go routines are used to do the work each is spawned per file
  • The calling function doesn’t know anything that happened until the entire walk is completed

Solution ideas/goals

To fix the downsides in the “stock” example it will be good to frame out some ideas on the goals in mind. If the project doesn’t demand more than a simple walk that does quick work per file perhaps the example is good enough. The goals I want from the file walker I am building are pretty well defined.

  1. File walking should should be non-blocking, find the next file don’t wait for processing
  2. File walking shouldn’t have dependencies unrelated to finding files.
  3. File processing should be taken on by a set pool of processing workers
  4. Walking should slow if there are not enough workers

Walking the filesystem (better maybe?, eh. not exactly)

Introducing a channel to the code feels like a solid approach. A channel is created by Walker() and returned to the caller. As a go routine running the filepath.WalkDir encounters DirEntry that is not a Dir type the path is placed on the channel.
This solution is called the Generator concurrency pattern. Decoupling the file walking from the processing will allow callers to introduce any worker/job pattern that fits the needs of their own task.

Note: To let callers know when the walking is done, the Walker will close the channel.

func Walker(path string) <-chan string {
	rch := make(chan string)
	go func() {
		defer close(rch) // Close when walk is complete to signal walking has finished
		
		// Use WalkDir intoduced in Go 1.16 which is faster than Walk
		err := filepath.WalkDir(path, func(path string, d os.DirEntry, err error) error {
			if err != nil {
				return err
			}
			if !d.IsDir() {
				rch <- path
			}

			return nil
		})
		if err != nil {
			log.Printf("encountered error: %v\n", err)
		}
	}()

	return rch
}

This looks great! The Walker function can find files decoupled from processing and users have the freedom to handle the results in any way they prefer. After integrating the Walker into my project, I did notice a few additional downsides that are worth considering.

  • No cancellation. Walk will continue to put results on the channel until completely done
  • Does not follow symlinks

Further development will continue…

Conclusion

The Generator Pattern decoupling processing from traversal met the stated goals. However, this filesystem walker implementation has certain limitations. In Part 2, we will explore how to enhance Walker() into a robust and fully-featured file walker for use in production environments.

Filesystem Walking: Cancellation, Pt. 2 - Next Article Go Concurrency Patters - Rob Pike - Generator Pattern slide #25
Go Standard library: filepath.Walk - filepath.Walk example
Go Standard Library: filepath.WalkDir - notes on filepath.WalkDir