Filesystem Walking: Diving Deeper Than Beginner Tutorials, Pt. 1
Walking the filesystem, Tutorial style (use it, why not?)
When just looking at file walking examples in Go docs and tutorials they usually give a very simple example that leaves you in a pickle when its time to do actual work. For example:
...
err := filepath.WalkDir(path, func(path string, d os.DirEntry, err error) error {
if err != nil {
return err
}
if !d.IsDir() {
// do your work on each file here
}
})
if err != nil {
...
Downsides
The example is good to convey the idea of walking and how it is done but has some clear downsides.
- Work has to be done in the thread that is also doing the filesystem walking
- When go routines are used to do the work each is spawned per file
- The calling function doesn’t know anything that happened until the entire walk is completed
Solution ideas/goals
To fix the downsides in the “stock” example it will be good to frame out some ideas on the goals in mind. If the project doesn’t demand more than a simple walk that does quick work per file perhaps the example is good enough. The goals I want from the file walker I am building are pretty well defined.
- File walking should should be non-blocking, find the next file don’t wait for processing
- File walking shouldn’t have dependencies unrelated to finding files.
- File processing should be taken on by a set pool of processing workers
- Walking should slow if there are not enough workers
Walking the filesystem (better maybe?, eh. not exactly)
Introducing a channel to the code feels like a solid approach. A channel is created by Walker() and returned to the caller. As a go routine running the filepath.WalkDir encounters DirEntry
that is not a Dir type the path is placed on the channel.
This solution is called the Generator concurrency pattern. Decoupling the file walking from the processing will allow callers to introduce any worker/job pattern that fits the needs of their own task.
Note: To let callers know when the walking is done, the Walker will close the channel.
func Walker(path string) <-chan string {
rch := make(chan string)
go func() {
defer close(rch) // Close when walk is complete to signal walking has finished
// Use WalkDir intoduced in Go 1.16 which is faster than Walk
err := filepath.WalkDir(path, func(path string, d os.DirEntry, err error) error {
if err != nil {
return err
}
if !d.IsDir() {
rch <- path
}
return nil
})
if err != nil {
log.Printf("encountered error: %v\n", err)
}
}()
return rch
}
This looks great! The Walker function can find files decoupled from processing and users have the freedom to handle the results in any way they prefer. After integrating the Walker into my project, I did notice a few additional downsides that are worth considering.
- No cancellation. Walk will continue to put results on the channel until completely done
- Does not follow symlinks
Further development will continue…
Conclusion
The Generator Pattern decoupling processing from traversal met the stated goals. However, this filesystem walker implementation has certain limitations. In Part 2, we will explore how to enhance Walker()
into a robust and fully-featured file walker for use in production environments.
Links
Filesystem Walking: Cancellation, Pt. 2 - Next Article
Go Concurrency Patters - Rob Pike - Generator Pattern slide #25
Go Standard library: filepath.Walk - filepath.Walk example
Go Standard Library: filepath.WalkDir - notes on filepath.WalkDir