DirectoryIterator and SymLinks


#1

This kind link (from Digidesign's dev build!) throws DirectoryIterator into an infinite loop!

lrwxr-xr-x@ 1 jim  staff         2  2 May 18:24 Frameworks -> ./

I'm scratching my head about the right fix for this ...


#2

Yeah, that's a tricky one. I'm not keen to make the iterator responsible for finding recursions.. And what would be the correct thing for it do in the case above? At what point should it stop iterating, and how would it know that? There could easily be situations where you would want it to follow a link, but there's no way of knowing.


#3

Unfortunately I think it needs to be responsible.  Otherwise it could be responsible for weird and hard to identify hangs on user systems. 

A cheap safety fix would be to stop after a certain depth was reached.  A better fix is to get the inode number of any sym-links and check them against a stack of parent folder inodes.  

I had a look at GNU find and it has the following approach (and also a shit load of security stuff I wasn't aware of!):


 /* If we've already seen this directory on this branch,
     don't descend it again.  */
  for (i = 0; i <= dir_curr; i++)
    if (stat_buf.st_ino == dir_ids[i].ino &&
        stat_buf.st_dev == dir_ids[i].dev)
      {
        state.stop_at_current_level = true;
        issue_loop_warning(name, pathname, i);
      }
 

I don't know enough about Windows to know if there's a similar problem to be had there!

Incidentially - GNU find is about 5x quicker on my system I think.  I didn't do a scientific analysis - but i could run several GNU finds to completion while DirectoryIterator was building a list of the same folder structure.  I suspect the massive cascade of function calls for every call to next() probably doesn't help. 

I was thinking of re-writing it. But I've moved to using Spotlight and Windows Search instead...


#4

Sure, but the problem's not detecting recursion, it's the fact that it may already be too late when that happens. In your example above, if you start searching from a deep sub-folder which then links back to the root folder, then once it has gone back to the root, it's already too late - it could be scanning folders for hours before it gets around to hitting the original start folder and realising that it has recursed. Ideally you'd want it to not take that link in the first place, but I can't think of a good way to do that.

Incidentially - GNU find is about 5x quicker on my system I think.  I didn't do a scientific analysis - but i could run several GNU finds to completion while DirectoryIterator was building a list of the same folder structure.  I suspect the massive cascade of function calls for every call to next() probably doesn't help. 

No, the function calls will be negligible in terms of performance.

The most likely reason for it being slower would be that you're asking for details about the files, e.g. their size, etc. That'd require a lot of extra overhead for each one.


#5

Sure, but the problem's not detecting recursion, it's the fact that it may already be too late when that happens.

Well, keeping a stack of inodes and, when you get a symlink, not following it if it's already in the stack fixes that I think?

No, the function calls will be negligible in terms of performance.

The most likely reason for it being slower would be that you're asking for details about the files, e.g. their size, etc. That'd require a lot of extra overhead for each one.

Fair point :)


#6

Well, keeping a stack of inodes and, when you get a symlink, not following it if it's already in the stack fixes that I think?

Nope - I don't think you understood what I was trying to say. My point was that it could go for hours before it can possibly detect a recursion.


#7

I think I understand what you're trying to say.   But to explain I've made some code.  This Mac only tweak refuses to follow (or list) symlinks that are circular loops. At least all the obvious ones I could manage to create :) 

It still needs the root folder details adding to the linkDetectionStack, so it doesn't currently protect against a loop back to the starting point. But that's a couple of lines that need adding which can wait till after breakfast!  


diff --git i/JuceLibraryCode/modules/juce_core/files/juce_DirectoryIterator.cpp w/JuceLibraryCode/modules/juce_core/files/juce_DirectoryIterator.cpp
index 7d6dd4b..205cde1 100644
--- i/JuceLibraryCode/modules/juce_core/files/juce_DirectoryIterator.cpp
+++ w/JuceLibraryCode/modules/juce_core/files/juce_DirectoryIterator.cpp
@@ -27,7 +27,8 @@
 */
 
 DirectoryIterator::DirectoryIterator (const File& directory, bool recursive,
-                                      const String& pattern, const int type)
+                                      const String& pattern, const int type,
+                                      LoopDetectionStack * lds)
   : wildCards (parseWildcards (pattern)),
     fileFinder (directory, (recursive || wildCards.size() > 1) ? "*" : pattern),
     wildCard (pattern),
@@ -36,15 +37,23 @@ DirectoryIterator::DirectoryIterator (const File& directory, bool recursive,
     totalNumFiles (-1),
     whatToLookFor (type),
     isRecursive (recursive),
-    hasBeenAdvanced (false)
+    hasBeenAdvanced (false),
+    loopDetectionStack(lds)
 {
     // you have to specify the type of files you're looking for!
     jassert ((type & (File::findFiles | File::findDirectories)) != 0);
     jassert (type > 0 && type <= 7);
+    /* Create a loop detection stack if we haven't been called recursively. */
+    if (loopDetectionStack == nullptr) {
+        loopDetectionStackMaster = new LoopDetectionStack();
+        loopDetectionStack = loopDetectionStackMaster;
+        /* Need to add my master inod/dev onto the stack here (TODO) */
+    }
 }
 
 DirectoryIterator::~DirectoryIterator()
 {
+    if (loopDetectionStack) loopDetectionStack->removeLast(); // pop.
 }
 
 StringArray DirectoryIterator::parseWildcards (const String& pattern)
@@ -81,14 +90,17 @@ bool DirectoryIterator::next (bool* const isDirResult, bool* const isHiddenResul
             return true;
 
         subIterator = nullptr;
+
     }
 
     String filename;
     bool isDirectory, isHidden = false;
+    
+    int64 inod, dev;
 
     while (fileFinder.next (filename, &isDirectory,
                             (isHiddenResult != nullptr || (whatToLookFor & File::ignoreHiddenFiles) != 0) ? &isHidden : nullptr,
-                            fileSize, modTime, creationTime, isReadOnly))
+                            fileSize, modTime, creationTime, isReadOnly, &inod, &dev))
     {
         ++index;
 
@@ -98,10 +110,22 @@ bool DirectoryIterator::next (bool* const isDirResult, bool* const isHiddenResul
 
             if (isDirectory)
             {
-                if (isRecursive && ((whatToLookFor & File::ignoreHiddenFiles) == 0 || ! isHidden))
+                bool isLoop = false;
+                jassert(loopDetectionStack);
+                for (int i = 0; i<loopDetectionStack->size(); ++i) {
+                    if (inod == (*loopDetectionStack)[i].inod &&
+                        dev == (*loopDetectionStack)[i].dev) isLoop = true;
+                }
+                
+                if (isRecursive && !isLoop && ((whatToLookFor & File::ignoreHiddenFiles) == 0 || ! isHidden)) {
+                    LinkLoopDetectionData ldetect;
+                    ldetect.inod = inod;
+                    ldetect.dev = dev;
+                    loopDetectionStack->add(ldetect); // push.
                     subIterator = new DirectoryIterator (File::createFileWithoutCheckingPath (path + filename),
-                                                         true, wildCard, whatToLookFor);
+                                                         true, wildCard, whatToLookFor, loopDetectionStack);
 
+                }
                 matches = (whatToLookFor & File::findDirectories) != 0;
             }
             else
diff --git i/JuceLibraryCode/modules/juce_core/files/juce_DirectoryIterator.h w/JuceLibraryCode/modules/juce_core/files/juce_DirectoryIterator.h
index 3ca2265..126e385 100644
--- i/JuceLibraryCode/modules/juce_core/files/juce_DirectoryIterator.h
+++ w/JuceLibraryCode/modules/juce_core/files/juce_DirectoryIterator.h
@@ -46,6 +46,12 @@
 class JUCE_API  DirectoryIterator
 {
 public:
+    struct LinkLoopDetectionData {
+        int64 inod;
+        int64 dev;
+    };
+    typedef Array<LinkLoopDetectionData> LoopDetectionStack;
+    
     //==============================================================================
     /** Creates a DirectoryIterator for a given directory.
 
@@ -68,11 +74,13 @@ public:
                             separated by a semi-colon or comma, e.g. "*.jpg;*.png"
         @param whatToLookFor    a value from the File::TypesOfFileToFind enum, specifying
                                 whether to look for files, directories, or both.
+        @param loopDetectionStack @internal
     */
     DirectoryIterator (const File& directory,
                        bool isRecursive,
                        const String& wildCard = "*",
-                       int whatToLookFor = File::findFiles);
+                       int whatToLookFor = File::findFiles,
+                       LoopDetectionStack * = nullptr); /* Perhaps split this into two diff. constructors. */
 
     /** Destructor. */
     ~DirectoryIterator();
@@ -126,7 +134,8 @@ private:
 
         bool next (String& filenameFound,
                    bool* isDirectory, bool* isHidden, int64* fileSize,
-                   Time* modTime, Time* creationTime, bool* isReadOnly);
+                   Time* modTime, Time* creationTime, bool* isReadOnly,
+                   int64* const inod, int64* const dev);
 
         class Pimpl;
 
@@ -137,6 +146,10 @@ private:
 
         JUCE_DECLARE_NON_COPYABLE_WITH_LEAK_DETECTOR (NativeIterator)
     };
+    
+
+    ScopedPointer<LoopDetectionStack> loopDetectionStackMaster;
+    LoopDetectionStack * loopDetectionStack;
 
     friend struct ContainerDeletePolicy<NativeIterator::Pimpl>;
     StringArray wildCards;
diff --git i/JuceLibraryCode/modules/juce_core/native/juce_mac_Files.mm w/JuceLibraryCode/modules/juce_core/native/juce_mac_Files.mm
index 0a00e92..29c513c 100644
--- i/JuceLibraryCode/modules/juce_core/native/juce_mac_Files.mm
+++ w/JuceLibraryCode/modules/juce_core/native/juce_mac_Files.mm
@@ -343,7 +343,8 @@ public:
 
     bool next (String& filenameFound,
                bool* const isDir, bool* const isHidden, int64* const fileSize,
-               Time* const modTime, Time* const creationTime, bool* const isReadOnly)
+               Time* const modTime, Time* const creationTime, bool* const isReadOnly,
+               int64* const inod, int64* const dev)
     {
         JUCE_AUTORELEASEPOOL
         {
@@ -365,7 +366,7 @@ public:
                     continue;
 
                 const String fullPath (parentDir + filenameFound);
-                updateStatInfoForFile (fullPath, isDir, fileSize, modTime, creationTime, isReadOnly);
+                updateStatInfoForFile (fullPath, isDir, fileSize, modTime, creationTime, isReadOnly, inod, dev);
 
                 if (isHidden != nullptr)
                     *isHidden = FileHelpers::isHiddenFile (fullPath);
@@ -393,9 +394,10 @@ DirectoryIterator::NativeIterator::~NativeIterator()
 
 bool DirectoryIterator::NativeIterator::next (String& filenameFound,
                                               bool* const isDir, bool* const isHidden, int64* const fileSize,
-                                              Time* const modTime, Time* const creationTime, bool* const isReadOnly)
+                                              Time* const modTime, Time* const creationTime, bool* const isReadOnly,
+                                              int64* const inod, int64* const dev)
 {
-    return pimpl->next (filenameFound, isDir, isHidden, fileSize, modTime, creationTime, isReadOnly);
+    return pimpl->next (filenameFound, isDir, isHidden, fileSize, modTime, creationTime, isReadOnly, inod, dev);
 }
 
 
diff --git i/JuceLibraryCode/modules/juce_core/native/juce_posix_SharedCode.h w/JuceLibraryCode/modules/juce_core/native/juce_posix_SharedCode.h
index 23a5334..bc389a4 100644
--- i/JuceLibraryCode/modules/juce_core/native/juce_posix_SharedCode.h
+++ w/JuceLibraryCode/modules/juce_core/native/juce_posix_SharedCode.h
@@ -224,7 +224,8 @@ namespace
     }
 
     void updateStatInfoForFile (const String& path, bool* const isDir, int64* const fileSize,
-                                Time* const modTime, Time* const creationTime, bool* const isReadOnly)
+                                Time* const modTime, Time* const creationTime, bool* const isReadOnly,
+                                int64* const inod, int64* const dev)
     {
         if (isDir != nullptr || fileSize != nullptr || modTime != nullptr || creationTime != nullptr)
         {
@@ -235,6 +236,8 @@ namespace
             if (fileSize != nullptr)      *fileSize     = statOk ? info.st_size : 0;
             if (modTime != nullptr)       *modTime      = Time (statOk ? (int64) info.st_mtime * 1000 : 0);
             if (creationTime != nullptr)  *creationTime = Time (statOk ? (int64) info.st_ctime * 1000 : 0);
+            if (inod != nullptr)          *inod = info.st_ino;
+            if (dev != nullptr)           *dev = info.st_dev;
         }
 
         if (isReadOnly != nullptr)

#8

Hmm, That'll be a bit of a non-starter unless there's an equivalent to inodes on win32, I'm afraid, and I'm not sure there is.

Just a quick thought: it could actually be possible to get this right if instead of just having a list of folders that have been visited, it was to check the canonical path of each sub-folder to see if it's a parent folder of the start location.

So if you started from e.g. /a/b/c and there was a link in there back to /a, then it'd see that /a/b/c is a subfolder of /a, and would avoid that branch. But this means working out the correct canonical path of each sub-folder rather than just basing its pathname on the name of the parent folder that's being recursed, and that'd take a bit of investigation.


#9

I've had a bit of digging on Windows (turned out I knew very little about NTFS!).  Apparently it's got:

  • Shortcuts 
  • Hard links (fsutil.exe hardlink) - these only refer to files so they aren't such a problem
  • Symbolic Links (mklink.exe) - 
  • Junction Boxes (linkd.exe) - a bit like hard links for directories.  Perhaps the same as reparse points, I'm starting to get lost. 

 

There's no such thing as canonical path in Windows unfortunately!   The function PathCanonicalize(..) just removes ./ and /../ but doesn't deal with links, reparse-points and so on. 

But there is a function GetFileInformationByHandle(..) returns something which is loosely the NTFS equivalent of an inode.  

FAT shouldn't be a problem. Perhaps returning -1 as the inod number and not applying the test if inod == -1 can deal with that. 

J.

PS. apparently some versions of Windows have loops built in :-

[In response to a complaint about a recursive folder iteration not completing]

And here, this special folder is a Reparse-Point (a junction point) that points to ... its father.

So when an old program wants to write a file to ApplicationData, he will in fact writes the file in the parent's folder. This is because the whole profil organization has been changed in Vista.

So when you recursely browse, of course you should test the attribute in THIS SPECIAL CASE.

PPS. And also it appears, to get an honest view of the file system, it's necessary to wrap calls to FindFirstFile in Wow64DisableWow64FsRedirection.  Some horrible 32-bit/64-bit hack of Microsoft's.  Probably doesn't affect most applications mind you!