In the previous post in this series, I talked about how you might begin the process of organizing and migrating large collections of files into a content management system. When you start to manage your content in a controlled system, one of the terms you will hear / read a lot about is metadata.
The Value of Metadata
Metadata is data about data and in our context, it can be thought of as labels or tags applied to our files to make them easier to find, easier to organize, and easier to manage. One of the big barriers to adoption of content management systems is the need to label and tag all the files as they are submitted – this can be cumbersome for submitters and particularly so for people or processes that bulk upload files to the system.
In the past, the way around this was to assign some fields automatically (username of uploader, file name, date of upload, etc.) but this still left fields that needed to be manually assigned – things like security group, class, department, etc. The problem with bulk uploads is that content would have to be uploaded in small batches where each had the same tags applied or loaded with blank values and then a person assigned to assign tags to each.
You already have metadata
If you have files, you already have metadata. All files stored on a windows file system (NTFS - and we will focus on this as it’s the most common, but almost all file systems store the same data) have the following metadata tags applied to them:
- File Name
- Security descriptor
- Position in folder hierarchy
- Author or owner of the file
- File extension (which maps to file type in the OS)
- File size
- When the file was created
- When it was last updated
- When it was last accessed
- Previous versions
So, you already have quite a bit of information about your file.
Deriving Additional Metadata from These Standard Fields
The area where it’s easiest to do this is the existing folder hierarchy. Most shared file systems start with some good intentions and organization and then tend to decay from that point out. But people do use the folder hierarchy to organize files (generally). So you have a folder hierarchy like this:
Shared File System Home
System access policy v1.2.docx
Server procurement policy.docx
Joe Resume 11-24-15-docx
Security Policy draft.docx
In this example, then, you can use the hierarchy to derive department = IT, location = Boulder, Document type = Policies.
This won’t work for all your content; as the example shows, some may be misfiled and you’ll need to do a manual sort or sanity check before or after uploading. You may also find folders in the hierarchy that are not organized – the dreaded “needs sorting” container. But this will work for some of your data and remove the need for manual tagging in this case – it will always be an iterative process.
Now look at the other fields and see how can use these to assign other metadata: security groups often indicate department, location, or functional areas; timestamps can be used to exclude content from import (whether creation date, update date, or last accessed date); file types can be used to organize content – particularly when they are specialized types (not so useful for .xslx or .docx).
Hopefully this is helpful – the main goal is to show you that you already have many of the tags needed to import content into a managed system. You may choose to add more at a later date, but don’t let the metadata question prevent you from starting to get better control of your content.
If you do need more detailed metadata / content tagging, there are a number of options available to you. Historically, the only way to get accurate tags was for knowledgeable readers (those with experience in the areas covered by the document) to manually assign metadata. However, with recent improvements in textual analysis and machine learning there are now automated methods for assigning metadata based on rules and defined taxonomies and ontologies. TEAM has partnered with SmartLogic and M-Files to work on these and I’ll be following up with a blog post on this topic early next year (2020),
As always, the TEAM Informatics Content Advisory practice would be happy to help you with planning the process of migrating content to a management system and labelling and tagging that content, and out other practices can help you undertake the project. If you have immediate needs to discuss autoclassification and ML autotagging and can’t wait till the spring blog post, please reach out and I’d be happy to talk through the options.
Next in the series – can I use SharePoint as my records management platform?