![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
When training AI models, the accuracy of the AI app depends on the quality of the training material it receives. Naturally, feeding it more than it needs or not enough is either costly or results in a poor model, respectively. When using AI, you want your results quickly and with minimal cost. The best way to do that is to feed it just the data you need. Yet given the size of unstructured data — multiple petabytes in most enterprises — and its distribution across storage silos, it’s difficult to curate and segment specific data sets.
Enter metadata, which is data about data. Metadata is created automatically by storage technologies and offers better insights on your data, such as: who owns the data, what file type it is, where it lives, who accessed it and so on. This system-level information is extremely useful for managing data, but it lacks the additional context that users and applications often have.
Additional metadata can enhance the information such as through tagging data by its contents (clinical images showing breast cancer versus pancreatic cancer or images of celebrities or alumni), tagging sensitive information or information related to a project, geography or demographics (research on females in the Northeast region) or related to a particular initiative (manufacturing test data from product X in 2022). Metadata brings structure to unstructured data, which can vastly aid the effort of finding the right data for use in AI tools.
Managing and enriching metadata is a time-consuming process that requires collaboration between IT and departments — data scientists and data owners — to tag data accurately. Tagging adds additional metadata to your file data in the form of key-value pairs, which give context to your data. One example of using multiple tags on a file is: Country = US, Project ID = 123, HIPAA = TRUE. Yet tagging across large data sets manually is virtually impossible. Machine learning-based automation will play a growing and important role in these efforts. Here’s how:
Enriching metadata is much more effective with a data management system that can persist that information no matter where the data lives. This way, you do not have to run the AI/ML algorithm repeatedly each time you need the additional context. The enriched metadata lives as long as the data lives. A storage-agnostic data management system can maintain an index of this metadata as your data moves from one storage system to another and provides a simple way to search, curate and extract the right data based on this enhanced metadata.
Name an industry and you can imagine how metadata augmentation can deliver powerful benefits. Let’s look at the auto sector. Electric and autonomous vehicles collect large quantities of sensor data, which helps the car adjust and take actions on the fly or issue alerts to the driver. The analysis of this data is white gold for manufacturers for product enhancements and customer behavior analysis.
Using an unstructured data management system, a car manufacturer could create a workflow like this:
Here are other examples:
A metadata augmentation project can get out of hand quickly. If you create too many new tags, you must store and manage them appropriately to avoid performance issues with user access. Most IT organizations will need to implement automation for metadata management, given the volume and variety of metadata today.
It’s best to use software that uses a combination of queries and tags. Queries deliver results for common inquiries such as: “Show me all data owned by this department that has been accessed in the last six months.” Users can create any custom queries based on the available metadata. Tags are not needed to save these queries but are used only to enhance the available metadata information using machine learning or user-driven inputs. This query plus tag approach maximizes efficiency, saves time and eliminates the issue of tag proliferation.
It’s also wise to be selective on metadata augmentation. Even with the help of machine learning tools and other systems, it takes time and resources to curate the right data for enrichment, monitor the results for accuracy, safeguard the data from misuse and work with data stakeholders to ensure that more metadata is serving their needs rather than making an AI project more complex or producing false or inaccurate findings. Yet by spending time and using the right tools and resources to understand and properly leverage metadata, IT leaders and data stakeholders can lay the groundwork for a stronger, more relevant AI and big data analytics program.