Discovery, Not all Tools are the Same

Written by
Matt Brown

A group of passionate Technologists, Consultants, and Trusted Disruptors focused on the maelstrom that is Cloud services and the IT industry. Get decisions close to the data, be disruptive, and design for cloud in scale. You've been warned.

Discovery, Not all Tools are the Same

Written by
Matt Brown

Discovery, Not all Tools are the Same

Written by
Matt Brown

Discovery, Not all Tools are the Same

It’s not a secret, that any experienced migration consultancy is going to recommend performing some level of electronic discovery for your project.  It’s the initial step for every migration methodology.  It’s hard to plan adequately and execute flawlessly, if there is not a thorough understanding of the source environment.  Furthermore, there’s no shortage of tools on the market today to aid with discovery.  In this article, we will discuss the various electronic discovery methods and what situations they should apply.   

Let’s start with Automated Asset Discovery 

Agent-Based 

First things first…let’s start with the simple truth that the “agent” of any agent-based tool does not perform the automated discovery activity.  It can perform data collection, but you must know where to deploy the agent in order for it to work.   

Subnet Scan + Credentialed Discovery 

Most truly automatic asset discovery tools perform two levels of discovery.  They perform an initial sweep of a subnet to gather host name and operating system of the asset responding to each IP address on that subnet.  Upon gathering this information, the tool will then execute a credentialed scan to collect configuration information on the assets.  Most of these tools are deployed on a virtual appliance and require access to all network segments.  Some tools may also prefer to deploy a collector in each segment that can communicate with a centralized analysis engine.  

Subnet Scan 

This does require the client to be able to provide a list of active subnets.  Once the tool is deployed, this initial scan can generally be run without any concern as most of them typically produce negligible, if any, host or network utilization.  The objective of this scan is to capture a baseline set of assets residing on each subnet and prepare for credentialed discovery.   

Credentialed Discovery 

The objective here is to gather configuration information on each asset on the subnet.  In order to collect information, the tools will need to log into the asset.  Most tools do this either by having a service account created with Local Admin privileges on Windows hosts or by providing sudo access for Unix workloads.  While most tools that perform this activity typically only create 1-2% CPU overhead on the CPU (similar to an admin logging into a server), it is recommended to follow a change management process and schedule these scans around other deployments to the assets.  

Also, don’t forget that for highly virtualized environments (above 80%), configuration information for each guest can be extracted from the hypervisor and be just as accurate as any credentialed tool without the need to scan.   

Once you’ve completed Subnet Scans and Credentialed Discovery, the next step is compare to the Physical inventory in support of compiling a Master Inventory.   

Now let’s review Interdependency Mapping 

A thorough Interdependency Mapping will define asset to asset, application to asset, database to asset, database to application, and application to service relationships.  The goal should not be to just learn that a dependency exists, but to understand the frequency of the communication, size of the workload, protocols being used, and patterns that exist.   

There are many products that perform some type of mapping, however, all will follow one or more of these three techniques; Flow Data Aggregation, Agent-Based, Service Call   

Flow Data Aggregation 

Many legacy tools will collect packet header information from network devices that support flow data.  These tools do not typically store client data, only the packet header information.  The packet header will show source IP, destination IP, frequency, packet size, and protocol used.  These collectors are generally very easy to deploy and begin collecting data immediately. At this point, resources will begin categorizing assets into host groups based on the known application components.  Tools like this, however, are quickly becoming more antiquated because they sometimes struggle with capturing east to west traffic within virtual and highly shared environments.  While this tool requires more manual effort, it is a great approach for environments that have a low virtualization ratio (sub 50%).    

Agent-Based 

Agent-based tools are typically the most comprehensive interdependency mapping tools on the market.  Many of them also have a database of common COTS application signatures and can auto-detect some applications which can minimize the amount of manual manipulation required.  The agents are typically deployed on each server and typically work by watching a Process ID (PID) call another PID.  This gives very granular dependency and utilization mapping data.  However, there are two main hindrances that lead many clients to avoid using them for temporary purposes.   

Security Vetting: With higher security regulations, many IT organizations will need to thoroughly test the agents and the tool under a variety of measures before deploying them into the environment.  Depending on the client’s security policy this can take up to a couple months and many don’t see the value in going through this cycle for a temporary implementation.   

Deployment and Tuning: This requires client resources to package and deploy the agent during change windows, so it could be a lengthy process before data collection is taking place.  Furthermore, once the agent is deployed, the tools typically require tuning to gather the appropriate information.  Many times, this requires specialized talent and additional consulting resources.   

While there are many advantages to using agent-based tools for interdependency mapping, we typically find that the only time it makes financial or timeline sense is when a client already has one deployed and minimal configuration/tuning is required.   

Service Call 

These tools act similarly to agent-based tools, but there is nothing to deploy.  Essentially, most of these work by running a script package on each host and tracking PID calls.  They are typically able to track PID level calls and auto-detect common COTS application signatures.  Furthermore, these tools can also collect configuration information just like the Credentialed Inventory tools.  This reduces the number of tools required to gather asset/dependency information as well as the overall time required to thoroughly perform Electronic Discovery.   

Now that you have a solid understanding of the techniques used by all discovery tools, let’s take a look at the potential use cases for each: 

Asset Discovery 

Subnet Scan = when there is a low virtualization ratio (less than 50%) and detailed configuration information is not required 

Credentialed Inventory = Anytime there are stand alone physical servers in the environment and detailed configuration information is required 

Interdependency Mapping 

Flow Data Aggregation = Can be used when there is a low virtualization ratio (less than 50%), however a Service Call tool an also be used in this situation which will provide more granular data at similar pricing 

Agent-Based  = Can be used in any situation, but typically the only time it makes financial or timeline sense is when a client already has it thoroughly deployed.   

Service Call = Preferred use with all environment types and is typically the best fit for highly virtualized and heavily segmented architectures 

We invite you to continue learning about all the facets of environmental discovery in our next article of this series, Context Gathering.