Blackfish Troubleshooting Research

Last tended: July 29, 2021

2021-08-13 - Gene Liverman lost nodes follow up

Don't recreate the UI
Just give enough information to "determine if I care"
Lost node alert
- Node name (may not have pretty name, depending on setup, so OS helpful?)
- OS (if not filtered by team)
- How long missing (if not determined by alert setting)
- Link
Can I include/exclude fields from webhook? Start with minimum as default.
Alerts based on teams (at minimum, Windows). Wants to know what he can rule in/out at a glance as not having to worry about.

X days thershold for lost nodes
Filter by kernel/team (ex: I only manage windows so only want alerts about them)
Include/exclude fields for webhook (do I want tons of data in alert or just name)
Alerting on a single fact not really helpful? Really need a compound filter (if this fact changes and this = X and this = Y, etc)

OS major version would be helpful
Can we store last report for that node?
- Entire last report + (if different) last report that contained an change (intentional/corrective)
- At most that stores 2 reports for every node. Could start analysing those across nodes. (ex: correlate when PE changed versions to when nodes stop reporting)
Anything older than 2 weeks, cannot view report in PE
A node might be lost but still connected. Can we check connection and show if it is connected/not connected to PE PXP connection? That changes troubleshooting options.
1. Top priority is to be able to ignore and retire nodes 2) Prefilter to remove/automatically mark as retired nodes whose certificates have been revoked in Puppet (CRL?).
When retiring, can we take an action to clean up certificate on puppet-side? Certificates which are never revoked have potential security implications: anything with a cert can potentially establish connection.
Might want to ignore but not retire (ex: they took a set of compiliers offline for 2 months and are now bringing some back online ready to go with certs and all. They were never 'lost' so don't want to see them in table).
Still need to be able to see hidden (retired/ignored) nodes.

Phrasing is key. Educated guesses are good, but don't act like this is 100% the problem.
Under promise, over deliver.
Wouldn't trust it to automatically take a corrective action.
First step when applying a fix: do a no-op run to see if it stops the error.
Could we kick off a task (needs to be task if it's outside of management) and default to no-op? Then once we've established trust, users might trust to kick off corrective tasks. "I'm more likely to give it a whirl if it's no-op"

A patch is a patch. There’s a million different ways you can apply it, but it’s still a patch.
End result: an updated piece of software is applied to the box
Really interested in patching

Relies on running a vulnerability scanner to validate has been fixed or not.
Sometimes there is a way of validating (looking for a particular key, package, etc.) but varies by patch. Remote SSH to validate. Validation comes from OS vendors (what to look for).
How do you check across multiple across entire system? Difficult if you don’t have PE or Bolt (to run same thing in multiple places). Also why vulnerability scanners are useful, to use it to validate and find servers.
If it’s something that’s managed by PE, can enforce a minimum version.

If something doesn’t standout, just try again (either Puppet run or taking that subset)
Will either fix it or spit out some info about why it’s not working.
==Rerunning is faster than root-causing. For critical vulnerabilities speed is more important.==

Having defaults is helpful.
Can’t imagine looking at packages based on facts. Going to use an interface that’s designed for this purpose (package inventory in PE or a scanner). — to me sounds like all packages, not specific nodes
Probably using whatever told me nodes don’t have it to start with it.
Not something I intuitively think about. It may take another puppet run before a patch shows up.
This is more for historical context, not checking if a patch is being applied
Not for real-time needs (can we compete on troubleshooting?), but useful for ‘when was something done’/audits.
Alternative source: Look at tickets and hope there is enough info there to explain historical change.

Being able to kick off another puppet run would be helpful here
Try to figure out why they were falling off in the first place
==Looks at logs to investigate guesses + previous last run. ==
In lost nodes case want to get them back online + root-cause
So many possibilities.
==Attempting to run again brings back useful error messages. Could we surface those automatically if we auto-ping?==
Likes is helpful because of longer retention history and dedicated to it (instead of a busy page)
Lost nodes changes “It wouldn’t hurt”
Biggest problem is that last puppet run report is no longer there because it’s not there. Having a copy of the last report would be helpful to pair with node. If it’s less than 2 weeks, it’s still in PE and I’m going to go look at the run report in PE.
Recently applied changes would also be helpful (did I have a successful/partly successful run before)

Free and Open Source Software - FOSS, open source puppet have some

For the "print nightmare" vulnerability which machines had that resource and for which machines it was changed?
Before I deploy this module to my fleet how will it affect the CPU of those nodes? (CD4PE)

Who owns 40% of our VMs and which applications access them. Can we decommission these?
Which nodes/VMs are not managed by Puppet? What will happen if we release on these servers which are not yet managed?

Our current, 90-day, exports of PDB data in Splunk are limited. What is the context behind the CRQ closed on this date?
We are strong Windows users and need to know which user applied a patch, was it SCCM?
We want to group nodes using Trusted Facts, such as the "Puppet Role"

We want to group nodes using "intelligent server names" so that the correct support team can address issues with those nodes.