This is an unpopular opinion, and I get why – people crave a scapegoat. CrowdStrike undeniably pushed a faulty update demanding a low-level fix (booting into recovery). However, this incident lays bare the fragility of corporate IT, particularly for companies entrusted with vast amounts of sensitive personal information.

Robust disaster recovery plans, including automated processes to remotely reboot and remediate thousands of machines, aren’t revolutionary. They’re basic hygiene, especially when considering the potential consequences of a breach. Yet, this incident highlights a systemic failure across many organizations. While CrowdStrike erred, the real culprit is a culture of shortcuts and misplaced priorities within corporate IT.

Too often, companies throw millions at vendor contracts, lured by flashy promises and neglecting the due diligence necessary to ensure those solutions truly fit their needs. This is exacerbated by a corporate culture where CEOs, vice presidents, and managers are often more easily swayed by vendor kickbacks, gifts, and lavish trips than by investing in innovative ideas with measurable outcomes.

This misguided approach not only results in bloated IT budgets but also leaves companies vulnerable to precisely the kind of disruptions caused by the CrowdStrike incident. When decision-makers prioritize personal gain over the long-term health and security of their IT infrastructure, it’s ultimately the customers and their data that suffer.

  • breakingcups@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    4 months ago

    Please, enlighten me how you’d remotely service a few thousand Bitlocker-locked machines, that won’t boot far enough to get an internet connection, with non-tech-savvy users behind them. Pray tell what common “basic hygiene” practices would’ve helped, especially with Crowdstrike reportedly ignoring and bypassing the rollout policies set by their customers.

    Not saying the rest of your post is wrong, but this stood out as easily glossed over.

    • LrdThndr@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      4 months ago

      A decade ago I worked for a regional chain of gyms with locations in 4 states.

      I was in TN. When a system would go down in SC or NC, we originally had three options:

      1. (The most common) have them put it in a box and ship it to me.
      2. I go there and fix it (rare)
      3. I walk them through fixing it over the phone (fuck my life)

      I got sick of this. So I researched options and found an open source software solution called FOG. I ran a server in our office and had little optiplex 160s running a software client that I shipped to each club. Then each machine at each club was configured to PXE boot from the fog client.

      The server contained images of every machine we commonly used. I could tell FOG which locations used which models, and it would keep the images cached on the client machines.

      If everything was okay, it would chain the boot to the os on the machine. But I could flag a machine for reimage and at next boot, the machine would check in with the local FOG client via PXE and get a complete reimage from premade images on the fog server.

      The corporate office was physically connected to one of the clubs, so I trialed the software at our adjacent club, and when it worked great, I rolled it out company wide. It was a massive success.

      So yes, I could completely reimage a computer from hundreds of miles away by clicking a few checkboxes on my computer. Since it ran in PXE, the condition of the os didn’t matter at all. It never loaded the os when it was flagged for reimage. It would even join the computer to the domain and set up that locations printers and everything. All I had to tell the low-tech gymbro sales guy on the phone to do was reboot it.

      This was free software. It saved us thousands in shipping fees alone. And brought our time to fix down from days to minutes.

      There ARE options out there.

        • LrdThndr@lemmy.world
          link
          fedilink
          English
          arrow-up
          0
          ·
          edit-2
          4 months ago

          How would it not have? You got an office or field offices?

          “Bring your computer by and plug it in over there.” And flag it for reimage. Yeah. It’s gonna be slow, since you have 200 of the damn things running at once, but you really want to go and manually touch every computer in your org?

          The damn thing’s even boot looping, so you don’t even have to reboot it.

          I’m sure the user saved all their data in one drive like they were supposed to, right?

          I get it, it’s not a 100% fix rate. And it’s a bit of a callous answer to their data. And I don’t even know if the project is still being maintained.

          But the post I replied to was lamenting the lack of an option to remotely fix unbootable machines. This was an option to remotely fix nonbootable machines. No need to be a jerk about it.

          But to actually answer your question and be transparent, I’ve been doing Linux devops for 10 years now. I haven’t touched a windows server since the days of the gymbros. I DID say it’s been a decade.

          • Brkdncr@lemmy.world
            cake
            link
            fedilink
            English
            arrow-up
            0
            ·
            4 months ago

            Because your imaging environment would also be down. And you’re still touching each machine and bringing users into the office.

            Or your imaging process over the wan takes 3 hours since it’s dynamically installing apps and updates and not a static “gold” image. Imaging is then even slower because your source disk is only ssd and imaging slows down once you get 10+ going at once.

            I’m being rude because I see a lot of armchair sysadmins that don’t seem to understand the scale of the crowdstike outage, what crowdstrike even is beyond antivirus, and the workflow needed to recover from it.

            • John Richard@lemmy.worldOP
              link
              fedilink
              English
              arrow-up
              0
              arrow-down
              1
              ·
              4 months ago

              Imaging environment down? If a sysadmin can’t figure out how to boot a machine into recovery to remove the bad update file then they have bigger problems. The fix in this instance wasn’t even re-imaging machines. It was merely removing a file. Ideal DR scenario would have a recovery image already on the system that can be booted into remotely, so there is minimal strain on the network. Furthermore, we don’t live in dial-up age anymore.

      • John Richard@lemmy.worldOP
        link
        fedilink
        English
        arrow-up
        0
        arrow-down
        1
        ·
        4 months ago

        Thank you for sharing this. This is what I’m talking about. Larger companies not utilizing something like this already are dysfunctional. There are no excuses for why it would take them days, weeks or longer.

    • John Richard@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      0
      arrow-down
      1
      ·
      4 months ago

      I’d issue IPMI or remote management commands to reboot the machines. Then I’d boot into either a Linux recovery environment (yes, Linux can unlock BitLocker-encrypted drives) or a WinPE (or Windows RE) and unlock the drives, preferably already loaded on the drives, but could have them PXE boot - just giving ideas here, but ideal DR scenario would have an environment ready to load & PXE would cause delays.

      I’d either push a command or script that would then remove the update file that caused the issue & then reboots. Having planned for a scenario like this already, total time to fix would be less than 2 hours.

  • computergeek125@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    4 months ago

    Getting production servers back online with a low level fix is pretty straightforward if you have your backup system taking regular snapshots of pet VMs. Just roll back a few hours. Properly managed cattle, just redeploy the OS and reconnect to data. Physical servers of either type you can either restore a backup (potentially with the IPMI integration so it happens automatically), but you might end up taking hours to restore all data, limited by the bandwidth of your giant spinning rust NAS that is cost cut to only sustain a few parallel recoveries. Or you could spend a few hours with your server techs IPMI booting into safe mode, or write a script that sends reboot commands to the IPMI until the host OS pings back.

    All that stuff can be added to your DR plan, and many companies now are probably planning for such an event. It’s like how the US CDC posted a plan about preparing for the zombie apocalypse to help people think about it, this was a fire drill for a widespread ransomware attack. And we as a world weren’t ready. There’s options, but they often require humans to be helping it along when it’s so widespread.

    The stinger of this event is how many workstations were affected in parallel. First, there do not exist good tools to be able to cover a remote access solution at the firmware level capable of executing power controls over the internet. You have options in an office building for workstations onsite, there are a handful of systems that can do this over existing networks, but more are highly hardware vendor dependent.

    But do you really want to leave PXE enabled on a workstation that will be brought home and rebooted outside of your physical/electronic perimeter? The last few years have showed us that WFH isn’t going away, and those endpoints that exist to roam the world need to be configured in a way that does not leave them easily vulnerable to a low level OS replacement the other 99.99% of the time you aren’t getting crypto’d or receive a bad kernel update.

    Even if you place trust in your users and don’t use a firmware password, do you want an untrained user to be walked blindly over the phone to open the firmware settings, plug into their router’s Ethernet port, and add https://winfix.companyname.com as a custom network boot option without accidentally deleting the windows bootloader? Plus, any system that does that type of check automatically at startup makes itself potentially vulnerable to a network-based attack by a threat actor on a low security network (such as the network of an untrusted employee or a device that falls into the wrong hands). I’m not saying such a system is impossible - but it’s a super huge target for a threat actor to go after and it needs to be ironclad.

    Given all of that, a lot of companies may instead opt that their workstations are cattle, and would simply be re-imaged if they were crypto’d. If all of your data is on the SMB server/OneDrive/Google/Nextcloud/Dropbox/SaaS whatever, and your users are following the rules, you can fix the problem by swapping a user’s laptop - just like the data problem from paragraph one. You just have a team scale issue that your IT team doesn’t have enough members to handle every user having issues at once.

    The reality is there are still going to be applications and use cases that may be critical that don’t support that methodology (as we collectively as IT slowly try to deprecate their use), and that is going to throw a Windows-sized monkey wrench into your DR plan. Do you force your uses to use a VDI solution? Those are pretty dang powerful, but as a Parsec user that has operated their computer from several hundred miles away, you can feel when a responsive application isn’t responding quite fast enough. That VDI system could be recovered via paragraph 1 and just use Chromebooks (or equivalent) that can self-reimage if needed as the thin clients. But would you rather have annoyed users with a slightly less performant system 99.99% of the time or plan for a widespread issue affecting all system the other 0.01%? You’re probably already spending your energy upgrading from legacy apps to make your workstations more like cattle.

    All in trying to get at here with this long winded counterpoint - this isn’t an easy problem to solve. I’d love to see the day that IT shops are valued enough to get the budget they need informed by the local experts, and I won’t deny that “C-suite went to x and came back with a bad idea” exists. In the meantime, I think we’re all going to instead be working on ensuring our update policies have better controls on them.

    As a closing thought - if you audited a vendor that has a product that could get a system back online into low level recovery after this, would you make a budget request for that product? Or does that create the next CrowdStruckOut event? Do you dual-OS your laptops? How far do you go down the rabbit hole of preparing for the low probability? This is what you have to think about - you have to solve enough problems to get your job done, and not everyone is in an industry regulated to have every problem required to be solved. So you solve what you can by order of probability.

    • John Richard@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      0
      arrow-down
      1
      ·
      4 months ago

      I upvoted because you actually posted technical discussion and details that are accurate. PXE and remote power management is the way. Most workstation BIOS will have IPMI functionality already included. I agree thought that being that these are remote endpoints, it can be more challenging. Having a script to reboot their endpoints into a recovery environment though would be a basic step though in any DR scenario. Mounting the OS partition to delete a file & reboot wouldn’t be a significant endeavor, although one that they’d need to make sure they got right. Still though, it would be hard to mess up for anyone with intermediate computer skills… and you’d hope these companies at least have someone trained to do that rather quickly. They’d have to spend more time writing up a CR explaining all the steps, and then joining a conference call with like 100 people with babies crying in the background… and managers insisting they remain on the call while they write the script.

  • Leeks@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    4 months ago

    bloated IT budgets

    Can you point me to one of these companies?

    In general IT is run as a “cost center” which means they have to scratch and save everywhere they can. Every IT department I have seen is under staffed and spread too thin. Also, since it is viewed as a cost, getting all teams to sit down and make DR plans (since these involve the entire company, not just IT) is near impossible since “we may spend a lot of time and money on a plan we never need”.

    • John Richard@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      0
      arrow-down
      1
      ·
      4 months ago

      With most corporations, especially Fortune 500s… audit their budgets. The problem doesn’t start with IT. but with bad management from top down. This “cost center” you speak of is mostly what I’d expect to hear do-nothing middle-level managers tell their in-house employees when asking for a raise.

      • Leeks@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        4 months ago

        It feels like you have an agenda that you are trying to apply to the CrowdStrike event and just so happen to slandering IT as an innocent bystander to the agenda you are putting forward.

        If you had to summarize the goal of your initial post in less then 10 words, what would it be?