LinuxCzar

Engineering Software, Linux, and Observability. The website of Jack Neely.    

Resume

  January 30, 2024

Jack Neely
jjneely at gmail dot com
https://linuxczar.net
https://github.com/jjneely

Summary

Large scale systems architect with twenty plus years experience in heterogeneous and global enterprise environments. Subject matter expert in Observability. Including designing, coding, deploying, and operating streaming data science pipelines. Architect of a global Prometheus solution ingesting more than 8 million data points per second. Proven abilities to lead, learn from, integrate with, and mentor others across a wide cultural span. Teacher, trainer, and conference presenter.

Goals

A Systems Architect position utilizing a computer science background, data science skills, strong coding and operations skills, and knowledge of computer architecture to create highly efficient and robust cloud solutions. Leading an organization by solving problems with data, mathematics, and the scientific method.

Conference Presentations

Monitorama PDX 2023

Portland, Oregon | June 2023

Observability Data Engineering: A Story About Math, Four Golden Signals, and Business Intelligence. A brief tour of the Google SRE Four Golden Signals and the mathematical concepts that inform them, including enterprise use cases of tracking individual customers in observability data. Includes advanced techniques for aggregating and rolling up percentiles.

All Things Open 2020

Raleigh, North Carolina | October 2020

Finding the Golden Signals with Prometheus. Scale a business by using Prometheus. A walk through of identifying how to instrument applications, use that data to create Service Level Objectives, and create Burn Rate style alerting while avoiding common challenges with large data sets in Prometheus.

Monitorama PDX 2019 Lighting Talks

Portland, Oregon | June 2019

5 Neat Prometheus Tricks: PromQL and the Power of 1. A 5 minute lighting talk covering Prometheus’s query language and 5 useful ways to build advanced alerting and Service Level Objective hacks.

Professional Experience

Sr. Principal DevOps Observability Architect

Palo Alto Networks, Raleigh, NC | December 2020 - Present

Technical lead for Observability in the Prisma Cloud (public cloud security) division.

  • Technical lead of the global observability team.
  • Build road maps, plans, and timelines to project manage the roll out, implementation, and upgrades of metrics, traces, logs/events, real user monitoring, OpenTelemetry, and synthetic monitoring through an enterprise culture.
  • Implement a unified and global Observability Platform for Prisma Cloud covering metrics, traces, logging, and alerting based on Grafana Mimir, Grafana Tempo, and a combination of Grafana Loki and ElasticSearch.
  • Designed the Observability Platform to support 60+ Kubernetes clusters running in multiple cloud service providers in all regions as well as integration of serverless applications.
  • Create required isolation zones for supporting Observability in compliance based partitions and deployments such as China and FedRAMP High.
  • Created a migration plan away from a SignalFx metric solution with custom agents to a Cloud Native approach with Prometheus, Thanos, Spring Framework, and Micrometer.
  • Initiated an Alert Review process to address alert fatigue and discover on going and unnoticed customer facing issues.
  • Project management of migration away from Splunk toward a globally designed ElasticSearch and Grafana Loki based logging and eventing solution saving $2.5 million per year.
  • Define a schema and plan for applications to adopt Structured Logs. Support and configure structured log usage in Spring Boot, Logback, and similar Java tools.
  • Manage vendor relationships.
  • Roll out PagerDuty and migrate PagerDuty account to corporate level account for better costing opportunities. Report on every On Call rotation each week.
  • Worked with global teams to design and solve high cardinality business intelligence use cases with streaming data pipelines utilizing AWS Kinesis and Apache Flink. Created interactive Grafana Dashboards to display and alert on results.
  • Develop an automated score card tool with Grafana that tracks each functional area, team, and service. Report on observability adoption and Service Level Objectives at each level.

Senior Operations Engineer

42 Lines, Inc., Raleigh, NC | December 2013 - December 2020

Systems Architect, 42 Lines. Support a young SaaS product and scale operations and reliability with modern load balancing and telemetry. February 2020 - December 2020.

  • Constructed a scalable load balancing solution using AWS Network Load Balancers, Auto Scaling Groups, and HAProxy that is able to manage stateful user sessions for existing custom software and newly developed SaaS products.
  • Built relationships with with partner companies to create a referral network working with the marketing team.
  • Created an array of presentations and webinars showcasing Service Reliability Engineering best practices and how to use observability tools to make better business decisions.
  • Gathered data for capacity planning and costs per customer by extracting meaningful value from log events in Elastic Stack and deploying a Prometheus based metrics solution.
  • Introduced Prometheus and Grafana and built dashboards around the Four Golden Signals to visualize SaaS product behaviors.

Consulting for Mutations Limited, Los Angeles, California. Build and maintain a cloud agnostic Kubernetes ecosystem as a production environment for an IoT startup. November 2019 - February 2020.

  • Integrated visibility services from DataDog into Amazon Elastic Kubernetes. Including management of all log sources and metrics from many different sources and protocols.
  • Deployed and maintained mission critical services in Kubernetes via Helm Charts including Confluent Kafka endpoints and streaming data manipulation with Confluent ksqlDB.
  • Managed 3 Kubernetes clusters for multiple development, staging, and production environments.

Consulting for Fitbit, Inc., San Francisco, California. Designed and implemented a Prometheus and Thanos observability platform ingesting 8 million data points per second. October 2014 - November 2019.

  • Implemented Thanos as a solution for Prometheus clustering and long term storage of data in GCS. Worked with upstream developers in Go to fix and merge TSDB block repair routines, bug fixes for pointer math, and several command line options to help build a migration path from a large Prometheus environment.
  • Worked with client’s teams all over the globe in an effort to sunset all StatsD and Graphite metric instrumentation in favor of Prometheus. Taught the software engineering teams the Prometheus APIs and libraries for Python, Go, and Java as required.
  • Shifted client’s entire Prometheus monitoring stack from bare metal hardware in IBM SoftLayer into the Google Compute Platform. Containerized all components.
  • Designed, planned, and implemented a distributed Prometheus based monitoring and telemetry infrastructure for a service oriented architecture supporting more than a hundred teams and more than 8 million samples per second.
  • Patched Prometheus’s Histogram routines in Go to ensure buckets always increase monotonically when estimating quantiles to handle scrape consistency issues.
  • Designed, managed, and upgraded a client’s Graphite and Grafana cluster supporting more than 30 million incoming metrics per minute, 300 terabytes of storage, and over 130 million unique time series.
  • Contributor to the Open Source Graphite project with merge access. Multiple Python based patches written to increase the efficiency of Graphite and improve failure modes.
  • Coded tools in Go to manage large Graphite clusters including rebalancing metrics and merging duplicate metrics. Code was an order of magnitude faster than similar Python tools.
  • Architected a solution to ingest more than 2.5 million Statsd metrics per second and feed aggregate metrics to Graphite.
  • Coded StatsRelay as an Open Source StatsD load balancer using Google’s Jump consistent hashing algorithm in Go. A single instance could handle 700,000 UDP packets per second.
  • Replaced Etsy StatsD NodeJS daemon with Statsite written in C. Patched Statsite to add configuration options and to set socket options required for higher throughput. Additional patches to fix bugs in the event driven architecture.

Consulting for the Academy of Art University, San Francisco, California. December 2014 - October 2014.

  • Built Nagios/Merlin monitoring systems to achieve single pane of glass monitoring for infrastructures spanning the globe and multiple cloud providers. Multiple bug fixes in C submitted and accepted by the OP5 team developing Merlin.
  • Migrated an in-house Amazon EC2 provisioning system away from Chef to an Ansible based system.

Operations and Systems Specialist

NC State University Office of Information Technology, Raleigh, NC | April 2006 - November 2013

  • Architect of NC State University’s Linux deployment. Continued project lead of NCSU Realm Linux. Support of over 2,000 workstations and servers and more than 100,000 users.
  • Technical lead for the deployment of Red Hat Enterprise Linux throughout the University of North Carolina System’s 16 universities.
  • Designed automated tools to better support Realm Linux including hands-off installs using PXE and Red Hat Kickstart.
  • Designed and built a configuration management solution using Bcfg2 that was used by system administrators for a campus of 100,000 users. Also effective in situations where IT groups required being partitioned away from other IT groups. Planned and implemented a migration of this system to Puppet.
  • Coded and implemented a dynamic Kickstart system for all of campus using Python, Genshi Templates, and XMLRPC.
  • Created and maintained many RPM packages including OpenAFS packages.
  • Built an RPM package build system using Subversion and Mock.
  • Contributed to a number of Open Source projects such as MoinMoin, Yum, Anaconda, Bcfg2, Up2date, and others.
  • Wrote production quality PAM modules in C to implement LDAP based authorization.
  • Upgraded and took primary responsibility of the campus Kerberos authentication system. Moved the campus off of the Kerberos 4 protocol.
  • Upgraded and took primary responsibility for the NC State University’s public NTP service.
  • Implemented an inexpensive load balancing and high availability solution using LVS, Keepalived, and spanned network VLANs through the data centers. This system load balanced NC State’s main website, LDAP infrastructure, RHN Satellites, Webmail, Linux installs, and many other services.
  • Upgraded the Cyrus IMAP implementation that supported over 100,000 users to new hardware, latest Realm Linux version, and the most current Cyrus IMAP software.
  • Built Xen and KVM based virtual machines for optimal use of physical hardware.
  • Train users, help desk staff, and other system administrators on a regular basis including topics such as configuration management with Puppet, RPM packaging, RAID and LVM usage, and deploying Realm Linux.
  • Wrote and continue to update documentation and best practices guidelines for various topics in Linux administration.
  • Started and organized NC State University’s FOSS Fair, an annual unconference style event for topics in Free and Open Source Software. Beginning in 2009.

Systems Programmer I

NC State University College of Physical and Mathematical Sciences, Raleigh, NC | 2001 - 2006

  • Took on the responsibility as the project lead for NCSU Realm Linux.
  • Managed the campus wide install base of Realm Linux at over 1,000 machines.
  • Deployed and managed a Red Hat Network Satellite server and its supporting Oracle 9iR2 database.
  • Served as contact point for campus regarding Linux security issues, bugs, enhancements, and features for Red Hat style Linux distributions.
  • Designed, deployed, and administrated 100+ node Beowulf Cluster based on RHEL, Sun Grid Engine, and MPI.
  • Built supporting infrastructure for the Beowulf including fiber channel storage arrays, deployment of Brocade FC switches, and Cisco 3750 network switches.
  • Supported the Beowulf users as they created live hurricane prediction models submitted to the National Weather Service.
  • Participated in the server room design process for the room that housed several Beowulf Clusters and other servers.

Systems Administrator

NC State University Department of Physics, Raleigh, NC | 2000 - 2001

  • Worked with faculty and graduate students to troubleshoot problems and identify solutions.
  • Tested and deployed Realm Linux and other Linuxes.
  • Gained experience with Solaris, AIX, IRIX, and ULTRIX.

Research

NC State University Department of Chemistry, Raleigh, NC | 1999 - 2000

  • Created professional quality video with Linux and developed a process to master recordings onto LaserDisc and produce VHS tapes from the master.
  • Wrote C code to generate models of a molecular ``bridge’’ using graph algorithms. This code was used to design molecules that identify and bond to cancer cells and leave normal cells untouched.

Instructor

Sandhills Community College, Pinehurst, NC | Summer of 1999

Education

B.S. in Computer Science
North Carolina State University, Raleigh, NC
May, 2002

Activities and Honors

Practical Operations Podcast – operations.fm

  • Co-host of the Practical Operations Podcast with more than 100 episodes recorded featuring best practices in the Operations, DevOps, SRE, Observability, and infrastructure fields.

Professional Awards and Associations

  • Gertrude Cox Award for Innovative Excellence in Teaching and Learning with Technology winner for the Realm Linux project.
  • Triangle Linux Users’ Group (TriLUG) member.
  • Former President of the NC State University Linux Users’ Group.

Musician

  • Tenor section leader for the North Carolina Master Chorale.
  • Vocalist for St. Michael’s Episcopal Church and the Raleigh Convocation Choir of the Episcopal Diocese of North Carolina.
  • Board member of the Raleigh Convocation Choir.