LinuxCzar

Engineering Software, Linux, and Observability. The website of Jack Neely.    

Resume

  May 30, 2021

Jack Neely
jjneely at gmail dot com
https://linuxczar.net
https://github.com/jjneely

Summary

Large scale systems architect with twenty years experience in complex, heterogeneous environments. Including designing, coding, deploying, and operating highly available and scalable cloud services. Subject matter expert on Visibility and Observability. Architect of a Prometheus solution ingesting more than 8 million data points per second. Proven abilities to learn from, integrate with, and mentor others across a wide cultural span.

Goals

A Systems Architect position utilizing a computer science background, data science skills, strong coding and operating skills, and knowledge of computer architecture to create highly efficient and robust cloud solutions.

Professional Experience

DevOps Observability Architect

Palo Alto Networks, Raleigh, NC | December 2020 - Present

Technical lead for the Observability initiative in the Prisma Cloud (public cloud security) division.

  • Built a roadmap of design goals and milestones to implement a unified Logging, Metrics, and Tracing Observability platform for the entire division.
  • Designed and built a global Grafana, Prometheus, Thanos metric and monitoring system to create a single pane of glass dashboarding and alerting environment for Kubernetes clusters across regions. Handle compliance based partitions for deployments in China and US Government regions.
  • Created a migration plan away from a SignalFx metric solution with custom agents to a Cloud Native approach with Prometheus, Spring Framework, and Micrometer.
  • Initiated an Alert Review process to address alert fatigue and discover on going and unnoticed customer facing issues.
  • Kick off prototyping of Loki to integrate logs and events into the Grafana single pane of glass monitoring environment. Foster the planning of moving a vendor based log and event setup to a more secure and cost effective solution. Visualize and create metrics from the log and event data.

Senior Operations Engineer

42 Lines, Inc., Raleigh, NC | December 2013 - December 2020

Systems Architect, 42 Lines. Support a young SaaS product and scale operations and reliability with modern load balancing and telemetry. February 2020 - December 2020.

  • Constructed a scalable load balancing solution using AWS Network Load Balancers, Auto Scaling Groups, and HAProxy that is able to manage stateful user sessions for existing custom software and newly developed SaaS products.
  • Built relationships with with partner companies to create a referral network working with the marketing team.
  • Created an array of presentations and webinars showcasing Service Reliability Engineering best practices and how to use observability tools to make better business decisions.
  • Gathered data for capacity planning and costs per customer by extracting meaningful value from log events in Elastic Stack and deploying a Prometheus based metrics solution.
  • Introduced Prometheus and Grafana and built dashboards around the Four Golden Signals to visualize SaaS product behaviors.

Consulting for Mutations Limited, Los Angeles, California. Build and maintain a cloud agnostic Kubernetes ecosystem as a production environment for an IoT startup. November 2019 - February 2020.

  • Integrated visibility services from DataDog into Amazon Elastic Kubernetes. Including management of all log sources and metrics from many different sources and protocols.
  • Deployed and maintained mission critical services in Kubernetes via Helm Charts including Confluent Kafka endpoints and streaming data manipulation with Confluent ksqlDB.
  • Managed 3 Kubernetes clusters for multiple development, staging, and production environments.

Consulting for Fitbit, Inc., San Francisco, California. Designed and implemented a Prometheus and Thanos observability platform ingesting 8 million data points per second. October 2014 - November 2019.

  • Implemented Thanos as a solution for Prometheus clustering and long term storage of data in GCS. Worked with upstream developers in Go to fix and merge TSDB block repair routines, bug fixes for pointer math, and several command line options to help build a migration path from a large Prometheus environment.
  • Worked with client’s teams all over the globe in an effort to sunset all StatsD and Graphite metric instrumentation in favor of Prometheus. Taught the software engineering teams the Prometheus APIs and libraries for Python, Go, and Java as required.
  • Shifted client’s entire Prometheus monitoring stack from bare metal hardware in IBM SoftLayer into the Google Compute Platform. Containerized all components.
  • Designed, planned, and implemented a distributed Prometheus based monitoring and telemetry infrastructure for a service oriented architecture supporting more than a hundred teams and more than 8 million samples per second.
  • Patched Prometheus’s Histogram routines in Go to ensure buckets always increase monotonically when estimating quantiles to handle scrape consistency issues.
  • Designed, managed, and upgraded a client’s Graphite and Grafana cluster supporting more than 30 million incoming metrics per minute, 300 terabytes of storage, and over 130 million unique time series.
  • Contributor to the Open Source Graphite project with merge access. Multiple Python based patches written to increase the efficiency of Graphite and improve failure modes.
  • Coded tools in Go to manage large Graphite clusters including rebalancing metrics and merging duplicate metrics. Code was an order of magnitude faster than similar Python tools.
  • Architected a solution to ingest more than 2.5 million Statsd metrics per second and feed aggregate metrics to Graphite.
  • Coded StatsRelay as an Open Source StatsD load balancer using Google’s Jump consistent hashing algorithm in Go. A single instance could handle 700,000 UDP packets per second.
  • Replaced Etsy StatsD NodeJS daemon with Statsite written in C. Patched Statsite to add configuration options and to set socket options required for higher throughput. Additional patches to fix bugs in the event driven architecture.

Consulting for the Academy of Art University, San Francisco, California. December 2014 - October 2014.

  • Built Nagios/Merlin monitoring systems to achieve single pane of glass monitoring for infrastructures spanning the globe and multiple cloud providers. Multiple bug fixes in C submitted and accepted by the OP5 team developing Merlin.
  • Migrated an in-house Amazon EC2 provisioning system away from Chef to an Ansible based system.

Operations and Systems Specialist

NC State University Office of Information Technology, Raleigh, NC | April 2006 - November 2013

  • Architect of NC State University’s Linux deployment. Continued project lead of NCSU Realm Linux. Support of over 2,000 workstations and servers and more than 100,000 users.
  • Technical lead for the deployment of Red Hat Enterprise Linux throughout the University of North Carolina System’s 16 universities.
  • Designed automated tools to better support Realm Linux including hands-off installs using PXE and Red Hat Kickstart.
  • Designed and built a configuration management solution using Bcfg2 that was used by system administrators for a campus of 100,000 users. Also effective in situations where IT groups required being partitioned away from other IT groups. Planned and implemented a migration of this system to Puppet.
  • Coded and implemented a dynamic Kickstart system for all of campus using Python, Genshi Templates, and XMLRPC.
  • Created and maintained many RPM packages including OpenAFS packages.
  • Built an RPM package build system using Subversion and Mock.
  • Contributed to a number of Open Source projects such as MoinMoin, Yum, Anaconda, Bcfg2, Up2date, and others.
  • Wrote production quality PAM modules in C to implement LDAP based authorization.
  • Upgraded and took primary responsibility of the campus Kerberos authentication system. Moved the campus off of the Kerberos 4 protocol.
  • Upgraded and took primary responsibility for the NC State University’s public NTP service.
  • Implemented an inexpensive load balancing and high availability solution using LVS, Keepalived, and spanned network VLANs through the data centers. This system load balanced NC State’s main website, LDAP infrastructure, RHN Satellites, Webmail, Linux installs, and many other services.
  • Upgraded the Cyrus IMAP implementation that supported over 100,000 users to new hardware, latest Realm Linux version, and the most current Cyrus IMAP software.
  • Built Xen and KVM based virtual machines for optimal use of physical hardware.
  • Train users, help desk staff, and other system administrators on a regular basis including topics such as configuration management with Puppet, RPM packaging, RAID and LVM usage, and deploying Realm Linux.
  • Wrote and continue to update documentation and best practices guidelines for various topics in Linux administration.
  • Started and organized NC State University’s FOSS Fair, an annual unconference style event for topics in Free and Open Source Software. Beginning in 2009.

Systems Programmer I

NC State University College of Physical and Mathematical Sciences, Raleigh, NC | 2001 - 2006

  • Took on the responsibility as the project lead for NCSU Realm Linux.
  • Managed the campus wide install base of Realm Linux at over 1,000 machines.
  • Deployed and managed a Red Hat Network Satellite server and its supporting Oracle 9iR2 database.
  • Served as contact point for campus regarding Linux security issues, bugs, enhancements, and features for Red Hat style Linux distributions.
  • Designed, deployed, and administrated 100+ node Beowulf Cluster based on RHEL, Sun Grid Engine, and MPI.
  • Built supporting infrastructure for the Beowulf including fiber channel storage arrays, deployment of Brocade FC switches, and Cisco 3750 network switches.
  • Supported the Beowulf users as they created live hurricane prediction models submitted to the National Weather Service.
  • Participated in the server room design process for the room that housed several Beowulf Clusters and other servers.

Systems Administrator

NC State University Department of Physics, Raleigh, NC | 2000 - 2001

  • Worked with faculty and graduate students to troubleshoot problems and identify solutions.
  • Tested and deployed Realm Linux and other Linuxes.
  • Gained experience with Solaris, AIX, IRIX, and ULTRIX.

Research

NC State University Department of Chemistry, Raleigh, NC | 1999 - 2000

  • Created professional quality video with Linux and developed a process to master recordings onto LaserDisc and produce VHS tapes from the master.
  • Wrote C code to generate models of a molecular ``bridge’’ using graph algorithms. This code was used to design molecules that identify and bond to cancer cells and leave normal cells untouched.

Instructor

Sandhills Community College, Pinehurst, NC | Summer of 1999

Education

B.S. in Computer Science
North Carolina State University, Raleigh, NC
May, 2002

Activities and Honors

Practical Operations Podcast – operations.fm

  • Co-host of the Practical Operations Podcast with more than 100 episodes recorded featuring best practices in the Operations, DevOps, SRE, Observability, and infrastructure fields.

Professional Awards and Associations

  • Gertrude Cox Award for Innovative Excellence in Teaching and Learning with Technology winner for the Realm Linux project.
  • Triangle Linux Users’ Group (TriLUG) member.
  • Former President of the NC State University Linux Users’ Group.

Musician

  • Tenor section leader for the North Carolina Master Chorale.
  • Vocalist for St. Michael’s Episcopal Church and the Raleigh Convocation Choir of the Episcopal Diocese of North Carolina.
  • Board member of the Raleigh Convocation Choir.