Our Approach to Building Security Tooling

- 11 minutes read - 2240 words

Introduction

Most “security tools” today are typically composed by code that consumes an API and applies predefined logic to identify issues. This is generally accomplished by:

  1. Fetching a subset of the endpoints exposed by the service / API being audited (i.e. the information required for the evaluation logic, such as a list of the EC2 instances deployed in an AWS account, as well as their configuration)
  2. Storing the data retrieved
  3. Evaluating this data to produce “findings” (this is the added value provided by the tool)

Integrating third party tools into our monitoring platform isn’t always straightforward, as each tool:

Additionally, tools are usually designed to only fetch the data that is determined to be useful at a point in time. Often this means fetching the data that’s required in order to evaluate the currently implemented findings. This approach severely limits how useful they can be in other contexts, e.g. when conducting an incident response.

Motivated by the above limitations, we have taken a different approach in constructing our monitoring platform and in-house tools, which has come with several advantages. This blog post elaborates on our tooling development strategy, aiming to encourage the adoption of a similar approach in future tool developments across the security industry.

Our Approach

Rather than fetching a subset of known-useful endpoints, we generalize the consumption APIs in order to capture all the available data, independently of its perceived value at a point in time. For example, this could mean making requests to every single endpoint of the GitHub API. This data is then stored in its original format, including both the request and response. Once the data is stored locally (we typically call this a snapshot), we can run queries against it, and build findings on top of these queries.

This approach presents a number of benefits:

Let’s dig into some specifics.

Generating Snapshots

We mentioned that we generalize the consumption of the API - what does this mean? Well, APIs follow a more or less strict format. If you can write some code that understands that format, then you can consume all of it with relatively little human intervention. Let’s look into the types of APIs we usually encounter.

Simple APIs

These are usually REST APIs, that may have some level of definition without having a formal specification. We consider these to be “simple” as there’s limited relationships between endpoints, which allows for relatively straightforward collection.

A good example of this are the APIs exposed by Kubernetes clusters. Clusters expose /api & /apis endpoints, which detail the resources available in the cluster in a RESTy manner. The Kubernetes cluster API is quite simple (endpoints are self-contained and multiple requests aren’t required to fetch resources), which allows using the /api and /apis endpoints to fetch all the resources.

The following figure shows the process a tool could follow to consume a Kubernetes cluster’s API:

Simple API

In the above, the first request discovers the available endpoints/resources, and the second one fetches the details for the cluster’s namespaces. Following this process for all the resources returned by the initial request would allow consuming all the data exposed by the API.

Formal APIs

These are APIs that publish definitions based on standards such as Open API or GraphQL. Going back to the Kubernetes API example, in addition to the /api and /apis endpoints, clusters also expose a /openapi/v2 endpoint which provides an Open API specification. Making an API request to this endpoint returns information similar to what we saw previously:

{
  "swagger": "2.0",
  "info": {
    "title": "Kubernetes",
    "version": "v1.25.3"
  },
  "paths": {
    "/api/v1/": {
      "get": {
        "description": "get available resources"
      }
    },
    "/api/v1/namespaces": {
      "get": {
        "description": "list or watch objects of kind Namespace",
        "responses": {
          "200": {
            "description": "OK",
            "schema": {
              "$ref": "#/definitions/io.k8s.api.core.v1.NamespaceList"
            }
          },
          "401": {
            "description": "Unauthorized"
          }
        },
        "x-kubernetes-action": "list",
        "x-kubernetes-group-version-kind": {
          "group": "",
          "kind": "Namespace",
          "version": "v1"
        }
      }
    }
  }
}

While the format is different, the complexity of parsing this information remains low. Things change significantly when there are relationships between the different endpoints. For example, in order to retrieve the details of all the repositories in a GitHub organization using the public REST API, you’d need to make two API calls:

To accomplish this automatically, you’d need to consume the API’s definition and use it to build a relationship graph, which defines the endpoints that return data that needs to be used as input for other endpoints. This looks something like this:

GitHub REST API

Complex APIs

These are APIs that don’t publish definitions or don’t have a consistent behavior. They also tend to have very loose conventions, with frequent exceptions. A good example of this is AWS, where each service has its own set of endpoints, and these endpoints follow differing structures. To further complicate things, AWS API endpoints have considerable, and sometimes complex, relationships.

For example, in order to identify EC2 instances that have instance profiles with IAM policies that grant administrative level permissions, you’d need to call the following API endpoints:

The call graph would look something like this:

AWS API

While the complexity of the call graph is similar to the GitHub API example above, AWS does not publish a definition of its APIs across all services in a convenient location. Supporting these APIs requires more human intervention compared to simple or formal APIs. But, for high-value APIs such as AWS, the effort involved is definitely worth it; where else can you get the details of everything deployed in an account!

Querying Snapshots

As mentioned above, both our investigation and “findings” logic leverage query languages rather than custom logic. At Latacora, our primary programming language is Clojure, and the main technologies we use to query snapshots are Datalog with Datascript and Specter.

Thanks to our choice of query language, which is both expressed as structured data and intrinsically supports expanding “functions” (via Predicate Expressions), we can also express complex logic with simple helpers. What does this look like? Well, let’s say we want to write a query that returns all AWS EC2 instances that have the IMDSv1 enabled1. The Datalog query would look something like this:

(require '[datascript.core :as d])

(pqd/q '[:find  ?instance-id ?vpc
         :keys  instance-id vpc
         :where (aws-call "ec2" "describe-instances" ?api)
                [?api :reservation ?reservation]
                [?reservation :instance ?instance]
                [?instance :instance-id ?instance-id]
                [?instance :vpc-id ?vpc]
                [?instance :metadata-option ?metadata]
                [?metadata :http-tokens "optional"]
         :in $ %]
       db)

In the above:

Another example would be identifying every API call (i.e. path in the snapshot) where a specific IP address appeared. Here Specter would be our tool of choice. A simple query like this one would suffice:

(require '[com.rpl.specter :as specter])

(def path-finder
  "Finds the first entry matching `pred` in a deeply
  nested structure of maps   and vectors, and collects
  the path on the way there."
  (specter/recursive-path
    [term-pred] p
    (specter/cond-path
      (specter/pred term-pred) specter/STAY
      map? [INDEXED p]
      coll? [INDEXED-SEQ p])))

#(specter/select (path-finder (fn [val] (.equals "1.3.3.7" val)))
                 snapshot)

In the above, snapshot is the snapshot loaded into a map.

These are just a few examples of the data manipulation we perform routinely against snapshots. Since the tools are not tied to the underlying data structure being queried, they can be changed at any time, or more adequate tools can be used for specific tasks (e.g. we love using Meander for data transformation).

Replaying Snapshots

While we’re very proud of our in-house tooling, other excellent tools exist that implement useful functionality which we want to use, such as generating visual representation of cloud environments. While we could implement these same things (and we sometimes do), it would require a lot of work with limited added benefit. We’d rather allow these tools to run against the snapshots, instead of the actual APIs.

That way, we’d:

Since we’ve pulled all the information out of the provider, we have the ability to “play back” this information by faking an API endpoint that returns the information in the same format it was originally ingested in. This isn’t always trivial (and in some cases quite impractical), but it’s definitely a strategy we’ll want to implement more broadly in the future.

Tying it all together

Good tooling is useless if you don’t have a means of proactively reacting to its findings. At Latacora, we’ve built a scheduling, reporting and alerting pipeline that integrates all of our tools. At a high level, it looks something like this:

Tooling Pipeline

The “reporting pipeline” is the central piece of our persistent monitoring capability. It ingests daily tooling output and not only does it generate point-in-time reports we share with our clients, it also creates JIRA tickets for things that need to be validated by an engineer, and potentially escalated to the impacted organization.

A specificity of this reporting pipeline is the approach we take to modeling our knowledge of our clients’ infrastructure and applying it to the tooling output. Once again, we take a generalist approach and describe the logic as Clojure code. This logic can be divided into two “models”:

As these models are code, they can be overlapped to our tooling output, which results in an accurate representation of the environments’ overall security posture.

Conclusion

We hope this blog post has provided valuable insights to security tool developers and has inspired a shift in the way they approach tooling development. If you have any questions or would like to get in touch, you can reach out at hello@latacora.com.