Skip to content

Cb ThreatHunter _search() function does not yield the correct num_available results #239

@crahan

Description

@crahan

The current implementation of the _search() function in the Query class in cbapi/psc/threathunter/query.py yields an incorrect number of results for event searches due to the rows argument being set as part of the REST API call. the _count() and _search() functions both use the same arguments and the same urlobject when making the REST calls, except for _search() also specifying rows as 100 (the default batch size set via args['rows'] = self._batch_size).

To test I updated the _count() function to the following:

    def _count(self):
        args = {"search_params": self._get_query_parameters()}

        log.debug("args: {}".format(str(args)))

        js = self._cb.post_object(
            self._doc_class.urlobject.format(self._cb.credentials.org_key), body=args
        ).json()

        self._total_results = int(js.get("response_header", {}).get("num_available", 0))
        self._count_valid = True

        print(f'_count args: {args}')
        print(f'_count json: {js.get("response_header")}\n')

        return self._total_results

And I updated the while loop in the _search() function to the following:

       while still_querying:

            url = self._doc_class.urlobject.format(self._cb.credentials.org_key)
            resp = self._cb.post_object(url, body=args)
            result = resp.json()

            print(f'_search args: {args}')
            print(f'_search json: {result.get("response_header")}\n')

            self._total_results = result.get("response_header", {}).get("num_available", 0)
            self._count_valid = True

            results = result.get('docs', [])

Both print the arguments sent to the urlobject URL and the response header that is sent back from the REST API. Now we run the following script (please note this uses #237 to fix the 403 issue with the len() function):

from cbapi.psc.threathunter import CbThreatHunterAPI
from cbapi.psc.threathunter.models import Event

cb = CbThreatHunterAPI(profile='redlab2')
guid = 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c'
query_result = cb.select(Event).where(process_guid=guid).and_(event_type="crossproc")
print(f'len() count: {len(query_result)}')

i=0

for t in query_result:
    i+=1

print(f'Iteration count: {i}')

Output:

❯ python test.py
_count args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp'}}
_count json: {'num_found': 13910, 'num_available': 1051, 'total_segments': 2145, 'processed_segments': 2145}

len() count: 1051
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100}}
_search json: {'num_found': 13910, 'num_available': 251, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 251
Iteration count: 251

The only difference between the _count() and the _search() function arguments is the rows value. It's set to 100 by default due to args['rows'] = self._batch_size and causes the platform to return a lower value for num_available than when rows is not specified at all.

In addition, even with the incorrect value for num_available (251 in the example above) the expectation would be that the while loop would loop at least 3 times as the requested row size is 100. 251 results would be loops of 100, 100, and 51. Instead however, the API returns all 251 results in a single go, resulting in a single iteration of the while loop. Iteration count near the end of the output shows how many actual results were returned. 251 in this case.

When we comment out args['rows'] = self._batch_size, in order to not send the rows argument in the _search() REST call, we see the following output instead:

❯ python test.py
_count args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp'}}
_count json: {'num_found': 13910, 'num_available': 1051, 'total_segments': 2145, 'processed_segments': 2145}

len() count: 1051
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp'}}
_search json: {'num_found': 13910, 'num_available': 1051, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp'}, 'start': 501}
_search json: {'num_found': 13910, 'num_available': 1051, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp'}, 'start': 1001}
_search json: {'num_found': 13910, 'num_available': 1051, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
Iteration count: 1500

By not specifying a rows argument in the _search() request the platform now returns the same number of results as it does for the _count() function, and as expected it loops more frequently with a default result set of 500 per iteration. Surprisingly though, instead of returning 500, 500, 51 as indicated by num_avialable. It actually returns 3 batches of 500 results, effectively going beyond the reported number of num_available. The difference between num_found and num_available appears to be arbitrary and it's actually possible to retrieve more results than what num_available says are possible to be sent back from the platform.

In summary, _search() for events is broken in its current form and it appears to be in part at the side of the REST API in the way it handles requests that contain a rows parameter when submitting a search request.

  1. Adding rows to a request should not impact num_available it should only change the number of results which are returned for that particular request.
  2. When requesting results where the default result set size (i.e. 500 from our testing) goes beyond the value of num_available (e.g. starting at 1000 with a default result size of 500 when num_available is 1051) should yield 51 results and not 500. num_available should properly indicate the hard limit of retrievable results and not cause confusion by returning results beyond that value.

With the change to the v2 search at scale API, it's unclear how this will be resolved, but at this time event search results retrieved using the Python API are highly inconsistent and cause a lot of confusion. Especially because of the difference between what the _count() function returnes (and thus .len() on a query object returns) versus the actual results which are being retrieved (which is significantly less based on our testing: 251 vs 1051).

Update: for giggles I forced args['rows'] = 100000000 in _search() and, as by magic, num_available changes to 13956 (which is actually all available results) and I'm able to retrieve all events in batches of 500. I've cut off the below output for brevity, but I hope this helps clarify the point that the value of rows causes issues in the API (and begs the question why num_available is set to such incredibly low values by default).

❯ python test.py
_count args: {'search_params': {'q': 'process_guid:N5DFVQXD\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'N5DFVQXD-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp'}}
_count json: {'num_found': 13956, 'num_available': 1097, 'total_segments': 2145, 'processed_segments': 2145}

len() count: 1097
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}, 'start': 501}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}, 'start': 1001}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}, 'start': 1501}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}, 'start': 2001}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}, 'start': 2501}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}, 'start': 3001}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions