-
Notifications
You must be signed in to change notification settings - Fork 86
Description
The current implementation of the _search()
function in the Query class in cbapi/psc/threathunter/query.py
yields an incorrect number of results for event searches due to the rows
argument being set as part of the REST API call. the _count()
and _search()
functions both use the same arguments and the same urlobject when making the REST calls, except for _search()
also specifying rows as 100 (the default batch size set via args['rows'] = self._batch_size
).
To test I updated the _count() function to the following:
def _count(self):
args = {"search_params": self._get_query_parameters()}
log.debug("args: {}".format(str(args)))
js = self._cb.post_object(
self._doc_class.urlobject.format(self._cb.credentials.org_key), body=args
).json()
self._total_results = int(js.get("response_header", {}).get("num_available", 0))
self._count_valid = True
print(f'_count args: {args}')
print(f'_count json: {js.get("response_header")}\n')
return self._total_results
And I updated the while loop in the _search()
function to the following:
while still_querying:
url = self._doc_class.urlobject.format(self._cb.credentials.org_key)
resp = self._cb.post_object(url, body=args)
result = resp.json()
print(f'_search args: {args}')
print(f'_search json: {result.get("response_header")}\n')
self._total_results = result.get("response_header", {}).get("num_available", 0)
self._count_valid = True
results = result.get('docs', [])
Both print the arguments sent to the urlobject URL and the response header that is sent back from the REST API. Now we run the following script (please note this uses #237 to fix the 403 issue with the len()
function):
from cbapi.psc.threathunter import CbThreatHunterAPI
from cbapi.psc.threathunter.models import Event
cb = CbThreatHunterAPI(profile='redlab2')
guid = 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c'
query_result = cb.select(Event).where(process_guid=guid).and_(event_type="crossproc")
print(f'len() count: {len(query_result)}')
i=0
for t in query_result:
i+=1
print(f'Iteration count: {i}')
Output:
❯ python test.py
_count args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp'}}
_count json: {'num_found': 13910, 'num_available': 1051, 'total_segments': 2145, 'processed_segments': 2145}
len() count: 1051
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100}}
_search json: {'num_found': 13910, 'num_available': 251, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 251
Iteration count: 251
The only difference between the _count()
and the _search()
function arguments is the rows
value. It's set to 100 by default due to args['rows'] = self._batch_size
and causes the platform to return a lower value for num_available
than when rows
is not specified at all.
In addition, even with the incorrect value for num_available
(251 in the example above) the expectation would be that the while loop would loop at least 3 times as the requested row size is 100. 251 results would be loops of 100, 100, and 51. Instead however, the API returns all 251 results in a single go, resulting in a single iteration of the while loop. Iteration count near the end of the output shows how many actual results were returned. 251 in this case.
When we comment out args['rows'] = self._batch_size
, in order to not send the rows
argument in the _search()
REST call, we see the following output instead:
❯ python test.py
_count args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp'}}
_count json: {'num_found': 13910, 'num_available': 1051, 'total_segments': 2145, 'processed_segments': 2145}
len() count: 1051
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp'}}
_search json: {'num_found': 13910, 'num_available': 1051, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp'}, 'start': 501}
_search json: {'num_found': 13910, 'num_available': 1051, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp'}, 'start': 1001}
_search json: {'num_found': 13910, 'num_available': 1051, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
Iteration count: 1500
By not specifying a rows
argument in the _search()
request the platform now returns the same number of results as it does for the _count()
function, and as expected it loops more frequently with a default result set of 500 per iteration. Surprisingly though, instead of returning 500, 500, 51 as indicated by num_avialable
. It actually returns 3 batches of 500 results, effectively going beyond the reported number of num_available
. The difference between num_found
and num_available
appears to be arbitrary and it's actually possible to retrieve more results than what num_available
says are possible to be sent back from the platform.
In summary, _search()
for events is broken in its current form and it appears to be in part at the side of the REST API in the way it handles requests that contain a rows
parameter when submitting a search request.
- Adding
rows
to a request should not impactnum_available
it should only change the number of results which are returned for that particular request. - When requesting results where the default result set size (i.e. 500 from our testing) goes beyond the value of
num_available
(e.g. starting at 1000 with a default result size of 500 whennum_available
is 1051) should yield 51 results and not 500.num_available
should properly indicate the hard limit of retrievable results and not cause confusion by returning results beyond that value.
With the change to the v2 search at scale API, it's unclear how this will be resolved, but at this time event search results retrieved using the Python API are highly inconsistent and cause a lot of confusion. Especially because of the difference between what the _count()
function returnes (and thus .len()
on a query object returns) versus the actual results which are being retrieved (which is significantly less based on our testing: 251 vs 1051).
Update: for giggles I forced args['rows'] = 100000000
in _search()
and, as by magic, num_available
changes to 13956 (which is actually all available results) and I'm able to retrieve all events in batches of 500. I've cut off the below output for brevity, but I hope this helps clarify the point that the value of rows
causes issues in the API (and begs the question why num_available
is set to such incredibly low values by default).
❯ python test.py
_count args: {'search_params': {'q': 'process_guid:N5DFVQXD\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'N5DFVQXD-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp'}}
_count json: {'num_found': 13956, 'num_available': 1097, 'total_segments': 2145, 'processed_segments': 2145}
len() count: 1097
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}, 'start': 501}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}, 'start': 1001}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}, 'start': 1501}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}, 'start': 2001}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}, 'start': 2501}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
_search args: {'search_params': {'q': 'process_guid:ORGKEY\\-01319b03\\-00000cc4\\-00000000\\-1d64c634925f39c AND event_type:crossproc', 'cb.process_guid': 'ORGKEY-01319b03-00000cc4-00000000-1d64c634925f39c', 'fl': '*,parent_hash,parent_name,process_cmdline,backend_timestamp,device_external_ip,device_group,device_internal_ip,device_os,process_effective_reputation,process_reputation,ttp', 'rows': 100000000}, 'start': 3001}
_search json: {'num_found': 13956, 'num_available': 13956, 'total_segments': 2145, 'processed_segments': 2145}
_search result count: 500
...