python - Looping through a paginated api asynchronously -
i'm ingesting data through api returns close 100,000 documents in paginated fashion (100 per page). have code functions follows:
while c <= limit: if not api_url: break req = urllib2.request(api_url) opener = urllib2.build_opener() f = opener.open(req) response = simplejson.load(f) item in response['documents']: # here if 'more_url' in response: api_url = response['more_url'] else: api_url = none break c += 1
downloading data way slow , wondering if there way loop through pages in async way. have been recommended take @ twisted, not entirely sure how proceed.
what have here not know front read next unless call api. think of like, can in parallel?
i not know how can in parallel , tasks, lets try...
some assumptions: - can retrieve data api without penalties or limits - data processing of 1 page/batch can done independently 1 other
what slow io - can split code 2 parallel running tasks - 1 read data, put in queue , continue reading unless hit limit/empty response or pause if queue full
then second task, taking data queue, , data
so can call 1 task another
other approach have 1 task, calling other 1 after data read, execution running in parallel shifted
how i'll implement it? celery tasks , yes requests
e.g. second one:
@task def do_data_process(data): # data pass @task def parse_one_page(url): response = requests.get(url) data = response.json() if 'more_url' in data: parse_one_page.delay(data['more_url']) # , here data processing in task do_data_process(data) # or call worker , try in other process # do_data_process.delay(data)
and how many tasks run in parallel if add limits code, can have workers on multiple machines , have separate queues parse_one_page
, do_data_process
why approach, not twisted or async?
because have cpu-bond data processing (json, data) , better have separate processes , celery perfect them.
Comments
Post a Comment