python - Looping through a paginated api asynchronously -
i'm ingesting data through api returns close 100,000 documents in paginated fashion (100 per page). have code functions follows:
while c <= limit:     if not api_url:         break      req = urllib2.request(api_url)     opener = urllib2.build_opener()     f = opener.open(req)     response = simplejson.load(f)      item in response['documents']:         # here      if 'more_url' in response:         api_url = response['more_url']     else:         api_url = none         break     c += 1 downloading data way slow , wondering if there way loop through pages in async way. have been recommended take @ twisted, not entirely sure how proceed.
what have here not know front read next unless call api. think of like, can in parallel?
i not know how can in parallel , tasks, lets try...
some assumptions: - can retrieve data api without penalties or limits - data processing of 1 page/batch can done independently 1 other
what slow io - can split code 2 parallel running tasks - 1 read data, put in queue , continue reading unless hit limit/empty response or pause if queue full
then second task, taking data queue, , data
so can call 1 task another
other approach have 1 task, calling other 1 after data read, execution running in parallel shifted
how i'll implement it? celery tasks , yes requests
e.g. second one:
@task def do_data_process(data):    # data    pass  @task def parse_one_page(url):     response = requests.get(url)     data = response.json()      if 'more_url' in data:         parse_one_page.delay(data['more_url'])      # , here data processing in task     do_data_process(data)     # or call worker , try in other process     # do_data_process.delay(data) and how many tasks run in parallel if add limits code, can have workers on multiple machines , have separate queues parse_one_page , do_data_process 
why approach, not twisted or async?
because have cpu-bond data processing (json, data) , better have separate processes , celery perfect them.
Comments
Post a Comment