postgresql - Efficient DB solution for system task tracking -


i'm working on data tracking system. system multiprocess application written in python , working in following manner:

  1. every s seconds selects n appropriate tasks database (currently postgres) , finds data it
  2. if there's no tasks, creates n new tasks , returns (1).

the problem following - have approx. 80gb of data , 36m of tasks , queries tasks table begin work slower , slower (its populated , used table).

the main bottleneck of performance task tracking query:

lock table task in access exclusive mode; select * task line = 1 , action = any(array['find', 'get']) , (stat in ('', 'cr1') or stat = 'error' , (actiondate <= now() or actiondate null)) order taskid, actiondate, action desc, idtype, date asc limit 36;                                      table "public.task"    column   |            type             |                    modifiers ------------+-----------------------------+-------------------------------------------------  number     | character varying(16)       | not null  date       | timestamp without time zone | default now()  stat       | character varying(16)       | not null default ''::character varying  idtype     | character varying(16)       | not null default 'container'::character varying  uri        | character varying(1024)     |  action     | character varying(16)       | not null default 'find'::character varying  reason     | character varying(4096)     | not null default ''::character varying  rev        | integer                     | not null default 0  actiondate | timestamp without time zone |  modifydate | timestamp without time zone |  line       | integer                     |  datasource | character varying(512)      |  taskid     | character varying(32)       |  found      | integer                     | not null default 0 indexes:     "task_pkey" primary key, btree (idtype, number)     "action_index" btree (action)     "actiondate_index" btree (actiondate)     "date_index" btree (date)     "line_index" btree (line)     "modifydate_index" btree (modifydate)     "stat_index" btree (stat)     "taskid_index" btree (taskid)                                 query plan                           ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  limit  (cost=312638.87..312638.96 rows=36 width=668) (actual time=1838.193..1838.197 rows=36 loops=1)    ->  sort  (cost=312638.87..313149.54 rows=204267 width=668) (actual time=1838.192..1838.194 rows=36 loops=1)          sort key: taskid, actiondate, action, idtype, date          sort method: top-n heapsort  memory: 43kb          ->  bitmap heap scan on task  (cost=107497.61..306337.31 rows=204267 width=668) (actual time=1013.491..1343.751 rows=914586 loops=1)                recheck cond: ((((stat)::text = ('{"",cr1}'::text[])) or ((stat)::text = 'error'::text)) , (line = 1))                filter: (((action)::text = ('{find,get}'::text[])) , (((stat)::text = ('{"",cr1}'::text[])) or (((stat)::text = 'error'::text) , ((actiondate <= now()) or (actiondate null)))))                rows removed filter: 133                heap blocks: exact=76064                ->  bitmapand  (cost=107497.61..107497.61 rows=237348 width=0) (actual time=999.457..999.457 rows=0 loops=1)                      ->  bitmapor  (cost=9949.15..9949.15 rows=964044 width=0) (actual time=121.936..121.936 rows=0 loops=1)                            ->  bitmap index scan on stat_index  (cost=0.00..9449.46 rows=925379 width=0) (actual time=117.791..117.791 rows=920900 loops=1)                                  index cond: ((stat)::text = ('{"",cr1}'::text[]))                            ->  bitmap index scan on stat_index  (cost=0.00..397.55 rows=38665 width=0) (actual time=4.144..4.144 rows=30262 loops=1)                                  index cond: ((stat)::text = 'error'::text)                      ->  bitmap index scan on line_index  (cost=0.00..97497.14 rows=9519277 width=0) (actual time=853.033..853.033 rows=9605462 loops=1)                            index cond: (line = 1)  planning time: 0.284 ms  execution time: 1838.882 ms (19 rows) 

of course, involved fields indexed. i'm thinking in 2 directions:

  1. how optimize query , give me performance improvement perspective or not (currently takes approx. 10 seconds per query inacceptable in dynamic task tracking)
  2. where , how more effective store task data - may should use db such purposes - cassandra, voltdb or big data store?

i think data should somehow preordered actual tasks fast possible.

and please keep in mind current volume of 80g minimum rather maximum such task.

thanks in advance!

i don't quite understand use case, doesn't me indexes working well. looks query relying on stat index. think need composite index (action, line, stat).

another option shard data across multiple tables splitting on key low cardinality. don't use postgres don't think looking @ db solution going work better unless know you're optimizing for.


Comments

Popular posts from this blog

wordpress - (T_ENDFOREACH) php error -

Export Excel workseet into txt file using vba - (text and numbers with formulas) -

Using django-mptt to get only the categories that have items -