ã¯ããã«

éçºç°å¢ã§ã¯äžç¬ã§çµããDBååŸããååŸå
ããªã³ãã¬ïŒãŸãã¯å¥ç°å¢ïŒã«å€ããã ãã§ããç°åžžã«é
ãããšããçµéšãããŸãããïŒ
ãããã¯ãŒã¯ã®ã¬ã€ãã³ã·ãå å³ããŠãç°åžžã«é
ãâŠ
ã¡ã¢ãªãCPUã«äœè£ãããã®ã«ïŒåã®Fetchä»¶æ°ãåŒãäžããŠãéããªããªãâŠ
ãã®èšäºã§ã¯ãç§ãçŽé¢ããäžå¯è§£ãªé床åé¡ãšããã®åçãªè§£æ±ºæ¹æ³ãå ±æããŸãã
èæ¯
ããPythonäžã®åŠçã§ãDirect Connectçµç±ã®Oracle DBããïŒåä»¶è¶
ã®ããŒã¿ãæœåºãããããåŠçããããŸããã
ããŒã¿éãå€ãããã100äžä»¶ããšã«æœåºã»ãã¡ã€ã«åºåãããŠããŸãããããããã®åŠçæéãç°åžžã«é·ããæ¥åèŠä»¶ãæºãããªãæãããããŸããã
çµè«ããèšããšãPythonäžã§ cx_Oracle (ãŸãã¯åŸç¶ã® python-oracledb) ã®ãã£ãäžã€ã®ãã©ã¡ãŒã¿ãèŠçŽãã ãã§ããã®åŠçã¯2å以äžé«éåããŸããã
ãã®åå ç¹å®ãŸã§ã®ããã»ã¹ãšãéµãšãªã£ããã©ã¡ãŒã¿ arraysize ã«ã€ããŠã玹ä»ããŸãã
ããã«ããã¯ã®ç¹å®ïŒç¯äººã¯ DB Fetch Time
ãŸãã¯ã©ãã«æéãããã£ãŠããã®ããã³ãŒãã«ãã°ãä»èŸŒãã§èšæž¬ããŸããã
åŠçã¯å€§ãããDBããããŒã¿ãååŸããCSVã«æžã蟌ã¿ããS3ã«ã¢ããããŒããã®3ã€ã«åãããŸãã
以äžãã1ãã¡ã€ã«ïŒ100äžä»¶ïŒãããã®åŠçæéã瀺ãããã°ã®æç²ã§ãã
INFO:root: - Total DB Fetch Time: 0:02:13.958024 <-- (å
šäœã®80%以äžïŒ)
INFO:root: - Total CSV Write Time: 0:00:18.858070
INFO:root:[1/161] Uploading file to S3 finished. Elapsed: 0:00:06.067239倿ããäºå®:
åŠçæéã®å€§åã data_rows = _cur.fetchmany(...) ã®åŠçãã€ãŸãDBããã®ããŒã¿ååŸã«è²»ããããŠããŸããã
CSVæžã蟌ã¿ãS3ã¢ããããŒãã¯ããã«ããã¯ã§ã¯ãããŸããã§ããã
åå ã®æ·±æãïŒãªã FETCH_ROW_COUNT ãå¢ãããŠãéããªããªãã£ãã®ãïŒ
ãã®çµæãåããç§ãã¡ã¯ãŸã FETCH_ROW_COUNTïŒäžåºŠã«ååŸããè¡æ°ïŒã1äžä»¶ãã2äžä»¶ãžãšå¢ãããŠã¿ãŸãããCPUãã¡ã¢ãªã«ã¯ååãªäœè£ããã£ãããã§ãã
ããããåŠçé床ã¯å šãå€ãããŸããã§ããã
ããã§ããã¥ã¡ã³ããèªã¿èŸŒã¿ãarraysize ãšãããã©ã¡ãŒã¿ã®ååšãšãfetchmany ã®çã®æåã«æ°ã¥ããŸãã
fetchmany(N): PythonåŽããæçµçã«N件欲ããããšåºãåœä»€cursor.arraysize: DBãšã®1åŸåŸ©ã®éä¿¡ã§ãå éšçã«äœä»¶ãŸãšããŠéã¶ããæ±ºããææ®µïŒéæ¬çšã®ç®±ã®ãµã€ãºïŒ
arraysize ãæå®ããªãå Žåãcx_Oracle ã®ããã©ã«ãå€ïŒéåžž100ä»¶ïŒã䜿ãããŸãã
ã€ãŸããã³ãŒãã§ fetchmany(20000) ãå®è¡ããŠããPythonã®è£åŽã§ã¯ã100ä»¶ãã ããããšãã现ããéä¿¡ãDBãšã®éã§ 200å ãçºçããŠããã®ã§ãïŒ
å®ã¯ããã®ãéä¿¡åæ°ã®å€ããããããéçºç°å¢ã§ã¯çéã ã£ãã®ã«æ¬çªã§ã¯çµãããªããšããçŸè±¡ã®æ£äœã§ããã
ããã«ããã¯ã®æ£äœïŒãããã¯ãŒã¯é å»¶ïŒã¬ã€ãã³ã·ïŒ
äž¡ç°å¢ã®æ±ºå®çãªéãã¯ãéä¿¡1åãããã®åŸ ã¡æéãã§ãã
- éçºç°å¢ïŒåäžVPCå
ã®RDSïŒ:
ãããã¯ãŒã¯é å»¶ãã»ãŒãŒãã200åŸåŸ©ããŠãäžç¬ã§çµãããããåé¡ãé èœãããŠããŸããã - æ¬çªç°å¢ïŒDirect Connectçµç±ã®ãªã³ãã¬DBïŒ:
ç©ççãªè·é¢ãããã1åŸåŸ©ãããã®é å»¶ãçºçããŸãããã®ãé å»¶ à èšå€§ãªåŸåŸ©åæ°ããç©ã¿éãªããèŽåœçãªããã«ããã¯ãšãªã£ãŠããŸããã
解決çïŒãã£ã1è¡ã®è¿œå
åå ããããã°è§£æ±ºã¯ç°¡åã§ãã
cx_Oracle ã®ã«ãŒãœã«ãäœæããçŽåŸã«ãarraysize ã fetchmany ã§æå®ãããµã€ãºãšåãå€ã«èšå®ããã ãã§ããã
def execute_for_get_row(...):
_cur = _conn.cursor(...)
# ååŸãããä»¶æ°ãèšå®ïŒäŸ: 20,000ä»¶ïŒ
batch_size = int(os.environ.get('FETCH_ROW_COUNT', 20000))
# â
â
â
远å ãã1è¡ â
â
â
# 1åã®éä¿¡ã§éã¶ãµã€ãºïŒarraysizeïŒããåŠçãããä»¶æ°ã«åããã
_cur.arraysize = batch_size
_cur.execute(sql)
return _curBefore / After ã®æ¯èŒ
# â Before: ããã©ã«ãã®arraysize(100)ã§200åéä¿¡
_cur = _conn.cursor()
data = _cur.fetchmany(20000) # å
éšã§200åŸåŸ©çºç
# â
After: arraysizeãèšå®ããŠéä¿¡åæ°ãåæž
_cur = _conn.cursor()
_cur.arraysize = 20000 # ãã®1è¡ã远å
data = _cur.fetchmany(20000) # 倧å¹
ã«åŸåŸ©åæ°ãæžå°æ€èšŒïŒéçºç°å¢ã§ã©ããã£ãŠãé å»¶ããåçŸãããïŒ
ä¿®æ£ã³ãŒããã§ããŸããããéçºç°å¢ïŒåäžVPCå
ã®RDSïŒã§å®è¡ããŠããå
ã
éãã£ãããå¹æãæž¬å®ã§ããŸããã
ãæ¬çªã§æ¬åœã«éããªãã®ãïŒãã確èªãããããéçºç°å¢ã§ç䌌çã«æ¬çªã®ãããã¯ãŒã¯ç¶æ³ãåçŸããããšã«ããŸããã
å¥ãªãŒãžã§ã³ãžã®ãé è·é¢æ¥ç¶ããã¹ã
åäžãªãŒãžã§ã³å ã§VPCãåããŠãAWSã®ããã¯ããŒã³ãéãããŠé å»¶ãåçŸã§ããªãã£ãããããããŠç©ççã«é¢ãããªãŒãžã§ã³ïŒç±³åœæ±éšãªã©ïŒã«ãã¹ãçšDBãäœæããVPC Peeringã§æ¥ç¶ããŸããã
çµæïŒ
æåŸ
éãïŒïŒïŒãä¿®æ£åã®ã³ãŒãã§ã¯åŠçã«æ°æéããããæ¿é
ç°å¢ããåçŸã§ããŸããã
ãããŠä¿®æ£åŸã®ã³ãŒããæµãããšãããåçãªé床åäžã確èªã§ããèªä¿¡ãæã£ãŠæ¬çªé©çšã«é²ãããšãã§ããŸããã
åçãªæ¹åçµæ
ãã®1è¡ã远å ããŠãæ¬çªç°å¢ã§ãžã§ããå®è¡ããŸããã
- ä¿®æ£å: åå1æã«éå§ â 8æéããŠãçµãããªãâŠ
- ä¿®æ£åŸ: åå1æã«éå§ â 3æ52åçµäº (4æé以äžççž®ïŒ)
arraysize ãé©åã«èšå®ããããšã§ããããã¯ãŒã¯é
å»¶ã«ããããã«ãã£ãåçã«æ¹åã§ããããšã蚌æãããŸããã
arraysize èšå®ã®ãã¹ããã©ã¯ãã£ã¹
æšå¥šå€ã®ç®å®
arraysize ã®é©åãªå€ã¯ãã¡ã¢ãªãªãœãŒã¹ãšããã©ãŒãã³ã¹ã®ãã©ã³ã¹ã§æ±ºãŸããŸãã
- å°èŠæš¡ããŒã¿ïŒã1äžä»¶ïŒ: 5,000ã10,000
- äžèŠæš¡ããŒã¿ïŒã100äžä»¶ïŒ: 10,000ã50,000
- å€§èŠæš¡ããŒã¿ïŒ100äžä»¶ãïŒ: 50,000ã100,000
ã¡ã¢ãªæ¶è²»ã®èšç®æ¹æ³
# ã¡ã¢ãªæ¶è²»éã®ç®å®
# arraysize à 1è¡ãããã®ãã€ãæ° = äžæçãªã¡ã¢ãªäœ¿çšé
# äŸ: arraysize=20,000ã1è¡ããã1KB â çŽ20MBã®ã¡ã¢ãªäœ¿çš
# å®å
šãªèšå®äŸ
_cur.arraysize = min(batch_size, 50000) # äžéãèšããŠå®å
šæ§ãç¢ºä¿æ³šæç¹
- ãµã€ãºã倧ãããããããšã¯ã©ã€ã¢ã³ãåŽã®ã¡ã¢ãªæ¶è²»éãå¢ãããããã¢ããªã±ãŒã·ã§ã³ãµãŒããŒã®ã¡ã¢ãªãªãœãŒã¹ãšã®ãã©ã³ã¹ãèæ ®ããŠãã ãã
- OOMïŒOut of MemoryïŒãšã©ãŒãçºçããå Žåã¯ã
arraysizeãå°ããããããããããµã€ãºèªäœãèŠçŽããŠãã ãã
ä»ã®DBã©ã€ãã©ãªã§ãåæ§ã®èšå®ãå¿ èŠ
ãã®åé¡ã¯ cx_Oracle ã«éã£ã話ã§ã¯ãããŸãããä»ã®ããŒã¿ããŒã¹ã©ã€ãã©ãªã§ãã倧éããŒã¿ãæ±ãéã«ã¯ããã©ã«ãèšå®ã®ãŸãŸã ãšããã©ãŒãã³ã¹ãåºãªãã£ãããã¡ã¢ãªäžè¶³ã«ãªã£ããããçœ ããããŸãã
SQL Server (pyodbc) ã®å Žå
Oracleãšåæ§ã« cursor.arraysize ã§å¶åŸ¡ããŸãããããã©ã«ãå€ãã1ãã§ããããšã«æ³šæããŠãã ããã
èšå®ãå¿ãããš1è¡ããšã«ãããã¯ãŒã¯éä¿¡ãçºçããOracle以äžã«å£æ»
çãªé
å»¶ãçºçããŸããå¿
ã倧ããªå€ãèšå®ããŸãããã
PostgreSQL (psycopg2) ã®å Žå
psycopg2 ã®éåžžã®ã«ãŒãœã«ïŒã¯ã©ã€ã¢ã³ããµã€ãã«ãŒãœã«ïŒã¯ãexecute() ãå®è¡ããç¬éã«å
šããŒã¿ãã¡ã¢ãªã«èªã¿èŸŒã¿ãŸãã
ãã®ããã倧éããŒã¿ãæ±ãå Žåã¯ã¡ã¢ãªäžè¶³ã«ãªãå±éºæ§ããããarraysize ãèšå®ããŠãéä¿¡åæ°ã®åæžã«ã¯ãªããŸããã
倧éããŒã¿ãæ±ãéã¯ãå¿
ãããµãŒããŒãµã€ãã«ãŒãœã«ïŒååä»ãã«ãŒãœã«ïŒãã䜿çšããŠãã ããããã®å ŽåãitersizeïŒããã©ã«ã: 2000ïŒãšãããã©ã¡ãŒã¿ã§ãããã¯ãŒã¯åŸåŸ©æã®ååŸä»¶æ°ãå¶åŸ¡ããŸãã
# â éåžžã®ã«ãŒãœã« (倧éããŒã¿ã«ã¯äžåã)
# execute() æç¹ã§å
šä»¶ååŸããããããã¡ã¢ãªãå§è¿«ãã
cursor = conn.cursor()
# â
ãµãŒããŒãµã€ãã«ãŒãœã« (ååãæå®ããŠäœæ)
# itersize åã ãå°åãã«ãããã¯ãŒã¯éä¿¡ããŠååŸãã
cursor = conn.cursor(name='my_cursor_name')
cursor.itersize = 10000 # â
ããã§ããããµã€ãºãæå®
cursor.execute("SELECT ...")MySQL (mysql-connector-python) ã®å Žå
MySQLå
¬åŒãã©ã€ãããããã©ã«ãïŒbuffered=TrueïŒã§ã¯ã¯ãšãªçµæãå
šä»¶ã¡ã¢ãªã«èªã¿èŸŒã¿ãŸãã
1åä»¶ã®ãããªããŒã¿ãæ±ããšå³åº§ã«ã¡ã¢ãªäžè¶³ã«ãªããããã«ãŒãœã«äœææã« buffered=False ãæå®ããŠãããŒã¿ãã¹ããªãŒãã³ã°ã§ååŸããããã«èšå®ããå¿
èŠããããŸãã
çµè«ïŒåŠãã ããš
ä»åã®çµéšããåŠãã ããšã¯3ã€ã§ãã
arraysizeã¯å¿ é :
cx_OracleïŒåŸç¶ã®python-oracledbãåæ§ïŒãpsycopg2ïŒitersizeïŒã§å€§éããŒã¿ãæ±ãéããã®èšå®ã¯å¿ é ã§ããäžè¬çã«ã¯1äžã10äžä»¶çšåºŠãç®å®ã«ãã¡ã¢ãªãªãœãŒã¹ãšã®ãã©ã³ã¹ãèæ ®ããŠèšå®ããŠãã ããã- é
å»¶ã¯éçºç°å¢ã§ã¯é èœããã:
ã¯ã©ãŠãå ïŒåäžVPCïŒã§ã®éçºã§ã¯ãéä¿¡ãéãããŠãéä¿¡åæ°ã®å€ãããšããåé¡ãèŠããªããªããŸãã - æ§èœãã¹ãã¯ãé
å»¶ããåçŸãã:
æ¬çªããããã¯ãŒã¯è¶ãã®DBãªããéçºç°å¢ã§ãå¥ãªãŒãžã§ã³ã«DBã眮ããªã©ããŠãæå³çã«é å»¶ãåçŸãããã¹ããéåžžã«éèŠã§ãã
ãã£ã1è¡ã®ã³ãŒãããæ°æéã®åŠçæéãåæžã§ããããšããããŸãã
ãã®èšäºããåãããã«ãããã¯ãŒã¯è¶ãã®DBæ§èœã«æ©ã誰ãã®å©ãã«ãªãã°å¹žãã§ãã
åèãªã³ã¯
æ¬èšäºã®å·çã«ããããåç §ããå ¬åŒããã¥ã¡ã³ãã§ãã
- Oracle (python-oracledb):
Tuning Fetch Performance
arraysizeã倿Žããããšã§ãã©ãŠã³ãããªããåæ°ãæžããããã©ãŒãã³ã¹ãåäžãããæ¹æ³ã«ã€ããŠã - PostgreSQL (psycopg2):
cursor.fetchmany
Server-side cursors
ããã©ãŒãã³ã¹ã®ããã« arraysizeïŒãŸãã¯itersizeïŒå±æ§ã䜿çšããã®ãæåã§ãããšããèšè¿°ã - SQL Server (pyodbc):
pyodbc Wiki: Cursor features
arraysize ã®ããã©ã«ãã 1 ã§ããããšã«ã€ããŠã®èšè¿°ã - MySQL (mysql-connector-python / PyMySQL):
MySQL Connector/Python: cursor()
PyMySQL: SSCursor
ããã©ã«ãã§ã¯å šä»¶ã¡ã¢ãªã«èªã¿èŸŒã仿§ã§ããããšãšã倧éããŒã¿åŠçæã®èšå®ïŒbuffered=Falseã SSCursorïŒã«ã€ããŠã