sql导致生产服务器OOM事故

367 阅读7分钟

现象

线上某个服务有接口非常慢,通过监控链路查看发现,中间的 GAP 时间非常大,并且导致用户提现下单失败啦

企业微信截图_17339054039144.png

原因分析

先从监控链路分析啦一波,发现请求打到服务上,处理啦8,9秒,猜测是不是请求量过大,负载大,通过QPS监控查看,只有调用提现下单接口才会有问题,CPU突然飙升,同时频繁的GC,且时间长,但是看这段时间请求量不是很大,且这台机器因为Heap满了都down啦 企业微信截图_17339053437003.png

问题排查

1、查看错误日志发现没什么有用的信息就是OOM,但是发现有个SQL查询很不正常,看到这个SQL,我脑子都疼 能写出这个SQL,本人也是很佩服的,作者本人是写不出来的 #问题SQL explain select g.* , IFNULL(f.shipment_no , '') as shipment_no, IFNULL(f.shipment_company , '') as shipment_company from ( select ( select GROUP_CONCAT( a.commodity_name, '' ) from t_b2b_contract_commodity a where a.contract_id = b.id ) as commodity_name, b.contract_no as contract_no, b.sys_contract_no as sys_contract_no, b.trade_scene_code as trade_scene_code, b.id as contract_id, b.buyer_english_name as buyer_english_name, b.contract_amount_currency as contract_amount_currency, b.contract_amount as contract_amount, DATE_FORMAT( b.created_date, '%Y/%m/%d %T' ) created_date, c.settlement_exchange_available_amount as settlement_exchange_available_amount, b.buyer_country_code as buyer_country_code, DATE_FORMAT( b.signed_date, '%Y/%m/%d %T' ) as signed_date, b.store_website as store_website from t_b2b_contract b, t_b2b_contract_available_amount c where b.id in ("6499", "6498", "6497", "6538", "6537", "6505", "6504", "6503", "6502", "6501", "6500", "6507", "6506", "7076", "6620", "6569", "6568", "6621", "6652", "6625", "6624", "6622", "6697", "6690", "6664", "6654", "6653", "6649", "6626", "6648", "7230", "6698", "6693", "6692", "6659", "6658", "6657", "6656", "6724", "6723", "6722", "6694", "7115", "6934", "6891", "6889", "6887", "6886", "6817", "6816", "6815", "7117", "7116", "6892", "6822", "6821", "6820", "6819", "6818", "7032", "6930", "6893", "6826", "6825", "6824", "6823", "6894", "6868", "6867", "6866", "6865", "6864", "7029", "6970", "6869", "6935", "6931", "6872", "6871", "6870", "6969", "6968", "6897", "6895", "6874", "6873", "6932", "7265", "7075", "6971", "6967", "6966", "7034", "7033", "7031", "7321", "7107", "7078", "7077", "7081", "7080", "7079", "7110", "7109", "7108", "7322", "7240", "7237", "7183", "7182", "7181", "7180", "7179", "7178", "7319", "7233", "7232", "7236", "7235", "7234", "7472", "7239", "7238", "7619", "7471", "7268", "7267", "7266", "7317", "7316", "7621", "7539", "7418", "7416", "7415", "7414", "7622", "7538", "7421", "7419", "7417", "7699", "7697", "7426", "7425", "7424", "7423", "7422", "7420", "7470", "7469", "7468", "7467", "7427", "7712", "7708", "7531", "7530", "7537", "7535", "7534", "7533", "7532", "7696", "7629", "7628", "7627", "7626", "7625", "7624", "7623", "7536", "7631", "7711", "7630", "7636", "7635", "7634", "7633", "7632", "7710", "7695", "7694", "7692", "7716", "7715", "7714", "7713", "7709", "7878", "7877", "7870", "7869", "7879", "7867", "7866", "7885", "7882", "7880", "7887", "7886", "7945", "7944", "7893", "7892", "7891", "7889", "7888", "7884", "7946", "7943", "7942", "8581", "7953", "7952", "7951", "7950", "7949", "7948", "7947", "7983", "7982", "7981", "7980", "8578", "8101", "8100", "8099", "8097", "8096", "8095", "8094", "8093", "8092", "8106", "8105", "8104", "8103", "8102", "8114", "8111", "8109", "8108", "8107", "8351", "8166", "8113", "8112", "8352", "8165", "8164", "8163", "8868", "8361", "8356", "8355", "8354", "8353", "8359", "8357", "8867", "8589", "8363", "8362", "8360", "8840", "8368", "8367", "8366", "8365", "8369", "8371", "8577", "8370", "8580", "8579", "8842", "8584", "8583", "8582", "8588", "8587", "8586", "8585", "8590", "8843", "8593", "8592", "8591", "8841", "8849", "8848", "8847", "8846", "8845", "8844", "8852", "8851", "8850", "8857", "8856", "8855", "8854", "8853", "8858", "8863", "8862", "8861", "8860", "8859", "8866", "8865", "8864", "6370", "6370", "6370", "6370", "6370") ) g left join ( select d.* from t_b2b_contract_shipment d where id in ( select max(id) from t_b2b_contract_shipment e group by e.contract_id )) f on f.contract_id = g.contract_id

看下SQL执行计划

执行计划有一个type类型是ALL且rows行是400多万,当时我都震惊啦,因为合同数据表都没有这么多的数据, 且SQL是递归查询,咱们真的佩服这个兄弟的勇气啊,table类型是DERIVED:(衍生表),在from列表中包含的子查询会被标记为DERIVED,MySQL会递归执行子查询,把结果放在临时表中,递归400万的衍生表的JVM垃圾回收对象,字段,没有回收完,用户继续点击操作,足够导致4核8G的服务器OOM

企业微信截图_17339062267371.png

问题总结

其实就是因为两张表(t_b2b_contract b,t_b2b_contract_available_amount c)没有用外键关联,且没有做数据隔离,你查询数据不传当前用户的标识么,还有提现需要可结汇金额大于零吧,还有这种分组,函数为什么要在SQL里执行,为什么不查询过滤符合的SQL来在java代码里算,过滤,你想要的最终结果

企业微信截图_17339066968194.png

问题解决

先临时解决衍生表400多万数据的问题,先投产保证用户可以提现操作,恢复服务器,解决SQL导致的OOM 1、把衍生表的查询条件,需要关联的索引外键 2、排除不满足的数据 3、数据隔离-查询本用户的 4、需要结汇的用户,且可结汇额度大于零 企业微信截图_17339074301789.png

总结

其实SQL很简单,一点都不难,但是就是因为马虎,不细心,导致写出有问题的SQL,建议大家SQL内部需要索引,排除过滤掉不符合的数据,尽量不用SQL函数,可以用java代码处理

学习交流

大家可以多给作者留言和关注,谢谢,我会持续更新一些好的知识和帖子