原文链接 原文 部分开源代码 数据集介绍 
复现 from CLHLS.get_3year_data import get_period_data中的CLHLS代表文件夹后的.参考 
1 2 library( foreign)  write.table( read.spss( "inFile.sav" ) ,  file= "outFile.csv" ,  quote  =  TRUE ,  sep =  "," )  
文件Descriptive Data.ipynb用于数据预处理。Python enumerate() 函数 .columns属性以返回给定 DataFrame 的列标签 参考 。列表(List) 函数map() 会根据提供的函数对指定序列做映射。第一个参数 function 以参数序列中的每一个元素调用 function 函数,返回包含每次 function 函数返回值的新列表。 
1 2 3 4 5 6 7 if  start_year != 2011 :    three_years_later_columns = list (         map (lambda  x: x + year_index if  x != 'trueage'  else  'vage'  + year_index, basic_columns)) else :    three_years_later_columns = list (map (lambda  x: x + year_index, basic_columns)) 
这个操作的作用就是获得三年后随访的变量名,11年前,后续年份随访调查的年龄字段叫vage_年份; 11年后,后续年份随访调查的年龄字段叫trueage_年份,这里的x取值和if start_year != 2011:这句话有关
1 2 3 4 5     for  personal_stable_variable in  list (map (lambda  x: x + year_index, ['a2' , 'f1' , 'a1' , 'a41' , 'a43' ])):         three_years_later_columns.remove(personal_stable_variable)     total_columns = basic_columns + three_years_later_columns 
筛选3年后随访还活着且没失访的老年人,比如dth11_14代表11年开始调查到14年还有是否活着0代表14年还存活
1 2 3 4 5 6 7 8 questions_may_contain8 = ['vage' , 'trueage' , 'a2' , 'a41' , 'f1' , 'b21' , 'b22' , 'b23' , 'b24' , 'b25' , 'b26' , 'b27' ,                           'd75' , 'd86' , 'd12' , 'c11' , 'c12' , 'c13' , 'c14' , 'c15' , 'c16' , 'c21a' , 'c21b' , 'c21c' ,                           'c31a' , 'c31b' , 'c31c' , 'c31d' , 'c31e' , 'c32' , 'c41a' , 'c41b' , 'c41c' , 'c51a' , 'c51b' ,                           'c52' , 'c53a' , 'c53b' , 'c53c' ] year_index = '_'  + str (start_year + 3 )[-1 ] if  start_year + 3  < 2010  else  '_'  + str (start_year + 3 )[-2 :] later_questions_may_contain8 = list (map (lambda  x: x + year_index, questions_may_contain8)) total_questions_may_contain8 = questions_may_contain8 + later_questions_may_contain8 
而剩余变量含8的可能会是缺失值,replace函数 ,这段代码把由数字代表的缺失值用NAN描述了。
1 2 3 4 5 6 7 8 9 10 11 12     variable_needed_processed = list (set (df.columns) - set (total_questions_may_contain8))     new_df = df.copy()     for  var in  variable_needed_processed:         new_df[var] = new_df[var].replace(8 , np.nan)         new_df[var] = new_df[var].replace(88 , np.nan)         new_df[var] = new_df[var].replace(888 , np.nan)         new_df[var] = new_df[var].replace(8888 , np.nan)         new_df[var] = new_df[var].replace(9 , np.nan)         new_df[var] = new_df[var].replace(99 , np.nan)         new_df[var] = new_df[var].replace(99 , np.nan)         new_df[var] = new_df[var].replace(9999 , np.nan) 
下面代码似乎在计算各个变量中缺失的概率,df.isnull() ,df.isnull().any()则会判断哪些”列”存在缺失值,new_df.loc[:, new_df.isnull().any()].isnull().sum()就是把所有有缺失值的列累加,最后除以总数,就是缺失的概率。
1 2 3 4 5 6 7 8     print ('Missing values situation' )     print (         (new_df.loc[:, new_df.isnull().any ()].isnull().sum () / len (df)).apply(lambda  x: str ('%.2f'  % (x * 100 )) + '%' ))     for  col in  new_df.columns[new_df.isnull().any ()]:         mode = new_df[col].value_counts().sort_values(ascending=False ).index[0 ]         new_df[col].fillna(mode, inplace=True ) 
去掉MMSE问题全部不能回答【总共23个问题】的人员,也就是删掉一些行数,new_df = new_df[new_df.isnull().sum(axis=1) < 23]代表有任何一个问题能回答就保留,最后该段代码统计减去未统计MMSE的人数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 MMSE_questions = ['c11' , 'c12' , 'c13' , 'c14' , 'c15' , 'c21a' , 'c21b' , 'c21c' , 'c31a' , 'c31b' , 'c31c' , 'c31d' , 'c31e' ,                   'c32' , 'c41a' , 'c41b' , 'c41c' , 'c51a' , 'c51b' , 'c52' , 'c53a' , 'c53b' , 'c53c' ] MMSE_questions_later = list (map (lambda  x: x + year_index, MMSE_questions)) for  var in  MMSE_questions:    new_df[var] = new_df[var].replace(8 , np.nan)     new_df[var] = new_df[var].replace(88 , np.nan)     new_df[var] = new_df[var].replace(888 , np.nan)     new_df[var] = new_df[var].replace(8888 , np.nan)     new_df[var] = new_df[var].replace(9 , np.nan)     new_df[var] = new_df[var].replace(99 , np.nan)     new_df[var] = new_df[var].replace(99 , np.nan)     new_df[var] = new_df[var].replace(9999 , np.nan) print ('Samples before drop current MMSE missing : ' , len (new_df))new_df = new_df[new_df.isnull().sum (axis=1 ) < 23 ] print ('Samples before drop current MMSE missing : ' , len (new_df))for  var in  MMSE_questions_later:    new_df[var] = new_df[var].replace(8 , np.nan)     new_df[var] = new_df[var].replace(88 , np.nan)     new_df[var] = new_df[var].replace(888 , np.nan)     new_df[var] = new_df[var].replace(8888 , np.nan)     new_df[var] = new_df[var].replace(9 , np.nan)     new_df[var] = new_df[var].replace(99 , np.nan)     new_df[var] = new_df[var].replace(99 , np.nan)     new_df[var] = new_df[var].replace(9999 , np.nan) print ('Samples before drop later MMSE missing : ' , len (new_df))new_df = new_df[new_df.isnull().sum (axis=1 ) < 23 ] print ('Samples before drop later MMSE missing : ' , len (new_df))
统计大于60岁的人数,并于之前对比。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 print ('Before drop age<60 : ' )print ('Wave 1: ' ,len (first_wave_new))print ('Wave 2: ' ,len (second_wave_new))print ('Wave 3: ' ,len (third_wave_new))print ('Wave 4: ' ,len (fourth_wave_new))first_wave_new = first_wave_new[first_wave_new.trueage>=60 ] second_wave_new = second_wave_new[second_wave_new.trueage>=60 ] third_wave_new = third_wave_new[third_wave_new.trueage>=60 ] fourth_wave_new = fourth_wave_new[fourth_wave_new.trueage>=60 ] print ('After drop age<60 : ' )print ('Wave 1: ' ,len (first_wave_new))print ('Wave 2: ' ,len (second_wave_new))print ('Wave 3: ' ,len (third_wave_new))print ('Wave 4: ' ,len (fourth_wave_new))
数据预处理总述 数据处理阶段一,纳入了 2002 年至 2014 年间的 4 波非重叠的 3 年调查数据,和所需的各个列变量【调查变量】,我们首先纳入了 2002 年至 2014 年间的 4 波非重叠的 3 年调查数据。参与者选择的详细流程图如图 1 所示。所有参与者每 3 年接受一次随访(一波)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 def  get_data_between_start_to_end (df,start_year ):              short_year = str (start_year+3 )[-1 ] if  start_year+3  < 2010  else  str (start_year+3 )[-2 :]     variable_list = []     index = 0      minux_index = len (short_year)+1                for  i,variable in  enumerate (df.columns):                  if  variable[-minux_index:] == '_'  + short_year:             index = i             break                   variable_list.append(variable)     print (index)                for  j in  range (index,len (df.columns)):         if  df.columns[j] == 'dth' +str (start_year+3 )[-2 :]+'_' +str (start_year+6 )[-2 :] or  df.columns[j][-3 :] == '_' +str (start_year+7 )[-2 :]:             break          variable_list.append(df.columns[j])          print (variable_list)     return  df.loc[:,variable_list] 
数据预处理和统计,根据变量dth02_05统计出三年间存活,死亡,失踪的各个项目人数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 """ 数据预处理和统计 """ def  get_df_peo_3years_situation (data02to05,data05to08,data08to11,data11to14 ):              data11to14.dth11_14 = pd.to_numeric(data11to14.dth11_14,errors = 'coerce' )          data08to11.loc[data08to11[data08to11.dth08_11 == 2011 ].index,'dth08_11' ] = 0           observations = []     survivor = []     death = []     missing = []     df_dict = {'df1' :data02to05,'df2' :data05to08,'df3' :data08to11,'df4' :data11to14}     df_var_dict = {'df1' :'dth02_05' ,'df2' :'dth05_08' ,'df3' :'dth08_11' ,'df4' :'dth11_14' }     for  df in  df_dict.keys():         print (df)          print (df_dict[df])          print (len (df_dict[df]))           counter = df_dict[df][df_var_dict[df]].value_counts()         observations.append(len (df_dict[df]))         survivor.append(counter[0 ])         death.append(counter[1 ])         missing.append(counter[-9 ])     return  pd.DataFrame({'start year observations' :observations,                      'survivor' :survivor,                       'death' :death,                       'missing' :missing},index = ['2002 to 2005' ,'2005 to 2008' ,'2008 to 2011' ,'2011 to 2014' ]) situation = get_df_peo_3years_situation(data02to05,data05to08,data08to11,data11to14) print ('Number of investgated elders situations : ' ,'\n' ,situation)
筛选出还活着的列表然后取特定的列变量,把三年后的初始个人信息问题去掉
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 def  get_wave_data (df, start_year ):         basic_columns = ['a1' , 'trueage' , 'a2' , 'f1' , 'f41' , 'f34' , 'a41' , 'a43' , 'a51' ,                        'b21' , 'b22' , 'b23' , 'b24' , 'b25' , 'b26' , 'b27' ,                        'd71' , 'd72' , 'd75' , 'd81' , 'd82' , 'd86' , 'd91' , 'd92' , 'd31' , 'd32' ,                        'd11b' , 'd11c' , 'd11d' , 'd11e' , 'd11f' , 'd11g' , 'd11h' , 'd12' ,                        'e1' , 'e2' , 'e3' , 'e4' , 'e5' , 'e6' ,                        'g15a1' , 'g15b1' , 'g15c1' , 'g15d1' , 'g15e1' , 'g15f1' , 'g15g1' , 'g15h1' , 'g15i1' , 'g15j1' , 'g15k1' ,                      'g15l1' , 'g15m1' , 'g15n1' , 'g15o1' ,                        'c11' , 'c12' , 'c13' , 'c14' , 'c15' , 'c16' , 'c21a' , 'c21b' , 'c21c' , 'c31a' , 'c31b' , 'c31c' , 'c31d' ,                      'c31e' , 'c32' , 'c41a' , 'c41b' , 'c41c' , 'c51a' , 'c51b' , 'c52' , 'c53a' , 'c53b' , 'c53c' ,                                            ]     year_index = '_'  + str (start_year + 3 )[-1 ] if  start_year + 3  < 2010  else  '_'  + str (start_year + 3 )[-2 :]               if  start_year != 2011 :         three_years_later_columns = list (             map (lambda  x: x + year_index if  x != 'trueage'  else  'vage'  + year_index, basic_columns))         print (three_years_later_columns)     else :         three_years_later_columns = list (map (lambda  x: x + year_index, basic_columns))         print (three_years_later_columns)          for  personal_stable_variable in  list (map (lambda  x: x + year_index, ['a2' , 'f1' , 'a1' , 'a41' , 'a43' ])):         three_years_later_columns.remove(personal_stable_variable)     total_columns = basic_columns + three_years_later_columns          death_index_var = 'dth'  + str (start_year)[-2 :] + '_'  + str (start_year + 3 )[-2 :]          live_df = df[df[death_index_var] == 0 ]     print (live_df)          wave_data = live_df[total_columns]     print (wave_data)     for  column in  wave_data.columns:                  wave_data[column] = pd.to_numeric(wave_data[column], errors='coerce' )     return  wave_data 
统计各个项的缺失概率,并用众数填补变量的缺失值并把MMSE问题全部不能回答的人去掉
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 def  replace_abnormal_and_fill_nan (df, start_year ):         questions_may_contain8 = ['vage' , 'trueage' , 'a2' , 'a41' , 'f1' , 'b21' , 'b22' , 'b23' , 'b24' , 'b25' , 'b26' , 'b27' ,                               'd75' , 'd86' , 'd12' , 'c11' , 'c12' , 'c13' , 'c14' , 'c15' , 'c16' , 'c21a' , 'c21b' , 'c21c' ,                               'c31a' , 'c31b' , 'c31c' , 'c31d' , 'c31e' , 'c32' , 'c41a' , 'c41b' , 'c41c' , 'c51a' , 'c51b' ,                               'c52' , 'c53a' , 'c53b' , 'c53c' ]     year_index = '_'  + str (start_year + 3 )[-1 ] if  start_year + 3  < 2010  else  '_'  + str (start_year + 3 )[-2 :]     later_questions_may_contain8 = list (map (lambda  x: x + year_index, questions_may_contain8))     total_questions_may_contain8 = questions_may_contain8 + later_questions_may_contain8          variable_needed_processed = list (set (df.columns) - set (total_questions_may_contain8))     new_df = df.copy()     for  var in  variable_needed_processed:         new_df[var] = new_df[var].replace(8 , np.nan)         new_df[var] = new_df[var].replace(88 , np.nan)         new_df[var] = new_df[var].replace(888 , np.nan)         new_df[var] = new_df[var].replace(8888 , np.nan)         new_df[var] = new_df[var].replace(9 , np.nan)         new_df[var] = new_df[var].replace(99 , np.nan)         new_df[var] = new_df[var].replace(99 , np.nan)         new_df[var] = new_df[var].replace(9999 , np.nan)          print ('Missing values situation' )     print (         (new_df.loc[:, new_df.isnull().any ()].isnull().sum () / len (df)).apply(lambda  x: str ('%.2f'  % (x * 100 )) + '%' ))     for  col in  new_df.columns[new_df.isnull().any ()]:         mode = new_df[col].value_counts().sort_values(ascending=False ).index[0 ]         new_df[col].fillna(mode, inplace=True )          MMSE_questions = ['c11' , 'c12' , 'c13' , 'c14' , 'c15' , 'c21a' , 'c21b' , 'c21c' , 'c31a' , 'c31b' , 'c31c' , 'c31d' , 'c31e' ,                       'c32' , 'c41a' , 'c41b' , 'c41c' , 'c51a' , 'c51b' , 'c52' , 'c53a' , 'c53b' , 'c53c' ]     MMSE_questions_later = list (map (lambda  x: x + year_index, MMSE_questions))     for  var in  MMSE_questions:         new_df[var] = new_df[var].replace(8 , np.nan)         new_df[var] = new_df[var].replace(88 , np.nan)         new_df[var] = new_df[var].replace(888 , np.nan)         new_df[var] = new_df[var].replace(8888 , np.nan)         new_df[var] = new_df[var].replace(9 , np.nan)         new_df[var] = new_df[var].replace(99 , np.nan)         new_df[var] = new_df[var].replace(99 , np.nan)         new_df[var] = new_df[var].replace(9999 , np.nan)     print ('Samples before drop current MMSE missing : ' , len (new_df))     new_df = new_df[new_df.isnull().sum (axis=1 ) < 23 ]     print ('Samples before drop current MMSE missing : ' , len (new_df))     for  var in  MMSE_questions_later:         new_df[var] = new_df[var].replace(8 , np.nan)         new_df[var] = new_df[var].replace(88 , np.nan)         new_df[var] = new_df[var].replace(888 , np.nan)         new_df[var] = new_df[var].replace(8888 , np.nan)         new_df[var] = new_df[var].replace(9 , np.nan)         new_df[var] = new_df[var].replace(99 , np.nan)         new_df[var] = new_df[var].replace(99 , np.nan)         new_df[var] = new_df[var].replace(9999 , np.nan)     print ('Samples before drop later MMSE missing : ' , len (new_df))     new_df = new_df[new_df.isnull().sum (axis=1 ) < 23 ]     print ('Samples before drop later MMSE missing : ' , len (new_df))     return  new_df 
整合并筛选出>60岁的人
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 def  get_period_data ():    print ('Reading data ...' )     data_02to18 = pd.read_csv('./data/clhls_2002_2018_utf8.csv' ,low_memory=False )     data_05to18 = pd.read_csv('./data/clhls_2005_2018_utf8.csv' ,low_memory=False )     data_08to18 = pd.read_csv('./data/clhls_2008_2018_utf8.csv' ,low_memory=False )     data_11to18 = pd.read_csv('./data/clhls_2011_2018_utf8.csv' ,low_memory=False )     print ('Getting 3 year periods data ...' )     data02to05 = get_data_between_start_to_end(data_02to18,2002 )     data05to08 = get_data_between_start_to_end(data_05to18,2005 )     data08to11 = get_data_between_start_to_end(data_08to18,2008 )     data11to14 = get_data_between_start_to_end(data_11to18,2011 )          situation = get_df_peo_3years_situation(data02to05,data05to08,data08to11,data11to14)     print ('Number of investgated elders situations : ' ,'\n' ,situation)          print ('Selecting needed variables ...' )     first_wave = get_wave_data(data02to05,2002 )     second_wave = get_wave_data(data05to08,2005 )     third_wave = get_wave_data(data08to11,2008 )     fourth_wave = get_wave_data(data11to14,2011 )     print ('Filling missing values ...' )     first_wave_new = replace_abnormal_and_fill_nan(first_wave,2002 )     second_wave_new = replace_abnormal_and_fill_nan(second_wave,2005 )     third_wave_new = replace_abnormal_and_fill_nan(third_wave,2008 )     fourth_wave_new = replace_abnormal_and_fill_nan(fourth_wave,2011 )          print ('Before drop age<60 : ' )     print ('Wave 1: ' ,len (first_wave_new))     print ('Wave 2: ' ,len (second_wave_new))     print ('Wave 3: ' ,len (third_wave_new))     print ('Wave 4: ' ,len (fourth_wave_new))     first_wave_new = first_wave_new[first_wave_new.trueage>=60 ]     second_wave_new = second_wave_new[second_wave_new.trueage>=60 ]     third_wave_new = third_wave_new[third_wave_new.trueage>=60 ]     fourth_wave_new = fourth_wave_new[fourth_wave_new.trueage>=60 ]     print ('After drop age<60 : ' )     print ('Wave 1: ' ,len (first_wave_new))     print ('Wave 2: ' ,len (second_wave_new))     print ('Wave 3: ' ,len (third_wave_new))     print ('Wave 4: ' ,len (fourth_wave_new))     return  first_wave_new,second_wave_new,third_wave_new,fourth_wave_new 
第二部分获得CI score,depression score,ADL score三项的分数
1 2 3 4 5 6 7 8     new_df = df.copy()     current_questions = ['c11' ,'c12' ,'c13' ,'c14' ,'c15' ,'c16' ,'c21a' ,'c21b' ,'c21c' ,'c31a' ,'c31b' ,'c31c' ,'c31d' ,'c31e' ,'c32' ,'c41a' ,'c41b' ,'c41c' ,'c51a' ,'c51b' ,'c52' ,'c53a' ,'c53b' ,'c53c' ]     year_index = '_' +str (start_year+3 )[-1 ] if  start_year + 3  < 2010  else  '_' +str (start_year+3 )[-2 :]     three_year_questions = list (map (lambda  x:x+year_index,['c11' ,'c12' ,'c13' ,'c14' ,'c15' ,'c16' ,'c21a' ,'c21b' ,'c21c' ,'c31a' ,'c31b' ,'c31c' ,'c31d' ,'c31e' ,'c32' ,'c41a' ,'c41b' ,'c41c' ,'c51a' ,'c51b' ,'c52' ,'c53a' ,'c53b' ,'c53c' ]))     questions = current_questions + three_year_questions 
apply()方法 ,apply()方法中的函数是calculate_CI_score,该函数还是可计算当年统计的CI分数和三年后统计的CI分数,该函数如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 def  calculate_CI_score (x,periods = 'current' ,start_year = None  ):    if  periods == 'current' :         questions = ['c11' ,'c12' ,'c13' ,'c14' ,'c15' ,'c16' ,'c21a' ,'c21b' ,'c21c' ,'c31a' ,'c31b' ,'c31c' ,'c31d' ,'c31e' ,'c32' ,'c41a' ,'c41b' ,'c41c' ,'c51a' ,'c51b' ,'c52' ,'c53a' ,'c53b' ,'c53c' ]     elif  periods == 'three_years_later' :         year_index = '_' +str (start_year+3 )[-1 ] if  start_year + 3  < 2010  else  '_' +str (start_year+3 )[-2 :]         questions = list (map (lambda  x:x+year_index,['c11' ,'c12' ,'c13' ,'c14' ,'c15' ,'c16' ,'c21a' ,'c21b' ,'c21c' ,'c31a' ,'c31b' ,'c31c' ,'c31d' ,'c31e' ,'c32' ,'c41a' ,'c41b' ,'c41c' ,'c51a' ,'c51b' ,'c52' ,'c53a' ,'c53b' ,'c53c' ]))     CI_score = 0      for  question in  questions:         if  question[0 :3 ] != 'c16' :              if  x[question] == 1 :CI_score += 1          else :              food_score = x[question] if  x[question]<=7  else  7              CI_score += food_score     return  CI_score 
具体统计的CI分数和三年后统计的CI分数
1 2 new_df['current_CI_score' ] = new_df.apply(lambda  x:calculate_CI_score(x),axis = 1 )     new_df['later_CI_score' ] = new_df.apply(lambda  x:calculate_CI_score(x,periods = 'three_years_later' ,start_year = start_year),axis = 1 ) 
统计完成后新增了分数列,删除相关变量列,凡是会对原数组作出修改并返回一个新数组的,往往都有一个 inplace可选参数。如果手动设定为True(默认为False),那么原数组直接就被替换。也就是说,采用inplace=True之后,原数组名(如2和3情况所示)对应的内存值直接改变;
1 2 print  frame.drop(['a' ])print  frame.drop(['Ohio' ], axis = 1 )
本文是预测认知障碍,认知障碍被定义为 MMSE 评分低于 18,所以使用函数判断是否有障碍
1 2 3 4 5 def  get_CI_label (x,periods ):    if  periods == 'current' :         return  0  if  x.current_CI_score>= 18  else  1      if  periods == 'three_years_later' :         return  0  if  x.later_CI_score>= 18  else  1  
一下代码中清理行变量,人数的关键是new_df.dropna(subset = ['depression_score','depression_score_later'],axis = 0,inplace = True)
1 2 3 4 print ('Before drop Depression missing : ' ,len (new_df))new_df.drop(questions,axis = 1 ,inplace = True ) new_df.dropna(subset = ['depression_score' ,'depression_score_later' ],axis = 0 ,inplace = True ) print ('After drop Depression missing : ' ,len (new_df))
计算只留下所有问题都回答了的样本的抑郁得分,其余的depression_score以nan填充,再用new_df.dropna(subset = ['depression_score','depression_score_later'],axis = 0,inplace = True)把nan的人数去除掉
1 2 df['a1' ] = df['a1' ].apply(lambda  x:0  if  x==2  else  1 ) 
根据a1的单元格数值x修改a1为0女,1男可能是作者习惯吧,然后用def process_personal_variant_variable(df,start_year)也是继续修改赋值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 def  get_training_df (df1,df2,df3,df4 ):    drop_list = ['d72' ,'d75' ,'d82' ,'d86' ,'d92' ]      rename_dict_base = {'a1' :'gender' ,'a43' :'residence' ,'a51' :'co-habitation' ,'d71' :'smoke' ,'d81' :'alcohol' ,'d91' :'exercise' ,'d31' :'eat_fruits' ,'d32' :'eat_vegs' ,'d11b' :'outdoor_activity' ,'d11c' :'plant_flower_bird' ,'d11d' :'read_news_books' ,                           'd11e' :'raise_domestic_animals' ,'d11f' :'majong_or_cards' ,'d11g' :'tv_or_radio' ,'d11h' :'social_activity' ,'d12' :'tour_times' ,                           'g15a1' :'hypertension' ,'g15b1' :'diabetes' ,'g15c1' :'heart_disease' ,'g15d1' :'stoke_CVD' ,'g15e1' :'trachea_or_lung' ,                           'g15f1' :'tuberculosis' ,'g15g1' :'cataracts' ,'g15h1' :'glaucoma' ,'g15i1' :'cancer' ,'g15j1' :'prostate' ,'g15k1' :'stomach_ulcer' ,                           'g15l1' :'Parkinson' ,'g15m1' :'bedsore' ,'g15n1' :'arthritis' ,'g15o1' :'dementia' }      later_year = [2005 ,2008 ,2011 ,2014 ]          for  df,year in  zip ([df1,df2,df3,df4],later_year):         year_index = '_' +str (year)[-1 ] if  year<2011  else  '_' +str (year)[-2 :]         drop_list_later = list (map (lambda  x:x+year_index,drop_list))         rename_dict = rename_dict_base.copy()         for  key in  rename_dict_base.keys():             later_key = key+year_index             rename_dict[later_key] = rename_dict[key]+'_3years_later'          df.drop(drop_list+drop_list_later,axis = 1 ,inplace = True )         df.rename(columns = rename_dict,inplace = True )                  ind = -2  if  year < 2011  else  -3          for  col in  df.columns:             if  col[ind:] == year_index:                 df.rename(columns = {col:col[:ind]+'_3years_later' },inplace = True )     return  pd.concat([df1,df2,df3,df4],axis = 0 ) 
构建CI分数变化值,如遇到数值型变量则先标准化,特征工程中的样本不均衡处理方法是本代码中的重点 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 def  get_total_df (wave1,wave2,wave3,wave4 ):    df1 = wave1.copy()     df2 = wave2.copy()     df3 = wave3.copy()     df4 = wave4.copy()     total_df = get_training_df(df1,df2,df3,df4)               total_df = total_df[total_df.dementia == 0 ]     total_df = total_df[total_df.dementia_3years_later == 0 ]     del  total_df['dementia' ]     del  total_df['dementia_3years_later' ]          scaler1,scaler2,scaler3,scaler4 = StandardScaler(),StandardScaler(),StandardScaler(),StandardScaler()     total_df['depression_score' ] = scaler1.fit_transform(np.array(total_df['depression_score' ]).reshape(-1 ,1 ))     total_df['depression_score_later' ] = scaler2.fit_transform(np.array(total_df['depression_score_later' ]).reshape(-1 ,1 ))     total_df['tour_times' ] = scaler3.fit_transform(np.array(total_df['tour_times' ]).reshape(-1 ,1 ))     total_df['tour_times_3years_later' ] = scaler4.fit_transform(np.array(total_df['tour_times_3years_later' ]).reshape(-1 ,1 ))               total_df['CI_diff' ] = total_df['later_CI_score' ] - total_df['current_CI_score' ]     return  total_df 
以下代码用于统计性别属性中,认知障碍是否有认知障碍的人数和占比,Groupby用法 ,读取第二列的值data1 = data.iloc[:, 1]var的总体CI进行分类
1 2 3 4 5 6 7 8 9 10 11 12 def  get_categorical_describe (total_df,var ):    grouped = total_df.groupby(['later_CI' ,var]).count().iloc[:,1 ]     not_CI = grouped[list (map (lambda  x:True  if  x[0 ]==0  else  False ,list (grouped.index)))]     CI = grouped[list (map (lambda  x:True  if  x[0 ]==1  else  False ,list (grouped.index)))]     res_notCI = [0 ]*len (total_df[var].unique())     res_CI = [0 ]*len (total_df[var].unique())     for  i in  range (len (not_CI)):         res_notCI[i] = str (not_CI.iloc[i])+ '('  + str ('%.2f'  %(not_CI.iloc[i]/(not_CI.iloc[i]+CI.iloc[i])*100 )) +')'          res_CI[i] = str (CI.iloc[i]) + '('  + str ('%.2f'  %(CI.iloc[i]/(not_CI.iloc[i]+CI.iloc[i])*100 )) + ')'      df = pd.concat([pd.DataFrame(res_notCI),pd.DataFrame(res_CI)],axis = 1 )     df.columns = ['not_CI' ,'CI' ]     return  df 
按vae变量算later_CI的Mean(SD)值,重点在于理解mean_ = total_df.groupby('later_CI').mean()[var]
1 2 3 4 5 6 7 def  get_numerical_describe (total_df,var ):    mean_ = total_df.groupby('later_CI' ).mean()[var]     std_ = total_df.groupby('later_CI' ).std()[var]     res = [0 ]*len (mean_)     for  i in  range (len (mean_)):         res[i] = str ('%.4f'  %mean_[i]) + '('  + str ('%.4f'  %std_[i]) + ')'      return  pd.DataFrame({'not_CI' :res[0 ],'CI' :res[1 ]},index = ['Mean(SD)' ]) 
默认情况下,pandas 是不超出屏幕的显示范围的,如果表的行数很多,它会截断中间的行只显示一部分。我们可以通过设置display.max_rows来控制显示的最大行数,比如我想设置显示200行。pd.set_option('display.max_rows', 200)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 def  get_var_describe_test (df,var ):    population = list (df.groupby(var).count().iloc[:,1 ].sort_index())     population_rate = list (map (lambda  x:str (x)+'(' +str ('{:.2f}' .format (x/sum (population)*100 ))+'%' +')' ,population))     value_index = list (df.groupby(var).count().iloc[:,1 ].sort_index().index)     current_CI_count = df.groupby([var,'current_CI' ]).count().iloc[:,1 ]     CI_now_index = list (map (lambda  x:True  if  x[1 ] == 1  else  False ,current_CI_count.index))     not_CI_now_index = list (map (lambda  x:True  if  x[1 ] == 0  else  False ,current_CI_count.index))     CI_now_population = list (current_CI_count[CI_now_index])     not_CI_now_population = list (current_CI_count[not_CI_now_index])          for  i,value in  enumerate (not_CI_now_population):         if  value/population[i]  == 1 :             CI_now_population.insert(i,0 )         not_CI_now_population[i] = str (not_CI_now_population[i])+'(' +str ('{:.2f}' .format (not_CI_now_population[i]/population[i]*100 ))+'%' +')'      for  i,value in  enumerate (CI_now_population):         CI_now_population[i] = str (CI_now_population[i])+'(' +str ('{:.2f}' .format (CI_now_population[i]/population[i]*100 ))+'%' +')'           later_CI_count = df.groupby([var,'later_CI' ]).count().iloc[:,1 ]     CI_later_index = list (map (lambda  x:True  if  x[1 ] == 1  else  False ,later_CI_count.index))     not_CI_later_index = list (map (lambda  x:True  if  x[1 ] == 0  else  False ,later_CI_count.index))     CI_later_population = list (later_CI_count[CI_later_index])     not_CI_later_population = list (later_CI_count[not_CI_later_index])          for  i,value in  enumerate (not_CI_later_population):         if  value/population[i]  == 1 :             CI_later_population.insert(i,0 )         not_CI_later_population[i] = str (not_CI_later_population[i])+'(' +str ('{:.2f}' .format (not_CI_later_population[i]/population[i]*100 ))+'%' +')'           for  i,value in  enumerate (CI_later_population):         CI_later_population[i] = str (CI_later_population[i])+'(' +str ('{:.2f}' .format (CI_later_population[i]/population[i]*100 ))+'%' +')'           return  pd.DataFrame({'value' :value_index,'population' :population_rate,'CI now' :CI_now_population,'not CI now' :not_CI_now_population,                           'CI 3 years later' :CI_later_population,'not CI 3 years later' :not_CI_later_population}) 
在第一部分代码中最后还对各个变量进行了假设检验,python内置values()函数返回一个字典中所有值。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 def  get_pvalues (total_df ):    var_list = ['agegroup' ,'nation_group' ,'education' ,'martial_status' ,'economy_status' ,'ADL' ]     rename_dict_base = {'a1' :'gender' ,'a43' :'residence' ,'a51' :'co-habitation' ,'d71' :'smoke' ,'d81' :'alcohol' ,'d91' :'exercise' ,'d31' :'eat_fruits' ,'d32' :'eat_vegs' ,'d11b' :'outdoor_activity' ,'d11c' :'plant_flower_bird' ,'d11d' :'read_news_books' ,                           'd11e' :'raise_domestic_animals' ,'d11f' :'majong_or_cards' ,'d11g' :'tv_or_radio' ,'d11h' :'social_activity' ,                           'g15a1' :'hypertension' ,'g15b1' :'diabetes' ,'g15c1' :'heart_disease' ,'g15d1' :'stoke_CVD' ,'g15e1' :'trachea_or_lung' ,                           'g15f1' :'tuberculosis' ,'g15g1' :'cataracts' ,'g15h1' :'glaucoma' ,'g15i1' :'cancer' ,'g15j1' :'prostate' ,'g15k1' :'stomach_ulcer' ,                           'g15l1' :'Parkinson' ,'g15m1' :'bedsore' ,'g15n1' :'arthritis' }     descrptive_vars = var_list + list (rename_dict_base.values())     numerical_vars = ['depression_score' ,'tour_times' ]     var_name = []     pvalues = []     for  var in  descrptive_vars:         kf_data = np.array(total_df.groupby(['later_CI' ,var]).count().iloc[:,0 ]).reshape(2 ,len (total_df[var].unique()))         kf = chi2_contingency(kf_data)         var_name.append(var)         pvalues.append(kf[1 ])     for  var in  numerical_vars:         a = total_df[total_df.later_CI == 0 ][var]         b = total_df[total_df.later_CI == 1 ][var]         t_res = ttest_ind(a, b)         var_name.append(var)         pvalues.append(t_res.pvalue)     df_out = pd.DataFrame({'Variable' :var_name,'p-values' :pvalues})     return  df_out 
p值检验,descrptive_vars和numerical_vars的p值算法是不一样的,前者用chi2_contingency卡方检验后者用ttest_ind
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 def  get_pvalues (total_df ):    var_list = ['agegroup' ,'nation_group' ,'education' ,'martial_status' ,'economy_status' ,'ADL' ]     rename_dict_base = {'a1' :'gender' ,'a43' :'residence' ,'a51' :'co-habitation' ,'d71' :'smoke' ,'d81' :'alcohol' ,'d91' :'exercise' ,'d31' :'eat_fruits' ,'d32' :'eat_vegs' ,'d11b' :'outdoor_activity' ,'d11c' :'plant_flower_bird' ,'d11d' :'read_news_books' ,                           'd11e' :'raise_domestic_animals' ,'d11f' :'majong_or_cards' ,'d11g' :'tv_or_radio' ,'d11h' :'social_activity' ,                           'g15a1' :'hypertension' ,'g15b1' :'diabetes' ,'g15c1' :'heart_disease' ,'g15d1' :'stoke_CVD' ,'g15e1' :'trachea_or_lung' ,                           'g15f1' :'tuberculosis' ,'g15g1' :'cataracts' ,'g15h1' :'glaucoma' ,'g15i1' :'cancer' ,'g15j1' :'prostate' ,'g15k1' :'stomach_ulcer' ,                           'g15l1' :'Parkinson' ,'g15m1' :'bedsore' ,'g15n1' :'arthritis' }     descrptive_vars = var_list + list (rename_dict_base.values())     numerical_vars = ['depression_score' ,'tour_times' ]     var_name = []     pvalues = []     for  var in  descrptive_vars:         kf_data = np.array(total_df.groupby(['later_CI' ,var]).count().iloc[:,0 ]).reshape(2 ,len (total_df[var].unique()))         kf = chi2_contingency(kf_data)         var_name.append(var)         pvalues.append(kf[1 ])     for  var in  numerical_vars:         a = total_df[total_df.later_CI == 0 ][var]         b = total_df[total_df.later_CI == 1 ][var]         t_res = ttest_ind(a, b)         var_name.append(var)         pvalues.append(t_res.pvalue)     df_out = pd.DataFrame({'Variable' :var_name,'p-values' :pvalues})     return  df_out 
第一部分结束
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 def  get_current_diff_variable_list (total_df ):    current_variables = []     later_variables = []     for  col in  total_df.columns:         if  col[-12 :] == '3years_later' :                          later_variables.append(col)         if  col[-12 :] != '3years_later' :                          current_variables.append(col)     for  var in  ['current_CI_score' ,                  'later_CI_score' ,                  'current_CI' ,                  'later_CI' ,                  'depression_score_later' ,                  'ADL_later' ,                  'CI_diff'                   ]:         current_variables.remove(var)     later_variables.extend(['ADL_later' ,'depression_score_later' ])     person_basic_variables = ['gender' ,'residence' ,'education' ,'borned_prov' ,'nation_group' ]     current_variables_without_personal = current_variables.copy()          for  var in  person_basic_variables:         current_variables_without_personal.remove(var)     current_variables_without_personal.sort()     later_variables.sort()          print ('Creating diff variables based on current variables and 3 years later variables ...' )     for  x,y in  zip (current_variables_without_personal,later_variables):         total_df[x+'_diff' ] = total_df[y] - total_df[x]     print ('Done!' )              diff_var = list (map (lambda  x:x+'_diff' ,current_variables_without_personal))     current_variables.remove('dementia' )     diff_var.remove('dementia_diff' )     return  current_variables,diff_var 
然后调用逻辑回归,这里调用的是model = sm.Logit(y,x.astype(float)).fit()应该是个库与先前教程中的是不同的是有点复杂的以后有做这方面研究的话可以看,这里其实是用了多重线性回归。
1 2 3 total_df_for_logit = total_df.copy() for  col in  total_df_for_logit.columns:    total_df_for_logit[col] = pd.to_numeric(total_df_for_logit[col],downcast='float' ) 
具体逻辑回归函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 def  logistic_regression (input_df,label_column,categorical_variable_list=[],CI_alpha=0.05 ,standardized_diff = False  ):    df = input_df.copy()     if  standardized_diff:         diff_var = list (filter (lambda  x:True  if  x[-4 :] == 'diff'  else  False ,df.columns))         if  diff_var:             print ('Standardizing diff variables ...' )             for  var in  diff_var:                 df[var] = df[var].apply(lambda  x:x if  x<= 0  else  1 )                 df[var] = df[var].apply(lambda  x:x if  x>=0  else  -1 )             print ('Standardize diff variables done!' )               x_col = [column for  column in  df.columns if  label_column not  in  column and  column not  in  categorical_variable_list]     y_col = [column for  column in  df.columns if  column == label_column]          if  'borned_prov'  in  x_col:x_col.remove('borned_prov' )     if  'agegroup_3years_later'  in  x_col:x_col.remove('agegroup_3years_later' )     if  'agegroup_diff'  in  x_col:x_col.remove('agegroup_diff' )       x = np.array(df[x_col])     if  categorical_variable_list:         onehot_cat_var = pd.get_dummies(df[categorical_variable_list[0 ]],prefix = categorical_variable_list[0 ])         drop_ref_col = list (filter (lambda  x:True  if  x[-2 :] == '_0'  or  x[-4 :] == '_0.0'  else  False ,onehot_cat_var.columns))         onehot_cat_var.drop(drop_ref_col,axis = 1 ,inplace = True )         for  variable in  categorical_variable_list[1 :]:             tmp = pd.get_dummies(df[variable],prefix = variable)             drop_ref_col = list (filter (lambda  x:True  if  x[-2 :] == '_0'  or  x[-4 :] == '_0.0'  else  False ,tmp.columns))             tmp.drop(drop_ref_col,axis = 1 ,inplace = True )             onehot_cat_var = pd.concat([onehot_cat_var,tmp],axis = 1 )         onehot_cat_var_values = np.array(onehot_cat_var)         x = np.hstack([x,onehot_cat_var_values])     y = np.array(df[label_column].astype(float )).reshape(-1 )     model = sm.Logit(y,x.astype(float )).fit()     para = pd.DataFrame(np.exp(model.conf_int(CI_alpha)),columns = ['95%lower' ,'95%upper' ])     para['coef' ] = model.params     para['OddRatio' ] = np.exp(para.coef)     x_name = x_col.copy()     if  categorical_variable_list:         x_name.extend(onehot_cat_var.columns)     para['Variable' ] = x_name     para['pvalues' ] = model.pvalues     var_value_count = []     for  vari in  x_name:         if  vari[-1 ] == '0' :             if  vari[-4 ] == '-' :                 var_value = float (vari[-4 :])                 var_name = vari[:-5 ]                 print (vari)                 var_value_count.append(len (df[df[var_name]==var_value]))             else :                 var_value = float (vari[-3 :])                 var_name = vari[:-4 ]                 var_value_count.append(len (df[df[var_name]==var_value]))         else :             var_value_count.append(len (df[df[vari]==1 ]))     para['Num of samples' ] = var_value_count     para = para[['Variable' ,'Num of samples' ,'coef' ,'OddRatio' ,'95%lower' ,'95%upper' ,'pvalues' ]]     std_x = np.array((df[x_col]-np.mean(df[x_col]))/np.std(df[x_col]))     if  categorical_variable_list:         std_x = np.hstack([std_x,onehot_cat_var_values])     model2 = sm.Logit(y,std_x.astype(float )).fit()     importance_df = pd.DataFrame(para['Variable' ])     importance_df['coef' ] = model2.params     importance_df.sort_values(by = 'coef' ,ascending=False ,key=np.abs ,inplace = True )     importance_df['importance' ] = range (1 ,len (importance_df)+1 )     importance_df.drop('coef' ,axis=1 ,inplace=True )     importance_df.set_index('importance' ,inplace = True )                                           res = {'summary' :model.summary(),        'params' :para,        'feature_importance' :importance_df}     return  res 
在上一步逻辑回归中计算了百分之95置信区间此处用在OR()中,OR值是《流行病学》中的重要概念,称作“优势比”(odds ratio),也称“比值比”,反映的是某种暴露与结局的关联强度。逻辑回归和OR值 ,所谓研究“暴露对结局”的影响,这里的“结局”在本例中就指“是否患有糖尿病”,一般可以等同于我们前面说的“因变量Y”。
1 2 def  get_OR (x ):    return  str ('%.2f'  %x['OddRatio' ])+'(' +str ('%.2f'  %x['95%lower' ]) +',' +str ('%.2f'  %x['95%upper' ]) + ')'  
本代码还计算了差异的相关性
1 output2.loc[list (set (output2.index)-set (output1.Variable)),:].sort_values(by = 'Variable' ) 
筛选出p值小于0.05的差异变量
1 2 significant_pos_diff = output2.loc[diff_variables_pos,:].query('pvalues<0.05' ) significant_pos_diff 
后面还需要以后研究
1 2 3 need_divide_diff = list (map (lambda  x:x[:-4 ],list (significant_pos_diff.index))) if  'exercise_diff'  in  need_divide_diff:    need_divide_diff.remove('exercise_diff' ) 
最后本文为比较了各种算法的结果包括SVM最后比较了他们的准确性等指标,这篇文章还分析了纵向行为变化与认知障碍之间关联