Encoding Protein Sequences into Integer Representations Using K-mer Binary Mapping

  • seq="MKTLGEFIVEKQH", k=4.
  • Kmers: ['MKTL', 'KTLG', 'TLGE', 'LGEF', 'GEFI', 'EFIV', 'FIVE', 'IVEK', 'VEKQ', 'EKQH']
  • Encodings: [5057, 15385, 49556, 6470, 37986, 17954, 25124, 8771, 9268, 17226]
  • Check:
KmerK-mer Binary MappingEncodingCheck
"MKTL"0001 0011 1100 00015057✔️
"KTLG"0011 1100 0001 100115385✔️
"TLGE"1100 0001 1001 010049556✔️
"LGEF"0001 1001 0100 01106470✔️
"GEFI"1001 0100 0110 001037986✔️
"EFIV"0100 0110 0010 001017954✔️
"FIVE"0110 0010 0010 010025124✔️
"IVEK"0010 0010 0100 00118771✔️
"VEKQ"0010 0100 0011 01009268✔️
"EKQH"0100 0011 0100 101017226✔️
  • Code: