# å符ç¼ç
å符ç¼ç æ¯è®¡ç®æºç¼ç¨ä¸ä¸å¯åé¿çé®é¢ï¼ä¸ç®¡ä½ ç¨ Python2 è¿æ¯ Python3ï¼äº¦ææ¯ C++, Java çï¼æé½è§å¾é常æå¿
è¦åæ¸
è®¡ç®æºä¸çå符ç¼ç æ¦å¿µãæ¬æä¸»è¦å以ä¸å 个é¨åä»ç»ï¼
- åºæ¬æ¦å¿µ
- 常è§å符ç¼ç ç®ä»
- Python çé»è®¤ç¼ç
- Python2 ä¸çå符类å
- UnicodeEncodeError & UnicodeDecodeError æ ¹æº
# åºæ¬æ¦å¿µ
- å符ï¼Characterï¼
å¨çµèåçµä¿¡é¢åä¸ï¼**å符æ¯ä¸ä¸ªä¿¡æ¯åä½ï¼å®æ¯åç§æåå符å·çæ»ç§°**ï¼å
æ¬åå½å®¶æåãæ ç¹ç¬¦å·ãå¾å½¢ç¬¦å·ãæ°åçãæ¯å¦ï¼ä¸ä¸ªæ±åï¼ä¸ä¸ªè±æåæ¯ï¼ä¸ä¸ªæ ç¹ç¬¦å·ç齿¯ä¸ä¸ªå符ã
- å符éï¼Character setï¼
**åç¬¦éæ¯å符çéå**ãå符éçç§ç±»è¾å¤ï¼æ¯ä¸ªå符éå
å«çå符个æ°ä¹ä¸åãæ¯å¦ï¼å¸¸è§çåç¬¦éæ ASCII å符éãGB2312 å符éãUnicode å符éçï¼å
¶ä¸ï¼ASCII å符éå
±æ 128 个å符ï¼å
å«å¯æ¾ç¤ºåç¬¦ï¼æ¯å¦è±æå¤§å°åå符ãé¿æä¼¯æ°åï¼åæ§å¶åç¬¦ï¼æ¯å¦ç©ºæ ¼é®ãå车é®ï¼ï¼GB2312 åç¬¦éæ¯ä¸å½å½å®¶æ åçç®ä½ä¸æå符éï¼å
å«ç®åæ±åãä¸è¬ç¬¦å·ãæ°åçï¼Unicode å符éåå
å«äºä¸çåå½è¯è¨ä¸ä½¿ç¨å°çææå符ï¼
- å符ç¼ç ï¼Character encodingï¼
**å符ç¼ç ï¼æ¯æå¯¹äºå符éä¸çå符ï¼å°å
¶ç¼ç 为ç¹å®çäºè¿å¶æ°**ï¼ä»¥ä¾¿è®¡ç®æºå¤çã常è§çå符ç¼ç æ ASCII ç¼ç ï¼UTF-8 ç¼ç ï¼GBK ç¼ç çãä¸è¬èè¨ï¼**å符é**å**å符ç¼ç **å¾å¾è¢«è®¤ä¸ºæ¯åä¹çæ¦å¿µï¼æ¯å¦ï¼å¯¹äºå符é ASCIIï¼å®é¤äºæãå符çéåãè¿å±å«ä¹å¤ï¼åæ¶ä¹å
å«äºãç¼ç ãçå«ä¹ï¼ä¹å°±æ¯è¯´ï¼**ASCII æ¢è¡¨ç¤ºäºå符éä¹è¡¨ç¤ºäºå¯¹åºçå符ç¼ç **ã
ä¸é¢æä»¬ç¨ä¸ä¸ªè¡¨æ ¼å䏿»ç»ï¼
| æ¦å¿µ | æ¦å¿µæè¿° | ä¸¾ä¾ |
| --- | --- | --- |
| å符 | ä¸ä¸ªä¿¡æ¯åä½ï¼åç§æåå符å·çæ»ç§° | âä¸â, âa', â1', '$', âï¿¥â, ... |
| å符é | å符çéå | ASCII å符é, GB2312 å符é, Unicode å符é |
| å符ç¼ç | å°å符éä¸çå符ï¼ç¼ç 为ç¹å®çäºè¿å¶æ° | ASCII ç¼ç ï¼GB2312 ç¼ç ï¼Unicode ç¼ç |
| åè | è®¡ç®æºä¸å卿°æ®çåå
ï¼ä¸ä¸ª 8 ä½ï¼bitï¼çäºè¿å¶æ° | 0x01, 0x45, ... |
# 常è§å符ç¼ç ç®ä»
常è§çå符ç¼ç æ ASCII ç¼ç ï¼GBK ç¼ç ï¼Unicode ç¼ç å UTF-8 ç¼ç ççãè¿éï¼æä»¬ä¸»è¦ä»ç» ASCIIãUnicode å UTF-8ã
## ASCII
è®¡ç®æºæ¯å¨ç¾å½è¯ççï¼äººå®¶ç¨çæ¯è±è¯ï¼èå¨è±è¯çä¸çéï¼ä¸è¿å°±æ¯è±æåæ¯ï¼æ°ååä¸äºæ®é符å·çç»åèå·²ã
å¨ 20 ä¸çºª 60 年代ï¼ç¾å½å¶å®äºä¸å¥å符ç¼ç æ¹æ¡ï¼è§å®äºè±æåæ¯ï¼æ°ååä¸äºæ®é符å·è·äºè¿å¶ç转æ¢å
³ç³»ï¼è¢«ç§°ä¸º ASCII (American Standard Code for Information Interchangeï¼ç¾å½ä¿¡æ¯äºæ¢æ åç¼ç ) ç ã
æ¯å¦ï¼å¤§åè±æåæ¯ A çäºè¿å¶è¡¨ç¤ºæ¯ 01000001ï¼åè¿å¶ 65ï¼ï¼å°åè±æåæ¯ a çäºè¿å¶è¡¨ç¤ºæ¯ 01100001 ï¼åè¿å¶ 97ï¼ï¼ç©ºæ ¼ SPACE çäºè¿å¶è¡¨ç¤ºæ¯ 00100000ï¼åè¿å¶ 32ï¼ã
## Unicode
ASCII ç åªè§å®äº 128 个å符çç¼ç ï¼è¿å¨ç¾å½æ¯å¤ç¨çã坿¯ï¼è®¡ç®æºåæ¥ä¼ å°äºæ¬§æ´²ï¼äºæ´²ï¼ä¹è³ä¸çåå°ï¼èä¸çåå½çè¯è¨å 乿¯å®å
¨ä¸ä¸æ ·çï¼ç¨ ASCII ç æ¥è¡¨ç¤ºå
¶ä»è¯è¨æ¯è¿è¿ä¸å¤çï¼æä»¥ï¼ä¸åçå½å®¶åå°åºåå¶å®äºèªå·±çç¼ç æ¹æ¡ï¼æ¯å¦ä¸å½å¤§éç GB2312 ç¼ç å GBK ç¼ç çï¼æ¥æ¬ç Shift_JIS ç¼ç ççã
è½ç¶å个å½å®¶åå°åºå¯ä»¥å¶å®èªå·±çç¼ç æ¹æ¡ï¼ä½ä¸åå½å®¶åå°åºçè®¡ç®æºå¨æ°æ®ä¼ è¾çè¿ç¨ä¸å°±ä¼åºç°åç§åæ ·çä¹±ç ï¼mojibakeï¼ï¼è¿æ çæ¯ä¸ªç¾é¾ã
æä¹åï¼æ³æ³ä¹å¾ç®åï¼å°±æ¯å°å
¨ä¸çææçè¯è¨ç»ä¸æä¸å¥ç¼ç æ¹æ¡ï¼è¿å¥ç¼ç æ¹æ¡å°±å« Unicodeï¼**å®ä¸ºæ¯ç§è¯è¨çæ¯ä¸ªå符设å®äºç¬ä¸æ äºçäºè¿å¶ç¼ç **ï¼è¿æ ·å°±å¯ä»¥è·¨è¯è¨ï¼è·¨å¹³å°è¿è¡ææ¬å¤çäºï¼æ¯ä¸æ¯å¾æ£ï¼
Unicode 1.0 çè¯çäº 1991 å¹´ 10 æï¼è³ä»å®ä»å¨ä¸æå¢ä¿®ï¼æ¯ä¸ªæ°çæ¬é½ä¼å å
¥æ´å¤æ°çå符ï¼ç®åææ°ççæ¬ä¸º 2016 å¹´ 6 æ 21 æ¥å
¬å¸ç 9.0.0ã
Unicode æ å使ç¨åå
è¿å¶æ°åï¼èä¸å¨æ°ååé¢å ä¸åç¼ `U+`ï¼æ¯å¦ï¼å¤§å忝ãAãç unicode ç¼ç 为 `U+0041`ï¼æ±åã严ãç unicode ç¼ç 为 `U+4E25`ãæ´å¤ç符å·å¯¹åºè¡¨ï¼å¯ä»¥æ¥è¯¢ [unicode.org](http://www.unicode.org/)ï¼æè
ä¸é¨ç[æ±å对åºè¡¨](http://www.chi2ko.com/tool/CJK.htm)ã
## UTF-8
Unicode çèµ·æ¥å·²ç»å¾å®ç¾äºï¼å®ç°äºå¤§ä¸ç»ã使¯ï¼Unicode å´åå¨ä¸ä¸ªå¾å¤§çé®é¢ï¼èµæºæµªè´¹ã
为ä»ä¹è¿ä¹è¯´å¢ï¼åæ¥ï¼Unicode 为äºè½è¡¨ç¤ºä¸çå彿ææåï¼ä¸å¼å§ç¨ä¸¤ä¸ªåèï¼åæ¥åç°ä¸¤ä¸ªåèä¸å¤ç¨ï¼åç¨äºå个åèãæ¯å¦ï¼æ±åã严ãç unicode ç¼ç æ¯åå
è¿å¶æ° `4E25`ï¼è½¬æ¢æäºè¿å¶æåäºä½ï¼å³ 100111000100101ï¼å æ¤è³å°éè¦ä¸¤ä¸ªåèæè½è¡¨ç¤ºè¿ä¸ªæ±åï¼ä½æ¯å¯¹äºå
¶ä»çå符ï¼å°±å¯è½éè¦ä¸ä¸ªæå个åèï¼çè³æ´å¤ã
è¿æ¶ï¼é®é¢å°±æ¥äºï¼å¦æä»¥åç ASCII å符éä¹ç¨è¿ç§æ¹å¼æ¥è¡¨ç¤ºï¼é£å²ä¸æ¯å¾æµªè´¹åå¨ç©ºé´ãæ¯å¦ï¼å¤§å忝ãAãçäºè¿å¶ç¼ç 为 01000001ï¼å®åªéè¦ä¸ä¸ªåèå°±å¤äºï¼å¦æ unicode ç»ä¸ä½¿ç¨ä¸ä¸ªåèæå个åèæ¥è¡¨ç¤ºå符ï¼é£ãAãçäºè¿å¶ç¼ç çåé¢å 个åè就齿¯ `0`ï¼è¿æ¯å¾æµªè´¹åå¨ç©ºé´çã
为äºè§£å³è¿ä¸ªé®é¢ï¼å¨ Unicode çåºç¡ä¸ï¼äººä»¬å®ç°äº UTF-16, UTF-32 å UTF-8ãä¸é¢åªè¯´ä¸ä¸ UTF-8ã
UTF-8 (8-bit Unicode Transformation Format) æ¯ä¸ç§é对 Unicode çå¯åé¿åº¦å符ç¼ç ï¼å®ä½¿ç¨ä¸å°å个åèæ¥è¡¨ç¤ºå符ï¼ä¾å¦ï¼ASCII å符继ç»ä½¿ç¨ä¸ä¸ªåèç¼ç ï¼é¿æä¼¯æãå¸è
æç使ç¨ä¸¤ä¸ªåèç¼ç ï¼å¸¸ç¨æ±å使ç¨ä¸ä¸ªåèç¼ç ï¼ççã
å æ¤ï¼æä»¬è¯´ï¼**UTF-8 æ¯ Unicode çå®ç°æ¹å¼ä¹ä¸**ï¼å
¶ä»å®ç°æ¹å¼è¿å
æ¬ UTF-16ï¼å符ç¨ä¸¤ä¸ªæå个åè表示ï¼å UTF-32ï¼å符ç¨å个åè表示ï¼ã
# Python çé»è®¤ç¼ç
Python2 çé»è®¤ç¼ç æ¯ asciiï¼Python3 çé»è®¤ç¼ç æ¯ utf-8ï¼å¯ä»¥éè¿ä¸é¢çæ¹å¼è·åï¼
- Python2
```python
Python 2.7.11 (default, Feb 24 2016, 10:48:05)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
```
- Python3
```python
Python 3.5.2 (default, Jun 29 2016, 13:43:58)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
```
# Python2 ä¸çå符类å
Python2 䏿䏤ç§åå符串ç¸å
³çç±»åï¼str å unicodeï¼å®ä»¬çç¶ç±»æ¯ basestringãå
¶ä¸ï¼str ç±»åçå符串æå¤ç§ç¼ç æ¹å¼ï¼é»è®¤æ¯ asciiï¼è¿æ gbkï¼utf-8 çï¼unicode ç±»åçåç¬¦ä¸²ä½¿ç¨ `u'...'` ç形弿¥è¡¨ç¤ºï¼ä¸é¢çå¾å±ç¤ºäº str å unicode ä¹é´çå
³ç³»ï¼

两ç§å符串çç¸äºè½¬æ¢æ¦æ¬å¦ä¸ï¼
- æ UTF-8 ç¼ç 表示çå符串 'xxx' 转æ¢ä¸º Unicode å符串 u'xxx' ç¨ `decode('utf-8')` æ¹æ³ï¼
```python
>>> '䏿'.decode('utf-8')
u'\u4e2d\u6587'
```
- æ u'xxx' 转æ¢ä¸º UTF-8 ç¼ç ç 'xxx' ç¨ `encode('utf-8')` æ¹æ³ï¼
```python
>>> u'䏿'.encode('utf-8')
'\xe4\xb8\xad\xe6\x96\x87'
```
# UnicodeEncodeError & UnicodeDecodeError æ ¹æº
ç¨ Python2 ç¼åç¨åºçæ¶åç»å¸¸ä¼éå° UnicodeEncodeError å UnicodeDecodeErrorï¼å®ä»¬åºç°çæ ¹æºå°±æ¯**妿代ç é颿··å使ç¨äº str ç±»åå unicode ç±»åçå符串ï¼Python ä¼é»è®¤ä½¿ç¨ ascii ç¼ç å°è¯å¯¹ unicode ç±»åçå符串ç¼ç (encode)ï¼æå¯¹ str ç±»åçå符串解ç (decode)ï¼è¿æ¶å°±å¾å¯è½åºç°ä¸è¿°é误**ã
ä¸é¢æä¸¤ä¸ªå¸¸è§çåºæ¯ï¼æä»¬æå¥½ç¢ç¢è®°ä½ï¼
- å¨è¿è¡åæ¶å
å« str ç±»åå unicode ç±»åçå符串æä½æ¶ï¼Python2 ä¸å¾é½æ str è§£ç ï¼decodeï¼æ unicode åè¿ç®ï¼è¿æ¶å°±å¾å®¹æåºç° UnicodeDecodeErrorã
让æä»¬ççä¾åï¼
```python
>>> s = 'ä½ å¥½' # str ç±»å, utf-8 ç¼ç
>>> u = u'ä¸ç' # unicode ç±»å
>>> s + u # ä¼è¿è¡éå¼è½¬æ¢ï¼å³ s.decode('ascii') + u
Traceback (most recent call last):
File "