Turkish characters in python

user3303613 picture user3303613 · Jun 4, 2014 · Viewed 7.2k times · Source

I am playing around with the Twitter API, but I have several questions regarding the encoding of Turkish characters. Here is the code I'm working with:

# -*- coding: cp1254 -*-
import sys
import csv
import tweepy
import locale
import string
locale.setlocale(locale.LC_ALL, "Turkish")

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

f=open("tweets.csv", "wb")
for q in [list of search queries]:

     a=[tweet.text.encode("utf-8") for tweet in tweepy.Cursor(api.search, q, result_type="recent", include_entities=True, lang="tr").items(20)]
     wr=csv.writer(f, quoting=csv.QUOTE_ALL)
     wr.writerow(q)

Basically, what I'm doing is running the search api by iterating through a list of search queries and then writing the tweets into an excel file. However, no matter what I do, the tweets are written by replacing regular Turkish characters with other substitutes. I've tried several things (setting the locale, adding the .encode("utf-8") part, etc.), but I still don't know how to fix it.

Here is what I am talking about:

what is written: Dün akşam Ülker Arena

what I want it to write: Dün akşam Ülker Arena

What I don't understand is that ü, Ü and ş are all in the local letters when I set the locale to Turkish, but Python substitutes these letters.

Answer

Sabuncu picture Sabuncu · Jun 6, 2014

I duplicated your code on my system (Windows 7, with Office 2010) and I got it working. I used your code but I simplified the search query as follows:

search_results = api.search(q="canan1405", count=10)
for tweet in search_results:
    print tweet.text.encode('utf-8')

I pulled tweets from the 'canan1405' user as they contained Turkish characters. (Hope she doesn't mind!)

I simply redirected the output of my script to a file, as follows:

python so_24038317.py > tweets.csv

At this point, the tweets.csv file contains Unicode characters encoded as UTF-8. If I double-click on the file as you did, the default Excel display shows garbage characters much like in your case:

Instead of double-clicking on the csv file, use the following steps to import the file:

  1. Start Excel.
  2. Click the "Data" tab on the ribbon.
  3. Click the "From Text" icon in the "Get External Data".
  4. Locate the CSV file and click the "Import" button.
  5. A wizard will be displayed. In my case, it came up with the correct guess for the file contents (see the "File origin:" drop-down):

You can complete the rest of the steps for the wizard but they are optional. The file displayed correctly:

As far as I can tell, it contains (and correctly displays) the following Turkish characters:

ş, Ğ, İ, ğ, ı, ç

Note that the character immediately after the string "Oyy şirin kedi" is an emoticon, not a valid UTF-8 character. Hope this helps.