I am playing around with the Twitter API, but I have several questions regarding the encoding of Turkish characters. Here is the code I'm working with:
# -*- coding: cp1254 -*-
import sys
import csv
import tweepy
import locale
import string
locale.setlocale(locale.LC_ALL, "Turkish")
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
f=open("tweets.csv", "wb")
for q in [list of search queries]:
a=[tweet.text.encode("utf-8") for tweet in tweepy.Cursor(api.search, q, result_type="recent", include_entities=True, lang="tr").items(20)]
wr=csv.writer(f, quoting=csv.QUOTE_ALL)
wr.writerow(q)
Basically, what I'm doing is running the search api by iterating through a list of search queries and then writing the tweets into an excel file. However, no matter what I do, the tweets are written by replacing regular Turkish characters with other substitutes. I've tried several things (setting the locale, adding the .encode("utf-8") part, etc.), but I still don't know how to fix it.
Here is what I am talking about:
what is written: Dün akşam Ülker Arena
what I want it to write: Dün akşam Ülker Arena
What I don't understand is that ü, Ü and ş are all in the local letters when I set the locale to Turkish, but Python substitutes these letters.
I duplicated your code on my system (Windows 7, with Office 2010) and I got it working. I used your code but I simplified the search query as follows:
search_results = api.search(q="canan1405", count=10)
for tweet in search_results:
print tweet.text.encode('utf-8')
I pulled tweets from the 'canan1405' user as they contained Turkish characters. (Hope she doesn't mind!)
I simply redirected the output of my script to a file, as follows:
python so_24038317.py > tweets.csv
At this point, the tweets.csv file contains Unicode characters encoded as UTF-8. If I double-click on the file as you did, the default Excel display shows garbage characters much like in your case:
Instead of double-clicking on the csv file, use the following steps to import the file:
You can complete the rest of the steps for the wizard but they are optional. The file displayed correctly:
As far as I can tell, it contains (and correctly displays) the following Turkish characters:
ş, Ğ, İ, ğ, ı, ç
Note that the character immediately after the string "Oyy şirin kedi" is an emoticon, not a valid UTF-8 character. Hope this helps.