Turkish characters in python

Question 1

Turkish characters in python

python twitter tweepy turkish

user3303613 · Jun 4, 2014 · Viewed 7.2k times · Source

Answer

Answer

I duplicated your code on my system (Windows 7, with Office 2010) and I got it working. I used your code but I simplified the search query as follows:

search_results = api.search(q="canan1405", count=10)
for tweet in search_results:
    print tweet.text.encode('utf-8')

I pulled tweets from the 'canan1405' user as they contained Turkish characters. (Hope she doesn't mind!)

I simply redirected the output of my script to a file, as follows:

python so_24038317.py > tweets.csv

At this point, the tweets.csv file contains Unicode characters encoded as UTF-8. If I double-click on the file as you did, the default Excel display shows garbage characters much like in your case:

Instead of double-clicking on the csv file, use the following steps to import the file:

Start Excel.
Click the "Data" tab on the ribbon.
Click the "From Text" icon in the "Get External Data".
Locate the CSV file and click the "Import" button.
A wizard will be displayed. In my case, it came up with the correct guess for the file contents (see the "File origin:" drop-down):

You can complete the rest of the steps for the wizard but they are optional. The file displayed correctly:

As far as I can tell, it contains (and correctly displays) the following Turkish characters:

ş, Ğ, İ, ğ, ı, ç

Note that the character immediately after the string "Oyy şirin kedi" is an emoticon, not a valid UTF-8 character. Hope this helps.

Question 2

I am playing around with the Twitter API, but I have several questions regarding the encoding of Turkish characters. Here is the code I'm working with:

# -*- coding: cp1254 -*-
import sys
import csv
import tweepy
import locale
import string
locale.setlocale(locale.LC_ALL, "Turkish")

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

f=open("tweets.csv", "wb")
for q in [list of search queries]:

     a=[tweet.text.encode("utf-8") for tweet in tweepy.Cursor(api.search, q, result_type="recent", include_entities=True, lang="tr").items(20)]
     wr=csv.writer(f, quoting=csv.QUOTE_ALL)
     wr.writerow(q)

Basically, what I'm doing is running the search api by iterating through a list of search queries and then writing the tweets into an excel file. However, no matter what I do, the tweets are written by replacing regular Turkish characters with other substitutes. I've tried several things (setting the locale, adding the .encode("utf-8") part, etc.), but I still don't know how to fix it.

Here is what I am talking about:

what is written: DÃ¼n akÅŸam Ãœlker Arena

what I want it to write: Dün akşam Ülker Arena

What I don't understand is that ü, Ü and ş are all in the local letters when I set the locale to Turkish, but Python substitutes these letters.

Turkish characters in python

Answer

Related questions