Thursday, December 24, 2015

Hello Twitter: Use Java to get sample data from Twitter

One of the often thought that strikes me when I try to do a POC for myself is where do I get some web data for my tests. I was lazy for a while to download few hundred rows from here and there but then decided to end the quest. Following are some steps which would help you to get sample twitter data if you are looking for some.

Twitter provides APIs to download the data from a data stream what twitter refers as public stream. There are other means like firehose that we are not going to look at.

Getting data from twitter is a 2 step process

  • Get authorization tokens from Twitter to access twitter data
  • Use some programming / scripting language to authorize and read streaming data from twitter
Get Authorization Tokens from Twitter
We need 4 things
  • Consumer Key
  • Consumer Secret
  • Access Token
  • Access Token Secret
Head to https://apps.twitter.com
Create an account if you don't have a twitter account. Else log in
Create an application, provide the details and give a dummy site if asked.
Head to "Keys and access tokens" tab. You would see the consumer key and consumer secret. You may have to click a button to see the Access Token and Access Token Secret.

Write a java program to use the authorization tokens and get the data

POM Dependancies required

<dependency>
    <groupId>org.scribe</groupId>
    <artifactId>scribe</artifactId>
    <version>1.3.7</version>
</dependency>
<dependency>
    <groupId>org.json</groupId>
    <artifactId>json</artifactId>
    <version>20151123</version>
</dependency>

TwitterStreamReader.java

package twitter.data.stream;

import java.io.*;
import java.text.SimpleDateFormat;
import java.util.Calendar;

import org.scribe.builder.*;
import org.scribe.builder.api.*;
import org.scribe.model.*;
import org.scribe.oauth.*;

import org.json.*;

public class TwitterStreamReader {

    private static String STREAM_URL = "https://stream.twitter.com/1.1/statuses/filter.json";
    private static String CONSUMER_KEY="<your consumer key>";
    private static String CONSUMER_SECRET="<your consumer secret";
    private static String ACCESS_TOKEN="your access token";
    private static String ACCESS_TOKEN_SECRET="your access token secret";
    private static String TRACK_KEYWORD= "NEWS USA";
    private static String OUTPUT_FOLDER="/Users/devashis/TwitterData/"; //Give full path
    private static int    NUM_OF_ROWS=10; //Set -1 to run foreever
public void getData() {
    try {
        //Create the Oath
        OAuthService service = new ServiceBuilder()
                .provider(TwitterApi.class)
                .apiKey(CONSUMER_KEY)
                .apiSecret(CONSUMER_SECRET)
                .build();

        //Create the token
        Token twitterAccessToken = new Token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET);

        //Create the request
        OAuthRequest request = new OAuthRequest(Verb.POST, STREAM_URL);
        request.addHeader("version", "HTTP/1.1");
        request.addHeader("host", "stream.twitter.com");
        request.addHeader("user-agent", "Twitter Stream Reader");
        request.addBodyParameter("track", TRACK_KEYWORD);
        request.setConnectionKeepAlive(true);

        //Sign and send the request
        service.signRequest(twitterAccessToken, request);
        Response response = request.send();

        // Create a reader to read Twitter's stream
        BufferedReader reader = new BufferedReader(new InputStreamReader(response.getStream()));

       //Change the delimiter below if you want a different delimiter
        String delimiter = "|^|";

        //Prepare the file for writing to        SimpleDateFormat dt = new SimpleDateFormat("yyyyMMddhhmmss");
        System.out.println(OUTPUT_FOLDER+ dt.format(Calendar.getInstance().getTime())+".txt");
        File file = new File(OUTPUT_FOLDER+ dt.format(Calendar.getInstance().getTime())+".txt");
        file.createNewFile();
        FileWriter fw = new FileWriter(file.getAbsoluteFile());
        BufferedWriter bw = new BufferedWriter(fw);


        String tweet;
        int counter=0;
        while (((tweet = reader.readLine()) != null) && (counter != NUM_OF_ROWS)) {
            //Uncomment the following line if you want the entire JSON and decide on what data you want            //System.out.println(tweet);
            //Some JSON parsing to get the data required            JSONObject obj = new JSONObject(tweet);
            String createdAt = obj.getString("created_at");
            String id = obj.getString("id_str");
            String tweetText = obj.getString("text").replaceAll("[\\t\\n\\r]"," ");
            String userId = obj.getJSONObject("user").getString("id_str");
            String screenName = obj.getJSONObject("user").getString("screen_name");
            int followerCount = obj.getJSONObject("user").getInt("followers_count");
            String outRow =createdAt+delimiter+id+delimiter+tweetText+delimiter+
                    userId+delimiter+screenName+delimiter+followerCount;
            bw.write(outRow);
            bw.newLine();
            System.out.println(outRow);
            counter++;
        }

        bw.close();

    } catch (Exception e) {
        e.printStackTrace();
    }
}
}
In order to play with JSON data directly with pig and hive I changed the code above to write the JSON directly to the file (instead of parsing it) to play with PIG and hive to consume directly JSON files




2 comments:

  1. Very well said, thanks for sharing this information. Your article is totally awesome and it gives clear idea about how it works.
    secure dataroom

    ReplyDelete
  2. Every employee, old and new, should be thoroughly instructed on security at the level of the individual computer. And new employees, before they officially begin work, should complete this training before accessing the company’s network.
    data room providers

    ReplyDelete