One of the often thought that strikes me when I try to do a POC for myself is where do I get some web data for my tests. I was lazy for a while to download few hundred rows from here and there but then decided to end the quest. Following are some steps which would help you to get sample twitter data if you are looking for some.
Twitter provides APIs to download the data from a data stream what twitter refers as public stream. There are other means like firehose that we are not going to look at.
Getting data from twitter is a 2 step process
Twitter provides APIs to download the data from a data stream what twitter refers as public stream. There are other means like firehose that we are not going to look at.
Getting data from twitter is a 2 step process
- Get authorization tokens from Twitter to access twitter data
- Use some programming / scripting language to authorize and read streaming data from twitter
Get Authorization Tokens from Twitter
We need 4 things
- Consumer Key
- Consumer Secret
- Access Token
- Access Token Secret
Head to https://apps.twitter.com
Create an account if you don't have a twitter account. Else log in
Create an application, provide the details and give a dummy site if asked.
Head to "Keys and access tokens" tab. You would see the consumer key and consumer secret. You may have to click a button to see the Access Token and Access Token Secret.
Write a java program to use the authorization tokens and get the data
POM Dependancies required
<dependency>
<groupId>org.scribe</groupId>
<artifactId>scribe</artifactId>
<version>1.3.7</version>
</dependency>
<dependency>
<groupId>org.json</groupId>
<artifactId>json</artifactId>
<version>20151123</version>
</dependency>
TwitterStreamReader.java
package twitter.data.stream; import java.io.*; import java.text.SimpleDateFormat; import java.util.Calendar; import org.scribe.builder.*; import org.scribe.builder.api.*; import org.scribe.model.*; import org.scribe.oauth.*; import org.json.*; public class TwitterStreamReader { private static String STREAM_URL = "https://stream.twitter.com/1.1/statuses/filter.json"; private static String CONSUMER_KEY="<your consumer key>"; private static String CONSUMER_SECRET="<your consumer secret"; private static String ACCESS_TOKEN="your access token"; private static String ACCESS_TOKEN_SECRET="your access token secret"; private static String TRACK_KEYWORD= "NEWS USA"; private static String OUTPUT_FOLDER="/Users/devashis/TwitterData/"; //Give full path private static int NUM_OF_ROWS=10; //Set -1 to run foreeverpublic void getData() { try { //Create the Oath OAuthService service = new ServiceBuilder() .provider(TwitterApi.class) .apiKey(CONSUMER_KEY) .apiSecret(CONSUMER_SECRET) .build(); //Create the token Token twitterAccessToken = new Token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET); //Create the request OAuthRequest request = new OAuthRequest(Verb.POST, STREAM_URL); request.addHeader("version", "HTTP/1.1"); request.addHeader("host", "stream.twitter.com"); request.addHeader("user-agent", "Twitter Stream Reader"); request.addBodyParameter("track", TRACK_KEYWORD); request.setConnectionKeepAlive(true); //Sign and send the request service.signRequest(twitterAccessToken, request); Response response = request.send(); // Create a reader to read Twitter's stream BufferedReader reader = new BufferedReader(new InputStreamReader(response.getStream())); //Change the delimiter below if you want a different delimiter String delimiter = "|^|"; //Prepare the file for writing to SimpleDateFormat dt = new SimpleDateFormat("yyyyMMddhhmmss"); System.out.println(OUTPUT_FOLDER+ dt.format(Calendar.getInstance().getTime())+".txt"); File file = new File(OUTPUT_FOLDER+ dt.format(Calendar.getInstance().getTime())+".txt"); file.createNewFile(); FileWriter fw = new FileWriter(file.getAbsoluteFile()); BufferedWriter bw = new BufferedWriter(fw); String tweet; int counter=0; while (((tweet = reader.readLine()) != null) && (counter != NUM_OF_ROWS)) { //Uncomment the following line if you want the entire JSON and decide on what data you want //System.out.println(tweet); //Some JSON parsing to get the data required JSONObject obj = new JSONObject(tweet); String createdAt = obj.getString("created_at"); String id = obj.getString("id_str"); String tweetText = obj.getString("text").replaceAll("[\\t\\n\\r]"," "); String userId = obj.getJSONObject("user").getString("id_str"); String screenName = obj.getJSONObject("user").getString("screen_name"); int followerCount = obj.getJSONObject("user").getInt("followers_count"); String outRow =createdAt+delimiter+id+delimiter+tweetText+delimiter+ userId+delimiter+screenName+delimiter+followerCount; bw.write(outRow); bw.newLine(); System.out.println(outRow); counter++; } bw.close(); } catch (Exception e) { e.printStackTrace(); } }
}
In order to play with JSON data directly with pig and hive I changed the code above to write the JSON directly to the file (instead of parsing it) to play with PIG and hive to consume directly JSON files
Very well said, thanks for sharing this information. Your article is totally awesome and it gives clear idea about how it works.
ReplyDeletesecure dataroom
Every employee, old and new, should be thoroughly instructed on security at the level of the individual computer. And new employees, before they officially begin work, should complete this training before accessing the company’s network.
ReplyDeletedata room providers